I'm trying to write some dataframe to a csv file. However, the columns F to N should me empty. This is the dataframe im using:
data = [['a'], ['b'], ['c'], ['d'], ['e'], ['o']]
dataFrame = pandas.DataFrame(data).transpose()
The letters are to clarify under which column the data should go. For example, 'c' is going under the column C. However, with the current line 'o' goes under the column F. Is there a way to tell this dataFrame that it should skip columns F to N and write 'o' under column O?
I assumed that's possible to write [], [], [] many times but this seems a bit unnecessary. Is there a smart way to make multiple empty lists seperated by comma? Like the example above?
Thanks for reading. If anything is unclear please let me know!
With a Pandas dataframe, what you desire is not possible. The name Pandas is derived from "panel data". As such, it's built around NumPy arrays, one for each series or "column" of data. You can't have "placeholders" for series which should be skipped over when exporting to a CSV or Excel file.
You can explicitly set your index equal to your dataframe values and then use pd.DataFrame.reindex with a list of letters. If you have more than 26 columns, see Get Excel-Style Column Names from Column Number.
import pandas as pd
from string import ascii_lowercase
data = [['a'], ['b'], ['c'], ['d'], ['e'], ['o']]
df = pd.DataFrame(data)
df.index = df[0]
df = df.reindex(list(ascii_lowercase)).T.fillna('')
print(df[list('abcdefg') + list('mnopqrs')])
0 a b c d e f g m n o p q r s
0 a b c d e o
Related
I have a dataframe:
import pandas as pd
df = pd.DataFrame({'val': ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']})
that I would like to slice into two new dataframes such that the first contains every nth value, while the second contains the remaining values not in the first.
For example, in the case of n=3, the second dataframe would keep two values from the original dataframe, skip one, keep two, skip one, etc. This slice is illustrated in the following image where the original dataframe values are blue, and these are split into a green set and a red set:
I have achieved this successfully using a combination of iloc and isin:
df1 = df.iloc[::3]
df2 = df[~df.val.isin(df1.val)]
but what I would like to know is:
Is this the most Pythonic way to achieve this? It seems inefficient and not particularly elegant to take what I want out of a dataframe then get the rest of what I want by checking what is not in the new dataframe that is in the original. Instead, is there an iloc expression, like that which was used to generate df1, which could do the second part of the slicing procedure and replace the isin line? Even better, is there a single expression that could execute the the entire two-step slice in one step?
Use modulo 3 with compare for not equal first values (same like sliced rows):
#for default RangeIndex
df2 = df[df.index % 3 != 0]
#for any Index
df2 = df[np.arange(len(df)) % 3 != 0]
print (df2)
val
1 b
2 c
4 e
5 f
7 h
How can I add a column to a pandas dataframe with values 'A', 'B', 'C', 'A', 'B' etc? i.e. ABC repeating down the rows. Also I need to vary the letter that is assigned to the first row (i.e. it could start ABCAB..., BCABC... or CABCA...).
I can get as far as:
df.index % 3
which gets me the index as 0,1,2 etc, but I cannot see how to get that into a column with A, B, C.
Many thanks,
Julian
If I've understood your question correctly, you can create a list of the letters as follows, and then add that to your dataframe:
from itertools import cycle
from random import randint
letter_generator = cycle('ABC')
offset = randint(0, 2)
dataframe_length = 10 # or just use len(your_dataframe) to avoid hardcoding it
column = [next(letter_generator) for _ in range(dataframe_length + offset)]
column = column[offset:]
What I will do
df['col']=(df.index%3).map({0:'A',1:'B',2:'C'})
Im having a CSV file which contain 436 columns and 14k rows.
The format of the data inside the cells is string.
For the example it looks like this:
A,A,A,B,B,C,C,,,,,
D,F,D,F,D,F,H,,,,,
My goal is to get every row with its unique values only. Like that:
A,B,C,,,,,,,,
D,F,H,,,,,,,,
The file is on csv/txt file. I can use Jupyter notebook( with Python3 or any other code you guys will provide). But this is my enviorment of work. Any help would be amazing!
I also uploaded the csv as a Dataframe to the notebook. What you guys suggest?
First you have to read your csv file into a numpy array. Then for each row, I'd do something like:
import numpy as np
s='A,A,A,B,B,C,C'
f=s.split(',')
np.unique(np.array(f))
which prints array(['A', 'B', 'C'], dtype='|S1').
If you have the csv loaded as a dataframe df:
0 1 2 3 4 5 6
0 A A A B B C C
1 D F D F D F H
Iterate over rows and find the unique values per each row:
unique_vals = []
for _, row in df.iterrows():
unique_vals.append(row.unique().tolist())
unique_vals
[['A', 'B', 'C'], ['D', 'F', 'H']]
You haven't mentioned the return data type so I've returned a list.
Edit: If the data set is too large, consider using the chunk_size option in read_csv.
How can I merge the several columns into one cell?
How can I convert a CSV file , that includes 1 by X cells where 1 is the row count and X is the column count unkown by the user,into a new CSV file including one single cell to combine all data from the original CSV file?
Right now://One row, four columns #####In fact, it will be variable number of columns as the data is extracted from a log file#############
1 A B C D
My purposes: //One row, one colum
1 A
B
C
D
The index of the row may not be always be 1 as I have many similar rows like that.
please refer to the original file and the expected new file at the following URL for details
https://drive.google.com/drive/folders/1cX0o86nbLAV5Foj5avCK6E2oAvExzt5t
161.csv is the file at the present
161-final.csv is what I want...
The count of rows will not be changed. However, the count of columns is a variable as the data is extracted from a log file. In the end, I only need each row has only one column.
I am just a fresh man with pandas. Will it be a way that I calculate the count of columns and then merge it into one cell?
Very appreciate your help.
code:
import pandas as pd
import numpy as np
df = pd.DataFrame([['a', 'a', 'a', 'a'], ['b', 'b', 'b', 'b']])
print(df)
df1 = np.empty(df.shape[0], dtype=object)
df1[:] = df.values.tolist()
print(df1)
output:
0 1 2 3
0 a a a a
1 b b b b
[list(['a', 'a', 'a', 'a']) list(['b', 'b', 'b', 'b'])]
Not sure this is what you want, but you can manage to put the content of one row in a single column with pd.groupby()
import pandas as pd
df = pd.DataFrame(np.arange(4).reshape(1, 4))
df.columns = ['A', 'B', 'C', 'D']
df = df.groupby(df.index).apply(lambda x: x.values.ravel())
df = df.to_frame('unique_col') # If needed
Output :
unique_col
0 [0, 1, 2, 3]
Not sure this is possible to have the output not in a list as you mentioned it in your example.
I used a "stupid" way to do that. It is the "pd.read_table(file)" do the magic.
def CsvMerge(file1, file2,output):
##It is for preparation of the getFormated()
##Merge the "Format.csv" and the "text.csv.csv
df1=pd.read_csv(file1)
###This is for merging the columns to one column
###It can also avoid an csv error
df2=pd.read_table(file2)
df3=pd.concat([df1,df2],axis=1)
print(df3)
with open(output,'w+',encoding="utf-8") as f:
df3.to_csv(f,index=False)
f.close()
This question is about filtering a NumPy ndarray according to some column values.
I have a fairly large NumPy ndarray (300000, 50) and I am filtering it according to values in some specific columns. I have ndtypes so I can access each column by name.
The first column is named category_code and I need to filter the matrix to return only rows where category_code is in ("A", "B", "C").
The result would need to be another NumPy ndarray whose columns are still accessible by the dtype names.
Here is what I do now:
index = numpy.asarray([row['category_code'] in ('A', 'B', 'C') for row in data])
filtered_data = data[index]
List comprehension like:
list = [row for row in data if row['category_code'] in ('A', 'B', 'C')]
filtered_data = numpy.asarray(list)
wouldn't work because the dtypes I originally had are no longer accessible.
Are there any better / more Pythonic way of achieving the same result?
Something that could look like:
filtered_data = data.where({'category_code': ('A', 'B','C'})
Thanks!
You can use the NumPy-based library, Pandas, which has a more generally useful implementation of ndarrays:
>>> # import the library
>>> import pandas as PD
Create some sample data as python dictionary, whose keys are the column names and whose values are the column values as a python list; one key/value pair per column
>>> data = {'category_code': ['D', 'A', 'B', 'C', 'D', 'A', 'C', 'A'],
'value':[4, 2, 6, 3, 8, 4, 3, 9]}
>>> # convert to a Pandas 'DataFrame'
>>> D = PD.DataFrame(data)
To return just the rows in which the category_code is either B or C, two steps conceptually, but can easily be done in a single line:
>>> # step 1: create the index
>>> idx = (D.category_code== 'B') | (D.category_code == 'C')
>>> # then filter the data against that index:
>>> D.ix[idx]
category_code value
2 B 6
3 C 3
6 C 3
Note the difference between indexing in Pandas versus NumPy, the library upon which Pandas is built. In NumPy, you would just place the index inside the brackets, indicating which dimension you are indexing with a ",", and using ":" to indicate that you want all of the values (columns) in the other dimension:
>>> D[idx,:]
In Pandas, you call the the data frame's ix method, and place only the index inside the brackets:
>>> D.loc[idx]
If you can choose, I strongly recommend pandas: it has "column indexing" built-in plus a lot of other features. It is built on numpy.