How to read in csv correctly with pandas? - python

I have a csv-file which looks like this:
A B C
1 2 3 4 5 6 7
8 9 1 2 3 4 5
When I read in this file using this code:
df2 = pd.read_csv(r'path\to\file.csv',delimiter=';')
I get a pandas dataframe which exist of three columns named A, B and C.
The first five rows of my actual csv file are taken as index and the last two rows are stated in column A and B and in C I only get NaN values.
Instead I would like to get a dataframe with columns A, B and C as the first three columns and the rest Unnamed columns. I think it is maybe due to a formatting issue with my csv-file but I do not know hwo to solve this..
Thanks a lot!

Try this:
df2 = pd.read_csv(r'path\to\file.csv',delimiter=' ', names=['A','B','C','D','E','F','G'], skiprows=1,index_col=False)

Related

Split strings into different columns not working correctly

I am working with a large dataset with a column for reviews which is comprised of a series of strings for example: "A,B,C" , "A,B*,B" etc..
for example,
import pandas as pd
df=pd.DataFrame({'cat1':[1,2,3,4,5],
'review':['A,B,C', 'A,B*,B,C', 'A,C', 'A,B,C,D', 'A,B,C,A,B']})
df2 = df["review"].str.split(",",expand = True)
df.join(df2)
I want to split that column up into separate columns for each letter, then add those columns into the original data frame. I used df2 = df["review"].str.split(",",expand = True) and df.join(df2) to do that.
However, when i use df["A"].unique() there are entries that should not be in the column. I only want 'A' to appear there, but there is also B and C. Also, B and B* are not splitting into two columns.
My dataset is quite large so I don't know how to properly illustrate this problem, I have tried to provide a small scale example, however, everything seems to be working correctly in this example;
I have tried to look through the original column with df['review'].unique() and all entries were entered correctly (no missing commas or anything like that), so I was wondering if there is something wrong with my approach that would influence it to not work correctly across all datasets. Or is there something wrong with my dataset.
Does anyone have any suggestions as to how I should troubleshoot?
when i use df["A"].unique() there are entries that should not be in the column. I only want 'A' to appear there
IIUC, you wanted to create dummy variables instead?
df2 = df.join(df['review'].str.get_dummies(sep=',').pipe(lambda x: x*[*x]).replace('',float('nan')))
Output:
cat1 review A B B* C D
0 1 A,B,C A B NaN C NaN
1 2 A,B*,B,C A B B* C NaN
2 3 A,C A NaN NaN C NaN
3 4 A,B,C,D A B NaN C D
4 5 A,B,C,A,B A B NaN C NaN

selecting rows with min and max values of a defined column in pandas

I have the following dataframe:
A,B,C,D
10,1,2,3
1,4,7,3
10,5,2,3
40,7,9,3
9,9,9,9
I would like to create another dataframe starting from the previous one which have only two row. The selection of these two rows is based on the minimum and maximum value in the column "A". I would like to get:
A,B,C,D
1,4,7,3
40,7,9,3
Do you think I should work with a sort of index.min e index.max and then select only the two rows and append then in a new dataframe? Do you have same other suggestions?
Thanks for any kind of help,
Best
IIUC you can simply subset the dataframe with an OR condition on df.A.min() and df.A.max():
df = df[(df.A==df.A.min())|(df.A==df.A.max())]
df
A B C D
1 1 4 7 3
3 40 7 9 3
Yes, you can use idxmin/idxmax and then use loc:
df.loc[df['A'].agg(['idxmin','idxmax']) ]
Output:
A B C D
1 1 4 7 3
3 40 7 9 3
Note that this only gives one row for min and one for max. If you want all values, you should use #CHRD's solution.

how to merge csv cells by pandas

I have below dataframe.
I want it transform it to below with merging cells with same value in a column
Anyone can provide some sample code?
try this,
df.loc[df.duplicated(['A', 'B']),['A', 'B']]=''
Get duplicate values and mask the values to empty string.
I/P:
A B C
0 1 a A
1 1 a B
2 2 b C
3 2 b A
O/P:
A B C
0 1 a A
1 B
2 2 b C
3 A
note: You can't exactly merge cells using pandas, the idea is suppressing values except first record
Based on the sample data generated by #mohamed thasin ah,
df.groupby(['A', 'B'], as_index=False).agg(', '.join)
A B C
0 1 a A, B
1 2 b C, A
so try:
df.groupby(['cd', 'ci', 'ui', 'module_behavior', 'feature_behavior', 'at']).agg(', '.join)
The output that you want seems to be an Excel file. If that is the case, I suggest :
df.groupby(['cn', 'ci', 'ui', 'module_behaviour', 'feature_behaviour', 'at']).apply(
lambda x: x.sort_values('caseid')).to_excel('filename.xlsx')
Pandas will groupby those columns and turn them into mutilevel indexes, and to_excel saves the DataFrame to an Excel file with the default setting merge_cells=True.

How to merge every N rows in a csv file using phyton pandas

I have long rows of data where each of them contains a single character of a number and I want to merge every 5 rows of data into 1 using python.
for example:
A Result
1 12335
2 23352
3 33525
3 35251
5 ...
2 ...
5 ...
1 ...
the first result contains rows from 1 to 5 and the second result contains rows from 2 to 6. Can someone help me with that? Any answer would be appreciated!
something like this should work:
df = pd.DataFrame({'input' : ['1','2','3','4','5','6','7']})
df['result'] = df['input']
for i in range (1,6):
df['result'] = df['result'] + df['input'].shift(-i)
This also works but not very elegant
dz = df.iloc[0::5,:].reset_index(drop=True)+df.iloc[1::5,:].reset_index(drop=True)+df.iloc[2::5,:].reset_index(drop=True)+df.iloc[3::5,:].reset_index(drop=True)+df.iloc[4::5,:].reset_index(drop=True)

Pandas indexing and accessing columns by names

I am trying to access pandas dataframe by column names after indexing the df with a specific column and it returns incorrect column values.
import pandas as pd
rs =pd.read_csv('rs.txt', header="infer", sep="\t", names=['id', 'exp','fov','cycle', 'color', 'values'], index_col=2)
rs.cycle.head()
I am indexing the df here with 'fov' and I want to access the 'cycle' column, it gives me the color column instead. I think I am missing something here?
EDIT
The first few lines of the input file are:
6 3 1 G 0.96593
6 3 1 O 0.88007
6 3 1 R 0.94305
6 3 2 B 0.90554
6 3 2 G 0.93146
I think the problem arises because your data file has 5 columns and your names list has 6 elements. To verify, check the first few values in the id column- these will all be set to 6 if I am right. The First few items in the exp column will have the value 3.
To fix this, read your input file like so:
rs =pd.read_csv('rs.txt', header="infer", sep="\t", names=['exp','fov','cycle', 'color', 'values'], index_col=2
Pandas will automatically insert row identifiers.

Categories

Resources