Grouping data on column value - python

Hi I have data (in excel and text file as well) like
C1 C2 C3
1 p a
1 q b
2 r c
2 s d
And I want the output like:
C1 C2 C3
1 p,q a,b
2 r,s c,d
How can I group the data based on column values.
I am open to anything: any library, any language, any tool
Like python, bash, or even excel?
I think we can do this using pandas in python, but I havent used it before.
Any leads appreciated.

First pandas.read_excel - output is DataFrame:
df = pd.read_excel('file.xlsx')
Then you can use groupby with agg join:
df = df.groupby('C1').agg(','.join).reset_index()
print (df)
C1 C2 C3
0 1 p,q a,b
1 2 r,s c,d
If more columns in df and need filter only C2 and C3:
df = df.groupby('C1')['C2','C3'].agg(','.join).reset_index()
print (df)
C1 C2 C3
0 1 p,q a,b
1 2 r,s c,d
For save to excel file use DataFrame.to_excel, obviously without index:
df.to_excel('file.xlsx', index=False)

Related

Pandas: Join data to a single row to new columns

I'm new to pandas and have been having trouble using the merge, join and concatenate functions on a single row of data.
I'm iterating over a handful of rows in a table and in each iteration add some data I've found to the row I'm handling. I know, blasphemy! Thou shall not iterate. Each iteration results in a call to a server, so I need to control flow. There aren't that many rows. It's just for my own use. I promise I'll not iterate when I shouldn't.
That aside, my basic question is this: How do I add data to a given row where the new data has priority over existing data and has new columns?
Let's suppose I have a DataFrame df that I'm iterating over by row:
> df
c1 c2 c3
0 a b c
1 d e f
and when iterating on row 0, I get some new data that I want to add to row 0. That new data is in df_a:
> df_a
c4 c5 c6
0 g h i
I want to add data from df_a to row 0 of df so df is now:
> df
c1 c2 c3 c4 c5 c6
0 a b c g h i
1 d e f NaN NaN NaN
Next I iterate on row 1 and I get some columns which overlap and some which don't in df_b:
> df_b
c5 c7 c8
0 j k l
And again I want to add this data to row 1 so df now has
> df
c1 c2 c3 c4 c5 c6 c7 c8
0 a b c g h i NaN NaN
1 d e f NaN j NaN k l
I can't list columns names because I don't know what they'll be and new ones can appear beyond my control. Rows don't have a key because the whole thing gets thrown away after I disconnect. Data I find during each iteration always overwrites what's currently in df.
Thanks in advance!

Re-index pandas dataframe by union of two columns

Probably a duplicate, but I'm not even sure what to search for.
If I have a pandas dataframe like so:
index RH LH Data1 Data2 . . .
1 A1 A2 A B
2 B1 NaN C D
3 NaN C2 E F
And I want to re-index as so:
index Data1 Data2
A1 A B
A2 A B
B1 C D
C2 E F
Is there a simple-ish way to do this? Or should I just do a pair of for loops?
You can use DataFrame.set_index with all columns without names defined in list and reshape by DataFrame.stack, then remove last level by DataFrame.reset_index with drop=True, convert all another levels to columns and create index by DataFrame.set_index:
cols = df.columns.difference(['RH','LH']).tolist()
df = (df.set_index(cols)
.stack()
.reset_index(len(cols), drop=True)
.reset_index(name='idx')
.set_index('idx'))
print (df)
Data1 Data2
idx
A1 A B
A2 A B
B1 C D
C2 E F
Or use DataFrame.melt with DataFrame.dropna, remove column variable and last create index by idx column:
df = (df.melt(cols, value_name='idx')
.dropna(subset=['idx'])
.drop('variable', axis=1)
.set_index('idx'))
print (df)
Data1 Data2
idx
A1 A B
B1 C D
A2 A B
C2 E F

Count occurrences of certain string in entire pandas dataframe

I have following dataframe in pandas
C1 C2 C3
10 a b
10 a b
? c c
? ? b
10 a b
10 ? ?
I want to count the occurrences of ? in all the columns
My desired output is column wise sum of occurrences
Use:
m=df.eq('?').sum()
pd.DataFrame([m.values],columns=m.index)
C1 C2 C3
0 2 2 1
Or better :
df.eq('?').sum().to_frame().T #thanks #user3483203
C1 C2 C3
0 2 2 1

Pandas loop through Excel sheets and append to df

I am trying to loop through an Excel sheet and append the data from multiple sheets into a data frame.
So far I have:
master_df = pd.DataFrame()
for sheet in target_sheets:
df1 = file.parse(sheet, skiprows=4)
master_df.append(df1, ignore_index=True)
But then when I call master_df.head() it returns __
The data on these sheets is in the same format and relate to each other.
So I would like to join them like this:
Sheet 1 contains:
A1
B1
C1
Sheet 2 contains:
A2
B2
C2
Sheet 3:
A3
B3
C3
End result:
A1
B1
C1
A2
B2
C2
A3
B3
C3
Is my logic correct or how can I achieve this?
Below code will work even if you don't know the exact sheet_names in the excel file. You can try this:
import pandas as pd
xls = pd.ExcelFile('myexcel.xls')
out_df = pd.DataFrame()
for sheet in xls.sheet_names:
df = pd.read_excel('myexcel.xls', sheet_name=sheet)
out_df.append(df) ## This will append rows of one dataframe to another(just like your expected output)
print(out_df)
## out_df will have data from all the sheets
Let me know if this helps.
Simply use pd.concat():
pd.concat([pd.read_excel(file, sheet_name=sheet) for sheet in ['Sheet1','Sheet2','Sheet3']], axis=1)
For example, will yield:
A1 B1 C1 A2 B2 C2 A3 B3 C3
0 1 2 3 1 2 3 1 2 3
1 4 5 6 4 5 6 4 5 6
2 7 8 9 7 8 9 7 8 9
The output desired in the question is obtained by setting axis=0.
import pandas as pd
df2 = pd.concat([pd.read_excel(io="projects.xlsx", sheet_name=sheet) for sheet in ['JournalArticles','Proposals','Books']], axis=0)
df2

Concatenate pandas dataframes with varying rows per index

I have two dataframes df1 and df2 with key as index.
dict_1={'key':[1,1,1,2,2,3], 'col1':['a1','b1','c1','d1','e1','f1']}
df1 = pd.DataFrame(dict_1).set_index('key')
dict_2={'key':[1,1,2], 'col2':['a2','b2','c2']}
df2 = pd.DataFrame(dict_2).set_index('key')
df1:
col1
key
1 a1
1 b1
1 c1
2 d1
2 e1
3 f1
df2
col2
key
1 a2
1 b2
2 c2
Note that there are unequal rows for each index. I want to concatenate these two dataframes such that, I have the following dataframe (say df3).
df3
col1 col2
key
1 a1 a2
1 b1 b2
2 d1 c2
i.e. concatenate the two columns so that the new dataframe as the least (of df1 and df2) rows for each index.
I tried
pd.concat([df1,df2],axis=1)
but I get the following error:
Value Error: Shape of passed values is (2,17), indices imply (2,7)
My question: How can I concatentate df1 and df2 to get df3? Should I use DataFrame.merge instead? If so, how?
Merge/join alone will get you a lot of (hard to get rid of) duplicates. But a little trick will help:
df1['count1'] = 1
df1['count1'] = df1['count1'].groupby(df1.index).cumsum()
df1
Out[198]:
col1 count1
key
1 a1 1
1 b1 2
1 c1 3
2 d1 1
2 e1 2
3 f1 1
The same thing for df2:
df2['count2'] = 1
df2['count2'] = df2['count2'].groupby(df2.index).cumsum()
And finally:
df_aligned = df1.reset_index().merge(df2.reset_index(), left_on = ['key','count1'], right_on = ['key', 'count2'])
df_aligned
Out[199]:
key col1 count1 col2 count2
0 1 a1 1 a2 1
1 1 b1 2 b2 2
2 2 d1 1 c2 1
Now, you can reset index with set_index('key') and drop no longer needed columns countn.
The biggest problem for why you are not going to be able to line up the two in the way that you want is that your keys are duplicative. How are you going to be line up the A1 value in df1 with the A2 value in df2 When A1, A2, B1, B2, and C1 all have the same key?
Using merge is what you'll want if you can resolve the key issues:
df3 = df1.merge(df2, left_index=True, right_index=True, how='inner')
You can use inner, outer, left or right for how.

Categories

Resources