I have dynamic number of columns in my dataframe for each row and a single record can go for more than 1 row. First 2 columns are the key columns. If the key columns are matching, i have to append each row of data into a single row and create as much columns as required for appending.
Input is below (dataframe) c1 in a column c2 in a column etc...
row 1: A 1 c1 c2 c3.. c20
row 2: A 1 c21....c25
row 3. A 1 c26.... c35
row 4: A 2 d1 d2... d21
row 5: A 2 d22....d27
I tried using df.groupby(___first 2 column names____).first().reset_index() which returns only first row as we are using first(). is there any function to do this in python
output required: (dataframe)
row 1: A 1 c1 c2...c35 (each value in 1 column)
row 2: A 2 d1...d27 (each value in 1 column)
Use GroupBy.cumcount for series of counter, then DataFrame.set_index, DataFrame.sort_index and last flatten MultiIndex in list comprehension:
print (df)
a b c d e f
row1: A 1 c1 c2 c3 c20
row2: A 1 c21 c22 c23 c24
row3. A 1 c26 c27 c28 c29
row4: A 2 d1 d2 d21 d22
row5: A 2 d22 d27 d28 d29
s = df.groupby(['a','b']).cumcount()
df1 = df.set_index(['a', 'b', s]).unstack().sort_index(level=1, axis=1)
df1.columns = [f'{x}{y}' for x, y in df1.columns]
df1 = df1.reset_index()
print (df1)
a b c0 d0 e0 f0 c1 d1 e1 f1 c2 d2 e2 f2
0 A 1 c1 c2 c3 c20 c21 c22 c23 c24 c26 c27 c28 c29
1 A 2 d1 d2 d21 d22 d22 d27 d28 d29 NaN NaN NaN NaN
Related
I have three data frames as follows:
df1
col1 CAND_SNP
1 a1
1 a2
1 a3
1 a4
2 b1
3 c1
3 c2
3 c3
df2
col1 LEAD_SNP
1 a1
2 b1
3 c1
df3
snp col2
a3 x1
a21 x2
a31 x3
a41 x4
b11 x5
c11 x6
c21 x7
c31 x8
I need to match CAND_SNP of df1 with snp of df3 to populate a new column in df2 with values "yes" or "no". The match needs to be groupwise for col1 of df1. In the above example, there are 3 groups in col1 of df1. If any of these group's corresponding value in CAND_SNP matches with snp of df3 then the new column of df2 would be "yes" as below: Any help?
df2
col1 LEAD_SNP col3
1 a1 Yes
2 b1 No
3 c1 No
If I understand correctly, you can group df1 by col1 and look up whether a value of col2 exists in col1 of df3. Then merge with df2:
df1['col3'] = df1.groupby('col1')['CAND_SNP'].apply(lambda s: s.isin(df3['snp']))
df2 = df2.merge(df1.groupby('col1')['col3'].any(), left_on='col1', right_index=True, how='left')
And if you need 'Yes'/'No' as values, use
df2.col3.map({True: 'Yes', False: 'No'})
I have a use-case where it naturally fits to compute each cell of a pd.DataFrame as a function of the corresponding index and column i.e.
import pandas as pd
import numpy as np
data = np.empty((3, 3))
data[:] = np.nan
df = pd.DataFrame(data=data, index=[1, 2, 3], columns=['a', 'b', 'c'])
print(df)
> a b c
>1 NaN NaN NaN
>2 NaN NaN NaN
>3 NaN NaN NaN
and I'd like (this is only a mock example) to get a result that is a function f(index, column):
> a b c
>1 a1 b1 c1
>2 a2 b2 c2
>3 a3 b3 c3
In order to accomplish this I need a way different to apply or applymap where the lambda gets the coordinates in terms of the index and col i.e.
def my_cell_map(ix, col):
return col + str(ix)
Here is possible use numpy - add index values to columns with broadcasting and pass to DataFrame constructor:
a = df.columns.to_numpy() + df.index.astype(str).to_numpy()[:, None]
df = pd.DataFrame(a, index=df.index, columns=df.columns)
print (df)
a b c
1 a1 b1 c1
2 a2 b2 c2
3 a3 b3 c3
EDIT: For processing by columns names is possible use x.name with index values:
def f(x):
return x.name + x.index.astype(str)
df = df.apply(f)
print (df)
a b c
1 a1 b1 c1
2 a2 b2 c2
3 a3 b3 c3
EDIT1: For your function is necessary use another lambda function for loop by index values:
def my_cell_map(ix, col):
return col + str(ix)
def f(x):
return x.index.map(lambda y: my_cell_map(y, x.name))
df = df.apply(f)
print (df)
a b c
1 a1 b1 c1
2 a2 b2 c2
3 a3 b3 c3
EDIT2: Also is possible loop by index and columns values and set by loc, if large DataFrame performance should be slow:
for c in df.columns:
for i in df.index:
df.loc[i, c] = my_cell_map(i, c)
print (df)
a b c
1 a1 b1 c1
2 a2 b2 c2
3 a3 b3 c3
I have N dataframes:
df1:
time data
1.0 a1
2.0 b1
3.0 c1
df2:
time data
1.0 a2
2.0 b2
3.0 c2
df3:
time data
1.0 a3
2.0 b3
3.0 c3
I want to merge all of them on id, thus getting
time data1 data2 data3
1.0 a1 a2 a3
2.0 b1 b2 b3
3.0 c1 c2 c3
I can assure all the ids are the same in all dataframes.
How can I do this in pandas?
One idea is use concat for list of DataFrames - only necessary create index by id for each DaatFrame. Also for avoid duplicated columns names is added keys parameter, but it create MultiIndex in output. So added map with format for flatten it:
dfs = [df1, df2, df3]
dfs = [x.set_index('id') for x in dfs]
df = pd.concat(dfs, axis=1, keys=range(1, len(dfs) + 1))
df.columns = df.columns.map('{0[1]}{0[0]}'.format)
df = df.reset_index()
print (df)
id data1 data2 data3
0 1 a1 a2 a3
1 2 b1 b2 b3
2 3 c1 c2 c3
df1=
A B C D
a1 b1 c1 1
a2 b2 c2 2
a3 b3 c3 4
df2=
A B C D
a1 b1 c1 2
a2 b2 c2 1
I want to compare the value of the column 'D' in both dataframes. If both dataframes had same number of rows I would just do this.
newDF = df1['D']-df2['D']
However there are times when the number of rows are different. I want a result Dataframe which shows a dataframe like this.
resultDF=
A B C D_df1 D_df2 Diff
a1 b1 c1 1 2 -1
a2 b2 c2 2 1 1
EDIT: if 1st row in A,B,C from df1 and df2 is same then and only then compare 1st row of column D for each dataframe. Similarly, repeat for all the row.
Use merge and df.eval
df1.merge(df2, on=['A','B','C'], suffixes=['_df1','_df2']).eval('Diff=D_df1 - D_df2')
Out[314]:
A B C D_df1 D_df2 Diff
0 a1 b1 c1 1 2 -1
1 a2 b2 c2 2 1 1
Suppose I have a main dataframe
main_df
Cri1 Cri2 Cr3 total
0 A1 A2 A3 4
1 B1 B2 B3 5
2 C1 C2 C3 6
I also have 3 dataframes
df_1
Cri1 Cri2 Cri3 value
0 A1 A2 A3 1
1 B1 B2 B3 2
df_2
Cri1 Cri2 Cri3 value
0 A1 A2 A3 9
1 C1 C2 C3 10
df_3
Cri1 Cri2 Cri3 value
0 B1 B2 B3 15
1 C1 C2 C3 17
What I want is to add value from each frame df to total in the main_df according to Cri
i.e. main_df will become
main_df
Cri1 Cri2 Cri3 total
0 A1 A2 A3 14
1 B1 B2 B3 22
2 C1 C2 C3 33
Of course I can do it using for loop, but at the end I want to apply the method to a large amount of data, say 50000 rows in each dataframe.
Is there other ways to solve it?
Thank you!
First you should align your numeric column names. In this case:
df_main = df_main.rename(columns={'total': 'value'})
Then you have a couple of options.
concat + groupby
You can concatenate and then perform a groupby with sum:
res = pd.concat([df_main, df_1, df_2, df_3])\
.groupby(['Cri1', 'Cri2', 'Cri3']).sum()\
.reset_index()
print(res)
Cri1 Cri2 Cri3 value
0 A1 A2 A3 14
1 B1 B2 B3 22
2 C1 C2 C3 33
set_index + reduce / add
Alternatively, you can create a list of dataframes indexed by your criteria columns. Then use functools.reduce with pd.DataFrame.add to sum these dataframes.
from functools import reduce
dfs = [df.set_index(['Cri1', 'Cri2', 'Cri3']) for df in [df_main, df_1, df_2, df_3]]
res = reduce(lambda x, y: x.add(y, fill_value=0), dfs).reset_index()
print(res)
Cri1 Cri2 Cri3 value
0 A1 A2 A3 14.0
1 B1 B2 B3 22.0
2 C1 C2 C3 33.0