Flatten a pandas dataframe containing lists - python

I have a pandas, that contains lists at some entries
pandas = pd.DataFrame([[1,[2,3],[4,5]],[9,[2,3],[4,5]]],columns = ['A','B','C'])
I would like to know, how one can flatten this
dataframe to
pandas_flat = pd.DataFrame([[1,2,3,4,5],[9,2,3,4,5]],columns = ['A','B_0','B_1','C_0','C_1'])
where the column names are adapted.
The next level is to flatten a pandas dataframe of lists with varying size in one column. How do I flatten them and fill in a fill_value as follows
pandas_1 = pd.DataFrame([[1,[2,3,3],[4,5]],[9,[2,3],[4,5]]],columns = ['A','B','C'])
-->
fill_value = 0
pandas_flat_1 = pd.DataFrame([[1,2,3,3,4,5],[9,2,3,0,4,5]],columns = ['A','B_0','B_1','B_2','C_1','C_2'])
--------------- Solution
For the first dataframe
pandas = pd.DataFrame([[1,[2,3],[4,5]],[9,[2,3],[4,5]]],columns = ['Samples','B','C'])
we have
df=pandas.T.apply(lambda x: x.explode())
groups=df.groupby(level=0)
ids=groups.cumcount()
#df.index=(df.index+'_'+ids.astype(str)).mask(groups.size()<2,df.index.to_series())
new_df=df.T
# Change list names
m = list(new_df.columns)
d = {a:list(range(1, b+1)) if b>1 else '' for a,b in Counter(m).items()}
new_df.columns = [i+str(d[i].pop(0)) if len(d[i]) else i for i in m]

In the first case you could use Series.explode + DataFrame.apply and DataFrame.transpose. We can use to rename the columns GroupBy.cumcount:
df=pandas.T.apply(lambda x: x.explode())
ids=df.groupby(level=0).cumcount()
df.index=df.index+'_'+ids.astype(str)
new_df=df.T
print(new_df)
A_0 B_0 B_1 C_0 C_1
0 1 2 3 4 5
1 9 2 3 4 5
EDIT:
df=pandas.T.apply(lambda x: x.explode())
groups=df.groupby(level=0)
ids=groups.cumcount()
df.index=(df.index+'_'+ids.astype(str)).mask(groups.size()<2,df.index.to_series())
new_df=df.T
print(new_df)
A B_0 B_1 C_0 C_1
0 1 2 3 4 5
1 9 2 3 4 5

Related

How to concat columns of similar dataframes with new names

I have two dataframes with similar columns:
df1 = (a, b, c, d)
df2 = (a, b, c, d)
I want concat or merge some columns of them like below in df3
df3 = (a_1, a_2, b_1, b_2)
How can I put them beside as they are (without any change), and how can I merge them on a similar key like d? I tried to add them to a list and concat them but don't know how to give them a new name. I don't want multi-level column names.
for ii, tdf in enumerate(mydfs):
tdf = tdf.sort_values(by="fid", ascending=False)
for _col in ["fid", "pred_text1"]:
new_col = _col + str(ii)
dfs.append(tdf[_col])
ii += 1
df = pd.concat(dfs, axis=1)
Without having a look at your dataframe, it would not be easy, but I am generating a dataframe to give you samples and insight into how the code works:
import pandas as pd
import re
df1 = pd.DataFrame({"a":[1,2,4], "b":[2,4,5], "c":[5,6,7], "d":[1,2,3]})
df2 = pd.DataFrame({"a":[6,7,5], "b":[3,4,8], "c":[6,3,9], "d":[1,2,3]})
mergedDf = df1.merge(df2, how="left", on="d").rename(columns=lambda x: re.sub("(.+)\_x", r"\1_1", x)).rename(columns=lambda x: re.sub("(.+)\_y", r"\1_2", x))
mergedDf
which results in:
a_1
b_1
c_1
d
a_2
b_2
c_2
0
1
2
5
1
6
3
6
1
2
4
6
2
7
4
3
2
4
5
7
3
5
8
9
If you are interested in dropping other columns you can use code below:
mergedDf.iloc[:, ~mergedDf.columns.str.startswith("c")]
which results in:
a_1
b_1
d
a_2
b_2
0
1
2
1
6
3
1
2
4
2
7
4
2
4
5
3
5
8

Subtracting multiple columns between dataframes based on key

I have two dataframes, example:
Df1 -
A B C D
x j 5 2
y k 7 3
z l 9 4
Df2 -
A B C D
z o 1 1
x p 2 1
y q 3 1
I want to deduct columns C and D in Df2 from columns C and D in Df1 based on the key contained in column A.
I also want to ensure that column B remains untouched, example:
Df3 -
A B C D
x j 3 1
y k 4 2
z l 8 3
I found an almost perfect answer in the following thread:
Subtracting columns based on key column in pandas dataframe
However what the answer does not explain is if there are other columns in the primary df (such as column B) that should not be involved as an index or with the operation.
Is somebody please able to advise?
I was originally performing a loop which find the value in the other df and deducts it however this takes too long for my code to run with the size of data I am working with.
Idea is specify column(s) for maching and column(s) for subtract, convert all not cols columnsnames to MultiIndex, subtract:
match = ['A']
cols = ['C','D']
df1 = Df1.set_index(match + Df1.columns.difference(match + cols).tolist())
df = df1.sub(Df2.set_index(match)[cols], level=0).reset_index()
print (df)
A B C D
0 x j 3 1
1 y k 4 2
2 z l 8 3
Or replace not matched values to original Df1:
match = ['A']
cols = ['C','D']
df1 = Df1.set_index(match)
df = df1.sub(Df2.set_index(match)[cols], level=0).reset_index().fillna(Df1)
print (df)
A B C D
0 x j 3 1
1 y k 4 2
2 z l 8 3

Python - lookup value in dataframe and return random corresponding value

df1 has a lot of NaN values.
I have compiled df2 with all unique values for code and name.
I need to replace the NaN code values in df1 with a random code value from df2 where df1 and df2 match on name.
df1 = pd.DataFrame(columns=['ID','name','code'])
df1.ID = [1,2,3,4]
df1.name = ['A','A','B','B']
df1.code = [np.nan,np.nan,np.nan,np.nan]
df2 = pd.DataFrame(columns=['name','code'])
df2.name = ['A','A','A','A','B','B','B','B']
df2.code = ['a','b','c','d','e','f','g','h']
df1
df2
example result
You could use random.sample and pas 2 after joining the values for each group into a list. Then, merge back into the initial dataframe, explode the list and drop_duplicates()
import random
df2 = df2.groupby('name')['code'].apply(lambda x: random.sample(list(x), 2)).reset_index()
df3 = df1[['ID', 'name']].merge(df2).explode('code').drop_duplicates(['name', 'code']).reset_index(drop=True)
df3['ID'] = np.flatnonzero(df3['ID']) + 1
Out[1]:
ID name code
0 1 A d
1 2 A a
2 3 B h
3 4 B f
You could create a dictionary where the keys are names and the values are the possible code, then for each name in df1 sample from the corresponding value:
import random
lookup = df2.groupby('name')['code'].apply(list).to_dict()
df1['code'] = df1['code'].fillna(pd.Series([random.choice(lookup[name]) for name in df1['name']],
index=df1.index))
print(df1)
Output
ID name code
0 1 A b
1 2 A b
2 3 B g
3 4 B g
If sample without replacement is needed you could do:
lst = [s for k, g in df1.groupby('name', as_index=False) for s in random.sample(lookup[k], len(g))]
df1['code'] = df1['code'].fillna(pd.Series(lst, index=df1.index))
print(df1)
Output
ID name code
0 1 A d
1 2 A a
2 3 B e
3 4 B h

Writing function to filter and rename multiple dataframe columns based on variable input

For a given dataframe df, imported from a csv file and containing redundant data (columns), I would like to write a function that allows to perform recursive filtering and sub-sequent renaming of df.columns, based on the amount of arguments given.
Ideally the function should perform as follows.
When input is (df, 'string1a', 'string1b', 'new_col_name1'), then:
filter1 = [col for col in df.columns if 'string1a' in col and 'string1b' in col]
df_out = df [ filter1]
df_out.columns= ['new_col_name1']
return df_out
Whereas, when input is:
(df, 'string1a', 'string1b', 'new_col_name1','string2a', 'string2b', 'new_col_name2', 'string3a', 'string3b', 'new_col_name3')the function should return
filter1 = [col for col in df.columns if 'string1a' in col and 'string1b' in col]
filter2 = [col for col in df.columns if 'string2a' in col and 'string2b' in col]
filter3 = [col for col in df.columns if 'string3a' in col and 'string3b' in col]
df_out = df [ filter1 + filter2 + filter3 ]
df_out.columns= ['new_col_name1','new_col_name2','new_col_name3']
return df_out
I think you can use dictionary for define values and then apply function with np.logical_and.reduce because need check multiple values in list:
df = pd.DataFrame({'aavfb':list('abcdef'),
'cedf':[4,5,4,5,5,4],
'd':[7,8,9,4,2,3],
'c':[1,3,5,7,1,0],
'abds':[5,3,6,9,2,4],
'F':list('aaabbb')})
print (df)
F aavfb abds c cedf d
0 a a 5 1 4 7
1 a b 3 3 5 8
2 a c 6 5 4 9
3 b d 9 7 5 4
4 b e 2 1 5 2
5 b f 4 0 4 3
def rename1(df, d):
#loop in dict
for k,v in d.items():
#get mask for columns contains all values in lists
m = np.logical_and.reduce([df.columns.str.contains(x) for x in v])
#set new columns names by mask
df.columns = np.where(m, k, df.columns)
#filter all columns by keys of dict
return df.loc[:, df.columns.isin(d.keys())]
d = {'new_col_name1':['a', 'b'],
'new_col_name2':['c', 'd']}
print (rename1(df, d))
new_col_name1 new_col_name1 new_col_name2
0 a 5 4
1 b 3 5
2 c 6 4
3 d 9 5
4 e 2 5
5 f 4 4

Pandas: set the value of a column in a row to be the value stored in a different df at the index of its other rows

>>> df
0 1
0 0 0
1 1 1
2 2 1
>>> df1
0 1 2
0 A B C
1 D E F
>>> crazy_magic()
>>> df
0 1 3
0 0 0 A #df1[0][0]
1 1 1 E #df1[1][1]
2 2 1 F #df1[2][1]
Is there a way to achieve this without for?
import pandas as pd
df = pd.DataFrame([[0,0],[1,1],[2,1]])
df1 = pd.DataFrame([['A', 'B', 'C'],['D', 'E', 'F']])
df2 = df1.reset_index(drop=False)
# index 0 1 2
# 0 0 A B C
# 1 1 D E F
df3 = pd.melt(df2, id_vars=['index'])
# index variable value
# 0 0 0 A
# 1 1 0 D
# 2 0 1 B
# 3 1 1 E
# 4 0 2 C
# 5 1 2 F
result = pd.merge(df, df3, left_on=[0,1], right_on=['variable', 'index'])
result = result[[0, 1, 'value']]
print(result)
yields
0 1 value
0 0 0 A
1 1 1 E
2 2 1 F
My reasoning goes as follows:
We want to use two columns of df as coordinates.
The word "coordinates" reminds me of pivot, since
if you have two columns whose values represent "coordinates" and a third
column representing values, and you want to convert that to a grid, then
pivot is the tool to use.
But df does not have a third column of values. The values are in df1. In fact df1 looks like the result of a pivot operation. So instead of pivoting df, we want to unpivot df1.
pd.melt is the function to use when you want to unpivot.
So I tried melting df1. Comparison with other uses of pd.melt led me to conclude df1 needed the index as a column. That's the reason for defining df2. So we melt df2.
Once you get that far, visually comparing df3 to df leads you naturally to the use of pd.merge.

Categories

Resources