Combining Pandas Dataframes - python

I am pretty new in Pandas. So please bear with me. I have a df like this one
DF1
column1 column2(ids)
a [1,2,13,4,9]
b [20,14,10,18,17]
c [6,8,12,16,19]
d [11,3,15,7,5]
Each number in each list corresponds to the column id in a second dataframe.
DF2
id. value_to_change.
1 x1
2 x2
3 x3
4 x4
5 x5
6 x6
7 x7
8 x8
9 x9
. .
. .
. .
20 x20
STEP1
I want to iterate each list and select the rows in DF2 with the matching ids, AND create 4 dataframes since I have 4 rows in DF1.
How to do this?
So for instance for the first row after applying the logic i would get this back
id. value_to_change
1 x1
2 x2
13 x13
14 x14
9 x9
The second row would give me
id. value_to_change
20 x20
14 x14
10 x10
18 x18
17 x17
And so on...
STEP 2
Once I have these 4 dataframes, i pass them as argument to a logic which returns me 4 dataframes.
2) How could I combine them into a sorted final one?
DF3
id. new_value
1 y1
2 y2
3 y3
4 y4
5 y5
6 y6
7 y7
8 y8
9 y9
. .
. .
. .
20 y20
how could I go about this?

It would be much easier and efficient to use a single dataframe like so
Initialization
df1 = pd.DataFrame({'label': ['A', 'B', 'C', 'D'], 'ids': [[1,2,13,4,9],
[20,14,10,18,17], [6,8,12,16,19],[11,3,15,7,5]]})
# Some custom function for dataframe operations
def my_func(x):
x['value_to_change'] = x.value_to_change.str.replace('x', 'y')
return x
Dataframe Operations
df1 = df1.explode('ids')
df1['value_to_change'] = df1.explode('ids')['ids'].map(dict(zip(df2.ids, df2.val)))
df1['new_value'] = df1.groupby('label').apply(my_func)['value_to_change']
Output
label ids value_to_change new_value
0 A 1 x1 y1
0 A 2 x2 y2
0 A 13 x13 y13
0 A 4 x4 y4
0 A 9 x9 y9
1 B 20 x20 y20
1 B 14 x14 y14
1 B 10 x10 y10
1 B 18 x18 y18
1 B 17 x17 y17
2 C 6 x6 y6
2 C 8 x8 y8
2 C 12 x12 y12
2 C 16 x16 y16
2 C 19 x19 y19
3 D 11 x11 y11
3 D 3 x3 y3
3 D 15 x15 y15
3 D 7 x7 y7
3 D 5 x5 y5

This code will help with the first part of the problem.
import pandas as pd
df1 = pd.DataFrame([[[1,2,4,5]],[[3,4,1]]], columns=["column2(ids)"])
df2 = pd.DataFrame([[1,"x1"],[2,"x2"],[3,"x3"],[4,"x4"],[5,"x5"]], columns=["id", "value_to_change"])
df3 = pd.DataFrame(columns=["id", "value_to_change"])
for row in df1.iterrows():
s = row[1][0]
for item in s:
val = df2.loc[df2['id']==item, 'value_to_change'].item()
df_temp = pd.DataFrame([[item,val]], columns=["id", "value_to_change"])
df3 = df3.append(df_temp, ignore_index=True)
df3
Note in the line s=row[1][0], you need to choose the index according to your dataframe, in my case it was [1][0]
-For second part you can use pd.concat: Documentation
-For sorting df.sort_values: Documentation

Use .loc and .isin to get new Dataframe with required rows in df2
Do your logic on these 4 dataframes
combine the resulting 4 dataframes using pandas.concat()
sort the dataframe by ids using .sort_values()
Code:
import pandas as pd
df1 = pd.DataFrame({'column1 ': ['A', 'B', 'C', 'D'], 'ids': [[1,2,13,4,9], [20,14,10,18,17], [6,8,12,16,19],[11,3,15,7,5]]})
df2 = pd.DataFrame({'ids': list(range(1,21)), 'val': [f'x{x}' for x in range(1,21)]})
df_list=[]
for id_list in df1['ids'].values:
df_list.append(df2.loc[df2['ids'].isin(id_list)])
# do logic on each DF in df_list
# assuming df_list now contains the resulting dataframes
df3 = pd.concat(df_list)
df3 = df3.sort_values('ids')

First things first, this code should do what you want.
import pandas as pd
idxs = [
[0,2],
[1,3],
]
df_idxs = pd.DataFrame({'idxs': idxs})
df = pd.DataFrame(
{'data': ['a', 'b', 'c', 'd']}
)
frames = []
for _, idx in df_idxs.iterrows():
rows = idx['idxs']
frame = df.loc[rows]
# some logic
print(frame)
#collect
frames.append(frame)
pd.concat(frames)
Note that pandas automatically creates a range index of none is passed. If you want to select on a different column, set that one as index, or use
df.loc[df.data.isin(rows)]
.
The pandas doc on split-apply-combine may also interest you: https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html

Related

Split rows to create new rows in Pandas Dataframe with same other row values

I have a pandas dataframe in which one column of text strings contains multiple comma-separated values. I want to split each field and create a new row per entry only where the number of commas is >= 2. For example, a should become b:
In [7]: a
Out[7]:
var1 var2 var3
0 a,b,c,d 1 X1
1 a,b,c,d 1 X2
2 a,b,c,d 1 X3
3 a,b,c,d 1
4 e,f,g 2 Y1
5 e,f,g 2 Y2
6 e,f,g 2
7 h,i 3 Z1
In [8]: b
Out[8]:
var1 var2 var3
0 a,d 1 X1
1 b,d 1 X2
3 c,d 1 X3
4 e,g 2 Y1
5 f,g 2 Y2
6 h,i 3 Z1
You could use a custom function:
def custom_split(r):
if r['var3']:
s = r['var1']
i = int(r['var3'][1:])-1
l = s.split(',')
return l[i]+','+l[-1]
df['var1'] = df.apply(custom_split, axis=1)
df = df.dropna()
output:
var1 var2 var3
0 a,d 1 X1
1 b,d 1 X2
2 c,d 1 X3
4 e,g 2 Y1
5 f,g 2 Y2
7 h,i 3 Z1
df['cc'] = df.groupby('var1')['var1'].cumcount()
df['var1'] = df['var1'].str.split(',')
df['var1'] = df[['cc','var1']].apply(lambda x: x['var1'][x['cc']]+','+x['var1'][-1],axis=1)
df = df.dropna().drop(columns=['cc']).reset_index(drop=True)
df
You can do so by splitting var1 on the comma into lists. The integer in var3 minus 1 can be interpreterd as the index of what item in the list in var1 to keep:
import pandas as pd
import io
data = ''' var1 var2 var3
0 a,b,c,d 1 X1
1 a,b,c,d 1 X2
2 a,b,c,d 1 X3
3 a,b,c,d 1
4 e,f,g 2 Y1
5 e,f,g 2 Y2
6 e,f,g 2
7 h,i 3 Z1'''
df = pd.read_csv(io.StringIO(data), sep = r'\s\s+', engine='python')
df['var1'] = df["var1"].str.split(',').apply(lambda x: [[i,x[-1]] for i in x[:-1]]) #split the string to list and create combinations of all items with the last item in the list
df = df[df['var3'].notnull()] # drop rows where var3 is None
df['var1'] = df.apply(lambda x: x['var1'][0 if not x['var3'] else int(x['var3'][1:])-1], axis=1) #keep only the element in the list in var1 where the index is the integer in var3 minus 1
Output:
var1
var2
var3
0
['a', 'd']
1
X1
1
['b', 'd']
1
X2
2
['c', 'd']
1
X3
4
['e', 'g']
2
Y1
5
['f', 'g']
2
Y2
7
['h', 'i']
3
Z1
Run df['var1'] = df['var1'].str.join(',') to reconvert var1 to a string.

Pandas Group by Multiple Columns and Levels of Values and Append Results to the Original Data Frame

I have a data frame and would like to group it by a few columns and different levels of values. Also, I want to append the group by results to the original data frame.
This is the original data frame:
AAA BBB CCC
x1 y1 yes
x1 y1 yes
x1 y1 no
x1 y2 no
x2 y2 yes
x2 y2 no
This is what I want:
AAA BBB CCC Yes No
x1 y1 yes 2 1
x1 y1 yes 2 1
x1 y1 no 2 1
x1 y2 no 0 1
x2 y2 yes 1 1
x2 y2 no 1 1
The idea here is that I want to group by AAA and BBB and count yes/no in CCC for each group. Then, I want to add the count values into 2 new columns, Yes and No.
Thanks in advance!
One way is to:
group by AAA and BBB
get the value_counts() of CCC for each group
unstack the innermost value-count index (which consists of yes and no) into the columns
merge the counts with the original DataFrame
counts = (df.groupby(['AAA', 'BBB'])['CCC']
.value_counts()
.unstack()
.fillna(0)
.astype(int))
counts.columns = counts.columns.str.title()
pd.merge(df, counts, left_on=['AAA', 'BBB'], right_index=True)
AAA BBB CCC No Yes
0 x1 y1 yes 1 2
1 x1 y1 yes 1 2
2 x1 y1 no 1 2
3 x1 y2 no 1 0
4 x2 y2 yes 1 1
5 x2 y2 no 1 1

pandas for each group calculate ratio of two categories, and append as a new column to dataframe using .pipe()

I have a pandas dataframe like the following:
import pandas as pd
pd.DataFrame({"AAA":["x1","x1","x1","x2","x2","x2"],
"BBB":["y1","y1","y2","y2","y2","y1"],
"CCC":["t1","t2","t3","t1","t1","t1"],
"DDD":[10,11,18,17,21,30]})
Out[1]:
AAA BBB CCC DDD
0 x1 y1 t1 10
1 x1 y1 t2 11
2 x1 y2 t3 18
3 x2 y2 t1 17
4 x2 y2 t1 21
5 x2 y1 t1 30
The problem
What I want is to group on column AAA so I have 2 groups - x1, x2.
I want then calculate the ratio of y1 to y2 in column BBB for each group.
And assign this output to a new column Ratio of BBB
The desired output
So I want this as my output.
pd.DataFrame({"AAA":["x1","x1","x1","x2","x2","x2"],
"BBB":["y1","y1","y2","y2","y2","y1"],
"CCC":["t1","t2","t3","t1","t1","t1"],
"DDD":[10,11,18,17,21,30],
"Ratio of BBB":[0.33,0.33,0.33,0.66,0.66,0.66]})
Out[2]:
AAA BBB CCC DDD Ratio of BBB
0 x1 y1 t1 10 0.33
1 x1 y1 t2 11 0.33
2 x1 y2 t3 18 0.33
3 x2 y2 t1 17 0.66
4 x2 y2 t1 21 0.66
5 x2 y1 t1 30 0.66
Current status
I have currently achieved it like so:
def f(df):
df["y1"] = sum(df["BBB"] == "y1")
df["y2"] = sum(df["BBB"] == "y2")
df["Ratio of BBB"] = df["y2"] / df["y1"]
return df
df.groupby(df.AAA).apply(f)
What I want to achieve
Is there anyway to achieve this with the .pipe() function?
I was thinking something like this:
df = (df
.groupby(df.AAA) # groupby a column not included in the current series (df.colname)
.BBB
.value_counts()
.pipe(lambda series: series["BBB"] == "y2" / series["BBB"] == "y1")
)
Edit: One solution using pipe()
N.B: User jpp made clear comment below:
unstack / merge / reset_index operations are unnecessary and expensive
However, I initially intended to use this method i thought I would share it here!
df = (df
.groupby(df.AAA) # groupby the column
.BBB # select the column with values to calculate ('BBB' with y1 & y2)
.value_counts() # calculate the values (# of y1 per group, # of y2 per group)
.unstack() # turn the rows into columns (y1, y2)
.pipe(lambda df: df["y1"]/df["y2"]) # calculate the ratio of y1:y2 (outputs a Series)
.rename("ratio") # rename the series 'ratio' so it will be ratio column in output df
.reset_index() # turn the groupby series into a dataframe
.merge(df) # merge with the original dataframe filling in the columns with the key (AAA)
)
Looks like you want the ratio of y1 to the total instead. Use groupby + value_counts:
v = df.groupby('AAA').BBB.value_counts().unstack()
df['RATIO'] = df.AAA.map(v.y2 / (v.y2 + v.y1))
AAA BBB CCC DDD RATIO
0 x1 y1 t1 10 0.333333
1 x1 y1 t2 11 0.333333
2 x1 y2 t3 18 0.333333
3 x2 y2 t1 17 0.666667
4 x2 y2 t1 21 0.666667
5 x2 y1 t1 30 0.666667
To generalise for many groups, you may use
df['RATIO'] = df.AAA.map(v.y2 / v.sum(axis=1))
Using groupby + transform with a custom function:
def ratio(x):
counts = x.value_counts()
return counts['y2'] / counts.sum()
df['Ratio of BBB'] = df.groupby('AAA')['BBB'].transform(ratio)
print(df)
AAA BBB CCC DDD Ratio of BBB
0 x1 y1 t1 10 0.333333
1 x1 y1 t2 11 0.333333
2 x1 y2 t3 18 0.333333
3 x2 y2 t1 17 0.666667
4 x2 y2 t1 21 0.666667
5 x2 y1 t1 30 0.666667

Aggregating cells/column in pandas dataframe

I have a dataframe that is like this
Index Z1 Z2 Z3 Z4
0 A(Z1W1) A(Z2W1) A(Z3W1) B(Z4W2)
1 A(Z1W3) B(Z2W1) A(Z3W2) B(Z4W3)
2 B(Z1W1) A(Z3W4) B(Z4W4)
3 B(Z1W2)
I want to convert it to
Index Z1 Z2 Z3 Z4
0 A(Z1W1,Z1W3) A(Z2W1) A(Z3W1,Z3W2,Z3W4) B(Z4W2,Z4W3,Z4W4)
1 B(Z1W1,Z1W2) B(Z2W1)
Basically I want to aggregate the values of different cell to one cell as shown above
Edit 1
Actual column names are either two words or 3 words names and not A B
For example Nut Butter instead of A
Things are getting interested : -)
s=df.stack().replace({'[(|)]':' '},regex=True).str.strip().str.split(' ',expand=True)
v=('('+s.groupby([s.index.get_level_values(1),s[0]])[1].apply(','.join)+')').unstack().apply(lambda x : x.name+x.astype(str)).T
v[~v.apply(lambda x : x.str.contains('None'))].apply(lambda x : sorted(x,key=pd.isnull)).reset_index(drop=True)
Out[1865]:
Z1 Z2 Z3 Z4
0 A(Z1W1,Z1W3) A(Z2W1) A(Z3W1,Z3W2,Z3W4) B(Z4W2,Z4W3,Z4W4)
1 B(Z1W1,Z1W2) B(Z2W1) NaN NaN
Update
Change
#s=df.stack().replace({'[(|)]':' '},regex=True).str.strip().str.split(' ',expand=True)
to
s=df.stack().str.split('(',expand=True)
s[1]=s[1].replace({'[(|)]':' '},regex=True).str.strip()
Geneal idea:
split string values
regroup and join stings
apply to all columns
Update 1
# I had to add parameter as_index=False to groupby(0)
# to get exactly same output as asked
Lets try one column
def str_regroup(s):
return s.str.extract(r"(\w)\((.+)\)",expand=True).groupby(0,as_index=False).apply(
lambda x: '{}({})'.format(x.name,', '.join(x[1])))
str_regroup(df1.Z1)
output
A A(Z1W1, Z1W3)
B B(Z1W1, Z1W2)
then apply to all columns
df.apply(str_regroup)
output
Z1 Z2 Z3 Z4
0 A(Z1W1, Z1W3) A(Z2W1) A(Z3W1, Z3W2, Z3W4) B(Z4W2, Z4W3, Z4W4)
1 B(Z1W1, Z1W2) B(Z2W1)
Update 2
Performance on 100 000 sample rows
928 ms for this apply version ;b
1.55 s for stack() by #Wen
You could use the following approach:
Melt df to get:
In [194]: melted = pd.melt(df, var_name='col'); melted
Out[194]:
col value
0 Z1 A(Z1W1)
1 Z1 A(Z1W3)
2 Z1 B(Z1W1)
3 Z1 B(Z1W2)
4 Z2 A(Z2W1)
5 Z2 B(Z2W1)
6 Z2
7 Z2
8 Z3 A(Z3W1)
9 Z3 A(Z3W2)
10 Z3 A(Z3W4)
11 Z3
12 Z4 B(Z4W2)
13 Z4 B(Z4W3)
14 Z4 B(Z4W4)
15 Z4
Use regex to extract row and value columns:
In [195]: melted[['row','value']] = melted['value'].str.extract(r'(.*)\((.*)\)', expand=True); melted
Out[195]:
col value row
0 Z1 Z1W1 A
1 Z1 Z1W3 A
2 Z1 Z1W1 B
3 Z1 Z1W2 B
4 Z2 Z2W1 A
5 Z2 Z2W1 B
6 Z2 NaN NaN
7 Z2 NaN NaN
8 Z3 Z3W1 A
9 Z3 Z3W2 A
10 Z3 Z3W4 A
11 Z3 NaN NaN
12 Z4 Z4W2 B
13 Z4 Z4W3 B
14 Z4 Z4W4 B
15 Z4 NaN NaN
Group by col and row and join the values together:
In [185]: result = melted.groupby(['col', 'row'])['value'].agg(','.join)
In [186]: result
Out[186]:
col row
Z1 A Z1W1,Z1W3
B Z1W1,Z1W2
Z2 A Z2W1
B Z2W1
Z3 A Z3W1,Z3W2,Z3W4
Z4 B Z4W2,Z4W3,Z4W4
Name: value, dtype: object
Add the row values to the value values:
In [188]: result['value'] = result['row'] + '(' + result['value'] + ')'
In [189]: result
Out[189]:
row value
col
Z1 A A(Z1W1,Z1W3)
Z1 B B(Z1W1,Z1W2)
Z2 A A(Z2W1)
Z2 B B(Z2W1)
Z3 A A(Z3W1,Z3W2,Z3W4)
Z4 B B(Z4W2,Z4W3,Z4W4)
Overwrite the row column values with groupby/cumcount values to setup the upcoming pivot:
In [191]: result['row'] = result.groupby(level='col').cumcount()
In [192]: result
Out[192]:
row value
col
Z1 0 A(Z1W1,Z1W3)
Z1 1 B(Z1W1,Z1W2)
Z2 0 A(Z2W1)
Z2 1 B(Z2W1)
Z3 0 A(Z3W1,Z3W2,Z3W4)
Z4 0 B(Z4W2,Z4W3,Z4W4)
Pivoting produces the desired result:
result = result.pivot(index='row', columns='col', values='value')
import pandas as pd
df = pd.DataFrame({
'Z1': ['A(Z1W1)', 'A(Z1W3)', 'B(Z1W1)', 'B(Z1W2)'],
'Z2': ['A(Z2W1)', 'B(Z2W1)', '', ''],
'Z3': ['A(Z3W1)', 'A(Z3W2)', 'A(Z3W4)', ''],
'Z4': ['B(Z4W2)', 'B(Z4W3)', 'B(Z4W4)', '']}, index=[0, 1, 2, 3],)
melted = pd.melt(df, var_name='col').dropna()
melted[['row','value']] = melted['value'].str.extract(r'(.*)\((.*)\)', expand=True)
result = melted.groupby(['col', 'row'])['value'].agg(','.join)
result = result.reset_index('row')
result['value'] = result['row'] + '(' + result['value'] + ')'
result['row'] = result.groupby(level='col').cumcount()
result = result.reset_index()
result = result.pivot(index='row', columns='col', values='value')
print(result)
yields
col Z1 Z2 Z3 Z4
row
0 A(Z1W1,Z1W3) A(Z2W1) A(Z3W1,Z3W2,Z3W4) B(Z4W2,Z4W3,Z4W4)
1 B(Z1W1,Z1W2) B(Z2W1) NaN NaN

Copy columns values matching multiple columns patterns in Pandas

I happen to have the following DataFrame:
import pandas as pd
import numpy as np
df = pd.DataFrame({ 'Prod1': ['10','','10','','',''],
'Prod2': ['','5','5','','','5'],
'Prod3': ['','','','8','8','8'],
'String1': ['','','','','',''],
'String2': ['','','','','',''],
'String3': ['','','','','',''],
'X1': ['x1','x2','x3','x4','x5','x6'],
'X2': ['','','y1','','','y2']
})
print(df)
Prod1 Prod2 Prod3 String1 String2 String3 X1 X2
0 10 x1
1 5 x2
2 10 5 x3 y1
3 8 x4
4 8 x5
5 5 8 x6 y2
It's a schematic table of Products with associated Strings; the actual Strings are in columns (X1, X2), but they should eventually move to (String1, String2, String3) based on whether the corresponding product has a value or not.
For instance:
row 0 has a value on Prod1, hence x1 should move to String1.
row 1 has a value on Prod2, hence x2 should move to String2.
In the actual dataset, mostly each Prod has a single String, but there are rows where multiple values are found in the Prods, and the String columns should be filled giving priority to the left. The final result should look like:
Prod1 Prod2 Prod3 String1 String2 String3 X1 X2
0 10 x1
1 5 x2
2 10 5 x3 y1
3 8 x4
4 8 x5
5 5 8 x6 y1
I was thinking about nested column/row loops, but I'm still not familiar enough with pandas to get to the solution.
Thank you very much in advance for any suggestion!
I break down the steps :
df[['String1', 'String2', 'String3']]=(df[['Prod1', 'Prod2', 'Prod3']]!='')
df1=df[['String1', 'String2', 'String3']].replace({False:np.nan}).stack().to_frame()
df1[0]=df[['X1','X2']].replace({'':np.nan}).stack().values
df[['String1', 'String2', 'String3']]=df1[0].unstack()
df.replace({None:''})
Out[1036]:
Prod1 Prod2 Prod3 String1 String2 String3 X1 X2
0 10 x1 x1
1 5 x2 x2
2 10 5 x3 y1 x3 y1
3 8 x4 x4
4 8 x5 x5
5 5 8 x6 y2 x6 y2

Categories

Resources