Copy columns values matching multiple columns patterns in Pandas - python

I happen to have the following DataFrame:
import pandas as pd
import numpy as np
df = pd.DataFrame({ 'Prod1': ['10','','10','','',''],
'Prod2': ['','5','5','','','5'],
'Prod3': ['','','','8','8','8'],
'String1': ['','','','','',''],
'String2': ['','','','','',''],
'String3': ['','','','','',''],
'X1': ['x1','x2','x3','x4','x5','x6'],
'X2': ['','','y1','','','y2']
})
print(df)
Prod1 Prod2 Prod3 String1 String2 String3 X1 X2
0 10 x1
1 5 x2
2 10 5 x3 y1
3 8 x4
4 8 x5
5 5 8 x6 y2
It's a schematic table of Products with associated Strings; the actual Strings are in columns (X1, X2), but they should eventually move to (String1, String2, String3) based on whether the corresponding product has a value or not.
For instance:
row 0 has a value on Prod1, hence x1 should move to String1.
row 1 has a value on Prod2, hence x2 should move to String2.
In the actual dataset, mostly each Prod has a single String, but there are rows where multiple values are found in the Prods, and the String columns should be filled giving priority to the left. The final result should look like:
Prod1 Prod2 Prod3 String1 String2 String3 X1 X2
0 10 x1
1 5 x2
2 10 5 x3 y1
3 8 x4
4 8 x5
5 5 8 x6 y1
I was thinking about nested column/row loops, but I'm still not familiar enough with pandas to get to the solution.
Thank you very much in advance for any suggestion!

I break down the steps :
df[['String1', 'String2', 'String3']]=(df[['Prod1', 'Prod2', 'Prod3']]!='')
df1=df[['String1', 'String2', 'String3']].replace({False:np.nan}).stack().to_frame()
df1[0]=df[['X1','X2']].replace({'':np.nan}).stack().values
df[['String1', 'String2', 'String3']]=df1[0].unstack()
df.replace({None:''})
Out[1036]:
Prod1 Prod2 Prod3 String1 String2 String3 X1 X2
0 10 x1 x1
1 5 x2 x2
2 10 5 x3 y1 x3 y1
3 8 x4 x4
4 8 x5 x5
5 5 8 x6 y2 x6 y2

Related

Combining Pandas Dataframes

I am pretty new in Pandas. So please bear with me. I have a df like this one
DF1
column1 column2(ids)
a [1,2,13,4,9]
b [20,14,10,18,17]
c [6,8,12,16,19]
d [11,3,15,7,5]
Each number in each list corresponds to the column id in a second dataframe.
DF2
id. value_to_change.
1 x1
2 x2
3 x3
4 x4
5 x5
6 x6
7 x7
8 x8
9 x9
. .
. .
. .
20 x20
STEP1
I want to iterate each list and select the rows in DF2 with the matching ids, AND create 4 dataframes since I have 4 rows in DF1.
How to do this?
So for instance for the first row after applying the logic i would get this back
id. value_to_change
1 x1
2 x2
13 x13
14 x14
9 x9
The second row would give me
id. value_to_change
20 x20
14 x14
10 x10
18 x18
17 x17
And so on...
STEP 2
Once I have these 4 dataframes, i pass them as argument to a logic which returns me 4 dataframes.
2) How could I combine them into a sorted final one?
DF3
id. new_value
1 y1
2 y2
3 y3
4 y4
5 y5
6 y6
7 y7
8 y8
9 y9
. .
. .
. .
20 y20
how could I go about this?
It would be much easier and efficient to use a single dataframe like so
Initialization
df1 = pd.DataFrame({'label': ['A', 'B', 'C', 'D'], 'ids': [[1,2,13,4,9],
[20,14,10,18,17], [6,8,12,16,19],[11,3,15,7,5]]})
# Some custom function for dataframe operations
def my_func(x):
x['value_to_change'] = x.value_to_change.str.replace('x', 'y')
return x
Dataframe Operations
df1 = df1.explode('ids')
df1['value_to_change'] = df1.explode('ids')['ids'].map(dict(zip(df2.ids, df2.val)))
df1['new_value'] = df1.groupby('label').apply(my_func)['value_to_change']
Output
label ids value_to_change new_value
0 A 1 x1 y1
0 A 2 x2 y2
0 A 13 x13 y13
0 A 4 x4 y4
0 A 9 x9 y9
1 B 20 x20 y20
1 B 14 x14 y14
1 B 10 x10 y10
1 B 18 x18 y18
1 B 17 x17 y17
2 C 6 x6 y6
2 C 8 x8 y8
2 C 12 x12 y12
2 C 16 x16 y16
2 C 19 x19 y19
3 D 11 x11 y11
3 D 3 x3 y3
3 D 15 x15 y15
3 D 7 x7 y7
3 D 5 x5 y5
This code will help with the first part of the problem.
import pandas as pd
df1 = pd.DataFrame([[[1,2,4,5]],[[3,4,1]]], columns=["column2(ids)"])
df2 = pd.DataFrame([[1,"x1"],[2,"x2"],[3,"x3"],[4,"x4"],[5,"x5"]], columns=["id", "value_to_change"])
df3 = pd.DataFrame(columns=["id", "value_to_change"])
for row in df1.iterrows():
s = row[1][0]
for item in s:
val = df2.loc[df2['id']==item, 'value_to_change'].item()
df_temp = pd.DataFrame([[item,val]], columns=["id", "value_to_change"])
df3 = df3.append(df_temp, ignore_index=True)
df3
Note in the line s=row[1][0], you need to choose the index according to your dataframe, in my case it was [1][0]
-For second part you can use pd.concat: Documentation
-For sorting df.sort_values: Documentation
Use .loc and .isin to get new Dataframe with required rows in df2
Do your logic on these 4 dataframes
combine the resulting 4 dataframes using pandas.concat()
sort the dataframe by ids using .sort_values()
Code:
import pandas as pd
df1 = pd.DataFrame({'column1 ': ['A', 'B', 'C', 'D'], 'ids': [[1,2,13,4,9], [20,14,10,18,17], [6,8,12,16,19],[11,3,15,7,5]]})
df2 = pd.DataFrame({'ids': list(range(1,21)), 'val': [f'x{x}' for x in range(1,21)]})
df_list=[]
for id_list in df1['ids'].values:
df_list.append(df2.loc[df2['ids'].isin(id_list)])
# do logic on each DF in df_list
# assuming df_list now contains the resulting dataframes
df3 = pd.concat(df_list)
df3 = df3.sort_values('ids')
First things first, this code should do what you want.
import pandas as pd
idxs = [
[0,2],
[1,3],
]
df_idxs = pd.DataFrame({'idxs': idxs})
df = pd.DataFrame(
{'data': ['a', 'b', 'c', 'd']}
)
frames = []
for _, idx in df_idxs.iterrows():
rows = idx['idxs']
frame = df.loc[rows]
# some logic
print(frame)
#collect
frames.append(frame)
pd.concat(frames)
Note that pandas automatically creates a range index of none is passed. If you want to select on a different column, set that one as index, or use
df.loc[df.data.isin(rows)]
.
The pandas doc on split-apply-combine may also interest you: https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html

Using key while sorting values for just one column

Lets say we have a df like below:
df = pd.DataFrame({'A':['y2','x3','z1','z1'],'B':['y2','x3','a2','z1']})
A B
0 y2 y2
1 x3 x3
2 z1 a2
3 z1 z1
if we wanted to sort the values on just the numbers in column A, we can do:
df.sort_values(by='A',key=lambda x: x.str[1])
A B
3 z1 z1
2 z1 a2
0 y2 y2
1 x3 x3
If we wanted to sort by both columns A and B, but have the key only apply to column A, is there a way to do that?
df.sort_values(by=['A','B'],key=lambda x: x.str[1])
Expected output:
A B
2 z1 a2
3 z1 z1
0 y2 y2
1 x3 x3
You can sort by B, then sort by A with a stable method:
(df.sort_values('B')
.sort_values('A', key=lambda x: x.str[1], kind='mergesort')
)
Output:
A B
2 z1 a2
3 z1 z1
0 y2 y2
1 x3 x3

Pandas Group by Multiple Columns and Levels of Values and Append Results to the Original Data Frame

I have a data frame and would like to group it by a few columns and different levels of values. Also, I want to append the group by results to the original data frame.
This is the original data frame:
AAA BBB CCC
x1 y1 yes
x1 y1 yes
x1 y1 no
x1 y2 no
x2 y2 yes
x2 y2 no
This is what I want:
AAA BBB CCC Yes No
x1 y1 yes 2 1
x1 y1 yes 2 1
x1 y1 no 2 1
x1 y2 no 0 1
x2 y2 yes 1 1
x2 y2 no 1 1
The idea here is that I want to group by AAA and BBB and count yes/no in CCC for each group. Then, I want to add the count values into 2 new columns, Yes and No.
Thanks in advance!
One way is to:
group by AAA and BBB
get the value_counts() of CCC for each group
unstack the innermost value-count index (which consists of yes and no) into the columns
merge the counts with the original DataFrame
counts = (df.groupby(['AAA', 'BBB'])['CCC']
.value_counts()
.unstack()
.fillna(0)
.astype(int))
counts.columns = counts.columns.str.title()
pd.merge(df, counts, left_on=['AAA', 'BBB'], right_index=True)
AAA BBB CCC No Yes
0 x1 y1 yes 1 2
1 x1 y1 yes 1 2
2 x1 y1 no 1 2
3 x1 y2 no 1 0
4 x2 y2 yes 1 1
5 x2 y2 no 1 1

Splitting values in a particular cell into rows in a data frame

I am trying to manipulate a data frame into the output data frame format. There are multiple values in a particular cell separated by ','. When I use .stack() to convert a number of values to rows, the remaining empty cells are filled with NaN. Is there any generic solution in pandas to handle this?
Input data frame:
x1 y1 x2 x3 x4
abc x or y v1,v2,v3 l1,l2,l3 self
abc z no1,no2,no3 e1,e2,e3 self
Output data frame:
x1 y1 x2 x3 x4
abc x v1 l1 self
v2 l2
v3 l3
y v1 l1 self
v2 l2
v3 l3
abc z no1 e1 self
no2 e2
no3 e3
df.set_index(df.index).apply(lambda x: x.str.split(",").apply(pd.Series).stack()).reset_index(drop=True).fillna("")
Output:
x1 x2 x3 x4
0 abc v1 11 self
1 v2 12
2 v3 13
3 abc no1 e1 self
4 no2 e2
5 no3 e3

pandas for each group calculate ratio of two categories, and append as a new column to dataframe using .pipe()

I have a pandas dataframe like the following:
import pandas as pd
pd.DataFrame({"AAA":["x1","x1","x1","x2","x2","x2"],
"BBB":["y1","y1","y2","y2","y2","y1"],
"CCC":["t1","t2","t3","t1","t1","t1"],
"DDD":[10,11,18,17,21,30]})
Out[1]:
AAA BBB CCC DDD
0 x1 y1 t1 10
1 x1 y1 t2 11
2 x1 y2 t3 18
3 x2 y2 t1 17
4 x2 y2 t1 21
5 x2 y1 t1 30
The problem
What I want is to group on column AAA so I have 2 groups - x1, x2.
I want then calculate the ratio of y1 to y2 in column BBB for each group.
And assign this output to a new column Ratio of BBB
The desired output
So I want this as my output.
pd.DataFrame({"AAA":["x1","x1","x1","x2","x2","x2"],
"BBB":["y1","y1","y2","y2","y2","y1"],
"CCC":["t1","t2","t3","t1","t1","t1"],
"DDD":[10,11,18,17,21,30],
"Ratio of BBB":[0.33,0.33,0.33,0.66,0.66,0.66]})
Out[2]:
AAA BBB CCC DDD Ratio of BBB
0 x1 y1 t1 10 0.33
1 x1 y1 t2 11 0.33
2 x1 y2 t3 18 0.33
3 x2 y2 t1 17 0.66
4 x2 y2 t1 21 0.66
5 x2 y1 t1 30 0.66
Current status
I have currently achieved it like so:
def f(df):
df["y1"] = sum(df["BBB"] == "y1")
df["y2"] = sum(df["BBB"] == "y2")
df["Ratio of BBB"] = df["y2"] / df["y1"]
return df
df.groupby(df.AAA).apply(f)
What I want to achieve
Is there anyway to achieve this with the .pipe() function?
I was thinking something like this:
df = (df
.groupby(df.AAA) # groupby a column not included in the current series (df.colname)
.BBB
.value_counts()
.pipe(lambda series: series["BBB"] == "y2" / series["BBB"] == "y1")
)
Edit: One solution using pipe()
N.B: User jpp made clear comment below:
unstack / merge / reset_index operations are unnecessary and expensive
However, I initially intended to use this method i thought I would share it here!
df = (df
.groupby(df.AAA) # groupby the column
.BBB # select the column with values to calculate ('BBB' with y1 & y2)
.value_counts() # calculate the values (# of y1 per group, # of y2 per group)
.unstack() # turn the rows into columns (y1, y2)
.pipe(lambda df: df["y1"]/df["y2"]) # calculate the ratio of y1:y2 (outputs a Series)
.rename("ratio") # rename the series 'ratio' so it will be ratio column in output df
.reset_index() # turn the groupby series into a dataframe
.merge(df) # merge with the original dataframe filling in the columns with the key (AAA)
)
Looks like you want the ratio of y1 to the total instead. Use groupby + value_counts:
v = df.groupby('AAA').BBB.value_counts().unstack()
df['RATIO'] = df.AAA.map(v.y2 / (v.y2 + v.y1))
AAA BBB CCC DDD RATIO
0 x1 y1 t1 10 0.333333
1 x1 y1 t2 11 0.333333
2 x1 y2 t3 18 0.333333
3 x2 y2 t1 17 0.666667
4 x2 y2 t1 21 0.666667
5 x2 y1 t1 30 0.666667
To generalise for many groups, you may use
df['RATIO'] = df.AAA.map(v.y2 / v.sum(axis=1))
Using groupby + transform with a custom function:
def ratio(x):
counts = x.value_counts()
return counts['y2'] / counts.sum()
df['Ratio of BBB'] = df.groupby('AAA')['BBB'].transform(ratio)
print(df)
AAA BBB CCC DDD Ratio of BBB
0 x1 y1 t1 10 0.333333
1 x1 y1 t2 11 0.333333
2 x1 y2 t3 18 0.333333
3 x2 y2 t1 17 0.666667
4 x2 y2 t1 21 0.666667
5 x2 y1 t1 30 0.666667

Categories

Resources