I am pretty new in Pandas. So please bear with me. I have a df like this one
DF1
column1 column2(ids)
a [1,2,13,4,9]
b [20,14,10,18,17]
c [6,8,12,16,19]
d [11,3,15,7,5]
Each number in each list corresponds to the column id in a second dataframe.
DF2
id. value_to_change.
1 x1
2 x2
3 x3
4 x4
5 x5
6 x6
7 x7
8 x8
9 x9
. .
. .
. .
20 x20
STEP1
I want to iterate each list and select the rows in DF2 with the matching ids, AND create 4 dataframes since I have 4 rows in DF1.
How to do this?
So for instance for the first row after applying the logic i would get this back
id. value_to_change
1 x1
2 x2
13 x13
14 x14
9 x9
The second row would give me
id. value_to_change
20 x20
14 x14
10 x10
18 x18
17 x17
And so on...
STEP 2
Once I have these 4 dataframes, i pass them as argument to a logic which returns me 4 dataframes.
2) How could I combine them into a sorted final one?
DF3
id. new_value
1 y1
2 y2
3 y3
4 y4
5 y5
6 y6
7 y7
8 y8
9 y9
. .
. .
. .
20 y20
how could I go about this?
It would be much easier and efficient to use a single dataframe like so
Initialization
df1 = pd.DataFrame({'label': ['A', 'B', 'C', 'D'], 'ids': [[1,2,13,4,9],
[20,14,10,18,17], [6,8,12,16,19],[11,3,15,7,5]]})
# Some custom function for dataframe operations
def my_func(x):
x['value_to_change'] = x.value_to_change.str.replace('x', 'y')
return x
Dataframe Operations
df1 = df1.explode('ids')
df1['value_to_change'] = df1.explode('ids')['ids'].map(dict(zip(df2.ids, df2.val)))
df1['new_value'] = df1.groupby('label').apply(my_func)['value_to_change']
Output
label ids value_to_change new_value
0 A 1 x1 y1
0 A 2 x2 y2
0 A 13 x13 y13
0 A 4 x4 y4
0 A 9 x9 y9
1 B 20 x20 y20
1 B 14 x14 y14
1 B 10 x10 y10
1 B 18 x18 y18
1 B 17 x17 y17
2 C 6 x6 y6
2 C 8 x8 y8
2 C 12 x12 y12
2 C 16 x16 y16
2 C 19 x19 y19
3 D 11 x11 y11
3 D 3 x3 y3
3 D 15 x15 y15
3 D 7 x7 y7
3 D 5 x5 y5
This code will help with the first part of the problem.
import pandas as pd
df1 = pd.DataFrame([[[1,2,4,5]],[[3,4,1]]], columns=["column2(ids)"])
df2 = pd.DataFrame([[1,"x1"],[2,"x2"],[3,"x3"],[4,"x4"],[5,"x5"]], columns=["id", "value_to_change"])
df3 = pd.DataFrame(columns=["id", "value_to_change"])
for row in df1.iterrows():
s = row[1][0]
for item in s:
val = df2.loc[df2['id']==item, 'value_to_change'].item()
df_temp = pd.DataFrame([[item,val]], columns=["id", "value_to_change"])
df3 = df3.append(df_temp, ignore_index=True)
df3
Note in the line s=row[1][0], you need to choose the index according to your dataframe, in my case it was [1][0]
-For second part you can use pd.concat: Documentation
-For sorting df.sort_values: Documentation
Use .loc and .isin to get new Dataframe with required rows in df2
Do your logic on these 4 dataframes
combine the resulting 4 dataframes using pandas.concat()
sort the dataframe by ids using .sort_values()
Code:
import pandas as pd
df1 = pd.DataFrame({'column1 ': ['A', 'B', 'C', 'D'], 'ids': [[1,2,13,4,9], [20,14,10,18,17], [6,8,12,16,19],[11,3,15,7,5]]})
df2 = pd.DataFrame({'ids': list(range(1,21)), 'val': [f'x{x}' for x in range(1,21)]})
df_list=[]
for id_list in df1['ids'].values:
df_list.append(df2.loc[df2['ids'].isin(id_list)])
# do logic on each DF in df_list
# assuming df_list now contains the resulting dataframes
df3 = pd.concat(df_list)
df3 = df3.sort_values('ids')
First things first, this code should do what you want.
import pandas as pd
idxs = [
[0,2],
[1,3],
]
df_idxs = pd.DataFrame({'idxs': idxs})
df = pd.DataFrame(
{'data': ['a', 'b', 'c', 'd']}
)
frames = []
for _, idx in df_idxs.iterrows():
rows = idx['idxs']
frame = df.loc[rows]
# some logic
print(frame)
#collect
frames.append(frame)
pd.concat(frames)
Note that pandas automatically creates a range index of none is passed. If you want to select on a different column, set that one as index, or use
df.loc[df.data.isin(rows)]
.
The pandas doc on split-apply-combine may also interest you: https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html
Lets say we have a df like below:
df = pd.DataFrame({'A':['y2','x3','z1','z1'],'B':['y2','x3','a2','z1']})
A B
0 y2 y2
1 x3 x3
2 z1 a2
3 z1 z1
if we wanted to sort the values on just the numbers in column A, we can do:
df.sort_values(by='A',key=lambda x: x.str[1])
A B
3 z1 z1
2 z1 a2
0 y2 y2
1 x3 x3
If we wanted to sort by both columns A and B, but have the key only apply to column A, is there a way to do that?
df.sort_values(by=['A','B'],key=lambda x: x.str[1])
Expected output:
A B
2 z1 a2
3 z1 z1
0 y2 y2
1 x3 x3
You can sort by B, then sort by A with a stable method:
(df.sort_values('B')
.sort_values('A', key=lambda x: x.str[1], kind='mergesort')
)
Output:
A B
2 z1 a2
3 z1 z1
0 y2 y2
1 x3 x3
I have a data frame and would like to group it by a few columns and different levels of values. Also, I want to append the group by results to the original data frame.
This is the original data frame:
AAA BBB CCC
x1 y1 yes
x1 y1 yes
x1 y1 no
x1 y2 no
x2 y2 yes
x2 y2 no
This is what I want:
AAA BBB CCC Yes No
x1 y1 yes 2 1
x1 y1 yes 2 1
x1 y1 no 2 1
x1 y2 no 0 1
x2 y2 yes 1 1
x2 y2 no 1 1
The idea here is that I want to group by AAA and BBB and count yes/no in CCC for each group. Then, I want to add the count values into 2 new columns, Yes and No.
Thanks in advance!
One way is to:
group by AAA and BBB
get the value_counts() of CCC for each group
unstack the innermost value-count index (which consists of yes and no) into the columns
merge the counts with the original DataFrame
counts = (df.groupby(['AAA', 'BBB'])['CCC']
.value_counts()
.unstack()
.fillna(0)
.astype(int))
counts.columns = counts.columns.str.title()
pd.merge(df, counts, left_on=['AAA', 'BBB'], right_index=True)
AAA BBB CCC No Yes
0 x1 y1 yes 1 2
1 x1 y1 yes 1 2
2 x1 y1 no 1 2
3 x1 y2 no 1 0
4 x2 y2 yes 1 1
5 x2 y2 no 1 1
I am trying to manipulate a data frame into the output data frame format. There are multiple values in a particular cell separated by ','. When I use .stack() to convert a number of values to rows, the remaining empty cells are filled with NaN. Is there any generic solution in pandas to handle this?
Input data frame:
x1 y1 x2 x3 x4
abc x or y v1,v2,v3 l1,l2,l3 self
abc z no1,no2,no3 e1,e2,e3 self
Output data frame:
x1 y1 x2 x3 x4
abc x v1 l1 self
v2 l2
v3 l3
y v1 l1 self
v2 l2
v3 l3
abc z no1 e1 self
no2 e2
no3 e3
df.set_index(df.index).apply(lambda x: x.str.split(",").apply(pd.Series).stack()).reset_index(drop=True).fillna("")
Output:
x1 x2 x3 x4
0 abc v1 11 self
1 v2 12
2 v3 13
3 abc no1 e1 self
4 no2 e2
5 no3 e3
I have a pandas dataframe like the following:
import pandas as pd
pd.DataFrame({"AAA":["x1","x1","x1","x2","x2","x2"],
"BBB":["y1","y1","y2","y2","y2","y1"],
"CCC":["t1","t2","t3","t1","t1","t1"],
"DDD":[10,11,18,17,21,30]})
Out[1]:
AAA BBB CCC DDD
0 x1 y1 t1 10
1 x1 y1 t2 11
2 x1 y2 t3 18
3 x2 y2 t1 17
4 x2 y2 t1 21
5 x2 y1 t1 30
The problem
What I want is to group on column AAA so I have 2 groups - x1, x2.
I want then calculate the ratio of y1 to y2 in column BBB for each group.
And assign this output to a new column Ratio of BBB
The desired output
So I want this as my output.
pd.DataFrame({"AAA":["x1","x1","x1","x2","x2","x2"],
"BBB":["y1","y1","y2","y2","y2","y1"],
"CCC":["t1","t2","t3","t1","t1","t1"],
"DDD":[10,11,18,17,21,30],
"Ratio of BBB":[0.33,0.33,0.33,0.66,0.66,0.66]})
Out[2]:
AAA BBB CCC DDD Ratio of BBB
0 x1 y1 t1 10 0.33
1 x1 y1 t2 11 0.33
2 x1 y2 t3 18 0.33
3 x2 y2 t1 17 0.66
4 x2 y2 t1 21 0.66
5 x2 y1 t1 30 0.66
Current status
I have currently achieved it like so:
def f(df):
df["y1"] = sum(df["BBB"] == "y1")
df["y2"] = sum(df["BBB"] == "y2")
df["Ratio of BBB"] = df["y2"] / df["y1"]
return df
df.groupby(df.AAA).apply(f)
What I want to achieve
Is there anyway to achieve this with the .pipe() function?
I was thinking something like this:
df = (df
.groupby(df.AAA) # groupby a column not included in the current series (df.colname)
.BBB
.value_counts()
.pipe(lambda series: series["BBB"] == "y2" / series["BBB"] == "y1")
)
Edit: One solution using pipe()
N.B: User jpp made clear comment below:
unstack / merge / reset_index operations are unnecessary and expensive
However, I initially intended to use this method i thought I would share it here!
df = (df
.groupby(df.AAA) # groupby the column
.BBB # select the column with values to calculate ('BBB' with y1 & y2)
.value_counts() # calculate the values (# of y1 per group, # of y2 per group)
.unstack() # turn the rows into columns (y1, y2)
.pipe(lambda df: df["y1"]/df["y2"]) # calculate the ratio of y1:y2 (outputs a Series)
.rename("ratio") # rename the series 'ratio' so it will be ratio column in output df
.reset_index() # turn the groupby series into a dataframe
.merge(df) # merge with the original dataframe filling in the columns with the key (AAA)
)
Looks like you want the ratio of y1 to the total instead. Use groupby + value_counts:
v = df.groupby('AAA').BBB.value_counts().unstack()
df['RATIO'] = df.AAA.map(v.y2 / (v.y2 + v.y1))
AAA BBB CCC DDD RATIO
0 x1 y1 t1 10 0.333333
1 x1 y1 t2 11 0.333333
2 x1 y2 t3 18 0.333333
3 x2 y2 t1 17 0.666667
4 x2 y2 t1 21 0.666667
5 x2 y1 t1 30 0.666667
To generalise for many groups, you may use
df['RATIO'] = df.AAA.map(v.y2 / v.sum(axis=1))
Using groupby + transform with a custom function:
def ratio(x):
counts = x.value_counts()
return counts['y2'] / counts.sum()
df['Ratio of BBB'] = df.groupby('AAA')['BBB'].transform(ratio)
print(df)
AAA BBB CCC DDD Ratio of BBB
0 x1 y1 t1 10 0.333333
1 x1 y1 t2 11 0.333333
2 x1 y2 t3 18 0.333333
3 x2 y2 t1 17 0.666667
4 x2 y2 t1 21 0.666667
5 x2 y1 t1 30 0.666667