I have a dataframe:
lft rel rgt num
0 t3 r3 z2 3
1 t1 r3 x1 9
2 x2 r3 t2 8
3 x4 r1 t2 4
4 t1 r1 z3 1
5 x1 r1 t2 2
6 x2 r2 t4 4
7 z3 r2 t4 5
8 t4 r3 x3 4
9 z1 r2 t3 4
And a reference dictionary:
replacement_dict = {
'X1' : ['x1', 'x2', 'x3', 'x4'],
'Y1' : ['y1', 'y2'],
'Z1' : ['z1', 'z2', 'z3']
}
My goal is to replace all occurrences of replacement_dict['X1'] with 'X1', and then merge the rows together. For example, any instance of 'x1', 'x2', 'x3' or 'x4' will be replaced by 'X1', etc.
I can do this by selecting the rows that contain any of these strings and replacing them with 'X1':
keys = replacement_dict.keys()
for key in keys:
DF.loc[DF['lft'].isin(replacement_dict[key]), 'lft'] = key
DF.loc[DF['rgt'].isin(replacement_dict[key]), 'rgt'] = key
giving:
lft rel rgt num
0 t3 r3 Z1 3
1 t1 r3 X1 9
2 X1 r3 t2 8
3 X1 r1 t2 4
4 t1 r1 Z1 1
5 X1 r1 t2 2
6 X1 r2 t4 4
7 Z1 r2 t4 5
8 t4 r3 X1 4
9 Z1 r2 t3 4
Now, if I select all the rows containing 'X1' and merge them, I should end up with:
lft rel rgt num
0 X1 r3 t2 8
1 X1 r1 t2 6
2 X1 r2 t4 4
3 t1 r3 X1 9
4 t4 r3 X1 4
So the three columns ['lft', 'rel', 'rgt'] are unique while the 'num' column is added up for each of these rows. The row 1 above : ['X1' 'r1' 't2' 6] is the sum of two rows ['X1' 'r1' 't2' 4] and ['X1' 'r1' 't2' 2].
I can do this easily for a small number of rows, but I am working with a dataframe with 6 million rows and a replacement dictionary with 60,000 keys. This is taking forever using a simple row wise extraction and replacement.
How can this (specifically the last part) be scaled efficiently? Is there a pandas trick that someone can recommend?
Reverse the replacement_dict mapping and map() this new mapping to each of lft and rgt columns to substitute certain values (e.g. x1->X1, y2->Y1 etc.). As some values in lft and rgt columns don't exist in the mapping (e.g. t1, t2 etc.), call fillna() to fill in these values.1
You may also stack() the columns whose values need to be replaced (lft and rgt), call map+fillna and unstack() back but because there are only 2 columns, it may not be worth the trouble for this particular case.
The second part of the question may be answered by summing num values after grouping by lft, rel and rgt columns; so groupby().sum() should do the trick.
# reverse replacement map
reverse_map = {v : k for k, li in replacement_dict.items() for v in li}
# substitute values in lft column using reverse_map
df['lft'] = df['lft'].map(reverse_map).fillna(df['lft'])
# substitute values in rgt column using reverse_map
df['rgt'] = df['rgt'].map(reverse_map).fillna(df['rgt'])
# sum values in num column by groups
result = df.groupby(['lft', 'rel', 'rgt'], as_index=False)['num'].sum()
1: map() + fillna() may perform better for your use case than replace() because under the hood, map() implements a Cython optimized take_nd() method that performs particularly well if there are a lot of values to replace, while replace() implements replace_list() method which uses a Python loop. So if replacement_dict is particularly large (which it is in your case), the difference in performance will be huge, but if replacement_dict is small, replace() may outperform map().
If you flip the keys and values of your replacement_dict, things become a lot easier:
new_replacement_dict = {
v: key
for key, values in replacement_dict.items()
for v in values
}
cols = ["lft", "rel", "rgt"]
df[cols] = df[cols].replace(new_replacement_dict)
df.groupby(cols).sum()
Try this, I commented the steps
#reverse dict to dissolve the lists as values
reversed_dict = {v:k for k,val in replacement_dict.items() for v in val}
# replace the values
cols = ['lft', 'rel', 'rgt']
df[cols] = df[cols].replace(reversed_dict)
# filter rows where X1 is anywhere in the columns
df = df[df.eq('X1').any(axis=1)]
# sum the duplicate rows
out = df_filtered.groupby(cols).sum().reset_index()
print(out)
Output:
lft rel rgt num
0 X1 r1 t2 6
1 X1 r2 t4 4
2 X1 r3 t2 8
3 t1 r3 X1 9
4 t4 r3 X1 4
Pandas has built in function replace that is faster than going through the whole dataframe with .loc
You can also pass a list in it making our dictionary good fit for it
keys = replacement_dict.keys()
# Loop through every value in our dictionary and get the replacements
for key in keys:
DF = DF.replace(to_replace=replacement_dict[key], value=key)
Here's a way to do what your question asks:
df[['lft','rgt']] = ( df[['lft','rgt']]
.replace({it:k for k, v in replacement_dict.items() for it in v}) )
df = ( df[(df.lft == 'X1') | (df.rgt == 'X1')]
.groupby(['lft','rel','rgt']).sum().reset_index() )
Output:
lft rel rgt num
0 X1 r1 t2 6
1 X1 r2 t4 4
2 X1 r3 t2 8
3 t1 r3 X1 9
4 t4 r3 X1 4
Explanation:
replace() uses a reversed version of the dictionary to replace items from lists in the original dict with the corresponding keys in the relevant df columns lft and rgt
after filtering for rows with 'X1' found in either lft or rgt, use groupby(), sum() and reset_index() to sum the num column for unique lft, rel, rgt group keys and restore the group components from index levels to columns.
As an alternative, we can use query() to select only rows containing 'X1':
df[['lft','rgt']] = ( df[['lft','rgt']]
.replace({it:k for k, v in replacement_dict.items() for it in v}) )
df = ( df.query("lft=='X1' or rgt=='X1'")
.groupby(['lft','rel','rgt']).sum().reset_index() )
lots of great answers. I avoid the need for the dict and use a df.apply() like this to generate new data.
import io
import pandas as pd
# # create the data
x = '''
lft rel rgt num
t3 r3 z2 3
t1 r3 x1 9
x2 r3 t2 8
x4 r1 t2 4
t1 r1 z3 1
x1 r1 t2 2
x2 r2 t4 4
z3 r2 t4 5
t4 r3 x3 4
z1 r2 t3 4
'''
data = io.StringIO(x)
df = pd.read_csv(data, sep=' ')
print(df)
replacement_dict = {
'X1' : ['x1', 'x2', 'x3', 'x4'],
'Y1' : ['y1', 'y2'],
'Z1' : ['z1', 'z2', 'z3']
}
def replace(x):
# which key to check
key_check = x[0] + '1'
key_check = key_check.upper()
return key_check
df['new'] = df['lft'].apply(replace)
df
return this:
lft rel rgt num
0 t3 r3 z2 3
1 t1 r3 x1 9
2 x2 r3 t2 8
3 x4 r1 t2 4
4 t1 r1 z3 1
5 x1 r1 t2 2
6 x2 r2 t4 4
7 z3 r2 t4 5
8 t4 r3 x3 4
9 z1 r2 t3 4
lft rel rgt num new
0 t3 r3 z2 3 T1
1 t1 r3 x1 9 T1
2 x2 r3 t2 8 X1
3 x4 r1 t2 4 X1
4 t1 r1 z3 1 T1
5 x1 r1 t2 2 X1
6 x2 r2 t4 4 X1
7 z3 r2 t4 5 Z1
8 t4 r3 x3 4 T1
9 z1 r2 t3 4 Z1
Related
I am trying to manipulate a data frame into the output data frame format. There are multiple values in a particular cell separated by ','. When I use .stack() to convert a number of values to rows, the remaining empty cells are filled with NaN. Is there any generic solution in pandas to handle this?
Input data frame:
x1 y1 x2 x3 x4
abc x or y v1,v2,v3 l1,l2,l3 self
abc z no1,no2,no3 e1,e2,e3 self
Output data frame:
x1 y1 x2 x3 x4
abc x v1 l1 self
v2 l2
v3 l3
y v1 l1 self
v2 l2
v3 l3
abc z no1 e1 self
no2 e2
no3 e3
df.set_index(df.index).apply(lambda x: x.str.split(",").apply(pd.Series).stack()).reset_index(drop=True).fillna("")
Output:
x1 x2 x3 x4
0 abc v1 11 self
1 v2 12
2 v3 13
3 abc no1 e1 self
4 no2 e2
5 no3 e3
The problem is simple and so must be solution but I am not able to find it.
I want to find which row and column in Pandas DataFrame has minimum value and how much is it.
I have tried following code (in addition to various combinations):
df = pd.DataFrame(data=[[4,5,6],[2,1,3],[7,0,5],[2,5,3]],
index = ['R1','R2','R3','R4'],
columns=['C1','C2','C3'])
print(df)
print(df.loc[df.idxmin(axis=0), df.idxmin(axis=1)])
The dataframe (df) being searched is:
C1 C2 C3
R1 4 5 6
R2 2 1 3
R3 7 0 5
R4 2 5 3
Output for the loc command:
C1 C2 C2 C1
R2 2 1 1 2
R3 7 0 0 7
R2 2 1 1 2
What I need is:
C2
R3 0
How can I get this simple result?
Use:
a, b = df.stack().idxmin()
print(df.loc[[a], [b]])
C2
R3 0
Another #John Zwinck solution working with missing values - use numpy.nanargmin:
df = pd.DataFrame(data=[[4,5,6],[2,np.nan,3],[7,0,5],[2,5,3]],
index = ['R1','R2','R3','R4'],
columns=['C1','C2','C3'])
print(df)
C1 C2 C3
R1 4 5.0 6
R2 2 NaN 3
R3 7 0.0 5
R4 2 5.0 3
#https://stackoverflow.com/a/3230123
ri, ci = np.unravel_index(np.nanargmin(df.values), df.shape)
print(df.iloc[[ri], [ci]])
C2
R3 0.0
I'd get the index this way:
np.unravel_index(np.argmin(df.values), df.shape)
This is much faster than df.stack().idxmin().
It gives you a tuple such as (2, 1) in your example. Pass that to df.iloc[] to get the value.
Or min+min+dropna+T+dropna+T:
>>> df[df==df.min(axis=1).min()].dropna(how='all').T.dropna().T
C2
R3 0.0
>>>
I have a pandas dataframe like the following:
import pandas as pd
pd.DataFrame({"AAA":["x1","x1","x1","x2","x2","x2"],
"BBB":["y1","y1","y2","y2","y2","y1"],
"CCC":["t1","t2","t3","t1","t1","t1"],
"DDD":[10,11,18,17,21,30]})
Out[1]:
AAA BBB CCC DDD
0 x1 y1 t1 10
1 x1 y1 t2 11
2 x1 y2 t3 18
3 x2 y2 t1 17
4 x2 y2 t1 21
5 x2 y1 t1 30
The problem
What I want is to group on column AAA so I have 2 groups - x1, x2.
I want then calculate the ratio of y1 to y2 in column BBB for each group.
And assign this output to a new column Ratio of BBB
The desired output
So I want this as my output.
pd.DataFrame({"AAA":["x1","x1","x1","x2","x2","x2"],
"BBB":["y1","y1","y2","y2","y2","y1"],
"CCC":["t1","t2","t3","t1","t1","t1"],
"DDD":[10,11,18,17,21,30],
"Ratio of BBB":[0.33,0.33,0.33,0.66,0.66,0.66]})
Out[2]:
AAA BBB CCC DDD Ratio of BBB
0 x1 y1 t1 10 0.33
1 x1 y1 t2 11 0.33
2 x1 y2 t3 18 0.33
3 x2 y2 t1 17 0.66
4 x2 y2 t1 21 0.66
5 x2 y1 t1 30 0.66
Current status
I have currently achieved it like so:
def f(df):
df["y1"] = sum(df["BBB"] == "y1")
df["y2"] = sum(df["BBB"] == "y2")
df["Ratio of BBB"] = df["y2"] / df["y1"]
return df
df.groupby(df.AAA).apply(f)
What I want to achieve
Is there anyway to achieve this with the .pipe() function?
I was thinking something like this:
df = (df
.groupby(df.AAA) # groupby a column not included in the current series (df.colname)
.BBB
.value_counts()
.pipe(lambda series: series["BBB"] == "y2" / series["BBB"] == "y1")
)
Edit: One solution using pipe()
N.B: User jpp made clear comment below:
unstack / merge / reset_index operations are unnecessary and expensive
However, I initially intended to use this method i thought I would share it here!
df = (df
.groupby(df.AAA) # groupby the column
.BBB # select the column with values to calculate ('BBB' with y1 & y2)
.value_counts() # calculate the values (# of y1 per group, # of y2 per group)
.unstack() # turn the rows into columns (y1, y2)
.pipe(lambda df: df["y1"]/df["y2"]) # calculate the ratio of y1:y2 (outputs a Series)
.rename("ratio") # rename the series 'ratio' so it will be ratio column in output df
.reset_index() # turn the groupby series into a dataframe
.merge(df) # merge with the original dataframe filling in the columns with the key (AAA)
)
Looks like you want the ratio of y1 to the total instead. Use groupby + value_counts:
v = df.groupby('AAA').BBB.value_counts().unstack()
df['RATIO'] = df.AAA.map(v.y2 / (v.y2 + v.y1))
AAA BBB CCC DDD RATIO
0 x1 y1 t1 10 0.333333
1 x1 y1 t2 11 0.333333
2 x1 y2 t3 18 0.333333
3 x2 y2 t1 17 0.666667
4 x2 y2 t1 21 0.666667
5 x2 y1 t1 30 0.666667
To generalise for many groups, you may use
df['RATIO'] = df.AAA.map(v.y2 / v.sum(axis=1))
Using groupby + transform with a custom function:
def ratio(x):
counts = x.value_counts()
return counts['y2'] / counts.sum()
df['Ratio of BBB'] = df.groupby('AAA')['BBB'].transform(ratio)
print(df)
AAA BBB CCC DDD Ratio of BBB
0 x1 y1 t1 10 0.333333
1 x1 y1 t2 11 0.333333
2 x1 y2 t3 18 0.333333
3 x2 y2 t1 17 0.666667
4 x2 y2 t1 21 0.666667
5 x2 y1 t1 30 0.666667
I am trying to iterate through a row in a pandas data frame, checking if there are any similar values and if there are similar values, I want to count how many times the value is repeated disregarding the first time and record it in a column.
Input:
pd.DataFrame(
[['K1', 'K2', 'K1', 'R3', 'R1', 'K3'],
['K2', 'K4', 'K4', 'R2', 'R2' ,'R2']],
columns=list('ASDFEI')
)
A S D F E I
0 K1 K2 K1 R3 R1 K3
1 K2 K4 K4 R2 R2 R2
The link contains an image showing what I am trying to do. In the first row, only K1 is repeated once so the count is 1. In the second row, K4 is repeated once and R2 is repeated twice so the count would be 3.
IIUC, you can stack your frame and call groupby + value_counts
df['Count'] = (df.stack().groupby(level=0).value_counts() - 1).sum(level=0)
df
A S D F E I Count
0 K1 K2 K1 R3 R1 K3 1
1 K2 K4 K4 R2 R2 R2 3
Or, using insert (as shown by #Anton vBR),
df.insert(
0, 'Count', (df.stack().groupby(level=0).value_counts() - 1).sum(level=0)
)
df
Count A S D F E I
0 1 K1 K2 K1 R3 R1 K3
1 3 K2 K4 K4 R2 R2 R2
This should work:
# Insert column count with count of duplicated (keep=First is default)
df.insert(0,'Count', df.T.apply(pd.Series.duplicated).sum())
print(df)
Returns
Count A S D F E I
0 1 K1 K2 K1 R3 R1 K3
1 3 K2 K4 K4 R2 R2 R2
Update: You can create a boolean mask with pd.Series.isin() and ~ to filter away undesired results.
Use axis=1 to iterate over rows
Use sum(axis=1) to calculate sum of rows
Use astype(int) to convert to float
# Create new Series with count of duplicated (keep=First is default)
newcol = (df.apply(lambda x: x[~x.isin(['TK',np.NaN])]
.duplicated(), axis=1).sum(axis=1).astype(int))
# Insert column
df.insert(0,'Count', newcol)
print(df)
Returns:
Count A S D F E I
0 1 K1 TK K1 R3 TK K3
1 2 K2 NaN NaN R2 R2 R2
Given the following Data Frame:
tdf1 = pd.DataFrame({'A' : ['r1', 'r1', 'r1', 'r2', 'r2', 'r2', 'r3'],
'B' : ['t1', 't1', 't2', 't3', 't4', 't4', 't5']})
>>> tdf1
A B
0 r1 t1
1 r1 t1
2 r1 t2
3 r2 t3
4 r2 t4
5 r2 t4
6 r3 t5
I want to group the data by column A and create a column C that contains all elements form each group. So the resulting Data Frame should look like this:
>>> res
A B C
0 r1 t1 t1t2
1 r1 t1 t1t2
2 r1 t2 t1t2
3 r2 t3 t3t4
4 r2 t4 t3t4
5 r2 t4 t3t4
6 r3 t5 t5
I hoped that the following would do most of the work required:
tdf1.groupby('A')['B'].transform(lambda x: x.unique())
But instead of getting a set of unique values for each group I just get repeated column B. It looks like a x.unique() is applied to each cell instead of all cells in the group.
However, if the column B has numbers and instead of using x.unique() I use x.sum() the results are as expected all cells in each group contain sum of the group.
Is this a bug or I am missing something?
I do not think this is a bug , transform converts the result it gets to the same size of the group, hence when you send it a list of unique elements, it repeats the list so that it becomes the same size of the group , hence for the first group, you get ['t1','t2','t1'] , and then each element is applied at each index.
If you want a string like 't1t2' , in the resulting column, you should use str.join to join the result and provide that to transform. Example -
tdf1['C'] = tdf1.groupby('A')['B'].transform(lambda x: ''.join(x.unique()))
Demo -
In [9]: tdf1
Out[9]:
A B
0 r1 t1
1 r1 t1
2 r1 t2
3 r2 t3
4 r2 t4
5 r2 t4
6 r3 t5
In [10]: tdf1.groupby('A')['B'].transform(lambda x: ''.join(x.unique()))
Out[10]:
0 t1t2
1 t1t2
2 t1t2
3 t3t4
4 t3t4
5 t3t4
6 t5
Name: B, dtype: object
If you want elements column 'C' to be a list of unique elements of the group, then you would need to pass x.unique() inside another list. Example -
tdf1['C'] = tdf1.groupby('A')['B'].transform(lambda x: [x.unique()])
Demo -
In [11]: tdf1.groupby('A')['B'].transform(lambda x: [x.unique()])
Out[11]:
0 [t1, t2]
1 [t1, t2]
2 [t1, t2]
3 [t3, t4]
4 [t3, t4]
5 [t3, t4]
6 [t5]
Name: B, dtype: object