Pandas: Add new column based on comparison of two DFs

Pandas: Add new column based on comparison of two DFs - python

I have 2 dataframes that I am wanting to compare one to the other and add a 'True/False' to a new column in the first based on the comparison.
My data resembles:
DF1:
cat sub-cat low high
3 3 1 208 223
4 3 1 224 350
8 4 1 223 244
9 4 1 245 350
13 5 1 232 252
14 5 1 253 350
DF2:
Cat Sub-Cat Rating
0 5 1 246
1 5 2 239
2 8 1 203
3 8 2 218
4 K 1 149
5 K 2 165
6 K 1 171
7 K 2 185
8 K 1 157
9 K 2 171
Desired result would be for DF2 to have an additional column with a True or False depending on if, based on the cat and sub-cat, that the rating is between the low.min() and high.max() or Null if no matches found to compare.
Have been running rounds with this for far too long with no results to speak of.
Thank you in advance for any assistance.
Update:
First row would look something like:
Cat Sub-Cat Rating In-Spec
0 5 1 246 True
As it falls within the min low and the max high.
Example: There are two rows in DF1 for cat = 5 and sub-cat = 2. I need to get the minimum low and the maximum high from those 2 rows and then check if the rating from row 0 in DF2 falls within the minimum low and maximum high from the two matching rows in DF1

join post groupby.agg
d2 = DF2.join(
DF1.groupby(
['cat', 'sub-cat']
).agg(dict(low='min', high='max')),
on=['Cat', 'Sub-Cat']
)
d2
Cat Sub-Cat Rating high low
0 5 1 246 350.0 232.0
1 5 2 239 NaN NaN
2 8 1 203 NaN NaN
3 8 2 218 NaN NaN
4 K 1 149 NaN NaN
5 K 2 165 NaN NaN
6 K 1 171 NaN NaN
7 K 2 185 NaN NaN
8 K 1 157 NaN NaN
9 K 2 171 NaN NaN
assign with .loc
DF2.loc[d2.eval('low <= Rating <= high'), 'In-Spec'] = True
DF2
Cat Sub-Cat Rating In-Spec
0 5 1 246 True
1 5 2 239 NaN
2 8 1 203 NaN
3 8 2 218 NaN
4 K 1 149 NaN
5 K 2 165 NaN
6 K 1 171 NaN
7 K 2 185 NaN
8 K 1 157 NaN
9 K 2 171 NaN

To add a new column based on a boolean expression would involve something along the lines of:
temp = boolean code involving inequality
df2['new column name'] = temp
However I'm not sure I understand, the first row in your DF2 table for instance, has a rating of 246, which means it's true for row 13 of DF1, but false for row 14. What would you like it to return?

You can do it like this
df2['In-Spec'] = 'False'
df2['In-Spec'][(df2['Rating'] > df1['low']) & (df2['Rating'] < df1['high'])] = 'True'
But which rows should be compared with each others? Do you want them to compare by their index or by their cat & subcat names?

Related

Check if column contains (/,-,_, *or~) and split in another column - Pandas

I have a column with numbers and one of these characters between them -,/,*,~,_. I need to check if values contain any of the characters, then split the value in another column. Is there a different solution than shown below? In the end, columns subnumber1, subnumber2 ...subnumber5 will be merged in one column and column "number5" will be without characters. Those two columns I need to use in further process. I'm a newbie in Python so any advice is welcome.
if gdf['column_name'].str.contains('~').any():
gdf[['number1', 'subnumber1']] = gdf['column_name'].str.split('~', expand=True)
gdf
if gdf['column_name'].str.contains('^').any():
gdf[['number2', 'subnumber2']] = gdf['column_name'].str.split('^', expand=True)
gdf
Input column:
column_name
152/6*3
163/1-6
145/1
163/6^3
output:
number5 |subnumber1 |subnumber2
152 | 6 | 3
163 | 1 | 6
145 | 1 |
163 | 6 | 3

Use Series.str.split with list of possible separators and create new DataFrame:
import re
L = ['-','/','*','~','_','^', '.']
#some values like `^.` are escape
pat = '|'.join(re.escape(x) for x in L)
df = df['column_name'].str.split(pat, expand=True).add_prefix('num')
print (df)
num0 num1 num2
0 152 6 3
1 163 1 6
2 145 1 None
3 163 6 3
EDIT: If need match values before value use:
L = ["\-_",'\^|\*','~','/']
for val in L:
df[f'before {val}'] = df['column_name'].str.extract(rf'(\d+){[val]}')
#for last value not exist separator, so match $ for end of string
df['last'] = df['column_name'].str.extract(rf'(\d+)$')
print (df)
column_name before \-_ before \^|\* before ~ before / last
0 152/2~3_4*5 3 4 2 152 5
1 152/2~3-4^5 4 4 2 152 5
2 152/6*3 NaN 6 NaN 152 3
3 163/1-6 NaN NaN NaN 163 6
4 145/1 NaN NaN NaN 145 1
5 163/6^3 6 6 NaN 163 3

Use str.split:
df['column_name'].str.split(r'[*,-/^_]', expand=True)
output:
0 1 2
0 152 6 3
1 163 1 6
2 145 1 None
3 163 6 3
Or, if you know in advance that you have 3 numbers, use str.extract and named capturing groups:
regex = '(?P<number5>\d+)\D*(?P<subnumber1>\d*)\D*(?P<subnumber2>\d*)'
df['column_name'].str.extract(regex)
output:
number5 subnumber1 subnumber2
0 152 6 3
1 163 1 6
2 145 1
3 163 6 3

Add a portion of a dataframe to another dataframe

Suppose to have two dataframes, df1 and df2, with equal number of columns, but different number of rows, e.g:
df1 = pd.DataFrame([(1,2),(3,4),(5,6),(7,8),(9,10),(11,12)], columns=['a','b'])
a b
1 1 2
2 3 4
3 5 6
4 7 8
5 9 10
6 11 12
df2 = pd.DataFrame([(100,200),(300,400),(500,600)], columns=['a','b'])
a b
1 100 200
2 300 400
3 500 600
I would like to add df2 to the df1 tail (df1.loc[df2.shape[0]:]), thus obtaining:
a b
1 1 2
2 3 4
3 5 6
4 107 208
5 309 410
6 511 612
Any idea?
Thanks!

If there is more rows in df1 like in df2 rows is possible use DataFrame.iloc with convert values to numpy array for avoid alignment (different indices create NaNs):
df1.iloc[-df2.shape[0]:] += df2.to_numpy()
print (df1)
a b
0 1 2
1 3 4
2 5 6
3 107 208
4 309 410
5 511 612
For general solution working with any number of rows with unique indices in both Dataframe with rename and DataFrame.add:
df = df1.add(df2.rename(dict(zip(df2.index[::-1], df1.index[::-1]))), fill_value=0)
print (df)
a b
0 1.0 2.0
1 3.0 4.0
2 5.0 6.0
3 107.0 208.0
4 309.0 410.0
5 511.0 612.0

Compute percentage changes with next row Pandas

I want to compute the percentage change with the next n row. I've tried pct_change() but I don't get the expected results
For example, with n=1
close return_n
0 100 1.00%
1 101 -0.99%
2 100 -1.00%
3 99 -4.04%
4 95 7.37%
5 102 NaN
With n=2
close return_n
0 100 0.00%
1 101 -1.98%
2 100 -5.00%
3 99 3.03%
4 95 NaN
5 102 NaN

Use pct_change and shift:
N = 2
df['return_n'] = df['close'].pct_change(N).mul(100).round(2).shift(-N)
print(df)
# Output:
close return_n
0 100 0.00
1 101 -1.98
2 100 -5.00
3 99 3.03
4 95 NaN
5 102 NaN

You can do shift with pct_change
n = 2
df['new'] = df.close.pct_change(periods=n).shift(-n)
df
Out[247]:
close return_n new
0 100 1.00% 0.000000
1 101 -0.99% -0.019802
2 100 -1.00% -0.050000
3 99 -4.04% 0.030303
4 95 7.37% NaN
5 102 NaN NaN

Pandas DataFrame - Creating dynamic number of columns

Data:
qid qualid val
0 1845631864 227 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1 1899053658 44 1,3,3,2,2,2,3,3,4,4,4,5,5,5,5,5,5,5
2 1192887045 197 704
3 1833579269 194 139472
4 1497352469 30 120026,170154,152723,90407,63119,80077,178871,...
Problem:
Numbers separated by commas in column val need to represented in different columns for each row.
I don't know if Pandas allows for it, but ideally, one would want to create exact n number of columns for each row, where n is the number of elements in column val.
If that is not possible, the greatest number of elements in val should be the number of columns and rows where elements are lesser than that should consist of NaNs.
Example Solution 1 for Above Problem:
qid qualid val1 val2 val3 valn-3 valn-2 valn-1 valn
0 1845631864 227 0 0 0 ...... 0 0 0 0
1 1899053658 44 1 3 3 ...... 5
2 1192887045 197 704
3 1833579269 194 139472
4 1497352469 30 120026 170154 152723.....63119 80077 178871 12313
Alternate Solution 2 for Above Problem:
qid qualid val1 val2 val3 valn-3 valn-2 valn-1 valn
0 1845631864 227 0 0 0 ...... 0 0 0 0
1 1899053658 44 1 3 3 ...... 5 NaN NaN NaN
2 1192887045 197 704 NaN NaN ...... NaN NaN NaN NaN
3 1833579269 194 139472 NaN NaN ...... NaN NaN NaN
4 1497352469 30 120026 170154 152723.....63119 80077 178871 12313

You can check str.split
pd.concat([df,df.val.str.split(',',expand=True).add_prefix('Val_')],axis=1)
Out[29]:
qid qualid ... Val_16 Val_17
0 1845631864 227 ... 0 0
1 1899053658 44 ... 5 5
2 1192887045 197 ... None None
3 1833579269 194 ... None None
4 1497352469 30 ... None None

Drop rows after maximum value in a grouped Pandas dataframe

I've got a date-ordered dataframe that can be grouped. What I am attempting to do is groupby a variable (Person), determine the maximum (weight) for each group (person), and then drop all rows that come after (date) the maximum.
Here's an example of the data:
df = pd.DataFrame({'Person': 1,1,1,1,1,2,2,2,2,2],'Date': '1/1/2015','2/1/2015','3/1/2015','4/1/2015','5/1/2015','6/1/2011','7/1/2011','8/1/2011','9/1/2011','10/1/2011'], 'MonthNo':[1,2,3,4,5,1,2,3,4,5], 'Weight':[100,110,115,112,108,205,210,211,215,206]})
Date MonthNo Person Weight
0 1/1/2015 1 1 100
1 2/1/2015 2 1 110
2 3/1/2015 3 1 115
3 4/1/2015 4 1 112
4 5/1/2015 5 1 108
5 6/1/2011 1 2 205
6 7/1/2011 2 2 210
7 8/1/2011 3 2 211
8 9/1/2011 4 2 215
9 10/1/2011 5 2 206
Here's what I want the result to look like:
Date MonthNo Person Weight
0 1/1/2015 1 1 100
1 2/1/2015 2 1 110
2 3/1/2015 3 1 115
5 6/1/2011 1 2 205
6 7/1/2011 2 2 210
7 8/1/2011 3 2 211
8 9/1/2011 4 2 215
I think its worth noting, there can be disjoint start dates and the maximum may appear at different times.
My idea was to find the maximum for each group, obtain the MonthNo the maximum was in for that group, and then discard any rows with MonthNo greater Max Weight MonthNo. So far I've been able to obtain the max by group, but cannot get past doing a comparison based on that.
Please let me know if I can edit/provide more information, haven't posted many questions here! Thanks for the help, sorry if my formatting/question isn't clear.

Using idxmax with groupby
df.groupby('Person',sort=False).apply(lambda x : x.reset_index(drop=True).iloc[:x.reset_index(drop=True).Weight.idxmax()+1,:])
Out[131]:
Date MonthNo Person Weight
Person
1 0 1/1/2015 1 1 100
1 2/1/2015 2 1 110
2 3/1/2015 3 1 115
2 0 6/1/2011 1 2 205
1 7/1/2011 2 2 210
2 8/1/2011 3 2 211
3 9/1/2011 4 2 215

You can use groupby.transform with idxmax. The first 2 steps may not be necessary depending on how your dataframe is structured.
# convert Date to datetime
df['Date'] = pd.to_datetime(df['Date'])
# sort by Person and Date to make index usable for next step
df = df.sort_values(['Person', 'Date']).reset_index(drop=True)
# filter for index less than idxmax transformed by group
df = df[df.index <= df.groupby('Person')['Weight'].transform('idxmax')]
print(df)
Date MonthNo Person Weight
0 2015-01-01 1 1 100
1 2015-02-01 2 1 110
2 2015-03-01 3 1 115
5 2011-06-01 1 2 205
6 2011-07-01 2 2 210
7 2011-08-01 3 2 211
8 2011-09-01 4 2 215

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas: Add new column based on comparison of two DFs - python

You can do it like this df2['In-Spec'] = 'False' df2['In-Spec'][(df2['Rating'] > df1['low']) & (df2['Rating'] < df1['high'])] = 'True' But which rows should be compared with each others? Do you want them to compare by their index or by their cat & subcat names?

Related

Check if column contains (/,-,_, *or~) and split in another column - Pandas

Add a portion of a dataframe to another dataframe

Compute percentage changes with next row Pandas

Pandas DataFrame - Creating dynamic number of columns

Drop rows after maximum value in a grouped Pandas dataframe

Categories

Resources