pandas subtracting two grouped dataframes of different size - python

i have two dataframes:
my stock solutions (df1):
pH salt_conc
5.5 0 23596.0
200 19167.0
400 17052.5
6.0 0 37008.5
200 27652.0
400 30385.5
6.5 0 43752.5
200 41146.0
400 39965.0
and my measurements after i did something (df2):
pH salt_conc id
5.5 0 8 20953.0
11 24858.0
200 3 20022.5
400 13 17691.0
20 18774.0
6.0 0 14 38639.0
200 1 37223.5
2 36597.0
7 37039.0
10 37088.5
15 35968.5
16 36344.5
17 34894.0
18 36388.5
400 9 33386.0
6.5 0 4 41401.5
12 44933.5
200 5 43074.5
400 6 42210.5
19 41332.5
I would like to normalize each measurement in the second dataframe (df2) with its corresponding stock solution from which i took the sample.
Any suggestions ?

Figured it out with the help of this post:
SO: Binary operation broadcasting across multiindex
I had to reset the index of both grouped dataframes and set it again.
df_initial = df_initial.reset_index().set_index(['pH','salt_conc'])
df_second = df_second.reset_index().set_index(['pH','salt_conc'])
No i can do any calculation i want to do.

Related

fastest way to access dataframe cell by colums values?

I have the following dataframe :
time bk1_lvl0_id bk2_lvl0_id pr_ss order_upto_level initial_inventory leadtime1 leadtime2 adjusted_leadtime
0 2020 1000 3 16 18 17 3 0.100000 1
1 2020 10043 3 65 78 72 12 0.400000 1
2 2020 1005 3 0 1 1 9 0.300000 1
3 2020 1009 3 325 363 344 21 0.700000 1
4 2020 102 3 0 1 1 7 0.233333 1
I want a function to get the pr_ss for example for (bk1_lvl0_id=1000,bk2_lvl0_id=3).
that's the code i've tried but it takes time :
def get_safety_stock(df,bk1,bk2):
##a function that returns the safety stock for any given (bk1,bk2)
for index,row in df.iterrows():
if (row["bk1_lvl0_id"]==bk1) and (row["bk2_lvl0_id"]==bk2):
return int(row["pr_ss"])
break
If your dataframe has no duplicate values based on bk1_lvl0_id and bk2_lvl0_id, You can make function as follows:
def get_safety_stock(df,bk1,bk2):
return df.loc[df.bk1_lvl0_id.eq(bk1) & df.bk2_lvl0_id.eq(bk2), 'pr_ss'][0]
Note that its accessing the first value in the Series which shouldnt be an issue if there are no duplicates in data. If you want all of them, just remove the [0] from the end and it should give you the whole series. This can be called as follows:
get_safety_stock(df, 1000,3)
>>>16

Get maximum relative difference between row-values and row-mean in new pandas dataframe column

I want to have an extra column with the maximum relative difference [-] of the row-values and the mean of these rows:
The df is filled with energy use data for several years.
The theoretical formula that should get me this is as follows:
df['max_rel_dif'] = MAX [ ABS(highest energy use – mean energy use), ABS(lowest energy use – mean energy use)] / mean energy use
Initial dataframe:
ID y_2010 y_2011 y_2012 y_2013 y_2014
0 23 22631 21954.0 22314.0 22032 21843
1 43 27456 29654.0 28159.0 28654 2000
2 36 61200 NaN NaN 31895 1600
3 87 87621 86542.0 87542.0 88456 86961
4 90 58951 57486.0 2000.0 0 0
5 98 24587 25478.0 NaN 24896 25461
Desired dataframe:
ID y_2010 y_2011 y_2012 y_2013 y_2014 max_rel_dif
0 23 22631 21954.0 22314.0 22032 21843 0.02149
1 43 27456 29654.0 28159.0 28654 2000 0.91373
2 36 61200 NaN NaN 31895 1600 0.94931
3 87 87621 86542.0 87542.0 88456 86961 0.01179
4 90 58951 57486.0 2000.0 0 0 1.48870
5 98 24587 25478.0 NaN 24896 25461 0.02065
tried code:
import pandas as pd
import numpy as np

df = pd.DataFrame({"ID": [23,43,36,87,90,98],
"y_2010": [22631,27456,61200,87621,58951,24587],
"y_2011": [21954,29654,np.nan,86542,57486,25478],
"y_2012": [22314,28159,np.nan,87542,2000,np.nan],
"y_2013": [22032,28654,31895,88456,0,24896,],
"y_2014": [21843,2000,1600,86961,0,25461]})

print(df)

a = df.loc[:, ['y_2010','y_2011','y_2012','y_2013', 'y_2014']]


# calculate mean
mean = a.mean(1)
# calculate max_rel_dif
df['max_rel_dif'] = (((df.max(axis=1).sub(mean)).abs(),(df.min(axis=1).sub(mean)).abs()).max()).div(mean)
# AttributeError: 'tuple' object has no attribute 'max'
-> I'm obviously doing the wrong thing with the tuple, I just don't know how to get the maximum values
from the tuples and divide them then by the mean in the proper Phytonic way
I feel like the whole function can be
s=df.filter(like='y')
s.sub(s.mean(1),axis=0).abs().max(1)/s.mean(1)
0 0.021494
1 0.913736
2 0.949311
3 0.011800
4 1.488707
5 0.020653
dtype: float64

How to insert rows with 0 data for missing quarters into a pandas dataframe?

I have a dataframe with specific Quota values for given quarters (YYYY-Qx format), and need to visualize them with some linecharts. However, some of the quarters are missing (as there was no Quota during those quarters).
Period Quota
2017-Q1 500
2017-Q3 600
2018-Q2 700
I want to add them (starting at 2017-Q1 until today, so 2019-Q2) to the dataframe with a default value of 0 in the Quota column. A desired output would be the following:
Period Quota
2017-Q1 500
2017-Q2 0
2017-Q3 600
2017-Q4 0
2018-Q1 0
2018-Q2 700
2018-Q3 0
2018-Q4 0
2019-Q1 0
2019-Q2 0
I tried
df['Period'] = pd.to_datetime(df['Period']).dt.to_period('Q')
And then resampling the df with 'Q' frequency, but I must be doing something wrong, as it doesn't help with anything.
Any help would be much appreciated.
Use:
df.index = pd.to_datetime(df['Period']).dt.to_period('Q')
end = pd.Period(pd.datetime.now(), freq='Q')
df = (df['Quota'].reindex(pd.period_range(df.index.min(), end), fill_value=0)
.rename_axis('Period')
.reset_index()
)
df['Period'] = df['Period'].dt.strftime('%Y-Q%q')
print (df)
Period Quota
0 2017-Q1 500
1 2017-Q2 0
2 2017-Q3 600
3 2017-Q4 0
4 2018-Q1 0
5 2018-Q2 700
6 2018-Q3 0
7 2018-Q4 0
8 2019-Q1 0
9 2019-Q2 0
#An alternate solution based on left join
qtr=['Q1','Q2','Q3','Q4']
finl=[]
for i in range(2017,2020):
for j in qtr:
finl.append((str(i)+'_'+j))
df1=pd.DataFrame({'year_qtr':finl}).reset_index(drop=True)
df1.head(2)
original_value=['2017_Q1' ,'2017_Q3' ,'2018_Q2']
df_original=pd.DataFrame({'year_qtr':original_value,
'value':[500,600,700]}).reset_index(drop=True)
final=pd.merge(df1,df_original,how='left',left_on=['year_qtr'], right_on =['year_qtr'])
final.fillna(0)
Output
year_qtr value
0 2017_Q1 500.0
1 2017_Q2 0.0
2 2017_Q3 600.0
3 2017_Q4 0.0
4 2018_Q1 0.0
5 2018_Q2 700.0
6 2018_Q3 0.0
7 2018_Q4 0.0
8 2019_Q1 0.0
9 2019_Q2 0.0
10 2019_Q3 0.0
11 2019_Q4 0.0

Scale values of a particular column of python dataframe between 1-10

I have a dataframe which contains youtube videos views, I want to scale these values in the range of 1-10.
Below is the sample of how values look like? How do i normalize it in the range of 1-10 or is there any more efficient way to do this thing?
rating
4394029
274358
473691
282858
703750
255967
3298456
136643
796896
2932
220661
48688
4661584
2526119
332176
7189818
322896
188162
157437
1153128
788310
1307902
One possibility is performing a scaling with max.
1 + df / df.max() * 9
rating
0 6.500315
1 1.343433
2 1.592952
3 1.354073
4 1.880933
5 1.320412
6 5.128909
7 1.171046
8 1.997531
9 1.003670
10 1.276217
11 1.060946
12 6.835232
13 4.162121
14 1.415808
15 10.000000
16 1.404192
17 1.235536
18 1.197075
19 2.443451
20 1.986783
21 2.637193
Similar solution by Wen (now deleted):
1 + (df - df.min()) * 9 / (df.max() - df.min())
rating
0 6.498887
1 1.339902
2 1.589522
3 1.350546
4 1.877621
5 1.316871
6 5.126922
7 1.167444
8 1.994266
9 1.000000
10 1.272658
11 1.057299
12 6.833941
13 4.159739
14 1.412306
15 10.000000
16 1.400685
17 1.231960
18 1.193484
19 2.440368
20 1.983514
21 2.634189

h2o python prefix an existing column in an h2o data frame with a string

How do I prefix an existing column in an h2o data frame with a string value in python? The column is numerical to begin with. I have been able to do this in the R H2O but I seem to struggle or can't get this right in the python version of h2o.
In R this seems to work.
h2o.init()
df = as.h2o(mtcars)
df['mpg']=h2o.ascharacter(df['mpg'])
df['mpg']=h2o.sub('','hey--------',df['mpg'])
df
However, when I try to do this in python I get a variety of errors. Sometimes I'm able to adjust the numerical column to a string without an error but then when I go and look at the data frame I receive an error. I'll post the code if needed. Given that they are the same functions I imagine it should be relatively easy but I must be missing something.
EDITED
(didn't answer original question the first time, answering it now)
This is how you would convert an numerical column to a column with string values and then replace those values.
import h2o
prostate = "http://h2o-public-test-data.s3.amazonaws.com/smalldata/prostate/prostate.csv"
h2o.init()
df = h2o.import_file(prostate)
# creating your example column with all values equal to 23
df['mpg'] = 23
df['mpg'] = df['mpg'].ascharacter()
df[1,'mpg'] # see that it is now a string
df['mpg']=df['mpg'].sub('23', 'please-help-me----23')
df
Out[16]: ID CAPSULE AGE RACE DPROS DCAPS PSA VOL GLEASON mpg
---- --------- ----- ------ ------- ------- ----- ----- --------- --------------------
1 0 65 1 2 1 1.4 0 6 please-help-me----23
2 0 72 1 3 2 6.7 0 7 please-help-me----23
3 0 70 1 1 2 4.9 0 6 please-help-me----23
4 0 76 2 2 1 51.2 20 7 please-help-me----23
5 0 69 1 1 1 12.3 55.9 6 please-help-me----23
6 1 71 1 3 2 3.3 0 8 please-help-me----23
7 0 68 2 4 2 31.9 0 7 please-help-me----23
8 0 61 2 4 2 66.7 27.2 7 please-help-me----23
9 0 69 1 1 1 3.9 24 7 please-help-me----23
10 0 68 2 1 2 13 0 6 please-help-me----23
[380 rows x 10 columns]
(answering the wrong question below:)
you have to pass a new list of column names (the same length as your original column list).
df.columns = new_column_list
for example I can rename the columns ID with NEW:
import h2o
prostate = "http://h2o-public-test-data.s3.amazonaws.com/smalldata/prostate/prostate.csv"
h2o.init()
df = h2o.import_file(prostate)
print(df.columns)
columns[0] = 'NEW'
df.columns = columns
print(df.columns)
which will show:
Checking whether there is an H2O instance running at http://localhost:54321. connected.
-------------------------- ------------------------------
H2O cluster uptime: 9 hours 31 mins
H2O cluster version: 3.10.4.8
H2O cluster version age: 1 month and 6 days
H2O cluster name: H2O_from_python_laurend_tzhifp
H2O cluster total nodes: 1
H2O cluster free memory: 3.276 Gb
H2O cluster total cores: 8
H2O cluster allowed cores: 8
H2O cluster status: locked, healthy
H2O connection url: http://localhost:54321
H2O connection proxy:
H2O internal security: False
Python version: 3.5.1 final
-------------------------- ------------------------------
Parse progress: |████████████████████████████████████████████████████████████████████████████| 100%
['ID', 'CAPSULE', 'AGE', 'RACE', 'DPROS', 'DCAPS', 'PSA', 'VOL', 'GLEASON']
['NEW', 'CAPSULE', 'AGE', 'RACE', 'DPROS', 'DCAPS', 'PSA', 'VOL', 'GLEASON']

Categories

Resources