Pandas pd.merge gives nan - python

I have two dataframes, which I need to merge/join based on a column. When I try to join/merge them, the new columns gives NaN.
Basically, I need to perform Left Join on the dataframes, considering df_user as the dataframe on the Left.
PS: The column on both the dataframes have same datatype.
Please find the dataframes below -
df_user.dtypes
App category
Sentiment int8
Sentiment_Polarity float64
Sentiment_Subjectivity float64
df_play.dtypes
App category
Category category
Rating float64
Reviews float64
Size float64
Installs int64
Type int8
Price float64
Content Rating int8
Installs_Cat int8
df_play.head()
App Category Rating Reviews Size Installs Type Price Content Installs_Cat
0 SPrapBook ART_AND_DESIGN 4.1 159 19 10000 0 0 0 9
1 U Launcher ART_AND_DESIGN 4.5 87510 25 5000000 0 0 0 14
2 Sketch - ART_AND_DESIGN 4.3 215644 2.8 50000000 0 0 1 16
3 Pixel Dra ART_AND_DESIGN 4.4 967 5.6 100000 0 0 0 11
4 Paper flo ART_AND_DESIGN 3.8 167 19 50000 0 0 0 10
df_user.head()
App Sentiment Sentiment_Polarity Sentiment_Subjectivity
0 10 Best Foods for You 2 1.00 0.533333
1 10 Best Foods for You 2 0.25 0.288462
3 10 Best Foods for You 2 0.40 0.875000
4 10 Best Foods for You 2 1.00 0.300000
5 10 Best Foods for You 2 1.00 0.300000
I tried both the codes below -
result = pd.merge(df_user, df_play, how='left', on='App')
result = df_user.join(df_play.set_index('App'),on='App',how='left',rsuffix='_y')
But all i got was -
App Sentiment Sentiment_Polarity Sentiment_Subjectivity Category Rating Reviews Size Installs Type Price Content Rating Installs_Cat
0 10 Best Foods for You 2 1.00 0.533333 NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 10 Best Foods for You 2 0.25 0.288462 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 10 Best Foods for You 2 0.40 0.875000 NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 10 Best Foods for You 2 1.00 0.300000 NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 10 Best Foods for You 2 1.00 0.300000 NaN NaN NaN NaN NaN NaN NaN NaN NaN
Please excuse me for the formatting.

Related

Append Columns C and D to A and B in Pandas DataFrame

I have a data frame which is 12 columns by 12 rows. I want to append columns C and D, E and F, G and H and I and J to columns A & B making it 2x72. when I try to do it with:
df = dffvstats.iloc[:,[0,1]]
df2 = dffvstats.iloc[:,[2,3]]
df.append(df2)
it gives me:
0 1 2 3
0 Index DJIA S&P500 NaN NaN
1 Market Cap 2521.74B NaN NaN
2 Income 86.80B NaN NaN
3 Sales 347.15B NaN NaN
4 Book/sh 3.87 NaN NaN
5 Cash/sh 3.76 NaN NaN
6 Dividend 0.88 NaN NaN
7 Dividend % 0.57% NaN NaN
8 Employees 147000 NaN NaN
9 Optionable Yes NaN NaN
10 Shortable Yes NaN NaN
11 Recom 1.90 NaN NaN
0 NaN NaN P/E 30.09
1 NaN NaN Forward P/E 27.12
2 NaN NaN PEG 1.53
3 NaN NaN P/S 7.26
4 NaN NaN P/B 39.70
5 NaN NaN P/C 40.87
6 NaN NaN P/FCF 31.35
7 NaN NaN Quick Ratio 1.00
8 NaN NaN Current Ratio 1.10
9 NaN NaN Debt/Eq 1.89
10 NaN NaN LT Debt/Eq 1.65
11 NaN NaN SMA20 3.47%
anyone know how to do this?
If your dffvstats dataframe looks like:
>>> dffvstats
0 1 2 3 # 4 5 6 7 and so on
0 Index DJIA S&P500 P/E 30.09
1 Market Cap 2521.74B Forward P/E 27.12
2 Income 86.80B PEG 1.53
3 Sales 347.15B P/S 7.26
4 Book/sh 3.87 P/B 39.70
5 Cash/sh 3.76 P/C 40.87
6 Dividend 0.88 P/FCF 31.35
7 Dividend % 0.57% Quick Ratio 1.00
8 Employees 147000 Current Ratio 1.10
9 Optionable Yes Debt/Eq 1.89
10 Shortable Yes LT Debt/Eq 1.65
11 Recom 1.90 SMA20 3.47%
And you want that:
>>> out
0 1
0 Index DJIA S&P500
1 Market Cap 2521.74B
2 Income 86.80B
3 Sales 347.15B
4 Book/sh 3.87
5 Cash/sh 3.76
6 Dividend 0.88
7 Dividend % 0.57%
8 Employees 147000
9 Optionable Yes
10 Shortable Yes
11 Recom 1.90
12 P/E 30.09
13 Forward P/E 27.12
14 PEG 1.53
15 P/S 7.26
16 P/B 39.70
17 P/C 40.87
18 P/FCF 31.35
19 Quick Ratio 1.00
20 Current Ratio 1.10
21 Debt/Eq 1.89
22 LT Debt/Eq 1.65
23 SMA20 3.47%
Try:
out = pd.DataFrame(np.concatenate([dffvstats.loc[:, cols].values
for cols in zip(dffvstats.columns[::2],
dffvstats.columns[1::2])]))
Explanation:
Start to get column pairwise with zip
>>> list(dffvstats.columns[::2]) # even columns
[0, 2]
>>> (dffvstats.columns[1::2]) # odd columns
[1, 3]
>>> list(zip(dffvstats.columns[::2], dffvstats.columns[1::2]))
[(0, 1), (2, 3)]
Split dataframe into sub dataframe:
tmp = [dffvstats.loc[:, cols].values
for cols in zip(dffvstats.columns[::2], dffvstats.columns[1::2])]
>>> tmp[0]
array([['Index', 'DJIA S&P500'],
['Market Cap', '2521.74B'],
['Income', '86.80B'],
['Sales', '347.15B'],
['Book/sh', '3.87'],
['Cash/sh', '3.76'],
['Dividend', '0.88'],
['Dividend %', '0.57%'],
['Employees', '147000'],
['Optionable', 'Yes'],
['Shortable', 'Yes'],
['Recom', '1.90']], dtype=object)
>>> tmp[1]
array([['P/E', '30.09'],
['Forward P/E', '27.12'],
['PEG', '1.53'],
['P/S', '7.26'],
['P/B', '39.70'],
['P/C', '40.87'],
['P/FCF', '31.35'],
['Quick Ratio', '1.00'],
['Current Ratio', '1.10'],
['Debt/Eq', '1.89'],
['LT Debt/Eq', '1.65'],
['SMA20', '3.47%']], dtype=object)
Concatenate all small dataframes into a big one with np.concatenate instead of pd.concat because you have a list of ndarray.
Try with pd.concat:
pd.concat([df[['A', 'B']], df[['C', 'D']].rename(columns={'C':'A', 'D':'B'}), df[['E', 'F']].rename(columns={'E':'A', 'F':'B'}), df[['G', 'H']].rename(columns={'G':'A', 'H':'B'}), df[['I', 'J']].rename(columns={'I':'A', 'G':'B'})])
Most efficient to reshape, use numpy:
a = dffvstats.values
pd.DataFrame(np.c_[a[:, ::2].flatten('F'),
a[:, 1::2].flatten('F')],
columns=['A', 'B'])
It concatenates all the even columns together, same with odd, and assembles both in a two columns array to form a dataframe.

How to get max of a slice of a dataframe based on column values?

I'm looking to make a new column, MaxPriceBetweenEntries based on the max() of a slice of the dataframe
idx Price EntryBar ExitBar
0 10.00 0 1
1 11.00 NaN NaN
2 10.15 2 4
3 12.14 NaN NaN
4 10.30 NaN NaN
turned into
idx Price EntryBar ExitBar MaxPriceBetweenEntries
0 10.00 0 1 11.00
1 11.00 NaN NaN NaN
2 10.15 2 4 12.14
3 12.14 NaN NaN NaN
4 10.30 NaN NaN NaN
I can get all the rows with an EntryBar or ExitBar value with df.loc[df["EntryBar"].notnull()] and df.loc[df["ExitBar"].notnull()], but I can't use that to set a new column:
df.loc[df["EntryBar"].notnull(),"MaxPriceBetweenEntries"] = df.loc[df["EntryBar"]:df["ExitBar"]]["Price"].max()
but that's effectively a guess at this point, because nothing I'm trying works. Ideally the solution wouldn't involve a loop directly because there may be millions of rows.
You can groupby the cumulative sum of non-null entries and take the max, unsing np.where() to only apply to non-null rows::
df['MaxPriceBetweenEntries'] = np.where(df['EntryBar'].notnull(),
df.groupby(df['EntryBar'].notnull().cumsum())['Price'].transform('max'),
np.nan)
df
Out[1]:
idx Price EntryBar ExitBar MaxPriceBetweenEntries
0 0 10.00 0.0 1.0 11.00
1 1 11.00 NaN NaN NaN
2 2 10.15 2.0 4.0 12.14
3 3 12.14 NaN NaN NaN
4 4 10.30 NaN NaN NaN
Let's try groupby() and where:
s = df['EntryBar'].notna()
df['MaxPriceBetweenEntries'] = df.groupby(s.cumsum())['Price'].transform('max').where(s)
Output:
idx Price EntryBar ExitBar MaxPriceBetweenEntries
0 0 10.00 0.0 1.0 11.00
1 1 11.00 NaN NaN NaN
2 2 10.15 2.0 4.0 12.14
3 3 12.14 NaN NaN NaN
4 4 10.30 NaN NaN NaN
You can forward fill the null values, group by entry and get the max of that groups Price. Use that as the right side of a left join and you should be in business.
df.merge(df.ffill().groupby('EntryBar')['Price'].max().reset_index(name='MaxPriceBetweenEntries'),
on='EntryBar',
how='left')
Try
df.loc[df['ExitBar'].notna(),'Max']=df.groupby(df['ExitBar'].ffill()).Price.max().values
df
Out[74]:
idx Price EntryBar ExitBar Max
0 0 10.00 0.0 1.0 11.00
1 1 11.00 NaN NaN NaN
2 2 10.15 2.0 4.0 12.14
3 3 12.14 NaN NaN NaN
4 4 10.30 NaN NaN NaN

How To Map Column Values where two others match? "Reindexing only valid with uniquely valued Index objects"?

I have one DataFrame, df, I have four columns shown below:
IDP1 IDP1Number IDP2 IDP2Number
1 100 1 NaN
3 110 2 150
5 120 3 NaN
7 140 4 160
9 150 5 190
NaN NaN 6 130
NaN NaN 7 NaN
NaN NaN 8 200
NaN NaN 9 90
NaN NaN 10 NaN
I want instead to map values from df.IDP1Number to IDP2Number using IDP1 to IDP2. I want to replace existing values if IDP1 and IDP2 both exist with IDP1Number. Otherwise leave values in IDP2Number alone.
The error message that appears reads, " Reindexing only valid with uniquely value Index objects
The Dataframe below is what I wish to have:
IDP1 IDP1Number IDP2 IDP2Number
1 100 1 100
3 110 2 150
5 120 3 110
7 140 4 160
9 150 5 120
NaN NaN 6 130
NaN NaN 7 140
NaN NaN 8 200
NaN NaN 9 150
NaN NaN 10 NaN
Here's a way to do:
# filter the data and create a mapping dict
maps = df.query("IDP1.notna()")[['IDP1', 'IDP1Number']].set_index('IDP1')['IDP1Number'].to_dict()
# create new column using ifelse condition
df['IDP2Number'] = df.apply(lambda x: maps.get(x['IDP2'], None) if (pd.isna(x['IDP2Number']) or x['IDP2'] in maps) else x['IDP2Number'], axis=1)
print(df)
IDP1 IDP1Number IDP2 IDP2Number
0 1.0 100.0 1 100.0
1 3.0 110.0 2 150.0
2 5.0 120.0 3 110.0
3 7.0 140.0 4 160.0
4 9.0 150.0 5 120.0
5 NaN NaN 6 130.0
6 NaN NaN 7 140.0
7 NaN NaN 8 200.0
8 NaN NaN 9 150.0
9 NaN NaN 10 NaN

rolling moving average and std dev by multiple columns dynamically

I have a dataframe like this
import pandas as pd
import numpy as np
raw_data = {'Country':['UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','US','US','US','US','US','US'],
'Product':['A','A','A','A','B','B','B','B','B','B','B','B','C','C','C','D','D','D','D','D','D'],
'Week': [1,2,3,4,1,2,3,4,5,6,7,8,1,2,3,1,2,3,4,5,6],
'val': [5,4,3,1,5,6,7,8,9,10,11,12,5,5,5,5,6,7,8,9,10]
}
df2 = pd.DataFrame(raw_data, columns = ['Country','Product','Week', 'val'])
print(df2)
i want to calculate moving average and std dev for val column by country and product..like 3 weeks,5 weeks ,7 weeks etc
wanted dataframe:
'Contry', 'product','week',val', '3wks_avg' '3wks_std','5wks_avg',5wks,std'..etc
Like WenYoBen suggested, we can create a list of all the window sizes you want, and then dynamically create your wanted columns with GroupBy.rolling:
weeks = [3, 5, 7]
for week in weeks:
df[[f'{week}wks_avg', f'{week}wks_std']] = (
df.groupby(['Country', 'Product']).rolling(window=week, on='Week')['val']
.agg(['mean', 'std']).reset_index(drop=True)
)
Country Product Week val 3wks_avg 3wks_std 5wks_avg 5wks_std 7wks_avg 7wks_std
0 UK A 1 5 nan nan nan nan nan nan
1 UK A 2 4 nan nan nan nan nan nan
2 UK A 3 3 4.00 1.00 nan nan nan nan
3 UK A 4 1 2.67 1.53 nan nan nan nan
4 UK B 1 5 nan nan nan nan nan nan
5 UK B 2 6 nan nan nan nan nan nan
6 UK B 3 7 6.00 1.00 nan nan nan nan
7 UK B 4 8 7.00 1.00 nan nan nan nan
8 UK B 5 9 8.00 1.00 7.00 1.58 nan nan
9 UK B 6 10 9.00 1.00 8.00 1.58 nan nan
10 UK B 7 11 10.00 1.00 9.00 1.58 8.00 2.16
11 UK B 8 12 11.00 1.00 10.00 1.58 9.00 2.16
12 UK C 1 5 nan nan nan nan nan nan
13 UK C 2 5 nan nan nan nan nan nan
14 UK C 3 5 5.00 0.00 nan nan nan nan
15 US D 1 5 nan nan nan nan nan nan
16 US D 2 6 nan nan nan nan nan nan
17 US D 3 7 6.00 1.00 nan nan nan nan
18 US D 4 8 7.00 1.00 nan nan nan nan
19 US D 5 9 8.00 1.00 7.00 1.58 nan nan
20 US D 6 10 9.00 1.00 8.00 1.58 nan nan
This is how you would get the moving average for 3 weeks :
df['3weeks_avg'] = list(df.groupby(['Country', 'Product']).rolling(3).mean()['val'])
Apply the same principle for the other columns you want to compute.
IIUC, you may try this
wks = ['Week_3', 'Week_5', 'Week_7']
df_calc = (df2.groupby(['Country', 'Product']).expanding().val
.agg(['mean', 'std']).rename(lambda x: f'Week_{x+1}', level=-1)
.query('ilevel_2 in #wks').unstack())
Out[246]:
mean std
Week_3 Week_5 Week_7 Week_3 Week_5 Week_7
Country Product
UK A 4.0 NaN NaN 1.0 NaN NaN
B NaN 5.0 6.0 NaN NaN 1.0
You will want to use a groupby-transform to get the rolling moments of your data. The following should compute what you are looking for:
weeks = [3, 5, 7] # define weeks
df2 = df2.sort_values('Week') # order by time
for i in weeks: # loop through time intervals you want to compute
df2['{}wks_avg'.format(i)] = df2.groupby(['Country', 'Product'])['val'].transform(lambda x: x.rolling(i).mean()) # i-week rolling mean
df2['{}wks_std'.format(i)] = df2.groupby(['Country', 'Product'])['val'].transform(lambda x: x.rolling(i).std()) # i-week rolling std
Here is what the resulting dataframe will look like.
print(df2.dropna().head().to_string())
Country Product Week val 3wks_avg 3wks_std 5wks_avg 5wks_std 7wks_avg 7wks_std
17 US D 3 7 6.0 1.0 6.0 1.0 6.0 1.0
6 UK B 3 7 6.0 1.0 6.0 1.0 6.0 1.0
14 UK C 3 5 5.0 0.0 5.0 0.0 5.0 0.0
2 UK A 3 3 4.0 1.0 4.0 1.0 4.0 1.0
7 UK B 4 8 7.0 1.0 7.0 1.0 7.0 1.0

Compare two dataframes, one column, and add certain values on match?

So I have two dataframes
eqdf
symbol qty
0 DABIND 1
1 INFTEC 6
2 DISHTV 8
3 HINDAL 40
4 NATMIN 5
5 POWGRI 40
6 CHEPET 6
premdf
share strike lprice premperc d_strike
0 HINDAL 250.0 237.90 1.975620 5.086171
1 RELIND 1280.0 1254.30 1.642350 2.048952
2 POWGRI 205.0 201.15 1.118568 1.913995
I want to compare columns premdf['share'] and eqdf['symbol'] and if there is a match premperc,d_strike,strike value is to be added to the end of the eqdf row in which there is a match.
I have tried
eqdf.loc[eqdf['symbol']==premdf['share'],eqdf['premperc'] == premdf['premperc']]
I keep getting errors
ValueError: Can only compare identically-labeled Series objects
Expected Output:
eqdf
symbol qty premperc d_strike strike
0 DABIND 1 NaN NaN NaN
1 INFTEC 6 NaN NaN NaN
2 DISHTV 8 NaN NaN NaN
3 HINDAL 40 1.975620 5.086171 250.0
4 NATMIN 5 NaN NaN NaN
5 POWGRI 40 1.118568 1.913995 205.0
6 CHEPET 6 NaN NaN NaN
What is the correct way to do this?
Thanks
rename and merge
eqdf.merge(premdf.rename(columns={'share': 'symbol'}), 'left')
symbol qty strike lprice premperc d_strike
0 DABIND 1 NaN NaN NaN NaN
1 INFTEC 6 NaN NaN NaN NaN
2 DISHTV 8 NaN NaN NaN NaN
3 HINDAL 40 250.0 237.90 1.975620 5.086171
4 NATMIN 5 NaN NaN NaN NaN
5 POWGRI 40 205.0 201.15 1.118568 1.913995
6 CHEPET 6 NaN NaN NaN NaN

Categories

Resources