Pandas pd.merge gives nan

Pandas pd.merge gives nan - python

I have two dataframes, which I need to merge/join based on a column. When I try to join/merge them, the new columns gives NaN.
Basically, I need to perform Left Join on the dataframes, considering df_user as the dataframe on the Left.
PS: The column on both the dataframes have same datatype.
Please find the dataframes below -
df_user.dtypes
App category
Sentiment int8
Sentiment_Polarity float64
Sentiment_Subjectivity float64
df_play.dtypes
App category
Category category
Rating float64
Reviews float64
Size float64
Installs int64
Type int8
Price float64
Content Rating int8
Installs_Cat int8
df_play.head()
App Category Rating Reviews Size Installs Type Price Content Installs_Cat
0 SPrapBook ART_AND_DESIGN 4.1 159 19 10000 0 0 0 9
1 U Launcher ART_AND_DESIGN 4.5 87510 25 5000000 0 0 0 14
2 Sketch - ART_AND_DESIGN 4.3 215644 2.8 50000000 0 0 1 16
3 Pixel Dra ART_AND_DESIGN 4.4 967 5.6 100000 0 0 0 11
4 Paper flo ART_AND_DESIGN 3.8 167 19 50000 0 0 0 10
df_user.head()
App Sentiment Sentiment_Polarity Sentiment_Subjectivity
0 10 Best Foods for You 2 1.00 0.533333
1 10 Best Foods for You 2 0.25 0.288462
3 10 Best Foods for You 2 0.40 0.875000
4 10 Best Foods for You 2 1.00 0.300000
5 10 Best Foods for You 2 1.00 0.300000
I tried both the codes below -
result = pd.merge(df_user, df_play, how='left', on='App')
result = df_user.join(df_play.set_index('App'),on='App',how='left',rsuffix='_y')
But all i got was -
App Sentiment Sentiment_Polarity Sentiment_Subjectivity Category Rating Reviews Size Installs Type Price Content Rating Installs_Cat
0 10 Best Foods for You 2 1.00 0.533333 NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 10 Best Foods for You 2 0.25 0.288462 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 10 Best Foods for You 2 0.40 0.875000 NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 10 Best Foods for You 2 1.00 0.300000 NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 10 Best Foods for You 2 1.00 0.300000 NaN NaN NaN NaN NaN NaN NaN NaN NaN
Please excuse me for the formatting.

Related

Append Columns C and D to A and B in Pandas DataFrame

I have a data frame which is 12 columns by 12 rows. I want to append columns C and D, E and F, G and H and I and J to columns A & B making it 2x72. when I try to do it with:
df = dffvstats.iloc[:,[0,1]]
df2 = dffvstats.iloc[:,[2,3]]
df.append(df2)
it gives me:
0 1 2 3
0 Index DJIA S&P500 NaN NaN
1 Market Cap 2521.74B NaN NaN
2 Income 86.80B NaN NaN
3 Sales 347.15B NaN NaN
4 Book/sh 3.87 NaN NaN
5 Cash/sh 3.76 NaN NaN
6 Dividend 0.88 NaN NaN
7 Dividend % 0.57% NaN NaN
8 Employees 147000 NaN NaN
9 Optionable Yes NaN NaN
10 Shortable Yes NaN NaN
11 Recom 1.90 NaN NaN
0 NaN NaN P/E 30.09
1 NaN NaN Forward P/E 27.12
2 NaN NaN PEG 1.53
3 NaN NaN P/S 7.26
4 NaN NaN P/B 39.70
5 NaN NaN P/C 40.87
6 NaN NaN P/FCF 31.35
7 NaN NaN Quick Ratio 1.00
8 NaN NaN Current Ratio 1.10
9 NaN NaN Debt/Eq 1.89
10 NaN NaN LT Debt/Eq 1.65
11 NaN NaN SMA20 3.47%
anyone know how to do this?

If your dffvstats dataframe looks like:
>>> dffvstats
0 1 2 3 # 4 5 6 7 and so on
0 Index DJIA S&P500 P/E 30.09
1 Market Cap 2521.74B Forward P/E 27.12
2 Income 86.80B PEG 1.53
3 Sales 347.15B P/S 7.26
4 Book/sh 3.87 P/B 39.70
5 Cash/sh 3.76 P/C 40.87
6 Dividend 0.88 P/FCF 31.35
7 Dividend % 0.57% Quick Ratio 1.00
8 Employees 147000 Current Ratio 1.10
9 Optionable Yes Debt/Eq 1.89
10 Shortable Yes LT Debt/Eq 1.65
11 Recom 1.90 SMA20 3.47%
And you want that:
>>> out
0 1
0 Index DJIA S&P500
1 Market Cap 2521.74B
2 Income 86.80B
3 Sales 347.15B
4 Book/sh 3.87
5 Cash/sh 3.76
6 Dividend 0.88
7 Dividend % 0.57%
8 Employees 147000
9 Optionable Yes
10 Shortable Yes
11 Recom 1.90
12 P/E 30.09
13 Forward P/E 27.12
14 PEG 1.53
15 P/S 7.26
16 P/B 39.70
17 P/C 40.87
18 P/FCF 31.35
19 Quick Ratio 1.00
20 Current Ratio 1.10
21 Debt/Eq 1.89
22 LT Debt/Eq 1.65
23 SMA20 3.47%
Try:
out = pd.DataFrame(np.concatenate([dffvstats.loc[:, cols].values
for cols in zip(dffvstats.columns[::2],
dffvstats.columns[1::2])]))
Explanation:
Start to get column pairwise with zip
>>> list(dffvstats.columns[::2]) # even columns
[0, 2]
>>> (dffvstats.columns[1::2]) # odd columns
[1, 3]
>>> list(zip(dffvstats.columns[::2], dffvstats.columns[1::2]))
[(0, 1), (2, 3)]
Split dataframe into sub dataframe:
tmp = [dffvstats.loc[:, cols].values
for cols in zip(dffvstats.columns[::2], dffvstats.columns[1::2])]
>>> tmp[0]
array([['Index', 'DJIA S&P500'],
['Market Cap', '2521.74B'],
['Income', '86.80B'],
['Sales', '347.15B'],
['Book/sh', '3.87'],
['Cash/sh', '3.76'],
['Dividend', '0.88'],
['Dividend %', '0.57%'],
['Employees', '147000'],
['Optionable', 'Yes'],
['Shortable', 'Yes'],
['Recom', '1.90']], dtype=object)
>>> tmp[1]
array([['P/E', '30.09'],
['Forward P/E', '27.12'],
['PEG', '1.53'],
['P/S', '7.26'],
['P/B', '39.70'],
['P/C', '40.87'],
['P/FCF', '31.35'],
['Quick Ratio', '1.00'],
['Current Ratio', '1.10'],
['Debt/Eq', '1.89'],
['LT Debt/Eq', '1.65'],
['SMA20', '3.47%']], dtype=object)
Concatenate all small dataframes into a big one with np.concatenate instead of pd.concat because you have a list of ndarray.

Try with pd.concat:
pd.concat([df[['A', 'B']], df[['C', 'D']].rename(columns={'C':'A', 'D':'B'}), df[['E', 'F']].rename(columns={'E':'A', 'F':'B'}), df[['G', 'H']].rename(columns={'G':'A', 'H':'B'}), df[['I', 'J']].rename(columns={'I':'A', 'G':'B'})])

Most efficient to reshape, use numpy:
a = dffvstats.values
pd.DataFrame(np.c_[a[:, ::2].flatten('F'),
a[:, 1::2].flatten('F')],
columns=['A', 'B'])
It concatenates all the even columns together, same with odd, and assembles both in a two columns array to form a dataframe.

How to get max of a slice of a dataframe based on column values?

I'm looking to make a new column, MaxPriceBetweenEntries based on the max() of a slice of the dataframe
idx Price EntryBar ExitBar
0 10.00 0 1
1 11.00 NaN NaN
2 10.15 2 4
3 12.14 NaN NaN
4 10.30 NaN NaN
turned into
idx Price EntryBar ExitBar MaxPriceBetweenEntries
0 10.00 0 1 11.00
1 11.00 NaN NaN NaN
2 10.15 2 4 12.14
3 12.14 NaN NaN NaN
4 10.30 NaN NaN NaN
I can get all the rows with an EntryBar or ExitBar value with df.loc[df["EntryBar"].notnull()] and df.loc[df["ExitBar"].notnull()], but I can't use that to set a new column:
df.loc[df["EntryBar"].notnull(),"MaxPriceBetweenEntries"] = df.loc[df["EntryBar"]:df["ExitBar"]]["Price"].max()
but that's effectively a guess at this point, because nothing I'm trying works. Ideally the solution wouldn't involve a loop directly because there may be millions of rows.

You can groupby the cumulative sum of non-null entries and take the max, unsing np.where() to only apply to non-null rows::
df['MaxPriceBetweenEntries'] = np.where(df['EntryBar'].notnull(),
df.groupby(df['EntryBar'].notnull().cumsum())['Price'].transform('max'),
np.nan)
df
Out[1]:
idx Price EntryBar ExitBar MaxPriceBetweenEntries
0 0 10.00 0.0 1.0 11.00
1 1 11.00 NaN NaN NaN
2 2 10.15 2.0 4.0 12.14
3 3 12.14 NaN NaN NaN
4 4 10.30 NaN NaN NaN

Let's try groupby() and where:
s = df['EntryBar'].notna()
df['MaxPriceBetweenEntries'] = df.groupby(s.cumsum())['Price'].transform('max').where(s)
Output:
idx Price EntryBar ExitBar MaxPriceBetweenEntries
0 0 10.00 0.0 1.0 11.00
1 1 11.00 NaN NaN NaN
2 2 10.15 2.0 4.0 12.14
3 3 12.14 NaN NaN NaN
4 4 10.30 NaN NaN NaN

You can forward fill the null values, group by entry and get the max of that groups Price. Use that as the right side of a left join and you should be in business.
df.merge(df.ffill().groupby('EntryBar')['Price'].max().reset_index(name='MaxPriceBetweenEntries'),
on='EntryBar',
how='left')

Try
df.loc[df['ExitBar'].notna(),'Max']=df.groupby(df['ExitBar'].ffill()).Price.max().values
df
Out[74]:
idx Price EntryBar ExitBar Max
0 0 10.00 0.0 1.0 11.00
1 1 11.00 NaN NaN NaN
2 2 10.15 2.0 4.0 12.14
3 3 12.14 NaN NaN NaN
4 4 10.30 NaN NaN NaN

How To Map Column Values where two others match? "Reindexing only valid with uniquely valued Index objects"?

I have one DataFrame, df, I have four columns shown below:
IDP1 IDP1Number IDP2 IDP2Number
1 100 1 NaN
3 110 2 150
5 120 3 NaN
7 140 4 160
9 150 5 190
NaN NaN 6 130
NaN NaN 7 NaN
NaN NaN 8 200
NaN NaN 9 90
NaN NaN 10 NaN
I want instead to map values from df.IDP1Number to IDP2Number using IDP1 to IDP2. I want to replace existing values if IDP1 and IDP2 both exist with IDP1Number. Otherwise leave values in IDP2Number alone.
The error message that appears reads, " Reindexing only valid with uniquely value Index objects
The Dataframe below is what I wish to have:
IDP1 IDP1Number IDP2 IDP2Number
1 100 1 100
3 110 2 150
5 120 3 110
7 140 4 160
9 150 5 120
NaN NaN 6 130
NaN NaN 7 140
NaN NaN 8 200
NaN NaN 9 150
NaN NaN 10 NaN

Here's a way to do:
# filter the data and create a mapping dict
maps = df.query("IDP1.notna()")[['IDP1', 'IDP1Number']].set_index('IDP1')['IDP1Number'].to_dict()
# create new column using ifelse condition
df['IDP2Number'] = df.apply(lambda x: maps.get(x['IDP2'], None) if (pd.isna(x['IDP2Number']) or x['IDP2'] in maps) else x['IDP2Number'], axis=1)
print(df)
IDP1 IDP1Number IDP2 IDP2Number
0 1.0 100.0 1 100.0
1 3.0 110.0 2 150.0
2 5.0 120.0 3 110.0
3 7.0 140.0 4 160.0
4 9.0 150.0 5 120.0
5 NaN NaN 6 130.0
6 NaN NaN 7 140.0
7 NaN NaN 8 200.0
8 NaN NaN 9 150.0
9 NaN NaN 10 NaN

rolling moving average and std dev by multiple columns dynamically

I have a dataframe like this
import pandas as pd
import numpy as np
raw_data = {'Country':['UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','US','US','US','US','US','US'],
'Product':['A','A','A','A','B','B','B','B','B','B','B','B','C','C','C','D','D','D','D','D','D'],
'Week': [1,2,3,4,1,2,3,4,5,6,7,8,1,2,3,1,2,3,4,5,6],
'val': [5,4,3,1,5,6,7,8,9,10,11,12,5,5,5,5,6,7,8,9,10]
}
df2 = pd.DataFrame(raw_data, columns = ['Country','Product','Week', 'val'])
print(df2)
i want to calculate moving average and std dev for val column by country and product..like 3 weeks,5 weeks ,7 weeks etc
wanted dataframe:
'Contry', 'product','week',val', '3wks_avg' '3wks_std','5wks_avg',5wks,std'..etc

Like WenYoBen suggested, we can create a list of all the window sizes you want, and then dynamically create your wanted columns with GroupBy.rolling:
weeks = [3, 5, 7]
for week in weeks:
df[[f'{week}wks_avg', f'{week}wks_std']] = (
df.groupby(['Country', 'Product']).rolling(window=week, on='Week')['val']
.agg(['mean', 'std']).reset_index(drop=True)
)
Country Product Week val 3wks_avg 3wks_std 5wks_avg 5wks_std 7wks_avg 7wks_std
0 UK A 1 5 nan nan nan nan nan nan
1 UK A 2 4 nan nan nan nan nan nan
2 UK A 3 3 4.00 1.00 nan nan nan nan
3 UK A 4 1 2.67 1.53 nan nan nan nan
4 UK B 1 5 nan nan nan nan nan nan
5 UK B 2 6 nan nan nan nan nan nan
6 UK B 3 7 6.00 1.00 nan nan nan nan
7 UK B 4 8 7.00 1.00 nan nan nan nan
8 UK B 5 9 8.00 1.00 7.00 1.58 nan nan
9 UK B 6 10 9.00 1.00 8.00 1.58 nan nan
10 UK B 7 11 10.00 1.00 9.00 1.58 8.00 2.16
11 UK B 8 12 11.00 1.00 10.00 1.58 9.00 2.16
12 UK C 1 5 nan nan nan nan nan nan
13 UK C 2 5 nan nan nan nan nan nan
14 UK C 3 5 5.00 0.00 nan nan nan nan
15 US D 1 5 nan nan nan nan nan nan
16 US D 2 6 nan nan nan nan nan nan
17 US D 3 7 6.00 1.00 nan nan nan nan
18 US D 4 8 7.00 1.00 nan nan nan nan
19 US D 5 9 8.00 1.00 7.00 1.58 nan nan
20 US D 6 10 9.00 1.00 8.00 1.58 nan nan

This is how you would get the moving average for 3 weeks :
df['3weeks_avg'] = list(df.groupby(['Country', 'Product']).rolling(3).mean()['val'])
Apply the same principle for the other columns you want to compute.

IIUC, you may try this
wks = ['Week_3', 'Week_5', 'Week_7']
df_calc = (df2.groupby(['Country', 'Product']).expanding().val
.agg(['mean', 'std']).rename(lambda x: f'Week_{x+1}', level=-1)
.query('ilevel_2 in #wks').unstack())
Out[246]:
mean std
Week_3 Week_5 Week_7 Week_3 Week_5 Week_7
Country Product
UK A 4.0 NaN NaN 1.0 NaN NaN
B NaN 5.0 6.0 NaN NaN 1.0

You will want to use a groupby-transform to get the rolling moments of your data. The following should compute what you are looking for:
weeks = [3, 5, 7] # define weeks
df2 = df2.sort_values('Week') # order by time
for i in weeks: # loop through time intervals you want to compute
df2['{}wks_avg'.format(i)] = df2.groupby(['Country', 'Product'])['val'].transform(lambda x: x.rolling(i).mean()) # i-week rolling mean
df2['{}wks_std'.format(i)] = df2.groupby(['Country', 'Product'])['val'].transform(lambda x: x.rolling(i).std()) # i-week rolling std
Here is what the resulting dataframe will look like.
print(df2.dropna().head().to_string())
Country Product Week val 3wks_avg 3wks_std 5wks_avg 5wks_std 7wks_avg 7wks_std
17 US D 3 7 6.0 1.0 6.0 1.0 6.0 1.0
6 UK B 3 7 6.0 1.0 6.0 1.0 6.0 1.0
14 UK C 3 5 5.0 0.0 5.0 0.0 5.0 0.0
2 UK A 3 3 4.0 1.0 4.0 1.0 4.0 1.0
7 UK B 4 8 7.0 1.0 7.0 1.0 7.0 1.0

Compare two dataframes, one column, and add certain values on match?

So I have two dataframes
eqdf
symbol qty
0 DABIND 1
1 INFTEC 6
2 DISHTV 8
3 HINDAL 40
4 NATMIN 5
5 POWGRI 40
6 CHEPET 6
premdf
share strike lprice premperc d_strike
0 HINDAL 250.0 237.90 1.975620 5.086171
1 RELIND 1280.0 1254.30 1.642350 2.048952
2 POWGRI 205.0 201.15 1.118568 1.913995
I want to compare columns premdf['share'] and eqdf['symbol'] and if there is a match premperc,d_strike,strike value is to be added to the end of the eqdf row in which there is a match.
I have tried
eqdf.loc[eqdf['symbol']==premdf['share'],eqdf['premperc'] == premdf['premperc']]
I keep getting errors
ValueError: Can only compare identically-labeled Series objects
Expected Output:
eqdf
symbol qty premperc d_strike strike
0 DABIND 1 NaN NaN NaN
1 INFTEC 6 NaN NaN NaN
2 DISHTV 8 NaN NaN NaN
3 HINDAL 40 1.975620 5.086171 250.0
4 NATMIN 5 NaN NaN NaN
5 POWGRI 40 1.118568 1.913995 205.0
6 CHEPET 6 NaN NaN NaN
What is the correct way to do this?
Thanks

rename and merge
eqdf.merge(premdf.rename(columns={'share': 'symbol'}), 'left')
symbol qty strike lprice premperc d_strike
0 DABIND 1 NaN NaN NaN NaN
1 INFTEC 6 NaN NaN NaN NaN
2 DISHTV 8 NaN NaN NaN NaN
3 HINDAL 40 250.0 237.90 1.975620 5.086171
4 NATMIN 5 NaN NaN NaN NaN
5 POWGRI 40 205.0 201.15 1.118568 1.913995
6 CHEPET 6 NaN NaN NaN NaN

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas pd.merge gives nan - python

Related

Append Columns C and D to A and B in Pandas DataFrame

How to get max of a slice of a dataframe based on column values?

How To Map Column Values where two others match? "Reindexing only valid with uniquely valued Index objects"?

rolling moving average and std dev by multiple columns dynamically

Compare two dataframes, one column, and add certain values on match?

Categories

Resources