count total, total nulls, mean and median - python

Let us say I have a data frame with a column called values, and for this column, I want to calculate the total observations, total null observations, mean and median values per group.
I.e.,
mydf = pd.DataFrame({'date_ym':['2018-01', '2018-01','2018-01','2018-01','2018-02','2018-02','2018-03'],'category':['A','A','A','B','A','B','B'], 'values':[np.nan,4.0,5.1,np.nan,6.2,np.nan,np.nan]})
mydf
Out[134]:
category date_ym values
0 A 2018-01 NaN
1 A 2018-01 4.0
2 A 2018-01 5.1
3 B 2018-01 NaN
4 A 2018-02 6.2
5 B 2018-02 NaN
6 B 2018-03 NaN
If I use groupby and agg, I get the following output:
mydf.groupby(['date_ym','category']).agg(['count', 'mean', 'median']).reset_index()
Out[135]:
date_ym category values
count mean median
0 2018-01 A 2 4.55 4.55
1 2018-01 B 0 NaN NaN
2 2018-02 A 1 6.20 6.20
3 2018-02 B 0 NaN NaN
4 2018-03 B 0 NaN NaN
But the output I'd really want is as follows:
date_ym category values
count countNAs mean median
0 2018-01 A 2 1 4.55 4.55
1 2018-01 B 0 1 NaN NaN
2 2018-02 A 1 0 6.20 6.20
3 2018-02 B 0 1 NaN NaN
4 2018-03 B 0 1 NaN NaN

You can using
def countNAs(x): return x.isnull().sum()
mydf.groupby(['date_ym','category']).agg(['count',countNAs, 'mean', 'median']).reset_index()
Out[647]:
date_ym category values
count countNAs mean median
0 2018-01 A 2 1.0 4.55 4.55
1 2018-01 B 0 1.0 NaN NaN
2 2018-02 A 1 0.0 6.20 6.20
3 2018-02 B 0 1.0 NaN NaN
4 2018-03 B 0 1.0 NaN NaN

This is not a straight-forward approach but it does the job.
#aggregate on size which counts NA's
mydf = mydf.groupby(['date_ym','category']).agg(['size', 'count', 'mean', 'median']).reset_index()
#Renaming columns
mydf.columns = ['date_ym','category', 'countNA', 'count', 'mean', 'median']
#countNA = size - count
mydf['countNA'] = mydf['countNA'] - mydf['count']

Related

Append Columns C and D to A and B in Pandas DataFrame

I have a data frame which is 12 columns by 12 rows. I want to append columns C and D, E and F, G and H and I and J to columns A & B making it 2x72. when I try to do it with:
df = dffvstats.iloc[:,[0,1]]
df2 = dffvstats.iloc[:,[2,3]]
df.append(df2)
it gives me:
0 1 2 3
0 Index DJIA S&P500 NaN NaN
1 Market Cap 2521.74B NaN NaN
2 Income 86.80B NaN NaN
3 Sales 347.15B NaN NaN
4 Book/sh 3.87 NaN NaN
5 Cash/sh 3.76 NaN NaN
6 Dividend 0.88 NaN NaN
7 Dividend % 0.57% NaN NaN
8 Employees 147000 NaN NaN
9 Optionable Yes NaN NaN
10 Shortable Yes NaN NaN
11 Recom 1.90 NaN NaN
0 NaN NaN P/E 30.09
1 NaN NaN Forward P/E 27.12
2 NaN NaN PEG 1.53
3 NaN NaN P/S 7.26
4 NaN NaN P/B 39.70
5 NaN NaN P/C 40.87
6 NaN NaN P/FCF 31.35
7 NaN NaN Quick Ratio 1.00
8 NaN NaN Current Ratio 1.10
9 NaN NaN Debt/Eq 1.89
10 NaN NaN LT Debt/Eq 1.65
11 NaN NaN SMA20 3.47%
anyone know how to do this?
If your dffvstats dataframe looks like:
>>> dffvstats
0 1 2 3 # 4 5 6 7 and so on
0 Index DJIA S&P500 P/E 30.09
1 Market Cap 2521.74B Forward P/E 27.12
2 Income 86.80B PEG 1.53
3 Sales 347.15B P/S 7.26
4 Book/sh 3.87 P/B 39.70
5 Cash/sh 3.76 P/C 40.87
6 Dividend 0.88 P/FCF 31.35
7 Dividend % 0.57% Quick Ratio 1.00
8 Employees 147000 Current Ratio 1.10
9 Optionable Yes Debt/Eq 1.89
10 Shortable Yes LT Debt/Eq 1.65
11 Recom 1.90 SMA20 3.47%
And you want that:
>>> out
0 1
0 Index DJIA S&P500
1 Market Cap 2521.74B
2 Income 86.80B
3 Sales 347.15B
4 Book/sh 3.87
5 Cash/sh 3.76
6 Dividend 0.88
7 Dividend % 0.57%
8 Employees 147000
9 Optionable Yes
10 Shortable Yes
11 Recom 1.90
12 P/E 30.09
13 Forward P/E 27.12
14 PEG 1.53
15 P/S 7.26
16 P/B 39.70
17 P/C 40.87
18 P/FCF 31.35
19 Quick Ratio 1.00
20 Current Ratio 1.10
21 Debt/Eq 1.89
22 LT Debt/Eq 1.65
23 SMA20 3.47%
Try:
out = pd.DataFrame(np.concatenate([dffvstats.loc[:, cols].values
for cols in zip(dffvstats.columns[::2],
dffvstats.columns[1::2])]))
Explanation:
Start to get column pairwise with zip
>>> list(dffvstats.columns[::2]) # even columns
[0, 2]
>>> (dffvstats.columns[1::2]) # odd columns
[1, 3]
>>> list(zip(dffvstats.columns[::2], dffvstats.columns[1::2]))
[(0, 1), (2, 3)]
Split dataframe into sub dataframe:
tmp = [dffvstats.loc[:, cols].values
for cols in zip(dffvstats.columns[::2], dffvstats.columns[1::2])]
>>> tmp[0]
array([['Index', 'DJIA S&P500'],
['Market Cap', '2521.74B'],
['Income', '86.80B'],
['Sales', '347.15B'],
['Book/sh', '3.87'],
['Cash/sh', '3.76'],
['Dividend', '0.88'],
['Dividend %', '0.57%'],
['Employees', '147000'],
['Optionable', 'Yes'],
['Shortable', 'Yes'],
['Recom', '1.90']], dtype=object)
>>> tmp[1]
array([['P/E', '30.09'],
['Forward P/E', '27.12'],
['PEG', '1.53'],
['P/S', '7.26'],
['P/B', '39.70'],
['P/C', '40.87'],
['P/FCF', '31.35'],
['Quick Ratio', '1.00'],
['Current Ratio', '1.10'],
['Debt/Eq', '1.89'],
['LT Debt/Eq', '1.65'],
['SMA20', '3.47%']], dtype=object)
Concatenate all small dataframes into a big one with np.concatenate instead of pd.concat because you have a list of ndarray.
Try with pd.concat:
pd.concat([df[['A', 'B']], df[['C', 'D']].rename(columns={'C':'A', 'D':'B'}), df[['E', 'F']].rename(columns={'E':'A', 'F':'B'}), df[['G', 'H']].rename(columns={'G':'A', 'H':'B'}), df[['I', 'J']].rename(columns={'I':'A', 'G':'B'})])
Most efficient to reshape, use numpy:
a = dffvstats.values
pd.DataFrame(np.c_[a[:, ::2].flatten('F'),
a[:, 1::2].flatten('F')],
columns=['A', 'B'])
It concatenates all the even columns together, same with odd, and assembles both in a two columns array to form a dataframe.

Sales data : Plot number of ordered each item over time

I have dataframe with time-series data and want to plot number of each item over time.
date item ordered
1 01-05-2020 1 1
2 01-05-2020 1 23
3 03-06-2020 2 4
4 03-07-2020 2 5
5 04-09-2020 3 4
df_new = df.groupby(df[['date','item']])['ordered'].sum().reset_index()
df_new.plot()
Use DataFrame.pivot_table before ploting, also dont convert DatetimeIndex to column by reset_index before ploting:
df_new = df.pivot_table(index='date', columns='item', values='ordered', aggfunc='sum')
print (df_new)
item 1 2 3
date
01-05-2020 24.0 NaN NaN
03-06-2020 NaN 4.0 NaN
03-07-2020 NaN 5.0 NaN
04-09-2020 NaN NaN 4.0
df_new.plot()
Your solution:
df_new = df.groupby(['date','item'])['ordered'].sum().unstack()
print (df_new)
item 1 2 3
date
01-05-2020 24.0 NaN NaN
03-06-2020 NaN 4.0 NaN
03-07-2020 NaN 5.0 NaN
04-09-2020 NaN NaN 4.0
df_new.plot()

How to get max of a slice of a dataframe based on column values?

I'm looking to make a new column, MaxPriceBetweenEntries based on the max() of a slice of the dataframe
idx Price EntryBar ExitBar
0 10.00 0 1
1 11.00 NaN NaN
2 10.15 2 4
3 12.14 NaN NaN
4 10.30 NaN NaN
turned into
idx Price EntryBar ExitBar MaxPriceBetweenEntries
0 10.00 0 1 11.00
1 11.00 NaN NaN NaN
2 10.15 2 4 12.14
3 12.14 NaN NaN NaN
4 10.30 NaN NaN NaN
I can get all the rows with an EntryBar or ExitBar value with df.loc[df["EntryBar"].notnull()] and df.loc[df["ExitBar"].notnull()], but I can't use that to set a new column:
df.loc[df["EntryBar"].notnull(),"MaxPriceBetweenEntries"] = df.loc[df["EntryBar"]:df["ExitBar"]]["Price"].max()
but that's effectively a guess at this point, because nothing I'm trying works. Ideally the solution wouldn't involve a loop directly because there may be millions of rows.
You can groupby the cumulative sum of non-null entries and take the max, unsing np.where() to only apply to non-null rows::
df['MaxPriceBetweenEntries'] = np.where(df['EntryBar'].notnull(),
df.groupby(df['EntryBar'].notnull().cumsum())['Price'].transform('max'),
np.nan)
df
Out[1]:
idx Price EntryBar ExitBar MaxPriceBetweenEntries
0 0 10.00 0.0 1.0 11.00
1 1 11.00 NaN NaN NaN
2 2 10.15 2.0 4.0 12.14
3 3 12.14 NaN NaN NaN
4 4 10.30 NaN NaN NaN
Let's try groupby() and where:
s = df['EntryBar'].notna()
df['MaxPriceBetweenEntries'] = df.groupby(s.cumsum())['Price'].transform('max').where(s)
Output:
idx Price EntryBar ExitBar MaxPriceBetweenEntries
0 0 10.00 0.0 1.0 11.00
1 1 11.00 NaN NaN NaN
2 2 10.15 2.0 4.0 12.14
3 3 12.14 NaN NaN NaN
4 4 10.30 NaN NaN NaN
You can forward fill the null values, group by entry and get the max of that groups Price. Use that as the right side of a left join and you should be in business.
df.merge(df.ffill().groupby('EntryBar')['Price'].max().reset_index(name='MaxPriceBetweenEntries'),
on='EntryBar',
how='left')
Try
df.loc[df['ExitBar'].notna(),'Max']=df.groupby(df['ExitBar'].ffill()).Price.max().values
df
Out[74]:
idx Price EntryBar ExitBar Max
0 0 10.00 0.0 1.0 11.00
1 1 11.00 NaN NaN NaN
2 2 10.15 2.0 4.0 12.14
3 3 12.14 NaN NaN NaN
4 4 10.30 NaN NaN NaN

Pandas pd.merge gives nan

I have two dataframes, which I need to merge/join based on a column. When I try to join/merge them, the new columns gives NaN.
Basically, I need to perform Left Join on the dataframes, considering df_user as the dataframe on the Left.
PS: The column on both the dataframes have same datatype.
Please find the dataframes below -
df_user.dtypes
App category
Sentiment int8
Sentiment_Polarity float64
Sentiment_Subjectivity float64
df_play.dtypes
App category
Category category
Rating float64
Reviews float64
Size float64
Installs int64
Type int8
Price float64
Content Rating int8
Installs_Cat int8
df_play.head()
App Category Rating Reviews Size Installs Type Price Content Installs_Cat
0 SPrapBook ART_AND_DESIGN 4.1 159 19 10000 0 0 0 9
1 U Launcher ART_AND_DESIGN 4.5 87510 25 5000000 0 0 0 14
2 Sketch - ART_AND_DESIGN 4.3 215644 2.8 50000000 0 0 1 16
3 Pixel Dra ART_AND_DESIGN 4.4 967 5.6 100000 0 0 0 11
4 Paper flo ART_AND_DESIGN 3.8 167 19 50000 0 0 0 10
df_user.head()
App Sentiment Sentiment_Polarity Sentiment_Subjectivity
0 10 Best Foods for You 2 1.00 0.533333
1 10 Best Foods for You 2 0.25 0.288462
3 10 Best Foods for You 2 0.40 0.875000
4 10 Best Foods for You 2 1.00 0.300000
5 10 Best Foods for You 2 1.00 0.300000
I tried both the codes below -
result = pd.merge(df_user, df_play, how='left', on='App')
result = df_user.join(df_play.set_index('App'),on='App',how='left',rsuffix='_y')
But all i got was -
App Sentiment Sentiment_Polarity Sentiment_Subjectivity Category Rating Reviews Size Installs Type Price Content Rating Installs_Cat
0 10 Best Foods for You 2 1.00 0.533333 NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 10 Best Foods for You 2 0.25 0.288462 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 10 Best Foods for You 2 0.40 0.875000 NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 10 Best Foods for You 2 1.00 0.300000 NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 10 Best Foods for You 2 1.00 0.300000 NaN NaN NaN NaN NaN NaN NaN NaN NaN
Please excuse me for the formatting.

Combine_first and null values in Pandas

df1:
0 1
0 nan 3.00
1 -4.00 nan
2 nan 7.00
df2:
0 1 2
1 -42.00 nan 8.00
2 -5.00 nan 4.00
df3 = df1.combine_first(df2)
df3:
0 1 2
0 nan 3.00 nan
1 -4.00 nan 8.00
2 -5.00 7.00 4.00
This is what I'd like df3 to be:
0 1 2
0 nan 3.00 nan
1 -4.00 nan 8.00
2 nan 7.00 4.00
(The difference is in df3.ix[2:2,0:0])
That is, if the column and index are the same for any cell in both df1 and df2, I'd like df1's value to prevail, even if that value is nan. combine_first does that, except when the value in df1 is nan.
Here's a bit of a hacky way to do it. First, align df2 with df1, which creates a frame indexed with the union of df1/df2, filled with df2's values. Then assign back df1's values.
In [325]: df3, _ = df2.align(df1)
In [327]: df3.loc[df1.index, df1.columns] = df1
In [328]: df3
Out[328]:
0 1 2
0 NaN 3 NaN
1 -4 NaN 8
2 NaN 7 4

Categories

Resources