How does pandas groupby() function makes a difference in this code? - python

import pandas as pd
data = {'Company':['GOOG','MSFT','FB','GOOG','MSFT','FB'],
'Dates':["1970-01-01 01:00:00","1970-01-01 01:00:02","1970-01-01 01:00:03","1970-01-01 01:00:04","1970-01-01 01:00:05","1970-01-01 01:00:06"]}
df = pd.DataFrame(data)
df["Sales"]=pd.to_datetime(df["Sales"])
df.Sales.diff().dt.total_seconds()/3600
This code gives me output
0 NaN
1 0.000556
2 0.000278
3 0.000278
4 0.000278
5 0.000278
Name: Sales, dtype: float64
and
df.groupby("Company").Sales.diff().dt.total_seconds()/3600
this gives me output
0 NaN
1 NaN
2 NaN
3 0.001111
4 0.000833
5 0.000833
Name: Sales, dtype: float64
Can you explain what groupby function does here?

The reason why , you have three NaN ,due to you have three different company name in df, so when we do groupby , it will split the dataframe into 3, then do diff for each of them, and concat the result back
Detail :
df["Dates"] = pd.to_datetime(df["Dates"])
...:
for x , y in df.groupby('Company'):
...: print(y)
...: print(y['Dates'].diff().dt.total_seconds())
...:
Company Dates
2 FB 1970-01-01 01:00:03
5 FB 1970-01-01 01:00:06
2 NaN
5 3.0
Name: Dates, dtype: float64
Company Dates
0 GOOG 1970-01-01 01:00:00
3 GOOG 1970-01-01 01:00:04
0 NaN
3 4.0
Name: Dates, dtype: float64
Company Dates
1 MSFT 1970-01-01 01:00:02
4 MSFT 1970-01-01 01:00:05
1 NaN
4 3.0
Name: Dates, dtype: float64

Related

Pandas downsampling more time intervalls?

I'm doing some resampling on data and I was wondering why resampling 1min data to 5min data creates MORE time intervals than my original dataset?
Also, why does t resample until 2018-12-11 (11 days longer!) than the original datset?
1-min data:
result of resampling to 5-min intervalls:
This is how I do the resampling:
df1.loc[:,'qKfz_gesamt'].resample('5min').mean()
I was wondering why resampling 1min data to 5min data creates MORE time intervals than my original dataset?
Problem is if no consecutive values in original pandas create consecutive 5minutes intervals and for not exist values are created NaNs:
df1 = pd.DataFrame({'qKfz_gesamt': range(4)},
index=pd.to_datetime(['2018-11-25 00:00:00','2018-11-25 00:01:00',
'2018-11-25 00:02:00','2018-11-25 00:15:00']))
print (df1)
qKfz_gesamt
2018-11-25 00:00:00 0
2018-11-25 00:01:00 1
2018-11-25 00:02:00 2
2018-11-25 00:15:00 3
print (df1['qKfz_gesamt'].resample('5min').mean())
2018-11-25 00:00:00 1.0
2018-11-25 00:05:00 NaN
2018-11-25 00:10:00 NaN
2018-11-25 00:15:00 3.0
Freq: 5T, Name: qKfz_gesamt, dtype: float64
print (df1['qKfz_gesamt'].resample('5min').mean().dropna())
2018-11-25 00:00:00 1.0
2018-11-25 00:15:00 3.0
Name: qKfz_gesamt, dtype: float64
why does t resample until 2018-12-11 (11 days longer!) than the original datset?
You need filter by maximal value of index:
rng = pd.date_range('2018-11-25', periods=10)
df1 = pd.DataFrame({'a': range(10)}, index=rng)
print (df1)
a
2018-11-25 0
2018-11-26 1
2018-11-27 2
2018-11-28 3
2018-11-29 4
2018-11-30 5
2018-12-01 6
2018-12-02 7
2018-12-03 8
2018-12-04 9
df1 = df1.loc[:'2018-11-30']
print (df1)
a
2018-11-25 0
2018-11-26 1
2018-11-27 2
2018-11-28 3
2018-11-29 4
2018-11-30 5
Or:
df1 = df1.loc[df1.index <= '2018-11-30']
print (df1)
a
2018-11-25 0
2018-11-26 1
2018-11-27 2
2018-11-28 3
2018-11-29 4
2018-11-30 5

How would I get the values from a previously assigned index

I have two data frames, the first column is formed by getting the index values from the other data frame. This is tested and successfully returns 5 entries.
The second line executes but assigns NaN to all rows in "StartPrice" column
df = pd.DataFrame()
df["StartBar"] = df_rs["HighTrendStart"].dropna().index # Works
df["StartPrice"] = df_rs["HighTrendStart"].loc[df["StartBar"]] # Assigns Nan's to all rows
As pointed out by #YOBEN_S, the indexes do not match.
Date
2020-05-01 00:00:00 NaN
2020-05-01 00:15:00 NaN
2020-05-01 00:30:00 NaN
2020-05-01 00:45:00 NaN
2020-05-01 01:00:00 NaN
Freq: 15T, Name: HighTrendStart, dtype: float64
0 2020-05-01 02:30:00
1 2020-05-01 06:30:00
2 2020-05-01 13:45:00
3 2020-05-01 16:15:00
4 2020-05-01 20:00:00
Name: StartBar, dtype: datetime64[ns]
You should make sure the index did not match when you assign the value from different dataframe
df["StartPrice"] = df_rs["HighTrendStart"].loc[df["StartBar"]].to_numpy()
For example
df=pd.DataFrame({'a':[1,2,3,4,5,6]})
s=pd.Series([1,2,3,4,5,6],index=list('abcdef'))
df
Out[190]:
a
0 1
1 2
2 3
3 4
4 5
5 6
s
Out[191]:
a 1
b 2
c 3
d 4
e 5
f 6
dtype: int64
df['New']=s
df
Out[193]:
a New
0 1 NaN
1 2 NaN
2 3 NaN
3 4 NaN
4 5 NaN
5 6 NaN

pandas replace NaT with strings in columns when trying to get timedelta object days

I have the following df,
A B
3 days NaT
NaT 1 days
4 days 3 days
NaT NaT
the dtype of A and B is timedelta64[ns], I am tring to get days from each timedelta of the two columns, so first I tried to remove all the rows with A and B happened to be all NaT,
daydelta = df.dropna(subset=['A', 'B'], how='all')
and then get days on each column value,
daydelta[['A', 'B']] = daydelta[['A', 'B']].applymap(lambda x: int(Timedelta(x).days))
but it failed since there is no days attribute in NaT. I am wondering how to get days from timedelta value, while replacing NaT with a string timedelta value does not exist.
Use dt.days which working with NaT too:
print (df['A'].dt.days)
0 3.0
1 NaN
2 4.0
3 NaN
Name: A, dtype: float64
df[['A', 'B']] = df[['A', 'B']].apply(lambda x: x.dt.days)
print (df)
A B
0 3.0 NaN
1 NaN 1.0
2 4.0 3.0
3 NaN NaN

how to groupby and calculate the percentage of non missing values in each column in pandas?

I have the following datadrame
var loyal_date
1 2017-01-17
1 2017-01-03
1 2017-01-11
1 NaT
1 NaT
2 2017-01-15
2 2017-01-07
2 Nat
2 Nat
2 Nat
i need to group by var column and find the percentage of non missing value in loyal_date column for each group. Is there any way to do it using lambda function?
try this:
In [59]: df
Out[59]:
var loyal_date
0 1 2017-01-17
1 1 2017-01-03
2 1 2017-01-11
3 1 NaT
4 1 NaT
5 2 2017-01-15
6 2 2017-01-07
7 2 NaT
8 2 NaT
9 2 NaT
In [60]: df.groupby('var')['loyal_date'].apply(lambda x: x.notnull().sum()/len(x)*100)
Out[60]:
var
1 60.0
2 40.0
Name: loyal_date, dtype: float64

pandas group by date, assign value to a column

I have a DataFrame with columns = ['date','id','value'], where id represents different products. Assume that we have n products. I am looking to create a new dataframe with columns = ['date', 'valueid1' ..,'valueidn'], where the values are assigned to the corresponding date-row if they exist, a NaN is assigned as value if they don't. Many thanks
assuming you have the following DF:
In [120]: df
Out[120]:
date id value
0 2001-01-01 1 10
1 2001-01-01 2 11
2 2001-01-01 3 12
3 2001-01-02 3 20
4 2001-01-03 1 20
5 2001-01-04 2 30
you can use pivot_table() method:
In [121]: df.pivot_table(index='date', columns='id', values='value')
Out[121]:
id 1 2 3
date
2001-01-01 10.0 11.0 12.0
2001-01-02 NaN NaN 20.0
2001-01-03 20.0 NaN NaN
2001-01-04 NaN 30.0 NaN
or
In [122]: df.pivot_table(index='date', columns='id', values='value', fill_value=0)
Out[122]:
id 1 2 3
date
2001-01-01 10 11 12
2001-01-02 0 0 20
2001-01-03 20 0 0
2001-01-04 0 30 0
I think you need pivot:
df = df.pivot(index='date', columns='id', values='value')
Sample:
df = pd.DataFrame({'date':pd.date_range('2017-01-01', periods=5),
'id':[4,5,6,4,5],
'value':[7,8,9,1,2]})
print (df)
date id value
0 2017-01-01 4 7
1 2017-01-02 5 8
2 2017-01-03 6 9
3 2017-01-04 4 1
4 2017-01-05 5 2
df = df.pivot(index='date', columns='id', values='value')
#alternative solution
#df = df.set_index(['date','id'])['value'].unstack()
print (df)
id 4 5 6
date
2017-01-01 7.0 NaN NaN
2017-01-02 NaN 8.0 NaN
2017-01-03 NaN NaN 9.0
2017-01-04 1.0 NaN NaN
2017-01-05 NaN 2.0 NaN
but if get:
ValueError: Index contains duplicate entries, cannot reshape
is necessary use aggregating function like mean, sum, ... with groupby or pivot_table:
df = pd.DataFrame({'date':['2017-01-01', '2017-01-02',
'2017-01-03','2017-01-05','2017-01-05'],
'id':[4,5,6,4,4],
'value':[7,8,9,1,2]})
df.date = pd.to_datetime(df.date)
print (df)
date id value
0 2017-01-01 4 7
1 2017-01-02 5 8
2 2017-01-03 6 9
3 2017-01-05 4 1 <- duplicity 2017-01-05 4
4 2017-01-05 4 2 <- duplicity 2017-01-05 4
df = df.groupby(['date', 'id'])['value'].mean().unstack()
#alternative solution (another answer same as groupby only slowier in big df)
#df = df.pivot_table(index='date', columns='id', values='value', aggfunc='mean')
print (df)
id 4 5 6
date
2017-01-01 7.0 NaN NaN
2017-01-02 NaN 8.0 NaN
2017-01-03 NaN NaN 9.0
2017-01-05 1.5 NaN NaN <- 1.5 is mean (1 + 2)/2

Categories

Resources