setting the first value of a groupby to Nan - python

I have a timeseries for different categories
cat date price
A 2000-01-01 100
A 2000-02-01 101
...
A 2010-12-01 140
B 2000-01-01 10
B 2000-02-01 10.4
...
B 2010-12-01 11.1
...
Z 2010-12-01 13.1
I need to compute returns on all assets, which is very quick using
df['ret'] = df['price'] / df['price'].shift(1) - 1
However, that also computes incorrect returns for the first element of each company (besides A) based on the last observation of the previous company. Therefore, I want to NaN the first observation in each category.
It is easy to get these observations using
df.groupby('cat')['ret'].first()
but I am a bit lost on how to set them.
df.groupby('cat')['ret'].first() = np.NaN
and
df.loc[df.groupby('cat')['ret'].first(), 'ret']=np.NaN
did not lead anywhere.

for set first value per groups to missing values use Series.duplicated:
df.loc[~df['cat'].duplicated(), 'ret']=np.NaN
But it seems need DataFrame.sort_values with GroupBy.pct_change:
df = df.sort_values(['cat','date'])
df['ret1'] = df.groupby('cat')['price'].pct_change()
Your solution should be changed with DataFrameGroupBy.shift:
df['ret2'] = df['price'] / df.groupby('cat')['price'].shift(1) - 1
print (df)
cat date price ret1 ret2
0 A 2000-01-01 100.0 NaN NaN
1 A 2000-02-01 101.0 0.010000 0.010000
2 A 2010-12-01 140.0 0.386139 0.386139
3 B 2000-01-01 10.0 NaN NaN
4 B 2000-02-01 10.4 0.040000 0.040000
5 B 2010-12-01 11.1 0.067308 0.067308
6 Z 2010-12-01 13.1 NaN NaN

Try this
df.sort_values('date').groupby('cat')['price'].pct_change()

Related

assign multiple rolling sums at once with pandas

I have a dataframe with counts in one column and I would like to assign several cumulative sums of this column at once. I tried the below code but unfortunately it gives me only the last cumulative sum for all columns.
d = pd.DataFrame({'counts':[242,99,2,13,0]})
kwargs = {f"cumulative_{i}" : lambda x: x['counts'].shift(1).rolling(i).sum() for i in range(1,4)}
d.assign(**kwargs)
this is what it gives me
counts cumulative_1 cumulative_2 cumulative_3
0 242 NaN NaN NaN
1 99 NaN NaN NaN
2 2 NaN NaN NaN
3 13 343.0 343.0 343.0
4 0 114.0 114.0 114.0
but I would like to get this
counts cumulative_1 cumulative_2 cumulative_3
0 242 NaN NaN NaN
1 99 242.0 NaN NaN
2 2 99.0 341.0 NaN
3 13 2.0 101.0 343.0
4 0 13.0 15.0 114.0
what can I change to get the above?
Variable i defined in the lambda has a global scope, and it's not captured in the lambda definition, i.e. it's always evaluated to 3, the last value when the loop ends. In order to capture i at definition time, you can define a wrapper function that captures i for each iteration of the loop and returns the lambda that can infer the correct i from it's enclosing environment:
def roll_i(i):
return lambda x: x['counts'].shift(1).rolling(i).sum()
kwargs = {f"cumulative_{i}" : roll_i(i) for i in range(1,4)}
d.assign(**kwargs)
counts cumulative_1 cumulative_2 cumulative_3
0 242 NaN NaN NaN
1 99 242.0 NaN NaN
2 2 99.0 341.0 NaN
3 13 2.0 101.0 343.0
4 0 13.0 15.0 114.0

Calculating groupby change with pandas for dataframe with other string type columns

I came across this question which is pretty similar to what I'm trying to do:
python pandas groupby calculate change
The only problem is that my dataframe is much more complex, as it has a bunch more value columns that I also want to calculate the differences for, and a few columns of string type which I need to keep, but I obviously can't calculate the numerical difference of those.
Group | Date | Value | Leader | Quantity
A 01-02-2016 16.0 John 1
A 01-03-2016 15.0 John 1
B 01-02-2016 16.0 Phillip 1
B 01-03-2016 13.0 Phillip 1
C 01-02-2016 16.0 Bob 1
C 01-03-2016 16.0 Bob 1
Is there a way to alter the code so that I can just make the difference apply to the float type values, rather than having to specify which columns are the float type ones by using loc/iloc? So I'd get something like this:
Date Group Change in Value Leader Change in Quantity
2016-01-02 A NaN John NaN
2016-01-03 A -0.062500 John 0
2016-01-02 B NaN Phillip NaN
2016-01-03 B -0.187500 Phillip 0
2016-01-02 C NaN Bob NaN
2016-01-03 C 0.000000 Bob 0
Additionally, would it be possible to change the pct_change to diff? So ideally I'd get something like this:
Date Group Leader Change in Value Change in Quantity
2016-01-02 A John NaN NaN
2016-01-03 A John -1.0 0.0
2016-01-02 B Phillip NaN NaN
2016-01-03 B Phillip -3.0 0.0
2016-01-02 C Bob NaN NaN
2016-01-03 C Bob 0.0 0.0
Extra details about my actual dataset:
For each group, there are two rows (there are only two dates being considered)
Ideally I want to then be able to slice through the rows so I delete all the ones with the NaN values
I need all the numerical values to display as floats for consistency
Thanks in advance!
Use select_dtypes and join
df1 = df.select_dtypes('number')
df_final = df.drop(df1.columns, 1).join(df1.groupby(df['Group'])
.pct_change().add_prefix('Change_in_'))
Out[10]:
Group Date Leader Change_in_Value Change_in_Quantity
0 A 01-02-2016 John NaN NaN
1 A 01-03-2016 John -0.0625 0.0
2 B 01-02-2016 Phillip NaN NaN
3 B 01-03-2016 Phillip -0.1875 0.0
4 C 01-02-2016 Bob NaN NaN
5 C 01-03-2016 Bob 0.0000 0.0
Using diff. Just replace pct_change by diff
df1 = df.select_dtypes('number')
df_final = df.drop(df1.columns, 1).join(df1.groupby(df['Group'])
.diff().add_prefix('Change_in_'))
Out[15]:
Group Date Leader Change_in_Value Change_in_Quantity
0 A 01-02-2016 John NaN NaN
1 A 01-03-2016 John -1.0 0.0
2 B 01-02-2016 Phillip NaN NaN
3 B 01-03-2016 Phillip -3.0 0.0
4 C 01-02-2016 Bob NaN NaN
5 C 01-03-2016 Bob 0.0 0.0
You can just do:
cols = []
for col in df3.columns:
if str(col).startswith('Value'):
cols.append(col)
for i in range(len(cols)-1):
df["Change " + i] = (df["Value " + i] - df["Value " + i].shift(-1)) / df["Value " + i]

Python/Pandas - Is there a way to make mean() return NaN when there is only one value to calculate?

This is a hypothetical example of how my data frame looks like,
>>df
1A 1B 1C 2A 2B 2C 3A 3B 3C
P1 11 13 15 11 9.7 12 12.3 22.6 22.4
P2 11 0 15 0 0 12 0 0 0
P3 NaN 25 12 NaN NaN 12 NaN NaN NaN
P4 11 NaN 12 9 NaN NaN NaN NaN NaN
P5 11 NaN NaN NaN 12 NaN NaN NaN 12.3
I'm currently averaging every three columns in each row by using
df_avg = df.groupby(np.arange(len(df.columns))//3, axis=1).mean()
Which will give me the resulting data frame,
0 1 2
P1 13.0 10.9 19.1
P2 8.7 4.0 0.0
P3 18.5 12.0 NaN
P4 11.5 9.0 NaN
P5 11.0 12.0 12.3
The mean() appears to divide by two when there are only 2 values and 1 NaN, which is what I want.
Plus, those with 3 NaN returns NaN and that is also good.
However, the last value of P5 is 12.3 which is from having two NaN and 12.3 as the only value (same in other cases).
That is not an average and I wish to remove any sites with 2 NaNs or make it return NaN.
What would be the best way to preserve this
"average every group of 3 cells" + "divide groups with 2 values and one NaN with 2" + "groups with three NaN should return NaN"
and make it also do "group with only one real value and two NaN return NaN"?
One way I could think of was to use the output of np.arange(len(df.columns))//3 to make a new row, then make a function that uses groupby and mean with the conditions I want; however, my skill isn't quite up there to understand how that code should roughly look like. And this doesn't seem like the easiest way to do this in my novice guess.
Sorry for the hassle and thank you in advance,
In your case min_count
g=df.groupby(np.arange(len(df.columns))//3, axis=1)
g.sum(min_count=2)/g.count()
Out[213]:
0 1 2
P1 13.000000 10.9 19.1
P2 8.666667 4.0 0.0
P3 18.500000 NaN NaN
P4 11.500000 NaN NaN
P5 NaN NaN NaN
We can get booleans back with DataFrame.isna and check if the sum over the row axis (axis=1) is greater than or equal (ge) 2, in other words, if the amount of NaN per group >= 2. If so we mask them with NaN:
grps = df.groupby(np.arange(df.shape[1])//3, axis=1)
mask = grps.apply(lambda x: x.isna().sum(axis=1)).ge(2)
df = grps.mean().mask(mask)
0 1 2
P1 13.00 10.90 19.10
P2 8.67 4.00 0.00
P3 18.50 nan nan
P4 11.50 nan nan
P5 nan nan nan
I'm not sure there's an built-in function. Here's a quick-fix:
m = (df.T.groupby(np.arange(len(df.columns))//3) # transpose and groupby because
.agg(['count', 'mean']) # agg only allows groupby with axis=0
.swaplevel(0,1, axis=1) # make 'count' and 'mean' first level for easy access
.T # transpose back
)
df_avg = m.loc['mean'].mask(m.loc['count']==1, np.nan)
Output:
0 1 2
P1 13.000000 10.9 19.1
P2 8.666667 4.0 0.0
P3 18.500000 NaN NaN
P4 11.500000 NaN NaN
P5 NaN NaN NaN

Counting Number of Occurrences Between Dates (Given an ID value) From Another Dataframe

Pandas: select DF rows based on another DF is the closest answer I can find to my question, but I don't believe it quite solves it.
Anyway, I am working with two very large pandas dataframes (so speed is a consideration), df_emails and df_trips, both of which are already sorted by CustID and then by date.
df_emails includes the date we sent a customer an email and it looks like this:
CustID DateSent
0 2 2018-01-20
1 2 2018-02-19
2 2 2018-03-31
3 4 2018-01-10
4 4 2018-02-26
5 5 2018-02-01
6 5 2018-02-07
df_trips includes the dates a customer came to the store and how much they spent, and it looks like this:
CustID TripDate TotalSpend
0 2 2018-02-04 25
1 2 2018-02-16 100
2 2 2018-02-22 250
3 4 2018-01-03 50
4 4 2018-02-28 100
5 4 2018-03-21 100
6 8 2018-01-07 200
Basically, what I need to do is find the number of trips and total spend for each customer in between each email sent. If it is the last time an email is sent for a given customer, I need to find the total number of trips and total spend after the email, but before the end of the data (2018-04-01). So the final dataframe would look like this:
CustID DateSent NextDateSentOrEndOfData TripsBetween TotalSpendBetween
0 2 2018-01-20 2018-02-19 2.0 125.0
1 2 2018-02-19 2018-03-31 1.0 250.0
2 2 2018-03-31 2018-04-01 0.0 0.0
3 4 2018-01-10 2018-02-26 0.0 0.0
4 4 2018-02-26 2018-04-01 2.0 200.0
5 5 2018-02-01 2018-02-07 0.0 0.0
6 5 2018-02-07 2018-04-01 0.0 0.0
Though I have tried my best to do this in a Python/Pandas friendly way, the only accurate solution I have been able to implement is through an np.where, shifting, and looping. The solution looks like this:
df_emails["CustNthVisit"] = df_emails.groupby("CustID").cumcount()+1
df_emails["CustTotalVisit"] = df_emails.groupby("CustID")["CustID"].transform('count')
df_emails["NextDateSentOrEndOfData"] = pd.to_datetime(df_emails["DateSent"].shift(-1)).where(df_emails["CustNthVisit"] != df_emails["CustTotalVisit"], pd.to_datetime('04-01-2018'))
for i in df_emails.index:
df_emails.at[i, "TripsBetween"] = len(df_trips[(df_trips["CustID"] == df_emails.at[i, "CustID"]) & (df_trips["TripDate"] > df_emails.at[i,"DateSent"]) & (df_trips["TripDate"] < df_emails.at[i,"NextDateSentOrEndOfData"])])
for i in df_emails.index:
df_emails.at[i, "TotalSpendBetween"] = df_trips[(df_trips["CustID"] == df_emails.at[i, "CustID"]) & (df_trips["TripDate"] > df_emails.at[i,"DateSent"]) & (df_trips["TripDate"] < df_emails.at[i,"NextDateSentOrEndOfData"])].TotalSpend.sum()
df_emails.drop(['CustNthVisit',"CustTotalVisit"], axis=1, inplace=True)
However, a %%timeit has revealed that this takes 10.6ms on just the seven rows shown above, which makes this solution pretty much infeasible on my actual datasets of about 1,000,000 rows. Does anyone know a solution here that is faster and thus feasible?
Add the next date column to emails
df_emails["NextDateSent"] = df_emails.groupby("CustID").shift(-1)
Sort for merge_asof and then merge to nearest to create a trip lookup table
df_emails = df_emails.sort_values("DateSent")
df_trips = df_trips.sort_values("TripDate")
df_lookup = pd.merge_asof(df_trips, df_emails, by="CustID", left_on="TripDate",right_on="DateSent", direction="backward")
Aggregate the lookup table for the data you want.
df_lookup = df_lookup.loc[:, ["CustID", "DateSent", "TotalSpend"]].groupby(["CustID", "DateSent"]).agg(["count","sum"])
Left join it back to the email table.
df_merge = df_emails.join(df_lookup, on=["CustID", "DateSent"]).sort_values("CustID")
I choose to leave NaNs as NaNs because I don't like filling default values (you can always do that later if you prefer, but you can't easily distinguish between things that existed vs things that didn't if you put defaults in early)
CustID DateSent NextDateSent (TotalSpend, count) (TotalSpend, sum)
0 2 2018-01-20 2018-02-19 2.0 125.0
1 2 2018-02-19 2018-03-31 1.0 250.0
2 2 2018-03-31 NaT NaN NaN
3 4 2018-01-10 2018-02-26 NaN NaN
4 4 2018-02-26 NaT 2.0 200.0
5 5 2018-02-01 2018-02-07 NaN NaN
6 5 2018-02-07 NaT NaN NaN
This would be an easy case of merge_asof had I been able to handle the max_date, so I go a long way:
max_date = pd.to_datetime('2018-04-01')
# set_index for easy extraction by id
df_emails.set_index('CustID', inplace=True)
# we want this later in the final output
df_emails['NextDateSentOrEndOfData'] = df_emails.groupby('CustID').shift(-1).fillna(max_date)
# cuts function for groupby
def cuts(df):
custID = df.CustID.iloc[0]
bins=list(df_emails.loc[[custID], 'DateSent']) + [max_date]
return pd.cut(df.TripDate, bins=bins, right=False)
# bin the dates:
s = df_trips.groupby('CustID', as_index=False, group_keys=False).apply(cuts)
# aggregate the info:
new_df = (df_trips.groupby([df_trips.CustID, s])
.TotalSpend.agg(['sum', 'size'])
.reset_index()
)
# get the right limit:
new_df['NextDateSentOrEndOfData'] = new_df.TripDate.apply(lambda x: x.right)
# drop the unnecessary info
new_df.drop('TripDate', axis=1, inplace=True)
# merge:
df_emails.reset_index().merge(new_df,
on=['CustID','NextDateSentOrEndOfData'],
how='left'
)
Output:
CustID DateSent NextDateSentOrEndOfData sum size
0 2 2018-01-20 2018-02-19 125.0 2.0
1 2 2018-02-19 2018-03-31 250.0 1.0
2 2 2018-03-31 2018-04-01 NaN NaN
3 4 2018-01-10 2018-02-26 NaN NaN
4 4 2018-02-26 2018-04-01 200.0 2.0
5 5 2018-02-01 2018-02-07 NaN NaN
6 5 2018-02-07 2018-04-01 NaN NaN

Pandas corr() returning NaN too often

I'm attempting to run what I think should be a simple correlation function on a dataframe but it is returning NaN in places where I don't believe it should.
Code:
# setup
import pandas as pd
import io
csv = io.StringIO(u'''
id date num
A 2018-08-01 99
A 2018-08-02 50
A 2018-08-03 100
A 2018-08-04 100
A 2018-08-05 100
B 2018-07-31 500
B 2018-08-01 100
B 2018-08-02 100
B 2018-08-03 0
B 2018-08-05 100
B 2018-08-06 500
B 2018-08-07 500
B 2018-08-08 100
C 2018-08-01 100
C 2018-08-02 50
C 2018-08-03 100
C 2018-08-06 300
''')
df = pd.read_csv(csv, sep = '\t')
# Format manipulation
df = df[df['num'] > 50]
df = df.pivot(index = 'date', columns = 'id', values = 'num')
df = pd.DataFrame(df.to_records())
# Main correlation calculations
print df.iloc[:, 1:].corr()
Subject DataFrame:
A B C
0 NaN 500.0 NaN
1 99.0 100.0 100.0
2 NaN 100.0 NaN
3 100.0 NaN 100.0
4 100.0 NaN NaN
5 100.0 100.0 NaN
6 NaN 500.0 300.0
7 NaN 500.0 NaN
8 NaN 100.0 NaN
corr() Result:
A B C
A 1.0 NaN NaN
B NaN 1.0 1.0
C NaN 1.0 1.0
According to the (limited) documentation on the function, it should exclude "NA/null values". Since there are overlapping values for each column, should the result not all be non-NaN?
There are good discussions here and here, but neither answered my question. I've tried the float64 idea discussed here, but that failed as well.
#hellpanderr's comment brought up a good point, I'm using 0.22.0
Bonus question - I'm no mathematician, but how is there a 1:1 correlation between B and C in this result?
The result seems to be an artefact of the data you work with. As you write, NAs are ignored, so it basically boils down to:
df[['B', 'C']].dropna()
B C
1 100.0 100.0
6 500.0 300.0
So, there are only two values per column left for the calculation which should therefore lead to to correlation coefficients of 1:
df[['B', 'C']].dropna().corr()
B C
B 1.0 1.0
C 1.0 1.0
So, where do the NAs then come from for the remaining combinations?
df[['A', 'B']].dropna()
A B
1 99.0 100.0
5 100.0 100.0
df[['A', 'C']].dropna()
A C
1 99.0 100.0
3 100.0 100.0
So, also here you end up with only two values per column. The difference is that the columns B and C contain only one value (100) which gives a standard deviation of 0:
df[['A', 'C']].dropna().std()
A 0.707107
C 0.000000
When the correlation coefficient is calculated, you divide by the standard deviation, which leads to a NA.

Categories

Resources