pandas operation by group

pandas operation by group - python

I have a dataframe like this
df = pd.DataFrame({'id': [205,205,205, 211, 211, 211]
, 'date': pd.to_datetime(['2019-12-01','2020-01-01', '2020-02-01'
,'2019-12-01' ,'2020-01-01', '2020-03-01'])})
df
id date
0 205 2019-12-01
1 205 2020-01-01
2 205 2020-02-01
3 211 2019-12-01
4 211 2020-01-01
5 211 2020-03-01
where the column date is made by consecutive months for id 205 but not for id 211.
I want to keep only the observations (id) for which I have monthly data without jumps. In this example I want:
id date
0 205 2019-12-01
1 205 2020-01-01
2 205 2020-02-01
Here I am collecting the id to keep:
keep_id = []
for num in pd.unique(df.index):
temp = (df.loc[df['id']==num,'date'].dt.year - df.loc[df['id']==num,'date'].shift(1).dt.year) * 12 + df.loc[df['id']==num,'date'].dt.month - df.loc[df['id']==num,'date'].shift(1).dt.month
temp.values[0] = 1.0 # here I correct the first entry
if (temp==1.).all():
keep_id.append(num)
where I am using (df.loc[num,'date'].dt.year - df.loc[num,'date'].shift(1).dt.year) * 12 + df.loc[num,'date'].dt.month - df.loc[num,'date'].shift(1).dt.month to compute the difference in months from the previous date for every id.
This seems to work when tested on a small portion of df, but I'm sure there is a better way of doing this, maybe using the .groupby() method.
Since df is made of millions of observations my code takes too much time (and I'd like to learn a more efficient and pythonic way of doing this)

What you want to do is use groupby-filter rather than a groupby apply.
df.groupby('id').filter(lambda x: not (x.date.diff() > pd.Timedelta(days=32)).any())
provides exactly:
id date
0 205 2019-12-01
1 205 2020-01-01
2 205 2020-02-01
And indeed, I would keep the index unique, there are too many useful characteristics to retain.
Both this response and Michael's above are correct in terms of output. In terms of performance, they are very similar as well:
%timeit df.groupby('id').filter(lambda x: not (x.date.diff() > pd.Timedelta(days=32)).any())
1.48 ms ± 12.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
and
%timeit df[df.groupby('id')['date'].transform(lambda x: x.diff().max() < pd.Timedelta(days=32))]
1.7 ms ± 163 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
For most operations, this difference is negligible.

You can use the following approach. Only ~3x faster in my tests.
df[df.groupby('id')['date'].transform(lambda x: x.diff().max() < pd.Timedelta(days=32))]
Out:
date
id
205 2019-12-01
205 2020-01-01
205 2020-02-01

Related

Replace column values in a large dataframe

I have a dataframe that has similar ids with spatiotemporal data like below:
car_id lat long
xxx 32 150
xxx 33 160
yyy 20 140
yyy 22 140
zzz 33 70
zzz 33 80
. . .
I want to replace car_id with car_1, car_2, car_3, ... However, my dataframe is large and it's not possible to do it manually by name so first I made a list of all unique values in the car_id column and made a list of names that should be replaced with:
u_values = [i for i in df['car_id'].unique()]
r = ['car'+str(i) for i in range(len(u_values))]
Now I'm not sure how to replace all unique numbers in car_id column with list values so the result is like this:
car_id lat long
car_1 32 150
car_1 33 160
car_2 20 140
car_2 22 140
car_3 33 70
car_3 33 80
. . .

The answers so far seem a little complicated to me, so here's another suggestion. This creates a dictionary that has the old name as the keys and the new name as the values. That can be used to map the old values to new values.
r={k:'car_{}'.format(i) for i,k in enumerate(df['car_id'].unique())}
df['car_id'] = df['car_id'].map(r)
edit: the answer using factorize is probably better even though I think this is a bit easier to read

Create a mapping from u_values to r and map it to car_id column. Also simplify the definition of u_values and r by using tolist() method and f-strings, respectively.
u_values = df['car_id'].unique().tolist()
r = [f'car_{i}' for i in range(len(u_values))]
mapping = pd.Series(r, index=u_values)
df['car_id'] = df['car_id'].map(mapping)
That said, it seems vectorized string concatenation is enough for this task. factorize() method encodes the strings.
df['car_id'] = 'car_' + pd.Series(df['car_id'].factorize()[0], dtype='string')
When I timed some these methods (I omitted Juan Manuel Rivera's solution because replace is very slow and the code takes forever on larger data), the map() implementation that built on OP's code turned out to be the fastest.
The factorize() implementation, while concise, is not fast after all. Also I agree with pasnik that their solution is the easiest to read.
# a dataframe with 500k rows and 100k unique car_ids
df = pd.DataFrame({'car_id': np.random.default_rng().choice(100000, size=500000)})
%timeit u_values = df['car_id'].unique().tolist(); r = [f'car_{i}' for i in range(len(u_values))]; mapping = pd.Series(r, index=u_values); df.assign(car_id=df['car_id'].map(mapping))
# 136 ms ± 2.92 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit df.assign(car_id = 'car_' + pd.Series(df['car_id'].factorize()[0], dtype='string'))
# 602 ms ± 19.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit r={k:'car_{}'.format(i) for i,k in enumerate(df['car_id'].unique())}; df.assign(car_id=df['car_id'].map(r))
# 196 ms ± 3.02 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

It may be easier if you use a dictionary to maintain the relation between each unique value (xxxx,yyyy...) and the new id you want (1, 2, 3...)
newIdDict={}
idCounter=1
for i in df['Car id'].unique():
if i not in newIdDict:
newIdDict[i] = 'car_'+str(idCounter)
idCounter += 1
Then, you can use Pandas replace function to change the values in car_id column:
df['Car id'].replace(newIdDict, inplace=True)
Take into account that this will change ALL the xxxx, yyyy in your dataframe, so if you have any xxxx in lat or long columns it will also be modified

What's the most efficient way to convert a time-series data into a cross-sectional one?

Here's the thing, I have the dataset below where date is the index:
date value
2020-01-01 100
2020-02-01 140
2020-03-01 156
2020-04-01 161
2020-05-01 170
.
.
.
And I want to transform it in this other dataset:
value_t0 value_t1 value_t2 value_t3 value_t4 ...
100 NaN NaN NaN NaN ...
140 100 NaN NaN NaN ...
156 140 100 NaN NaN ...
161 156 140 100 NaN ...
170 161 156 140 100 ...
First I thought about using pandas.pivot_table to do something, but that would just provide a different layout grouped by some column, which is not exactly what I want. Later, I thought about using pandasql and apply 'case when', but that wouldn't work because I would have to type dozens of lines of code. So I'm stuck here.

try this:
new_df = pd.DataFrame({f"value_t{i}": df['value'].shift(i) for i in range(len(df))})
The series .shift(n) method can get you a single column of your desired output by shifting everything down and filling in NaNs above. So we're building a new dataframe by feeding it a dictionary of the form {column name: column data, ...}, by using dictionary comprehension to iterate through your original dataframe.

I think the best is use numpy
values = np.asarray(df['value'].astype(float))
new_values = np.tril(np.repeat([values], values.shape[0], axis=0).T)
new_values[np.triu_indices(new_values.shape[0], 1)] = np.nan
new_df = pd.DataFrame(new_values).add_prefix('value_t')
Times for 5000 rows
%%timeit
values = np.asarray(df['value'].astype(float))
new_values = np.tril(np.repeat([values], values.shape[0], axis=0).T)
new_values[np.triu_indices(new_values.shape[0],1)] = np.nan
new_df = pd.DataFrame(new_values).add_prefix('value_t')
556 ms ± 35.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
new_df = pd.DataFrame({f"value_t{i}": df['value'].shift(i) for i in range(len(df))})
1.31 s ± 36.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Time without add_prefix
%%timeit
values = np.asarray(df['value'].astype(float))
new_values = np.tril(np.repeat([values], values.shape[0], axis=0).T)
new_values[np.triu_indices(new_values.shape[0],1)] = np.nan
new_df = pd.DataFrame(new_values)
357 ms ± 8.09 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Python: (Pandas) How to ignore the lowest and highest 25% of values grouped by id for mean calculation

I'm trying to get the mean of each column while grouped by id, BUT for the calculation only the 50% between the first 25% quantil and the third 75% quantil should be used. (So ignore the lowest 25% of values and the highest 25%)
The data:
ID Property3 Property2 Property3
1 10.2 ... ...
1 20.1
1 51.9
1 15.8
1 12.5
...
1203 104.4
1203 11.5
1203 19.4
1203 23.1
What I tried:
data.groupby('id').quantile(0.75).mean();
#data.groupby('id').agg(lambda grp: grp.quantil(0.25, 0,75)).mean(); something like that?
CW 67.089733
fd 0.265917
fd_maxna -1929.522001
fd_maxv -1542.468399
fd_sumna -1928.239954
fd_sumv -1488.165382
planc -13.165445
slope 13.654163
Something like that, but the GroupByDataFrame.quantil doesn't know a inbetween to my knowledge and I don't know how to now remove the lower 25% too.
And this also doesn't return a dataframe.
What I want
Idealy, I would like to have a dataframe as follows:
ID Property3 Property2 Property3
1 37.8 5.6 2.3
2 33.0 1.5 10.4
3 34.9 91.5 10.3
4 33.0 10.3 14.3
Where only the data between the 25% quantil and the 75% quantil are used for the mean calculation. So only the 50% in between.

Using GroupBy.apply here can be slow so I suppose this is your data frame:
print(df)
ID Property3 Property2 Property1
0 1 10.2 58.337589 45.083237
1 1 20.1 70.844807 29.423138
2 1 51.9 67.126043 90.558225
3 1 15.8 17.478715 41.492485
4 1 12.5 18.247211 26.449900
5 1203 104.4 113.728439 130.698964
6 1203 11.5 29.659894 45.991533
7 1203 19.4 78.910591 40.049054
8 1203 23.1 78.395974 67.345487
So I would use GroupBy.cumcount + DataFrame.pivot_table
to calculate quantiles without using apply:
df['aux']=df.groupby('ID').cumcount()
new_df=df.pivot_table(columns='ID',index='aux',values=['Property1','Property2','Property3'])
print(new_df)
Property1 Property2 Property3
ID 1 1203 1 1203 1 1203
aux
0 45.083237 130.698964 58.337589 113.728439 10.2 104.4
1 29.423138 45.991533 70.844807 29.659894 20.1 11.5
2 90.558225 40.049054 67.126043 78.910591 51.9 19.4
3 41.492485 67.345487 17.478715 78.395974 15.8 23.1
4 26.449900 NaN 18.247211 NaN 12.5 NaN
#remove aux column
df=df.drop('aux',axis=1)
Now we calculate the mean using boolean indexing:
new_df[(new_df.quantile(0.75)>new_df)&( new_df>new_df.quantile(0.25) )].mean()
ID
Property1 1 59.963006
1203 70.661294
Property2 1 49.863814
1203 45.703292
Property3 1 15.800000
1203 21.250000
dtype: float64
or create DataFrame with the mean:
mean_df=( new_df[(new_df.quantile(0.75)>new_df)&( new_df>new_df.quantile(0.25) )].mean()
.rename_axis(index=['Property','ID'])
.unstack('Property') )
print(mean_df)
Property Property1 Property2 Property3
ID
1 41.492485 58.337589 15.80
1203 56.668510 78.653283 21.25
Measure times:
%%timeit
df['aux']=df.groupby('ID').cumcount()
new_df=df.pivot_table(columns='ID',index='aux',values=['Property1','Property2','Property3'])
df=df.drop('aux',axis=1)
( new_df[(new_df.quantile(0.75)>new_df)&( new_df>new_df.quantile(0.25) )].mean()
.rename_axis(index=['Property','ID'])
.unstack('Property') )
25.2 ms ± 1.09 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
def mean_of_25_to_75_pct(s: pd.Series):
low, high = s.quantile(.25), s.quantile(.75)
return s.loc[(s >= low) & (s < high)].mean()
df.groupby("ID").apply(lambda x: x.apply(mean_of_25_to_75_pct))
33 ms ± 1.32 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
def filter_mean(df):
bounds = df.quantile([.25, .75])
mask = (df < bounds.loc[0.75]) & (df > bounds.loc[0.25])
return df[mask].mean()
means = df.groupby("ID").apply(filter_mean)
23 ms ± 809 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
It is even almost faster with a small data frame, in larger data frames such as its original data frame, it would be much faster than the other proposed methods, see: when use apply

You can use the quantile function to return multiple quantiles. Then, you can filter out values based on this, and compute the mean:
def filter_mean(df):
bounds = df.quantile([.25, .75])
mask = (df < bounds.loc[0.75]) & (df > bounds.loc[0.25])
return df[mask].mean()
means = data.groupby("id").apply(filter_mean)

Please try this.
def mean_of_25_to_75_pct(s: pd.Series):
low, high = s.quantile(.25), s.quantile(.75)
return s.loc[(s >= low) & (s < high)].mean()
data.groupby("id").apply(lambda x: x.apply(mean_of_25_to_75_pct))

You could use scipy ready-made function for trimmed mean, trim_mean():
from scipy import stats
means = data.groupby("id").apply(stats.trim_mean, 0.25)
If you insist on getting a dataframe, you could:
data.groupby("id").agg(lambda x: stats.trim_mean(x, 0.25)).reset_index()

What is the best way to vectorize/optimize this python code?

I am calculating 48 derived pandas columns by iterating and calculating each column at a time but need to speed up the process. What is the best way to do this to make it faster and more efficent. Each column calculates the closing price as a percentage of the period's (T, T-1, T-2 etc) high and low price.
The code I am currently using is:
#get last x closes as percentage of period high and low
for i in range(1, 49, 1):
df.loc[:,'Close_T_period_'+str(i)] = ((df['BidClose'].shift(i).values
- df['BidLow'].shift(i).values)/
(df['BidHigh'].shift(i).values - df['BidLow'].shift(i).values))
Input dataframe sample:
BidOpen BidHigh BidLow BidClose AskOpen AskHigh AskLow AskClose Volume
Date
2019-09-27 09:00:00 1.22841 1.22919 1.22768 1.22893 1.22850 1.22927 1.22777 1.22900 12075.0
2019-09-27 10:00:00 1.22893 1.23101 1.22861 1.23058 1.22900 1.23110 1.22870 1.23068 16291.0
2019-09-27 11:00:00 1.23058 1.23109 1.22971 1.23076 1.23068 1.23119 1.22979 1.23087 10979.0
2019-09-27 12:00:00 1.23076 1.23308 1.23052 1.23232 1.23087 1.23314 1.23062 1.23241 16528.0
2019-09-27 13:00:00 1.23232 1.23247 1.23163 1.23217 1.23241 1.23256 1.23172 1.23228 14106.0
Output dataframe sample:
BidOpen BidHigh BidLow BidClose ... Close_T_period_45 Close_T_period_46 Close_T_period_47 Close_T_period_48
Date ...
2019-09-27 09:00:00 1.22841 1.22919 1.22768 1.22893 ... 0.682635 0.070796 0.128940 0.794521
2019-09-27 10:00:00 1.22893 1.23101 1.22861 1.23058 ... 0.506024 0.682635 0.070796 0.128940
2019-09-27 11:00:00 1.23058 1.23109 1.22971 1.23076 ... 0.774920 0.506024 0.682635 0.070796
2019-09-27 12:00:00 1.23076 1.23308 1.23052 1.23232 ... 0.212500 0.774920 0.506024 0.682635
2019-09-27 13:00:00 1.23232 1.23247 1.23163 1.23217 ... 0.378882 0.212500 0.774920 0.506024

Short Answer (faster implementation)
the following code is 6x times faster:
import numpy as np
def my_shift(x, i):
first = np.array([np.nan]*i)
return np.append(first, x[:-i])
result = ((df2['BidClose'].values - df2['BidLow'].values)/(df2['BidHigh'].values - df2['BidLow'].values))
for i in range(1, 49, 1):
df2.loc[:,'Close_T_period_'+str(i)] = my_shift(result, i)
Long Answer (explanation)
The two main bottleneck issues in your code are:
In every iteration you recalculate the same values, the only
difference is that, every times, are shifted differently;
pandas shift operation is very slow for your purpose.
so my code simply manage the two issues. Basically I calculate the result just one time and I use the loop only for shifting (Issues #1 improved), and I implemented my own shift function that append in front of the original array i NaN values and cut the last i.
Execution time
With a dataframe with 5000 rows the time benchmark give:
42 ms ± 1.79 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
with my solution I obtained:
7.62 ms ± 140 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
 UPDATE
I tried to implement a solution with apply:
result = ((df2['BidClose'].values - df2['BidLow'].values)/(df2['BidHigh'].values - df2['BidLow'].values))
df3 = df.reindex(df2.columns.tolist() +[f'Close_T_period_{i}' for i in range(1, 2000)], axis=1)
df3.iloc[:, 9:] = df3.iloc[:, 9:].apply(lambda row: my_shift(result, int(row.name.split('_')[-1])))
In my test this solution seems a slightly slower then the first one.

Pandas - How to groupby and remove specifc rows

I have a DF like this:
id company duration
0 Other Company 5
0 Other Company 19
0 X Company 7
1 Other Company 24
1 Other Company 6
1 X Company 12
2 X Company 9
3 Other Company 30
3 X Company 16
I need to group the DF by ID and Company and then sum the duration in each. In the end I need only the values with 'X Company'. This is what I did:
import pandas as pd
jobs = pd.read_csv("data/jobs.csv")
time_in_company = jobs.groupby(['id','company'])['duration'].agg(sum)
And got this:
id company duration
0 Other Company 24
0 X Company 7
1 Other Company 30
1 X Company 12
2 X Company 9
3 Other Company 30
3 X Company 16
Now I need remove all entrys from 'Other Company'. Already tried using time_in_company.drop('Any Company') #Return KeyError 'Any Company'
Tried to .set_index('company'), in order to try something else, but it tells me 'Series' object has no attribute 'set_index'
Tried to use a .filter() in the groupby but I need the .agg(sum). (And it didn't work anyway..
Can someone shed some light in the issue for me? Thanks in advance.

Does this help?
time_in_company= time_in_company.reset_index(level='company')
time_in_company [time_in_company ['company']!="Other Company"]

First use pd.query() to remove the 'X Company' rows, than groupby the remaining df like:
import numpy as np
import pandas as pd
ids = [0,0,0,1,1,1,2,3,3]
company = ['Other Company','Other Company','X Company','Other Company','Other Company','X Company','X Company','Other Company','X Company']
duration = [5,19,7,24,6,12,9,30,16]
df = pd.DataFrame({'ids':ids,'company':company,'duration':duration})
df.query("company=='Other Company'").groupby(['ids','company'])['duration'].agg(sum)
You get:
ids company
0 Other Company 24
1 Other Company 30
3 Other Company 30
Name: duration, dtype: int64
EDIT: Additionally you can use a combination of pd.where(), dropna()and pd.pivot_table() with:
df.where(df['company']=='Other Company').dropna().pivot_table(['duration'],index=['ids','company'],aggfunc='sum')
You get:
duration
ids company
0.0 Other Company 24.0
1.0 Other Company 30.0
3.0 Other Company 30.0
Nonetheless, the firs one is faster:
2.03 ms ± 62.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
5.87 ms ± 23.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

pandas operation by group - python

You can use the following approach. Only ~3x faster in my tests. df[df.groupby('id')['date'].transform(lambda x: x.diff().max() < pd.Timedelta(days=32))] Out: date id 205 2019-12-01 205 2020-01-01 205 2020-02-01

Related

Replace column values in a large dataframe

What's the most efficient way to convert a time-series data into a cross-sectional one?

Python: (Pandas) How to ignore the lowest and highest 25% of values grouped by id for mean calculation

What is the best way to vectorize/optimize this python code?

Pandas - How to groupby and remove specifc rows

Categories

Resources