Suppose I have the following DataFrame:
df = pd.DataFrame({'city': ['a', 'a', 'a', 'b', 'b', 'c', 'd', 'd', 'd'],
'year': [2013, 2014, 2016, 2015, 2016, 2013, 2016, 2017, 2018],
'value': [10, 12, 16, 20, 21, 11, 15, 13, 16]})
And I want to find, for each city and year, what was the percentage change of value compared to the year before. My final dataframe would be:
city year value
a 2013 NaN
a 2014 0.20
a 2016 NaN
b 2015 NaN
b 2016 0.05
c 2013 NaN
d 2016 NaN
d 2017 -0.14
d 2018 0.23
I tried to use a group in city and then use apply but it didn't work:
df.groupby('city').apply(lambda x: x.sort_values('year')['value'].pct_change()).reset_index()
It didn't work because I couldn't get the year and also because this way I was considereing that I had all years for all cities, but that is not true.
EDIT: I'm not very concerned with efficiency, so any solution that solves the problem is valid for me.
Let's try lazy groupby(), use pct_change for the changes and diff to detect year jump:
groups = df.sort_values('year').groupby(['city'])
df['pct_chg'] = (groups['value'].pct_change()
.where(groups['year'].diff()==1)
)
Output:
city year value pct_chg
0 a 2013 10 NaN
1 a 2014 12 0.200000
2 a 2016 16 NaN
3 b 2015 20 NaN
4 b 2016 21 0.050000
5 c 2013 11 NaN
6 d 2016 15 NaN
7 d 2017 13 -0.133333
8 d 2018 16 0.230769
Although #Quang's answer is much more elegantly written and concise, I just add another approach using indexing.
sorted_df = df.sort_values(by=['city', 'year'])
sorted_df.loc[((sorted_df.year.diff() == 1) &
(sorted_df.city == sorted_df.city.shift(1))), 'pct_chg'] = sorted_df.value.pct_change()
my approach is faster as you can see below run on your df, but the syntax is not as pretty.
%timeit #mine
1.44 ms ± 2.91 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit ##Quang's
2.23 ms ± 40.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Related
I have a pandas dataframe like there is longer gaps in time and I want to slice them into smaller dataframes where time "clusters" are together
Time Value
0 56610.41341 8.55
1 56587.56394 5.27
2 56590.62965 6.81
3 56598.63790 5.47
4 56606.52203 6.71
5 56980.44206 4.75
6 56592.53327 6.53
7 57335.52837 0.74
8 56942.59094 6.96
9 56921.63669 9.16
10 56599.52053 6.14
11 56605.50235 5.20
12 57343.63828 3.12
13 57337.51641 3.17
14 56593.60374 5.69
15 56882.61571 9.50
I tried sorting this and taking time difference of two consecutive points with
df = df.sort_values("Time")
df['t_dif'] = df['Time'] - df['Time'].shift(-1)
And it gives
Time Value t_dif
1 56587.56394 5.27 -3.06571
2 56590.62965 6.81 -1.90362
6 56592.53327 6.53 -1.07047
14 56593.60374 5.69 -5.03416
3 56598.63790 5.47 -0.88263
10 56599.52053 6.14 -5.98182
11 56605.50235 5.20 -1.01968
4 56606.52203 6.71 -3.89138
0 56610.41341 8.55 -272.20230
15 56882.61571 9.50 -39.02098
9 56921.63669 9.16 -20.95425
8 56942.59094 6.96 -37.85112
5 56980.44206 4.75 -355.08631
7 57335.52837 0.74 -1.98804
13 57337.51641 3.17 -6.12187
12 57343.63828 3.12 NaN
Lets say I want to slice this dataframe to smaller dataframes where time difference between two consecutive points is smaller than 40 how would I go by doing this?
I could loop the rows but this is frowned upon so is there a smarter solution?
Edit: Here is a example:
df1:
Time Value t_dif
1 56587.56394 5.27 -3.06571
2 56590.62965 6.81 -1.90362
6 56592.53327 6.53 -1.07047
14 56593.60374 5.69 -5.03416
3 56598.63790 5.47 -0.88263
10 56599.52053 6.14 -5.98182
11 56605.50235 5.20 -1.01968
4 56606.52203 6.71 -3.89138
df2:
0 56610.41341 8.55 -272.20230
df3:
15 56882.61571 9.50 -39.02098
9 56921.63669 9.16 -20.95425
8 56942.59094 6.96 -37.85112
...
etc.
I think you can just
df1 = df[df['t_dif']<30]
df2 = df[df['t_dif']>=30]
def split_dataframe(df, value):
df = df.sort_values("Time")
df = df.reset_index()
df['t_dif'] = (df['Time'] - df['Time'].shift(-1)).abs()
indxs = df.index[df['t_dif'] > value].tolist()
indxs.append(-1)
indxs.append(len(df))
indxs.sort()
frames = []
for i in range(1, len(indxs)):
val = df.iloc[indxs[i] + 1: indxs[i]]
frames.append(val)
return frames
Returns the correct dataframes as a list
I have a DF like this:
id company duration
0 Other Company 5
0 Other Company 19
0 X Company 7
1 Other Company 24
1 Other Company 6
1 X Company 12
2 X Company 9
3 Other Company 30
3 X Company 16
I need to group the DF by ID and Company and then sum the duration in each. In the end I need only the values with 'X Company'. This is what I did:
import pandas as pd
jobs = pd.read_csv("data/jobs.csv")
time_in_company = jobs.groupby(['id','company'])['duration'].agg(sum)
And got this:
id company duration
0 Other Company 24
0 X Company 7
1 Other Company 30
1 X Company 12
2 X Company 9
3 Other Company 30
3 X Company 16
Now I need remove all entrys from 'Other Company'. Already tried using time_in_company.drop('Any Company') #Return KeyError 'Any Company'
Tried to .set_index('company'), in order to try something else, but it tells me 'Series' object has no attribute 'set_index'
Tried to use a .filter() in the groupby but I need the .agg(sum). (And it didn't work anyway..
Can someone shed some light in the issue for me? Thanks in advance.
Does this help?
time_in_company= time_in_company.reset_index(level='company')
time_in_company [time_in_company ['company']!="Other Company"]
First use pd.query() to remove the 'X Company' rows, than groupby the remaining df like:
import numpy as np
import pandas as pd
ids = [0,0,0,1,1,1,2,3,3]
company = ['Other Company','Other Company','X Company','Other Company','Other Company','X Company','X Company','Other Company','X Company']
duration = [5,19,7,24,6,12,9,30,16]
df = pd.DataFrame({'ids':ids,'company':company,'duration':duration})
df.query("company=='Other Company'").groupby(['ids','company'])['duration'].agg(sum)
You get:
ids company
0 Other Company 24
1 Other Company 30
3 Other Company 30
Name: duration, dtype: int64
EDIT: Additionally you can use a combination of pd.where(), dropna()and pd.pivot_table() with:
df.where(df['company']=='Other Company').dropna().pivot_table(['duration'],index=['ids','company'],aggfunc='sum')
You get:
duration
ids company
0.0 Other Company 24.0
1.0 Other Company 30.0
3.0 Other Company 30.0
Nonetheless, the firs one is faster:
2.03 ms ± 62.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
5.87 ms ± 23.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Hello! I have a dataframe with year (1910 ~ 2014), name, count (number of occurrence of each name) as columns. I want to create a new dataframe that shows the name with highest occurrence per year, and I'm not entirely sure about how to do this. Thanks!
Vectorized way
group = df.groupby('year')
df.loc[group['count'].agg('idxmax')]
Try this:
d = {'year': [1910, 1910, 1910,1920,1920,1920], 'name': ["Virginia", "Mary", "Elizabeth","Virginia", "Mary", "Elizabeth"], 'count': [848, 420, 747, 1048, 221, 147]}
df = pd.DataFrame(data=d)
df_results = pd.DataFrame(columns=df.columns)
years = pd.unique(df['year'])
for year in years:
tmp_df = df.loc[df['year'] == year]
tmp_df = tmp_df.sort_values(by='year')
df_results = df_results.append(tmp_df.iloc[0])
I suppose groupby & apply is good approach:
df = pd.DataFrame({
'Year': ['1910', '1910', '1911', '1911', '1911', '2014', '2014'],
'Name': ['Mary', 'Virginia', 'Elizabeth', 'Mary', 'Ann', 'Virginia', 'Elizabeth'],
'Count': [848, 270, 254, 360, 451, 81, 380]
})
df
Out:
Year Name Count
0 1910 Mary 848
1 1910 Virginia 270
2 1911 Elizabeth 254
3 1911 Mary 360
4 1911 Ann 451
5 2014 Virginia 81
6 2014 Elizabeth 380
df.groupby(['Year']).apply(lambda x: x.sort_values('Count', ascending=False).head(1))
Out:
Year Name Count
Year
1910 0 1910 Mary 848
1911 4 1911 Ann 451
2014 6 2014 Elizabeth 380
Also you can change head(1) by head(n) to get n most frequent names per year:
df.groupby(['Year']).apply(lambda x: x.sort_values('Count', ascending=False).head(2))
Out:
Year Name Count
Year
1910 0 1910 Mary 848
1 1910 Virginia 270
1911 4 1911 Ann 451
3 1911 Mary 360
2014 6 2014 Elizabeth 380
5 2014 Virginia 81
If you don't like new additional index, drop it via .reset_index(level=0, drop=True):
top_names = df.groupby(['Year']).apply(lambda x: x.sort_values('Count', ascending=False).head(1))
top_names.reset_index(level=0, drop=True)
Out:
Year Name Count
0 1910 Mary 848
4 1911 Ann 451
6 2014 Elizabeth 380
another way of doing this is sort the values of count and de-duplicate the Year column(faster too):
df.sort_values('Count', ascending=False).drop_duplicates(['Year'])
time results are below, you can try applying any method and see howmuch time each takes and apply accordingly:
%timeit df.sort_values('Count', ascending=False).drop_duplicates(['Year'])
result: 917 µs ± 13 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit df.loc[df.groupby('Year')['Count'].agg('idxmax')]
result: 1.06 ms ± 5.97 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit df.loc[df.groupby('Year')['Count'].idxmax(), :]
result: 1.13 ms ± 23.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
I have two DataFrames df and evol as follows (simplified for the example):
In[6]: df
Out[6]:
data year_final year_init
0 12 2023 2012
1 34 2034 2015
2 9 2019 2013
...
In[7]: evol
Out[7]:
evolution
year
2000 1.474946
2001 1.473874
2002 1.079157
...
2037 1.463840
2038 1.980807
2039 1.726468
I would like to operate the following operation in a vectorized way (current for loop implementation is just too long when I have Gb of data):
for index, row in df.iterrows():
for year in range(row['year_init'], row['year_final']):
factor = evol.at[year, 'evolution']
df.at[index, 'data'] += df.at[index, 'data'] * factor
Complexity comes from the fact that the range of year is not the same on each row...
In the above example the ouput would be:
data year_final year_init
0 163673 2023 2012
1 594596046 2034 2015
2 1277 2019 2013
(full evol dataframe for testing purpose:)
evolution
year
2000 1.474946
2001 1.473874
2002 1.079157
2003 1.876762
2004 1.541348
2005 1.581923
2006 1.869508
2007 1.289033
2008 1.924791
2009 1.527834
2010 1.762448
2011 1.554491
2012 1.927348
2013 1.058588
2014 1.729124
2015 1.025824
2016 1.117728
2017 1.261009
2018 1.705705
2019 1.178354
2020 1.158688
2021 1.904780
2022 1.332230
2023 1.807508
2024 1.779713
2025 1.558423
2026 1.234135
2027 1.574954
2028 1.170016
2029 1.767164
2030 1.995633
2031 1.222417
2032 1.165851
2033 1.136498
2034 1.745103
2035 1.018893
2036 1.813705
2037 1.463840
2038 1.980807
2039 1.726468
One vectorization approach using only pandas is to do a cartesian join between the two frames and subset. Would start out like:
df['dummy'] = 1
evol['dummy'] = 1
combined = df.merge(evol, on='dummy')
# filter date ranges, multiply etc
This will likely be faster than what you are doing, but is memory inefficient and might blow up on your real data.
If you can take on the numba dependency, something like this should be very fast - essentially a compiled version of what you are doing now. Something similar would be possible in cython as well. Note that this requires that the evol dataframe is sorted and contigous by year, that could be relaxed with modification.
import numba
#numba.njit
def f(data, year_final, year_init, evol_year, evol_factor):
data = data.copy()
for i in range(len(data)):
year_pos = np.searchsorted(evol_year, year_init[i])
n_years = year_final[i] - year_init[i]
for offset in range(n_years):
data[i] += data[i] * evol_factor[year_pos + offset]
return data
f(df['data'].values, df['year_final'].values, df['year_init'].values, evol.index.values, evol['evolution'].values)
Out[24]: array([ 163673, 594596044, 1277], dtype=int64)
Edit:
Some timings with your test data
In [25]: %timeit f(df['data'].values, df['year_final'].values, df['year_init'].values, evol.index.values, evol['evolution'].values)
15.6 µs ± 338 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [26]: %%time
...: for index, row in df.iterrows():
...: for year in range(row['year_init'], row['year_final']):
...: factor = evol.at[year, 'evolution']
...: df.at[index, 'data'] += df.at[index, 'data'] * factor
Wall time: 3 ms
I have the following dataframe:
2012 2013 2014 2015 2016 2017 2018 Kategorie
0 5.31 5.27 5.61 4.34 4.54 5.02 7.07 Gewinn pro Aktie in EUR
1 13.39 14.70 12.45 16.29 15.67 14.17 10.08 KGV
2 -21.21 -0.75 6.45 -22.63 -7.75 9.76 47.52 Gewinnwachstum
3 -17.78 2.27 -0.55 3.39 1.48 0.34 NaN PEG
Now, I am selecting only the KGV row with:
df[df["Kategorie"] == "KGV"]
Which outputs:
2012 2013 2014 2015 2016 2017 2018 Kategorie
1 13.39 14.7 12.45 16.29 15.67 14.17 10.08 KGV
How do I calculate the mean() of the last five years (2016,15,14,13,12 in this example)?
I tried
df[df["Kategorie"] == "KGV"]["2016":"2012"].mean()
but this throws a TypeError. Why can I not slice the columns here?
loc supports that type of slicing (from left to right):
df.loc[df["Kategorie"] == "KGV", "2012":"2016"].mean(axis=1)
Out:
1 14.5
dtype: float64
Note that this does not necessarily mean 2012, 2013, 2014, 2015 and 2016. These are strings so it means all columns between df['2012'] and df['2016']. There could be a column named foo in between and it would be selected.
I used filter and iloc
row = df[df.Kategorie == 'KGV']
row.filter(regex='\d{4}').sort_index(1).iloc[:, -5:].mean(1)
1 13.732
dtype: float64
Not sure why the last five years are 2012-2016 (they seem to be the first five years). Notwithstanding, to find the mean for 2012-2016 for 'KGV', you can use
df[df['Kategorie'] == 'KGV'][[c for c in df.columns if c != 'Kategorie' and 2012 <= int(c) <= 2016]].mean(axis=1)