I am trying to use groupby, nlargest, and sum functions in Pandas together, but having trouble making it work.
State County Population
Alabama a 100
Alabama b 50
Alabama c 40
Alabama d 5
Alabama e 1
...
Wyoming a.51 180
Wyoming b.51 150
Wyoming c.51 56
Wyoming d.51 5
I want to use groupby to select by state, then get the top 2 counties by population. Then use only those top 2 county population numbers to get a sum for that state.
In the end, I'll have a list that will have the state and the population (of it's top 2 counties).
I can get the groupby and nlargest to work, but getting the sum of the nlargest(2) is a challenge.
The line I have right now is simply: df.groupby('State')['Population'].nlargest(2)
You can use apply after performing the groupby:
df.groupby('State')['Population'].apply(lambda grp: grp.nlargest(2).sum())
I think this issue you're having is that df.groupby('State')['Population'].nlargest(2) will return a DataFrame, so you can no longer do group level operations. In general, if you want to perform multiple operations in a group, you'll need to use apply/agg.
The resulting output:
State
Alabama 150
Wyoming 330
EDIT
A slightly cleaner approach, as suggested by #cs95:
df.groupby('State')['Population'].nlargest(2).sum(level=0)
This is slightly slower than using apply on larger DataFrames though.
Using the following setup:
import numpy as np
import pandas as pd
from string import ascii_letters
n = 10**6
df = pd.DataFrame({'A': np.random.choice(list(ascii_letters), size=n),
'B': np.random.randint(10**7, size=n)})
I get the following timings:
In [3]: %timeit df.groupby('A')['B'].apply(lambda grp: grp.nlargest(2).sum())
103 ms ± 1.08 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [4]: %timeit df.groupby('A')['B'].nlargest(2).sum(level=0)
147 ms ± 3.38 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
The slower performance is potentially caused by the level kwarg in sum performing a second groupby under the hood.
Using agg, the grouping logic looks like:
df.groupby('State').agg({'Population': {lambda x: x.nlargest(2).sum() }})
This results in another dataframe object; which you could query to find the most populous states, etc.
Population
State
Alabama 150
Wyoming 330
Related
I have a dataframe that has similar ids with spatiotemporal data like below:
car_id lat long
xxx 32 150
xxx 33 160
yyy 20 140
yyy 22 140
zzz 33 70
zzz 33 80
. . .
I want to replace car_id with car_1, car_2, car_3, ... However, my dataframe is large and it's not possible to do it manually by name so first I made a list of all unique values in the car_id column and made a list of names that should be replaced with:
u_values = [i for i in df['car_id'].unique()]
r = ['car'+str(i) for i in range(len(u_values))]
Now I'm not sure how to replace all unique numbers in car_id column with list values so the result is like this:
car_id lat long
car_1 32 150
car_1 33 160
car_2 20 140
car_2 22 140
car_3 33 70
car_3 33 80
. . .
The answers so far seem a little complicated to me, so here's another suggestion. This creates a dictionary that has the old name as the keys and the new name as the values. That can be used to map the old values to new values.
r={k:'car_{}'.format(i) for i,k in enumerate(df['car_id'].unique())}
df['car_id'] = df['car_id'].map(r)
edit: the answer using factorize is probably better even though I think this is a bit easier to read
Create a mapping from u_values to r and map it to car_id column. Also simplify the definition of u_values and r by using tolist() method and f-strings, respectively.
u_values = df['car_id'].unique().tolist()
r = [f'car_{i}' for i in range(len(u_values))]
mapping = pd.Series(r, index=u_values)
df['car_id'] = df['car_id'].map(mapping)
That said, it seems vectorized string concatenation is enough for this task. factorize() method encodes the strings.
df['car_id'] = 'car_' + pd.Series(df['car_id'].factorize()[0], dtype='string')
When I timed some these methods (I omitted Juan Manuel Rivera's solution because replace is very slow and the code takes forever on larger data), the map() implementation that built on OP's code turned out to be the fastest.
The factorize() implementation, while concise, is not fast after all. Also I agree with pasnik that their solution is the easiest to read.
# a dataframe with 500k rows and 100k unique car_ids
df = pd.DataFrame({'car_id': np.random.default_rng().choice(100000, size=500000)})
%timeit u_values = df['car_id'].unique().tolist(); r = [f'car_{i}' for i in range(len(u_values))]; mapping = pd.Series(r, index=u_values); df.assign(car_id=df['car_id'].map(mapping))
# 136 ms ± 2.92 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit df.assign(car_id = 'car_' + pd.Series(df['car_id'].factorize()[0], dtype='string'))
# 602 ms ± 19.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit r={k:'car_{}'.format(i) for i,k in enumerate(df['car_id'].unique())}; df.assign(car_id=df['car_id'].map(r))
# 196 ms ± 3.02 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
It may be easier if you use a dictionary to maintain the relation between each unique value (xxxx,yyyy...) and the new id you want (1, 2, 3...)
newIdDict={}
idCounter=1
for i in df['Car id'].unique():
if i not in newIdDict:
newIdDict[i] = 'car_'+str(idCounter)
idCounter += 1
Then, you can use Pandas replace function to change the values in car_id column:
df['Car id'].replace(newIdDict, inplace=True)
Take into account that this will change ALL the xxxx, yyyy in your dataframe, so if you have any xxxx in lat or long columns it will also be modified
I have a dataframe like this
df = pd.DataFrame({'id': [205,205,205, 211, 211, 211]
, 'date': pd.to_datetime(['2019-12-01','2020-01-01', '2020-02-01'
,'2019-12-01' ,'2020-01-01', '2020-03-01'])})
df
id date
0 205 2019-12-01
1 205 2020-01-01
2 205 2020-02-01
3 211 2019-12-01
4 211 2020-01-01
5 211 2020-03-01
where the column date is made by consecutive months for id 205 but not for id 211.
I want to keep only the observations (id) for which I have monthly data without jumps. In this example I want:
id date
0 205 2019-12-01
1 205 2020-01-01
2 205 2020-02-01
Here I am collecting the id to keep:
keep_id = []
for num in pd.unique(df.index):
temp = (df.loc[df['id']==num,'date'].dt.year - df.loc[df['id']==num,'date'].shift(1).dt.year) * 12 + df.loc[df['id']==num,'date'].dt.month - df.loc[df['id']==num,'date'].shift(1).dt.month
temp.values[0] = 1.0 # here I correct the first entry
if (temp==1.).all():
keep_id.append(num)
where I am using (df.loc[num,'date'].dt.year - df.loc[num,'date'].shift(1).dt.year) * 12 + df.loc[num,'date'].dt.month - df.loc[num,'date'].shift(1).dt.month to compute the difference in months from the previous date for every id.
This seems to work when tested on a small portion of df, but I'm sure there is a better way of doing this, maybe using the .groupby() method.
Since df is made of millions of observations my code takes too much time (and I'd like to learn a more efficient and pythonic way of doing this)
What you want to do is use groupby-filter rather than a groupby apply.
df.groupby('id').filter(lambda x: not (x.date.diff() > pd.Timedelta(days=32)).any())
provides exactly:
id date
0 205 2019-12-01
1 205 2020-01-01
2 205 2020-02-01
And indeed, I would keep the index unique, there are too many useful characteristics to retain.
Both this response and Michael's above are correct in terms of output. In terms of performance, they are very similar as well:
%timeit df.groupby('id').filter(lambda x: not (x.date.diff() > pd.Timedelta(days=32)).any())
1.48 ms ± 12.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
and
%timeit df[df.groupby('id')['date'].transform(lambda x: x.diff().max() < pd.Timedelta(days=32))]
1.7 ms ± 163 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
For most operations, this difference is negligible.
You can use the following approach. Only ~3x faster in my tests.
df[df.groupby('id')['date'].transform(lambda x: x.diff().max() < pd.Timedelta(days=32))]
Out:
date
id
205 2019-12-01
205 2020-01-01
205 2020-02-01
Here's the thing, I have the dataset below where date is the index:
date value
2020-01-01 100
2020-02-01 140
2020-03-01 156
2020-04-01 161
2020-05-01 170
.
.
.
And I want to transform it in this other dataset:
value_t0 value_t1 value_t2 value_t3 value_t4 ...
100 NaN NaN NaN NaN ...
140 100 NaN NaN NaN ...
156 140 100 NaN NaN ...
161 156 140 100 NaN ...
170 161 156 140 100 ...
First I thought about using pandas.pivot_table to do something, but that would just provide a different layout grouped by some column, which is not exactly what I want. Later, I thought about using pandasql and apply 'case when', but that wouldn't work because I would have to type dozens of lines of code. So I'm stuck here.
try this:
new_df = pd.DataFrame({f"value_t{i}": df['value'].shift(i) for i in range(len(df))})
The series .shift(n) method can get you a single column of your desired output by shifting everything down and filling in NaNs above. So we're building a new dataframe by feeding it a dictionary of the form {column name: column data, ...}, by using dictionary comprehension to iterate through your original dataframe.
I think the best is use numpy
values = np.asarray(df['value'].astype(float))
new_values = np.tril(np.repeat([values], values.shape[0], axis=0).T)
new_values[np.triu_indices(new_values.shape[0], 1)] = np.nan
new_df = pd.DataFrame(new_values).add_prefix('value_t')
Times for 5000 rows
%%timeit
values = np.asarray(df['value'].astype(float))
new_values = np.tril(np.repeat([values], values.shape[0], axis=0).T)
new_values[np.triu_indices(new_values.shape[0],1)] = np.nan
new_df = pd.DataFrame(new_values).add_prefix('value_t')
556 ms ± 35.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
new_df = pd.DataFrame({f"value_t{i}": df['value'].shift(i) for i in range(len(df))})
1.31 s ± 36.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Time without add_prefix
%%timeit
values = np.asarray(df['value'].astype(float))
new_values = np.tril(np.repeat([values], values.shape[0], axis=0).T)
new_values[np.triu_indices(new_values.shape[0],1)] = np.nan
new_df = pd.DataFrame(new_values)
357 ms ± 8.09 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I want to generate a summary table from a tidy pandas DataFrame. I now use groupby and two for loops, which does not seem efficient. Seems stacking and unstacking would get me there, but I have failed.
Sample data
import pandas as pd
import numpy as np
import copy
import random
df_tidy = pd.DataFrame(columns = ['Stage', 'Exc', 'Cat', 'Score'])
for _ in range(10):
df_tidy = df_tidy.append(
{
'Stage': random.choice(['OP', 'FUEL', 'EOL']),
'Exc': str(np.random.randint(low=0, high=1000)),
'Cat': random.choice(['CC', 'HT', 'PM']),
'Score': np.random.random(),
}, ignore_index=True
)
df_tidy
returns
Stage Exc Cat Score
0 OP 929 HT 0.946234
1 OP 813 CC 0.829522
2 FUEL 114 PM 0.868605
3 OP 896 CC 0.382077
4 FUEL 10 CC 0.832246
5 FUEL 515 HT 0.632220
6 EOL 970 PM 0.532310
7 FUEL 198 CC 0.209856
8 FUEL 848 CC 0.479470
9 OP 968 HT 0.348093
I would like a new DataFrame with Stages as columns, Cats as rows and sum of Scores as values. I achieve it this way:
Working but probably inefficient approach
new_df = pd.DataFrame(columns=list(df_tidy['Stage'].unique()))
for cat, small_df in df_tidy.groupby('Cat'):
for lcs, smaller_df in small_df.groupby('Stage'):
new_df.loc[cat, lcs] = smaller_df['Score'].sum()
new_df['Total'] = new_df.sum(axis=1)
new_df
Which returns what I want:
OP FUEL EOL Total
CC 1.2116 1.52157 NaN 2.733170
HT 1.29433 0.63222 NaN 1.926548
PM NaN 0.868605 0.53231 1.400915
But I cannot believe this is the simplest or most efficient path.
Question
What pandas magic am I missing out on?
Update - Timing the proposed solutions
To understand the differences between pivot_table and crosstab proposed below, I timed the three solutions with a 100,000 row dataframe built exactly as above:
groupby solution, that I thought was inefficient:
%%timeit
new_df = pd.DataFrame(columns=list(df_tidy['Stage'].unique()))
for cat, small_df in df_tidy.groupby('Cat'):
for lcs, smaller_df in small_df.groupby('Stage'):
new_df.loc[cat, lcs] = smaller_df['Score'].sum()
new_df['Total'] = new_df.sum(axis=1)
41.2 ms ± 3.18 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
crosstab solution, that requires a creation of a DataFrame in the background, even if the passed data is already in DataFrame format:
%%timeit
pd.crosstab(index=df_tidy.Cat,columns=df_tidy.Stage, values=df_tidy.Score, aggfunc='sum', margins = True, margins_name = 'Total').iloc[:-1,:]
67.8 ms ± 1.08 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
pivot_table solution:
%%timeit
pd.pivot_table(df_tidy, index=['Cat'], columns=["Stage"], margins=True, margins_name='Total', aggfunc=np.sum).iloc[:-1,:]
713 ms ± 20.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
So, it would appear that the clunky groupbysolution is the quickest.
A simple solution from crosstab
pd.crosstab(index=df.Cat,columns=df.Stage,values=df.Score,aggfunc='sum', margins = True, margins_name = 'Total').iloc[:-1,:]
Out[342]:
Stage EOL FUEL OP Total
Cat
CC NaN 1.521572 1.211599 2.733171
HT NaN 0.632220 1.294327 1.926547
PM 0.53231 0.868605 NaN 1.400915
I was wondering if not a simpler solution than using pd.crosstab is to use pd.pivot:
pd.pivot_table(df_tidy, index=['Cat'], columns=["Stage"], margins=True, margins_name='Total', aggfunc=np.sum).iloc[:-1,:]
I am working with a dataset of around 400k rows of preprocessed strings.
[In]:
raw preprocessed
helpersstreet 46, second floor helpersstreet 46
489 john doe route john doe route
at main street 49 main street
All strings in column 'preprocessed' are either same size or smaller than column 'raw'. Is there a fast way to compare these strings and return all differences, getting them in a column:
[Out]:
raw preprocessed difference
helpersstreet 46, second floor helpersstreet 46 ,second floor
489 john doe route john doe route 489
at main street 49 main street at 49
I am not really sure how to do this, but I am also wondering whether this is the way to go. I have access to the functions that perform the preprocessing, so is it faster to modify them to return these values or is the a scalable way to create the differences later. I would prefer the latter.
Option 1
Seems like an iterative replacement is in order. You can do this best using a list comprehension:
df['difference'] = [i.replace(j, '') for i, j in zip(df.raw, df.preprocessed)]
df
raw preprocessed difference
0 helpersstreet 46, second floor helpersstreet 46 , second floor
1 489 john doe route john doe route 489
2 at main street 49 main street at 49
Given the limitations of this problem (the difficulty involved with vectorizing the replacement operation), I'd say this is your fastest option.
Option 2
Alternatively, np.vectorize a lambda,
f = np.vectorize(lambda i, j: i.replace(j, ''))
df['difference'] = f(df.raw, df.preprocessed)
df
raw preprocessed difference
0 helpersstreet 46, second floor helpersstreet 46 , second floor
1 489 john doe route john doe route 489
2 at main street 49 main street at 49
Note that this only hides the loop, it is just as fast/slow as Option 1, if not worse.
Option 3
Using apply, which I don't recommend:
df['difference'] = df.apply(lambda x: x.raw.replace(x.preprocessed, ''), 1)
df
raw preprocessed difference
0 helpersstreet 46, second floor helpersstreet 46 , second floor
1 489 john doe route john doe route 489
2 at main street 49 main street at 49
This also hides the loop, but does at a cost of more overhead than Option 2.
Timings
On request of my friend, Mr. jezrael:
df = pd.concat([df] * 10000, ignore_index=True) # setup
# Option 1
%timeit df['difference'] = [i.replace(j, '') for i, j in zip(df.raw, df.preprocessed)]
186 ms ± 12.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# Option 2
%timeit df['difference'] = f(df.raw, df.preprocessed)
326 ms ± 14.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# Option 3
%timeit df['difference'] = df.apply(lambda x: x.raw.replace(x.preprocessed, ''), 1)
20.8 s ± 237 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)