Replace column values in a large dataframe

Replace column values in a large dataframe - python

I have a dataframe that has similar ids with spatiotemporal data like below:
car_id lat long
xxx 32 150
xxx 33 160
yyy 20 140
yyy 22 140
zzz 33 70
zzz 33 80
. . .
I want to replace car_id with car_1, car_2, car_3, ... However, my dataframe is large and it's not possible to do it manually by name so first I made a list of all unique values in the car_id column and made a list of names that should be replaced with:
u_values = [i for i in df['car_id'].unique()]
r = ['car'+str(i) for i in range(len(u_values))]
Now I'm not sure how to replace all unique numbers in car_id column with list values so the result is like this:
car_id lat long
car_1 32 150
car_1 33 160
car_2 20 140
car_2 22 140
car_3 33 70
car_3 33 80
. . .

The answers so far seem a little complicated to me, so here's another suggestion. This creates a dictionary that has the old name as the keys and the new name as the values. That can be used to map the old values to new values.
r={k:'car_{}'.format(i) for i,k in enumerate(df['car_id'].unique())}
df['car_id'] = df['car_id'].map(r)
edit: the answer using factorize is probably better even though I think this is a bit easier to read

Create a mapping from u_values to r and map it to car_id column. Also simplify the definition of u_values and r by using tolist() method and f-strings, respectively.
u_values = df['car_id'].unique().tolist()
r = [f'car_{i}' for i in range(len(u_values))]
mapping = pd.Series(r, index=u_values)
df['car_id'] = df['car_id'].map(mapping)
That said, it seems vectorized string concatenation is enough for this task. factorize() method encodes the strings.
df['car_id'] = 'car_' + pd.Series(df['car_id'].factorize()[0], dtype='string')
When I timed some these methods (I omitted Juan Manuel Rivera's solution because replace is very slow and the code takes forever on larger data), the map() implementation that built on OP's code turned out to be the fastest.
The factorize() implementation, while concise, is not fast after all. Also I agree with pasnik that their solution is the easiest to read.
# a dataframe with 500k rows and 100k unique car_ids
df = pd.DataFrame({'car_id': np.random.default_rng().choice(100000, size=500000)})
%timeit u_values = df['car_id'].unique().tolist(); r = [f'car_{i}' for i in range(len(u_values))]; mapping = pd.Series(r, index=u_values); df.assign(car_id=df['car_id'].map(mapping))
# 136 ms ± 2.92 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit df.assign(car_id = 'car_' + pd.Series(df['car_id'].factorize()[0], dtype='string'))
# 602 ms ± 19.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit r={k:'car_{}'.format(i) for i,k in enumerate(df['car_id'].unique())}; df.assign(car_id=df['car_id'].map(r))
# 196 ms ± 3.02 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

It may be easier if you use a dictionary to maintain the relation between each unique value (xxxx,yyyy...) and the new id you want (1, 2, 3...)
newIdDict={}
idCounter=1
for i in df['Car id'].unique():
if i not in newIdDict:
newIdDict[i] = 'car_'+str(idCounter)
idCounter += 1
Then, you can use Pandas replace function to change the values in car_id column:
df['Car id'].replace(newIdDict, inplace=True)
Take into account that this will change ALL the xxxx, yyyy in your dataframe, so if you have any xxxx in lat or long columns it will also be modified

Related

pandas operation by group

I have a dataframe like this
df = pd.DataFrame({'id': [205,205,205, 211, 211, 211]
, 'date': pd.to_datetime(['2019-12-01','2020-01-01', '2020-02-01'
,'2019-12-01' ,'2020-01-01', '2020-03-01'])})
df
id date
0 205 2019-12-01
1 205 2020-01-01
2 205 2020-02-01
3 211 2019-12-01
4 211 2020-01-01
5 211 2020-03-01
where the column date is made by consecutive months for id 205 but not for id 211.
I want to keep only the observations (id) for which I have monthly data without jumps. In this example I want:
id date
0 205 2019-12-01
1 205 2020-01-01
2 205 2020-02-01
Here I am collecting the id to keep:
keep_id = []
for num in pd.unique(df.index):
temp = (df.loc[df['id']==num,'date'].dt.year - df.loc[df['id']==num,'date'].shift(1).dt.year) * 12 + df.loc[df['id']==num,'date'].dt.month - df.loc[df['id']==num,'date'].shift(1).dt.month
temp.values[0] = 1.0 # here I correct the first entry
if (temp==1.).all():
keep_id.append(num)
where I am using (df.loc[num,'date'].dt.year - df.loc[num,'date'].shift(1).dt.year) * 12 + df.loc[num,'date'].dt.month - df.loc[num,'date'].shift(1).dt.month to compute the difference in months from the previous date for every id.
This seems to work when tested on a small portion of df, but I'm sure there is a better way of doing this, maybe using the .groupby() method.
Since df is made of millions of observations my code takes too much time (and I'd like to learn a more efficient and pythonic way of doing this)

What you want to do is use groupby-filter rather than a groupby apply.
df.groupby('id').filter(lambda x: not (x.date.diff() > pd.Timedelta(days=32)).any())
provides exactly:
id date
0 205 2019-12-01
1 205 2020-01-01
2 205 2020-02-01
And indeed, I would keep the index unique, there are too many useful characteristics to retain.
Both this response and Michael's above are correct in terms of output. In terms of performance, they are very similar as well:
%timeit df.groupby('id').filter(lambda x: not (x.date.diff() > pd.Timedelta(days=32)).any())
1.48 ms ± 12.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
and
%timeit df[df.groupby('id')['date'].transform(lambda x: x.diff().max() < pd.Timedelta(days=32))]
1.7 ms ± 163 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
For most operations, this difference is negligible.

You can use the following approach. Only ~3x faster in my tests.
df[df.groupby('id')['date'].transform(lambda x: x.diff().max() < pd.Timedelta(days=32))]
Out:
date
id
205 2019-12-01
205 2020-01-01
205 2020-02-01

What's the most efficient way to convert a time-series data into a cross-sectional one?

Here's the thing, I have the dataset below where date is the index:
date value
2020-01-01 100
2020-02-01 140
2020-03-01 156
2020-04-01 161
2020-05-01 170
.
.
.
And I want to transform it in this other dataset:
value_t0 value_t1 value_t2 value_t3 value_t4 ...
100 NaN NaN NaN NaN ...
140 100 NaN NaN NaN ...
156 140 100 NaN NaN ...
161 156 140 100 NaN ...
170 161 156 140 100 ...
First I thought about using pandas.pivot_table to do something, but that would just provide a different layout grouped by some column, which is not exactly what I want. Later, I thought about using pandasql and apply 'case when', but that wouldn't work because I would have to type dozens of lines of code. So I'm stuck here.

try this:
new_df = pd.DataFrame({f"value_t{i}": df['value'].shift(i) for i in range(len(df))})
The series .shift(n) method can get you a single column of your desired output by shifting everything down and filling in NaNs above. So we're building a new dataframe by feeding it a dictionary of the form {column name: column data, ...}, by using dictionary comprehension to iterate through your original dataframe.

I think the best is use numpy
values = np.asarray(df['value'].astype(float))
new_values = np.tril(np.repeat([values], values.shape[0], axis=0).T)
new_values[np.triu_indices(new_values.shape[0], 1)] = np.nan
new_df = pd.DataFrame(new_values).add_prefix('value_t')
Times for 5000 rows
%%timeit
values = np.asarray(df['value'].astype(float))
new_values = np.tril(np.repeat([values], values.shape[0], axis=0).T)
new_values[np.triu_indices(new_values.shape[0],1)] = np.nan
new_df = pd.DataFrame(new_values).add_prefix('value_t')
556 ms ± 35.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
new_df = pd.DataFrame({f"value_t{i}": df['value'].shift(i) for i in range(len(df))})
1.31 s ± 36.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Time without add_prefix
%%timeit
values = np.asarray(df['value'].astype(float))
new_values = np.tril(np.repeat([values], values.shape[0], axis=0).T)
new_values[np.triu_indices(new_values.shape[0],1)] = np.nan
new_df = pd.DataFrame(new_values)
357 ms ± 8.09 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Find least frequent value in whole dataframe

my dataframe is something like this
> 93 40 73 41 115 74 59 98 76 109 43 44
105 119 56 62 69 51 50 104 91 78 77 75
119 61 106 105 102 75 43 51 60 114 91 83
It has 8000 rows and 12 columns
I wanted to find the least frequent value in this whole dataframe (not only in columns).
I tried converting this dataframe into numpy array and use for loop to count the numbers and then return the least count number but it it not very optimal. I searched if there are any other methods but could not find it.
I only found scipy.stats.mode which returns the most frequent number.
is there any other way to do it?

You could stack and take the value_counts:
df.stack().value_counts().index[-1]
# 69
value_counts orders by frequency, so you can just take the last, though in this example many appear just once. 69 happens to be the last.

Another way using pandas.DataFrame.apply with pandas.Series.value_counts:
df.apply(pd.Series.value_counts).sum(1).idxmin()
# 40
# There are many values with same frequencies.
To my surprise, apply method seems to be the fastest among the methods I've tried (reason why I'm posting):
df2 = pd.DataFrame(np.random.randint(1, 1000, (500000, 100)))
%timeit df2.apply(pd.Series.value_counts).sum(1).idxmin()
# 2.36 s ± 193 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df2.stack().value_counts().index[-1]
# 3.02 s ± 86.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
uniq, cnt = np.unique(df2, return_counts=True)
uniq[np.argmin(cnt)]
# 2.77 s ± 111 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
As opposed to my understanding of apply being very slow, it even outperformed numpy.unique (perhaps my coding is wrong tho ;().

Alternative to groupby for generating a summary table from tidy pandas DataFrame

I want to generate a summary table from a tidy pandas DataFrame. I now use groupby and two for loops, which does not seem efficient. Seems stacking and unstacking would get me there, but I have failed.
Sample data
import pandas as pd
import numpy as np
import copy
import random
df_tidy = pd.DataFrame(columns = ['Stage', 'Exc', 'Cat', 'Score'])
for _ in range(10):
df_tidy = df_tidy.append(
{
'Stage': random.choice(['OP', 'FUEL', 'EOL']),
'Exc': str(np.random.randint(low=0, high=1000)),
'Cat': random.choice(['CC', 'HT', 'PM']),
'Score': np.random.random(),
}, ignore_index=True
)
df_tidy
returns
Stage Exc Cat Score
0 OP 929 HT 0.946234
1 OP 813 CC 0.829522
2 FUEL 114 PM 0.868605
3 OP 896 CC 0.382077
4 FUEL 10 CC 0.832246
5 FUEL 515 HT 0.632220
6 EOL 970 PM 0.532310
7 FUEL 198 CC 0.209856
8 FUEL 848 CC 0.479470
9 OP 968 HT 0.348093
I would like a new DataFrame with Stages as columns, Cats as rows and sum of Scores as values. I achieve it this way:
Working but probably inefficient approach
new_df = pd.DataFrame(columns=list(df_tidy['Stage'].unique()))
for cat, small_df in df_tidy.groupby('Cat'):
for lcs, smaller_df in small_df.groupby('Stage'):
new_df.loc[cat, lcs] = smaller_df['Score'].sum()
new_df['Total'] = new_df.sum(axis=1)
new_df
Which returns what I want:
OP FUEL EOL Total
CC 1.2116 1.52157 NaN 2.733170
HT 1.29433 0.63222 NaN 1.926548
PM NaN 0.868605 0.53231 1.400915
But I cannot believe this is the simplest or most efficient path.
Question
What pandas magic am I missing out on?
Update - Timing the proposed solutions
To understand the differences between pivot_table and crosstab proposed below, I timed the three solutions with a 100,000 row dataframe built exactly as above:
groupby solution, that I thought was inefficient:
%%timeit
new_df = pd.DataFrame(columns=list(df_tidy['Stage'].unique()))
for cat, small_df in df_tidy.groupby('Cat'):
for lcs, smaller_df in small_df.groupby('Stage'):
new_df.loc[cat, lcs] = smaller_df['Score'].sum()
new_df['Total'] = new_df.sum(axis=1)
41.2 ms ± 3.18 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
crosstab solution, that requires a creation of a DataFrame in the background, even if the passed data is already in DataFrame format:
%%timeit
pd.crosstab(index=df_tidy.Cat,columns=df_tidy.Stage, values=df_tidy.Score, aggfunc='sum', margins = True, margins_name = 'Total').iloc[:-1,:]
67.8 ms ± 1.08 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
pivot_table solution:
%%timeit
pd.pivot_table(df_tidy, index=['Cat'], columns=["Stage"], margins=True, margins_name='Total', aggfunc=np.sum).iloc[:-1,:]
713 ms ± 20.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
So, it would appear that the clunky groupbysolution is the quickest.

A simple solution from crosstab
pd.crosstab(index=df.Cat,columns=df.Stage,values=df.Score,aggfunc='sum', margins = True, margins_name = 'Total').iloc[:-1,:]
Out[342]:
Stage EOL FUEL OP Total
Cat
CC NaN 1.521572 1.211599 2.733171
HT NaN 0.632220 1.294327 1.926547
PM 0.53231 0.868605 NaN 1.400915

I was wondering if not a simpler solution than using pd.crosstab is to use pd.pivot:
pd.pivot_table(df_tidy, index=['Cat'], columns=["Stage"], margins=True, margins_name='Total', aggfunc=np.sum).iloc[:-1,:]

pandas find sum of top two in each group [duplicate]

I am trying to use groupby, nlargest, and sum functions in Pandas together, but having trouble making it work.
State County Population
Alabama a 100
Alabama b 50
Alabama c 40
Alabama d 5
Alabama e 1
...
Wyoming a.51 180
Wyoming b.51 150
Wyoming c.51 56
Wyoming d.51 5
I want to use groupby to select by state, then get the top 2 counties by population. Then use only those top 2 county population numbers to get a sum for that state.
In the end, I'll have a list that will have the state and the population (of it's top 2 counties).
I can get the groupby and nlargest to work, but getting the sum of the nlargest(2) is a challenge.
The line I have right now is simply: df.groupby('State')['Population'].nlargest(2)

You can use apply after performing the groupby:
df.groupby('State')['Population'].apply(lambda grp: grp.nlargest(2).sum())
I think this issue you're having is that df.groupby('State')['Population'].nlargest(2) will return a DataFrame, so you can no longer do group level operations. In general, if you want to perform multiple operations in a group, you'll need to use apply/agg.
The resulting output:
State
Alabama 150
Wyoming 330
EDIT
A slightly cleaner approach, as suggested by #cs95:
df.groupby('State')['Population'].nlargest(2).sum(level=0)
This is slightly slower than using apply on larger DataFrames though.
Using the following setup:
import numpy as np
import pandas as pd
from string import ascii_letters
n = 10**6
df = pd.DataFrame({'A': np.random.choice(list(ascii_letters), size=n),
'B': np.random.randint(10**7, size=n)})
I get the following timings:
In [3]: %timeit df.groupby('A')['B'].apply(lambda grp: grp.nlargest(2).sum())
103 ms ± 1.08 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [4]: %timeit df.groupby('A')['B'].nlargest(2).sum(level=0)
147 ms ± 3.38 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
The slower performance is potentially caused by the level kwarg in sum performing a second groupby under the hood.

Using agg, the grouping logic looks like:
df.groupby('State').agg({'Population': {lambda x: x.nlargest(2).sum() }})
This results in another dataframe object; which you could query to find the most populous states, etc.
Population
State
Alabama 150
Wyoming 330

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Replace column values in a large dataframe - python

Related

pandas operation by group

What's the most efficient way to convert a time-series data into a cross-sectional one?

Find least frequent value in whole dataframe

Alternative to groupby for generating a summary table from tidy pandas DataFrame

pandas find sum of top two in each group [duplicate]

Categories

Resources