I am working with a dataset of around 400k rows of preprocessed strings.
[In]:
raw preprocessed
helpersstreet 46, second floor helpersstreet 46
489 john doe route john doe route
at main street 49 main street
All strings in column 'preprocessed' are either same size or smaller than column 'raw'. Is there a fast way to compare these strings and return all differences, getting them in a column:
[Out]:
raw preprocessed difference
helpersstreet 46, second floor helpersstreet 46 ,second floor
489 john doe route john doe route 489
at main street 49 main street at 49
I am not really sure how to do this, but I am also wondering whether this is the way to go. I have access to the functions that perform the preprocessing, so is it faster to modify them to return these values or is the a scalable way to create the differences later. I would prefer the latter.
Option 1
Seems like an iterative replacement is in order. You can do this best using a list comprehension:
df['difference'] = [i.replace(j, '') for i, j in zip(df.raw, df.preprocessed)]
df
raw preprocessed difference
0 helpersstreet 46, second floor helpersstreet 46 , second floor
1 489 john doe route john doe route 489
2 at main street 49 main street at 49
Given the limitations of this problem (the difficulty involved with vectorizing the replacement operation), I'd say this is your fastest option.
Option 2
Alternatively, np.vectorize a lambda,
f = np.vectorize(lambda i, j: i.replace(j, ''))
df['difference'] = f(df.raw, df.preprocessed)
df
raw preprocessed difference
0 helpersstreet 46, second floor helpersstreet 46 , second floor
1 489 john doe route john doe route 489
2 at main street 49 main street at 49
Note that this only hides the loop, it is just as fast/slow as Option 1, if not worse.
Option 3
Using apply, which I don't recommend:
df['difference'] = df.apply(lambda x: x.raw.replace(x.preprocessed, ''), 1)
df
raw preprocessed difference
0 helpersstreet 46, second floor helpersstreet 46 , second floor
1 489 john doe route john doe route 489
2 at main street 49 main street at 49
This also hides the loop, but does at a cost of more overhead than Option 2.
Timings
On request of my friend, Mr. jezrael:
df = pd.concat([df] * 10000, ignore_index=True) # setup
# Option 1
%timeit df['difference'] = [i.replace(j, '') for i, j in zip(df.raw, df.preprocessed)]
186 ms ± 12.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# Option 2
%timeit df['difference'] = f(df.raw, df.preprocessed)
326 ms ± 14.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# Option 3
%timeit df['difference'] = df.apply(lambda x: x.raw.replace(x.preprocessed, ''), 1)
20.8 s ± 237 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Related
I have a dataframe that has similar ids with spatiotemporal data like below:
car_id lat long
xxx 32 150
xxx 33 160
yyy 20 140
yyy 22 140
zzz 33 70
zzz 33 80
. . .
I want to replace car_id with car_1, car_2, car_3, ... However, my dataframe is large and it's not possible to do it manually by name so first I made a list of all unique values in the car_id column and made a list of names that should be replaced with:
u_values = [i for i in df['car_id'].unique()]
r = ['car'+str(i) for i in range(len(u_values))]
Now I'm not sure how to replace all unique numbers in car_id column with list values so the result is like this:
car_id lat long
car_1 32 150
car_1 33 160
car_2 20 140
car_2 22 140
car_3 33 70
car_3 33 80
. . .
The answers so far seem a little complicated to me, so here's another suggestion. This creates a dictionary that has the old name as the keys and the new name as the values. That can be used to map the old values to new values.
r={k:'car_{}'.format(i) for i,k in enumerate(df['car_id'].unique())}
df['car_id'] = df['car_id'].map(r)
edit: the answer using factorize is probably better even though I think this is a bit easier to read
Create a mapping from u_values to r and map it to car_id column. Also simplify the definition of u_values and r by using tolist() method and f-strings, respectively.
u_values = df['car_id'].unique().tolist()
r = [f'car_{i}' for i in range(len(u_values))]
mapping = pd.Series(r, index=u_values)
df['car_id'] = df['car_id'].map(mapping)
That said, it seems vectorized string concatenation is enough for this task. factorize() method encodes the strings.
df['car_id'] = 'car_' + pd.Series(df['car_id'].factorize()[0], dtype='string')
When I timed some these methods (I omitted Juan Manuel Rivera's solution because replace is very slow and the code takes forever on larger data), the map() implementation that built on OP's code turned out to be the fastest.
The factorize() implementation, while concise, is not fast after all. Also I agree with pasnik that their solution is the easiest to read.
# a dataframe with 500k rows and 100k unique car_ids
df = pd.DataFrame({'car_id': np.random.default_rng().choice(100000, size=500000)})
%timeit u_values = df['car_id'].unique().tolist(); r = [f'car_{i}' for i in range(len(u_values))]; mapping = pd.Series(r, index=u_values); df.assign(car_id=df['car_id'].map(mapping))
# 136 ms ± 2.92 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit df.assign(car_id = 'car_' + pd.Series(df['car_id'].factorize()[0], dtype='string'))
# 602 ms ± 19.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit r={k:'car_{}'.format(i) for i,k in enumerate(df['car_id'].unique())}; df.assign(car_id=df['car_id'].map(r))
# 196 ms ± 3.02 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
It may be easier if you use a dictionary to maintain the relation between each unique value (xxxx,yyyy...) and the new id you want (1, 2, 3...)
newIdDict={}
idCounter=1
for i in df['Car id'].unique():
if i not in newIdDict:
newIdDict[i] = 'car_'+str(idCounter)
idCounter += 1
Then, you can use Pandas replace function to change the values in car_id column:
df['Car id'].replace(newIdDict, inplace=True)
Take into account that this will change ALL the xxxx, yyyy in your dataframe, so if you have any xxxx in lat or long columns it will also be modified
I have a dataframe like this
df = pd.DataFrame({'id': [205,205,205, 211, 211, 211]
, 'date': pd.to_datetime(['2019-12-01','2020-01-01', '2020-02-01'
,'2019-12-01' ,'2020-01-01', '2020-03-01'])})
df
id date
0 205 2019-12-01
1 205 2020-01-01
2 205 2020-02-01
3 211 2019-12-01
4 211 2020-01-01
5 211 2020-03-01
where the column date is made by consecutive months for id 205 but not for id 211.
I want to keep only the observations (id) for which I have monthly data without jumps. In this example I want:
id date
0 205 2019-12-01
1 205 2020-01-01
2 205 2020-02-01
Here I am collecting the id to keep:
keep_id = []
for num in pd.unique(df.index):
temp = (df.loc[df['id']==num,'date'].dt.year - df.loc[df['id']==num,'date'].shift(1).dt.year) * 12 + df.loc[df['id']==num,'date'].dt.month - df.loc[df['id']==num,'date'].shift(1).dt.month
temp.values[0] = 1.0 # here I correct the first entry
if (temp==1.).all():
keep_id.append(num)
where I am using (df.loc[num,'date'].dt.year - df.loc[num,'date'].shift(1).dt.year) * 12 + df.loc[num,'date'].dt.month - df.loc[num,'date'].shift(1).dt.month to compute the difference in months from the previous date for every id.
This seems to work when tested on a small portion of df, but I'm sure there is a better way of doing this, maybe using the .groupby() method.
Since df is made of millions of observations my code takes too much time (and I'd like to learn a more efficient and pythonic way of doing this)
What you want to do is use groupby-filter rather than a groupby apply.
df.groupby('id').filter(lambda x: not (x.date.diff() > pd.Timedelta(days=32)).any())
provides exactly:
id date
0 205 2019-12-01
1 205 2020-01-01
2 205 2020-02-01
And indeed, I would keep the index unique, there are too many useful characteristics to retain.
Both this response and Michael's above are correct in terms of output. In terms of performance, they are very similar as well:
%timeit df.groupby('id').filter(lambda x: not (x.date.diff() > pd.Timedelta(days=32)).any())
1.48 ms ± 12.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
and
%timeit df[df.groupby('id')['date'].transform(lambda x: x.diff().max() < pd.Timedelta(days=32))]
1.7 ms ± 163 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
For most operations, this difference is negligible.
You can use the following approach. Only ~3x faster in my tests.
df[df.groupby('id')['date'].transform(lambda x: x.diff().max() < pd.Timedelta(days=32))]
Out:
date
id
205 2019-12-01
205 2020-01-01
205 2020-02-01
my dataframe is something like this
> 93 40 73 41 115 74 59 98 76 109 43 44
105 119 56 62 69 51 50 104 91 78 77 75
119 61 106 105 102 75 43 51 60 114 91 83
It has 8000 rows and 12 columns
I wanted to find the least frequent value in this whole dataframe (not only in columns).
I tried converting this dataframe into numpy array and use for loop to count the numbers and then return the least count number but it it not very optimal. I searched if there are any other methods but could not find it.
I only found scipy.stats.mode which returns the most frequent number.
is there any other way to do it?
You could stack and take the value_counts:
df.stack().value_counts().index[-1]
# 69
value_counts orders by frequency, so you can just take the last, though in this example many appear just once. 69 happens to be the last.
Another way using pandas.DataFrame.apply with pandas.Series.value_counts:
df.apply(pd.Series.value_counts).sum(1).idxmin()
# 40
# There are many values with same frequencies.
To my surprise, apply method seems to be the fastest among the methods I've tried (reason why I'm posting):
df2 = pd.DataFrame(np.random.randint(1, 1000, (500000, 100)))
%timeit df2.apply(pd.Series.value_counts).sum(1).idxmin()
# 2.36 s ± 193 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df2.stack().value_counts().index[-1]
# 3.02 s ± 86.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
uniq, cnt = np.unique(df2, return_counts=True)
uniq[np.argmin(cnt)]
# 2.77 s ± 111 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
As opposed to my understanding of apply being very slow, it even outperformed numpy.unique (perhaps my coding is wrong tho ;().
My data frame contains name, age, Task1, Task2, Task3.
Now I need to get all the rows that satisfy a string value in either of Task1, Task2, Task3 columns. Say I want to check 'Drafting', key word. If 'Drafting' is present as part of any of these column value, then, that entire row has to be added to resultant frame.
I tried isin() but I am getting true or false. I need to extract such 'N' rows, that contain a particular keyword.
I tried,
df.columns[df.Task1.str.contains("Drafting")] , but this compare and give single column .
Any one know how to use, str.contains or any other method to compare string values of columns and get all rows that satisfy the checking condition.
Name Age Task1 Task2 Task3
0 Ann 43 Drafting a Letter sending paking
1 Juh 29 sending paking Letter Drafting
2 Jeo 42 Pasting sending paking
3 Sam 59 sending pasting Letter Drafting
I need to check if the key word 'Drafting' is present in any of the columns[The column contains 3 to 4 words, need to check Drafting is present in this words/sentence]; the result should be:
Name Age Task1 Task2 Task3
0 Ann 43 Drafting a Letter sending paking
1 Juh 29 sending paking Letter Drafting
3 Sam 59 sending pasting Letter Drafting
Or just(Note this will check entire df not specific columns):
df[df.astype(str).apply(lambda x: x.str.contains('Drafting')).any(axis=1)]
#for case insensitive use below
#df[df.astype(str).apply(lambda x: x.str.contains('Drafting',case=False)).any(axis=1)]
Name Age Task1 Task2 Task3
0 Ann 43 Drafting a Letter sending paking
1 Juh 29 sending paking Letter Drafting
3 Sam 59 sending pasting Letter Drafting
A quick comparison of given answers on 20k rows of data-
#Alollz (in comments)
%timeit df.loc[df.filter(like='Task').applymap(lambda x: 'Drafting' in x).any(1)]
25.2 ms ± 2.09 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
#Sergey Bushmanov
%timeit df[df.Task1.str.contains("Drafting") | df.Task2.str.contains("Drafting") | df.Task3.str.contains("Drafting")]
58.7 ms ± 9.25 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
#anky_91
%timeit df[df.filter(like='Task').apply(lambda x: x.str.contains('Drafting')).any(axis=1)]
88.6 ms ± 12.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit df[df.astype(str).apply(lambda x: x.str.contains('Drafting')).any(axis=1)]
128 ms ± 14.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
#ALollz
%timeit df.loc[df.filter(like='Task').stack().str.split(expand=True).eq('Drafting').any(1).any(level=0)]
290 ms ± 29.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
You may try:
new_df = df[df.Task1.str.contains("Drafting") | df.Task2.str.contains("Drafting") | df.Task3.str.contains("Drafting")]
This will return a new_df with rows containing "Drafting" in any of the "Task1,2,3" columns.
This can be achieved using np.where:
df = pd.DataFrame({
'Name': ['Ann', 'Juh', 'Jeo', 'Sam'],
'Age': [43,29,42,59],
'Task1': ['Drafting a letter', 'Sending', 'Pasting', 'Sending'],
'Task2': ['Sending', 'Paking', 'Sending', 'Pasting'],
'Task3': ['Packing', 'Letter Drafting', 'Paking', 'Letter Drafting']
})
df_new = df.iloc[df.index[np.concatenate(
np.where(df['Task1'].str.contains('Drafting')) +
np.where(df['Task2'].str.contains('Drafting')) +
np.where(df['Task3'].str.contains('Drafting'))).astype(int)
].values.tolist()]
print(df_new)
Name Age Task1 Task2 Task3
0 Ann 43 Drafting a letter Sending Packing
1 Juh 29 Sending Paking Letter Drafting
3 Sam 59 Sending Pasting Letter Drafting
You can try something like this,
new_df = df[(df['Task1'] == 'Drafting') | (df['Task2'] == 'Drafting') | (df['Task3'] == 'Drafting')]
This will select all the rows if the columns Task1 or Task2 or Task3 contains 'Drafting`.
I am trying to use groupby, nlargest, and sum functions in Pandas together, but having trouble making it work.
State County Population
Alabama a 100
Alabama b 50
Alabama c 40
Alabama d 5
Alabama e 1
...
Wyoming a.51 180
Wyoming b.51 150
Wyoming c.51 56
Wyoming d.51 5
I want to use groupby to select by state, then get the top 2 counties by population. Then use only those top 2 county population numbers to get a sum for that state.
In the end, I'll have a list that will have the state and the population (of it's top 2 counties).
I can get the groupby and nlargest to work, but getting the sum of the nlargest(2) is a challenge.
The line I have right now is simply: df.groupby('State')['Population'].nlargest(2)
You can use apply after performing the groupby:
df.groupby('State')['Population'].apply(lambda grp: grp.nlargest(2).sum())
I think this issue you're having is that df.groupby('State')['Population'].nlargest(2) will return a DataFrame, so you can no longer do group level operations. In general, if you want to perform multiple operations in a group, you'll need to use apply/agg.
The resulting output:
State
Alabama 150
Wyoming 330
EDIT
A slightly cleaner approach, as suggested by #cs95:
df.groupby('State')['Population'].nlargest(2).sum(level=0)
This is slightly slower than using apply on larger DataFrames though.
Using the following setup:
import numpy as np
import pandas as pd
from string import ascii_letters
n = 10**6
df = pd.DataFrame({'A': np.random.choice(list(ascii_letters), size=n),
'B': np.random.randint(10**7, size=n)})
I get the following timings:
In [3]: %timeit df.groupby('A')['B'].apply(lambda grp: grp.nlargest(2).sum())
103 ms ± 1.08 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [4]: %timeit df.groupby('A')['B'].nlargest(2).sum(level=0)
147 ms ± 3.38 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
The slower performance is potentially caused by the level kwarg in sum performing a second groupby under the hood.
Using agg, the grouping logic looks like:
df.groupby('State').agg({'Population': {lambda x: x.nlargest(2).sum() }})
This results in another dataframe object; which you could query to find the most populous states, etc.
Population
State
Alabama 150
Wyoming 330