Speed up list creation from pandas dataframe

Speed up list creation from pandas dataframe - python

I have a pandas dataframe df from which I need to create a list Row_list.
import pandas as pd
df = pd.DataFrame([[1, 572548.283, 166424.411, -11.849, -11.512],
[2, 572558.153, 166442.134, -11.768, -11.983],
[3, 572124.999, 166423.478, -11.861, -11.512],
[4, 572534.264, 166414.417, -11.123, -11.993]],
columns=['PointNo','easting', 'northing', 't_20080729','t_20090808'])
I am able to create the list in the required format with the code below, but my dataframe has up to 8 million rows and the list creation is very slow.
def test_get_value_iterrows(df):
Row_list =[]
for index, rows in df.iterrows():
entirerow = df.values[index,].tolist()
entirerow.append((df.iloc[index,1],df.iloc[index,2]))
Row_list.append(entirerow)
Row_list
%timeit test_get_value_iterrows(df)
436 µs ± 6.16 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Not using df.iterrows() and df.iloc() is a little bit faster,
def test_get_value(df):
Row_list =[]
for i in df.index:
entirerow = df.values[i,].tolist()
entirerow.append((df.iloc[i,1],df.iloc[i,2]))
Row_list.append(entirerow)
Row_list
%timeit test_get_value(df)
270 µs ± 14.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
I am wondering if there is a faster solution to this?

Use list comprehension:
df = pd.concat([df] * 10000, ignore_index=True)
In [123]: %timeit [[*x, (x[1], x[2])] for x in df.values.tolist()]
27.8 ms ± 404 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [124]: %timeit [x + [(x[1], x[2])] for x in df.values.tolist()]
26.6 ms ± 441 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [125]: %timeit (test_get_value(df))
41.2 s ± 1.97 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

Related

Python speed up element in list

I have made a script where I check if a column's values from dataframe A exist in a columns of dataframe B. Here, dataframe A is named whole data, and dataframe B is named referrarls
users=set(whole_data['user_id'])
referees=set(referrals['referee_id'])
non_referees=set([x for x in users if x not in referees])
As you can see, I want a list of users (named non_referees) that contain users that are not referees, that's why I am checking for every user_id from whole_data if it exists in the set of referees.
Nonetheless, this is taking a massive amount of time, there are like 100K users and 4K referees. Is there a way to make this faster?

First, pandas can already give you the unique values of a series, which might be faster than building the set from the whole column.
Second, to build the set of non-referees, you can then use set operations:
non_referees = users - referees
EDIT: As an additional note, if you build a set using the generator expression style, you don't need to build an intermediate list:
# slow because it first builds a list and then turns that into a set:
some_set = set([x for x in something])
# faster because it goes right into building the set:
some_other_set = set(x for x in something_else)

You may want to consider:
non_referees = set(whole_data['user_id'].unique()).difference(
referrals['referee_id'].unique()
)
If you think there are few repeats in referrals['referee_id'], then you'll gain a smidgen of speed by avoiding .unique() for them.
Speed
Here are some experiments with a few closely related forms:
Case with lots of duplicated referee_id
n = 100_000
whole_data = pd.DataFrame({
'user_id': np.random.randint(0, n, n),
})
referrals = pd.DataFrame({
'referee_id': np.random.randint(0, n, n),
})
Measurements:
%timeit non_referees = set(pd.unique(whole_data['user_id'])) - set(pd.unique(referrals['referee_id']))
# 23.4 ms ± 53.3 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit non_referees = set(whole_data['user_id'].unique()) - set(referrals['referee_id'].unique())
# 23.3 ms ± 36 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit non_referees = set(whole_data['user_id'].unique()).difference(set(referrals['referee_id'].unique()))
# 23.3 ms ± 73.7 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# *** fastest
%timeit non_referees = set(whole_data['user_id'].unique()).difference(referrals['referee_id'].unique())
# 21.4 ms ± 74.8 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit non_referees = set(whole_data['user_id'].unique()).difference(referrals['referee_id'])
# 29.6 ms ± 21.2 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Case with few duplicates in referee_id:
n = 100_000
whole_data = pd.DataFrame({
'user_id': np.random.randint(0, n, n),
})
referrals = pd.DataFrame({
'referee_id': np.random.randint(0, 100*n, n),
})
Measurements:
%timeit non_referees = set(pd.unique(whole_data['user_id'])) - set(pd.unique(referrals['referee_id']))
# 30.7 ms ± 61.7 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit non_referees = set(whole_data['user_id'].unique()) - set(referrals['referee_id'].unique())
# 30.7 ms ± 25.2 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit non_referees = set(whole_data['user_id'].unique()).difference(set(referrals['referee_id'].unique()))
# 30.6 ms ± 57.4 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit non_referees = set(whole_data['user_id'].unique()).difference(referrals['referee_id'].unique())
# 23.7 ms ± 37.4 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# *** fastest
%timeit non_referees = set(whole_data['user_id'].unique()).difference(referrals['referee_id'])
# 20.9 ms ± 54 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Efficient way to check dtype of each row in a series

Say I have mixed ts/other data:
ser = pd.Series(pd.date_range('2017/01/05', '2018/01/05'))
ser.loc[3] = 4
type(ser.loc[0])
> pandas._libs.tslibs.timestamps.Timestamp
I would like to filter for all timestamps. For instance, this gives me what I want:
ser.apply(lambda x: isinstance(x, pd.Timestamp))
0 True
1 True
2 True
3 False
4 True
...
But I assume it would be faster to use a vectorized solution and avoid apply. I thought I should be able to use where:
ser.where(isinstance(ser, pd.Timestamp))
But I get
ValueError: Array conditional must be same shape as self
Is there a way to do this? Also, am I correct in my assumption that it would be faster/more 'Pandasic'?

It depends of length of data, but here for small data (365 rows) is faster list comprehension:
In [108]: %timeit (ser.apply(lambda x: isinstance(x, pd.Timestamp)))
434 µs ± 57.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [109]: %timeit ([isinstance(x, pd.Timestamp) for x in ser])
140 µs ± 5.09 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [110]: %timeit (pd.to_datetime(ser, errors='coerce').notna())
1.01 ms ± 25.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
But if test larger DataFrame is faster to_datetime with test non missing values by Series.isna:
ser = pd.Series(pd.date_range('1980/01/05', '2020/01/05'))
ser.loc[3] = 4
print (len(ser))
14611
In [116]: %timeit (ser.apply(lambda x: isinstance(x, pd.Timestamp)))
6.42 ms ± 541 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [117]: %timeit ([isinstance(x, pd.Timestamp) for x in ser])
4.9 ms ± 256 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [118]: %timeit (pd.to_datetime(ser, errors='coerce').notna())
4.22 ms ± 167 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

To address your question of filtering, you can convert to datetime and drop NaNs.
ser[pd.to_datetime(ser, errors='coerce').notna()]
Or, if you don't mind the result being datetime,
pd.to_datetime(ser, errors='coerce').dropna()

Is there a faster (numpy?) way to combine pandas df int columns into dot-separated str col without TypeError

I want to combine two int columns to create a new dot-separated str column. I've got one way that works but if there is a faster way, it would help. I've also tried a suggestion I found in another answer on SO that produces an error.
This works:
df3 = pd.DataFrame({'job_number': [3913291, 3887250, 3913041],
'task_number': [38544, 0, 1]})
df3['filename'] = df3['job_number'].astype(str) + '.' + df3['task_number'].astype(str)
0 3913291.38544
1 3887250.0
2 3913041.1
This answer to a similar question suggests a "numpy" way, using .values.astype(str), but I haven't gotten it to work yet. Here I run it without including the dot separator:
df3['job_number'].values.astype(int).astype(str) + df3['task_number'].astype(int).astype(str)
0 391329138544
1 38872500
2 39130411
But when I include the dot separator I get an error:
df3['job_number'].values.astype(int).astype(str) + '.' + df3['task_number'].astype(int).astype(str)
TypeError: ufunc 'add' did not contain a loop with signature matching types dtype('<U11') dtype('<U11') dtype('<U11')
The result I want is:
0 3913291.38544
1 3887250.0
2 3913041.1

For comparison of given methods with other available methods do refer #Jezrael answer.
Method 1
To add a dummy column containing ., use it in processing and later drop it:
%%timeit
df3['dummy'] ='.'
res = df3['job_number'].values.astype(str) + df3['dummy'] + df3['task_number'].values.astype(str)
df3.drop(columns=['dummy'], inplace=True)
1.31 ms ± 41.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
To the extension of method 1, if you exclude the processing time of dummy column creation and dropping it then it is the best you get -
%%timeit
df3['job_number'].values.astype(str) + df3['dummy'] + df3['task_number'].values.astype(str)
286 µs ± 15.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Method 2
Use apply
%timeit df3.T.apply(lambda x: str(x[0]) + '.' + str(x[1]))
883 µs ± 22 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

You can use list comprehension:
df3["filename"] = ['.'.join(i) for i in
zip(df3["job_number"].map(str),df3["task_number"].map(str))]
If use python 3.6+ the fastest solution with f-strings:
df3["filename2"] = [f'{i}.{j}' for i,j in zip(df3["job_number"],df3["task_number"])]
Performance in 30k rows:
df3 = pd.DataFrame({'job_number': [3913291, 3887250, 3913041],
'task_number': [38544, 0, 1]})
df3 = pd.concat([df3] * 10000, ignore_index=True)
In [64]: %%timeit
...: df3["filename2"] = [f'{i}.{j}' for i,j in zip(df3["job_number"],df3["task_number"])]
...:
20.5 ms ± 226 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [65]: %%timeit
...: df3["filename3"] = ['.'.join(i) for i in zip(df3["job_number"].map(str),df3["task_number"].map(str))]
...:
30.9 ms ± 189 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [66]: %%timeit
...: df3["filename4"] = df3.T.apply(lambda x: str(x[0]) + '.' + str(x[1]))
...:
1.7 s ± 31.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [67]: %%timeit
...: df3['dummy'] ='.'
...: res = df3['job_number'].values.astype(str) + df3['dummy'] + df3['task_number'].values.astype(str)
...: df3.drop(columns=['dummy'], inplace=True)
...:
73.6 ms ± 1.23 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
But also very fast is original solution:
In [73]: %%timeit
...: df3['filename'] = df3['job_number'].astype(str) + '.' + df3['task_number'].astype(str)
48.3 ms ± 872 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
With small modification - using map instead astype:
In [76]: %%timeit
...: df3['filename'] = df3['job_number'].map(str) + '.' + df3['task_number'].map(str)
...:
26 ms ± 676 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Methods in order of %%timeit results
I timed all the suggested methods and a few more on two DataFrames. Here are the timed results for the suggested methods (thank you #meW and #jezrael). If I missed any or you have another, let me know and I'll add it.
Two timings are shown for each method: first for processing the 3 rows in the example df and then for processing 57K rows in another df. Timings may vary on another system. Solutions that include TEST['dot'] in the concatenation string require this column in the df: add it with TEST['dot'] = '.'.
Original method (still the fastest):
.astype(str), +, '.'
%%timeit
TEST['filename'] = TEST['job_number'].astype(str) + '.' + TEST['task_number'].astype(str)
# 553 µs ± 6.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) on 3 rows
# 69.6 ms ± 876 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) on 57K rows
Proposed methods and a few permutations on them:
.astype(int).astype(str), +, '.'
%%timeit
TEST['filename'] = TEST['job_number'].astype(int).astype(str) + '.' + TEST['task_number'].astype(int).astype(str)
# 553 µs ± 6.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) on 3 rows
# 70.2 ms ± 739 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) on 57K rows
.values.astype(int).astype(str), +, TEST['dot']
%%timeit
TEST['filename'] = TEST['job_number'].values.astype(int).astype(str) + TEST['dot'] + TEST['task_number'].values.astype(int).astype(str)
# 221 µs ± 5.93 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) on 3 rows
# 82.3 ms ± 743 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) on 57K rows
.values.astype(str), +, TEST['dot']
%%timeit
TEST["filename"] = TEST['job_number'].values.astype(str) + TEST['dot'] + TEST['task_number'].values.astype(str)
# 221 µs ± 5.93 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) on 3 rows
# 92.8 ms ± 1.21 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) on 57K rows
'.'.join(), list comprehension, .values.astype(str)
%%timeit
TEST["filename"] = ['.'.join(i) for i in TEST[["job_number",'task_number']].values.astype(str)]
# 743 µs ± 19.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) on 3 rows
# 147 ms ± 532 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) on 57K rows
f-string, list comprehension, .values.astype(str)
%%timeit
TEST["filename2"] = [f'{i}.{j}' for i,j in TEST[["job_number",'task_number']].values.astype(str)]
# 642 µs ± 27.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) on 3 rows
# 167 ms ± 3.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) on 57K rows
'.'.join(), zip, list comprehension, .map(str)
%%timeit
TEST["filename"] = ['.'.join(i) for i in
zip(TEST["job_number"].map(str), TEST["task_number"].map(str))]
# 512 µs ± 5.74 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) on 3 rows
# 181 ms ± 4.17 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) on 57K rows
apply(lambda, str(x[2]), +, '.')
%%timeit
TEST['filename'] = TEST.T.apply(lambda x: str(x[2]) + '.' + str(x[10]))
# 735 µs ± 13.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) on 3 rows
# 2.69 s ± 18.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) on 57K rows
If you see a way to improve on any of these, please let me know and I'll add to the list!

"Pandorable" way to return index in dataframe slicing

Is there a pandorable way to get only the index in dataframe slicing?
In other words, is there a better way to write the following code:
df.loc[df['A'] >5].index
Thanks!

Yes, better is filter only index values, not all DataFrame and then select index:
#filter index
df.index[df['A'] >5]
#filter DataFrame
df[df['A'] >5].index
Difference is in performance too:
np.random.seed(1245)
df = pd.DataFrame({'A':np.random.randint(10, size=1000)})
print (df)
In [40]: %timeit df.index[df['A'] >5]
208 µs ± 11.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [41]: %timeit df[df['A'] >5].index
428 µs ± 6.42 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [42]: %timeit df.loc[df['A'] >5].index
466 µs ± 40.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
If performance is important use numpy - convert values of index and column by values to numpy array:
In [43]: %timeit df.index.values[df['A'] >5]
157 µs ± 8.71 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [44]: %timeit df.index.values[df['A'].values >5]
8.91 µs ± 196 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Get total values_count from a dataframe with Python Pandas

I have a Python pandas dataframe with several columns. Now I want to copy all values into one single column to get a values_count result alle values included. At the end I need the total count of string1, string2, n. What is the best way to do it?
index row 1 row 2 ...
0 string1 string3
1 string1 string1
2 string2 string2
...

If performance is an issue try:
from collections import Counter
Counter(df.values.ravel())
#Counter({'string1': 3, 'string2': 2, 'string3': 1})
Or stack it into one Series then use value_counts
df.stack().value_counts()
#string1 3
#string2 2
#string3 1
#dtype: int64
For larger (long) DataFrames with a small number of columns, looping may be faster than stacking:
s = pd.Series()
for col in df.columns:
s = s.add(df[col].value_counts(), fill_value=0)
#string1 3.0
#string2 2.0
#string3 1.0
#dtype: float64
Also, there's a numpy solution:
import numpy as np
np.unique(df.to_numpy(), return_counts=True)
#(array(['string1', 'string2', 'string3'], dtype=object),
# array([3, 2, 1], dtype=int64))
df = pd.DataFrame({'row1': ['string1', 'string1', 'string2'],
'row2': ['string3', 'string1', 'string2']})
def vc_from_loop(df):
s = pd.Series()
for col in df.columns:
s = s.add(df[col].value_counts(), fill_value=0)
return s
Small DataFrame
%timeit Counter(df.values.ravel())
#11.1 µs ± 56.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit df.stack().value_counts()
#835 µs ± 5.46 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit vc_from_loop(df)
#2.15 ms ± 34.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit np.unique(df.to_numpy(), return_counts=True)
#23.8 µs ± 241 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Long DataFrame
df = pd.concat([df]*300000, ignore_index=True)
%timeit Counter(df.values.ravel())
#124 ms ± 1.85 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit df.stack().value_counts()
#337 ms ± 3.59 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit vc_from_loop(df)
#182 ms ± 1.58 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit np.unique(df.to_numpy(), return_counts=True)
#1.16 s ± 1.09 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Speed up list creation from pandas dataframe - python

Related

Python speed up element in list

Efficient way to check dtype of each row in a series

Is there a faster (numpy?) way to combine pandas df int columns into dot-separated str col without TypeError

"Pandorable" way to return index in dataframe slicing

Get total values_count from a dataframe with Python Pandas

Categories

Resources