How to generate unique uuid in pandas without loop - python

Currently I am generating unique uuid in each row using loop like this -
df['uuid'] = df.apply(lambda x: uuid.uuid4(), axis=1)
Is there way to do this without loop?

Remains a loop, but is a little faster than the current way
df['uuid'] = [uuid.uuid4() for x in range(df.shape[0])]
24.4 µs ± 2.31 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Using apply
df['uuid'] = df.apply(lambda x: uuid.uuid4(), axis=1)
1.25 ms ± 39.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Related

Python speed up element in list

I have made a script where I check if a column's values from dataframe A exist in a columns of dataframe B. Here, dataframe A is named whole data, and dataframe B is named referrarls
users=set(whole_data['user_id'])
referees=set(referrals['referee_id'])
non_referees=set([x for x in users if x not in referees])
As you can see, I want a list of users (named non_referees) that contain users that are not referees, that's why I am checking for every user_id from whole_data if it exists in the set of referees.
Nonetheless, this is taking a massive amount of time, there are like 100K users and 4K referees. Is there a way to make this faster?
First, pandas can already give you the unique values of a series, which might be faster than building the set from the whole column.
Second, to build the set of non-referees, you can then use set operations:
non_referees = users - referees
EDIT: As an additional note, if you build a set using the generator expression style, you don't need to build an intermediate list:
# slow because it first builds a list and then turns that into a set:
some_set = set([x for x in something])
# faster because it goes right into building the set:
some_other_set = set(x for x in something_else)
You may want to consider:
non_referees = set(whole_data['user_id'].unique()).difference(
referrals['referee_id'].unique()
)
If you think there are few repeats in referrals['referee_id'], then you'll gain a smidgen of speed by avoiding .unique() for them.
Speed
Here are some experiments with a few closely related forms:
Case with lots of duplicated referee_id
n = 100_000
whole_data = pd.DataFrame({
'user_id': np.random.randint(0, n, n),
})
referrals = pd.DataFrame({
'referee_id': np.random.randint(0, n, n),
})
Measurements:
%timeit non_referees = set(pd.unique(whole_data['user_id'])) - set(pd.unique(referrals['referee_id']))
# 23.4 ms ± 53.3 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit non_referees = set(whole_data['user_id'].unique()) - set(referrals['referee_id'].unique())
# 23.3 ms ± 36 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit non_referees = set(whole_data['user_id'].unique()).difference(set(referrals['referee_id'].unique()))
# 23.3 ms ± 73.7 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# *** fastest
%timeit non_referees = set(whole_data['user_id'].unique()).difference(referrals['referee_id'].unique())
# 21.4 ms ± 74.8 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit non_referees = set(whole_data['user_id'].unique()).difference(referrals['referee_id'])
# 29.6 ms ± 21.2 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Case with few duplicates in referee_id:
n = 100_000
whole_data = pd.DataFrame({
'user_id': np.random.randint(0, n, n),
})
referrals = pd.DataFrame({
'referee_id': np.random.randint(0, 100*n, n),
})
Measurements:
%timeit non_referees = set(pd.unique(whole_data['user_id'])) - set(pd.unique(referrals['referee_id']))
# 30.7 ms ± 61.7 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit non_referees = set(whole_data['user_id'].unique()) - set(referrals['referee_id'].unique())
# 30.7 ms ± 25.2 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit non_referees = set(whole_data['user_id'].unique()).difference(set(referrals['referee_id'].unique()))
# 30.6 ms ± 57.4 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit non_referees = set(whole_data['user_id'].unique()).difference(referrals['referee_id'].unique())
# 23.7 ms ± 37.4 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# *** fastest
%timeit non_referees = set(whole_data['user_id'].unique()).difference(referrals['referee_id'])
# 20.9 ms ± 54 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Performance of pandas GroupBy to the Keys

I have to group a Dataframe and do some more calculations by using the keys as input parameter. Doing so I figured out some strange performance behavior.
The time for grouping is ok and the time to get the keys too. But if I execute both steps it takes 24x the time.
Am I using it wrong or is there another way to get the unique parameter pairs with all their indices?
Here is some easy example:
import numpy as np
import pandas as pd
def test_1(df):
grouped = df.groupby(['up','down'])
return grouped
def test_2(grouped):
keys = grouped.groups.keys()
return keys
def test_3(df):
keys = df.groupby(['up','down']).groups.keys()
return keys
def test_4(df):
grouped = df.groupby(['up','down'])
keys = grouped.groups.keys()
return keys
n = np.arange(1,10,1)
df = pd.DataFrame([],columns=['up','down'])
df['up'] = n
df['down'] = n[::-1]
grouped = df.groupby(['up','down'])
%timeit test_1(df)
169 µs ± 12.5 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit test_2(grouped)
1.01 µs ± 70.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit test_3(df)
4.36 ms ± 210 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit test_4(df)
4.2 ms ± 161 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Thanks in advance for comments or ideas.

Speed up list creation from pandas dataframe

I have a pandas dataframe df from which I need to create a list Row_list.
import pandas as pd
df = pd.DataFrame([[1, 572548.283, 166424.411, -11.849, -11.512],
[2, 572558.153, 166442.134, -11.768, -11.983],
[3, 572124.999, 166423.478, -11.861, -11.512],
[4, 572534.264, 166414.417, -11.123, -11.993]],
columns=['PointNo','easting', 'northing', 't_20080729','t_20090808'])
I am able to create the list in the required format with the code below, but my dataframe has up to 8 million rows and the list creation is very slow.
def test_get_value_iterrows(df):
Row_list =[]
for index, rows in df.iterrows():
entirerow = df.values[index,].tolist()
entirerow.append((df.iloc[index,1],df.iloc[index,2]))
Row_list.append(entirerow)
Row_list
%timeit test_get_value_iterrows(df)
436 µs ± 6.16 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Not using df.iterrows() and df.iloc() is a little bit faster,
def test_get_value(df):
Row_list =[]
for i in df.index:
entirerow = df.values[i,].tolist()
entirerow.append((df.iloc[i,1],df.iloc[i,2]))
Row_list.append(entirerow)
Row_list
%timeit test_get_value(df)
270 µs ± 14.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
I am wondering if there is a faster solution to this?
Use list comprehension:
df = pd.concat([df] * 10000, ignore_index=True)
In [123]: %timeit [[*x, (x[1], x[2])] for x in df.values.tolist()]
27.8 ms ± 404 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [124]: %timeit [x + [(x[1], x[2])] for x in df.values.tolist()]
26.6 ms ± 441 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [125]: %timeit (test_get_value(df))
41.2 s ± 1.97 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

Efficient way to check dtype of each row in a series

Say I have mixed ts/other data:
ser = pd.Series(pd.date_range('2017/01/05', '2018/01/05'))
ser.loc[3] = 4
type(ser.loc[0])
> pandas._libs.tslibs.timestamps.Timestamp
I would like to filter for all timestamps. For instance, this gives me what I want:
ser.apply(lambda x: isinstance(x, pd.Timestamp))
0 True
1 True
2 True
3 False
4 True
...
But I assume it would be faster to use a vectorized solution and avoid apply. I thought I should be able to use where:
ser.where(isinstance(ser, pd.Timestamp))
But I get
ValueError: Array conditional must be same shape as self
Is there a way to do this? Also, am I correct in my assumption that it would be faster/more 'Pandasic'?
It depends of length of data, but here for small data (365 rows) is faster list comprehension:
In [108]: %timeit (ser.apply(lambda x: isinstance(x, pd.Timestamp)))
434 µs ± 57.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [109]: %timeit ([isinstance(x, pd.Timestamp) for x in ser])
140 µs ± 5.09 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [110]: %timeit (pd.to_datetime(ser, errors='coerce').notna())
1.01 ms ± 25.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
But if test larger DataFrame is faster to_datetime with test non missing values by Series.isna:
ser = pd.Series(pd.date_range('1980/01/05', '2020/01/05'))
ser.loc[3] = 4
print (len(ser))
14611
In [116]: %timeit (ser.apply(lambda x: isinstance(x, pd.Timestamp)))
6.42 ms ± 541 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [117]: %timeit ([isinstance(x, pd.Timestamp) for x in ser])
4.9 ms ± 256 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [118]: %timeit (pd.to_datetime(ser, errors='coerce').notna())
4.22 ms ± 167 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
To address your question of filtering, you can convert to datetime and drop NaNs.
ser[pd.to_datetime(ser, errors='coerce').notna()]
Or, if you don't mind the result being datetime,
pd.to_datetime(ser, errors='coerce').dropna()

"Pandorable" way to return index in dataframe slicing

Is there a pandorable way to get only the index in dataframe slicing?
In other words, is there a better way to write the following code:
df.loc[df['A'] >5].index
Thanks!
Yes, better is filter only index values, not all DataFrame and then select index:
#filter index
df.index[df['A'] >5]
#filter DataFrame
df[df['A'] >5].index
Difference is in performance too:
np.random.seed(1245)
df = pd.DataFrame({'A':np.random.randint(10, size=1000)})
print (df)
In [40]: %timeit df.index[df['A'] >5]
208 µs ± 11.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [41]: %timeit df[df['A'] >5].index
428 µs ± 6.42 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [42]: %timeit df.loc[df['A'] >5].index
466 µs ± 40.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
If performance is important use numpy - convert values of index and column by values to numpy array:
In [43]: %timeit df.index.values[df['A'] >5]
157 µs ± 8.71 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [44]: %timeit df.index.values[df['A'].values >5]
8.91 µs ± 196 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Categories

Resources