Get total values_count from a dataframe with Python Pandas - python

I have a Python pandas dataframe with several columns. Now I want to copy all values into one single column to get a values_count result alle values included. At the end I need the total count of string1, string2, n. What is the best way to do it?
index row 1 row 2 ...
0 string1 string3
1 string1 string1
2 string2 string2
...

If performance is an issue try:
from collections import Counter
Counter(df.values.ravel())
#Counter({'string1': 3, 'string2': 2, 'string3': 1})
Or stack it into one Series then use value_counts
df.stack().value_counts()
#string1 3
#string2 2
#string3 1
#dtype: int64
For larger (long) DataFrames with a small number of columns, looping may be faster than stacking:
s = pd.Series()
for col in df.columns:
s = s.add(df[col].value_counts(), fill_value=0)
#string1 3.0
#string2 2.0
#string3 1.0
#dtype: float64
Also, there's a numpy solution:
import numpy as np
np.unique(df.to_numpy(), return_counts=True)
#(array(['string1', 'string2', 'string3'], dtype=object),
# array([3, 2, 1], dtype=int64))
df = pd.DataFrame({'row1': ['string1', 'string1', 'string2'],
'row2': ['string3', 'string1', 'string2']})
def vc_from_loop(df):
s = pd.Series()
for col in df.columns:
s = s.add(df[col].value_counts(), fill_value=0)
return s
Small DataFrame
%timeit Counter(df.values.ravel())
#11.1 µs ± 56.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit df.stack().value_counts()
#835 µs ± 5.46 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit vc_from_loop(df)
#2.15 ms ± 34.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit np.unique(df.to_numpy(), return_counts=True)
#23.8 µs ± 241 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Long DataFrame
df = pd.concat([df]*300000, ignore_index=True)
%timeit Counter(df.values.ravel())
#124 ms ± 1.85 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit df.stack().value_counts()
#337 ms ± 3.59 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit vc_from_loop(df)
#182 ms ± 1.58 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit np.unique(df.to_numpy(), return_counts=True)
#1.16 s ± 1.09 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Related

If nth character of a string in a dataframe is a certain character delete that row

I have a dataframe:
df1 = {'seq': ["(((...))).(.)", "...((.))", "..(.)..(.)"],
'a': [1,3,5],
'b': [9,4,7],
'val': [0.01, 0.02, 0.03],
}
df1 = pd.DataFrame (df1, columns = ['seq','a','b','val'])
And I want to specify that if the nth character of 'seq' is a "." then delete that row, where n is specified by the 'a' column.
You can use apply and boolean indexing:
df1 = df1[df1.apply(lambda r: r['seq'][r['a']-1]!='.', axis=1)]
Output:
seq a b val
0 (((...))).(.) 1 9 0.01
2 ..(.)..(.) 5 7 0.03
you can use str.split with '' to split each character in a new column, then use indexing to get the right character at the position depending on the column a. finally check where != of '.'.
print (df1[df1['seq'].str.split('', expand=True)
.to_numpy()[np.arange(len(df1)), df1['a']] #no need of -1 here
!='.'])
seq a b val
0 (((...))).(.) 1 9 0.01
2 ..(.)..(.) 5 7 0.03
for a small dataframe of 3 rows, apply if faster, but if the size increases, this method becomes interesting. Although if the strings in seq are longer, they might be an effect on the timing.
%timeit df1[df1.apply(lambda r: r['seq'][r['a']-1]!='.', axis=1)]
# 1.15 ms ± 23.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit df1[df1['seq'].str.split('', expand=True).to_numpy()[np.arange(len(df1)), df1['a']]!='.']
#1.7 ms ± 95 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#with 300 rows
df2 = pd.concat([df1]*100)
%timeit df2[df2.apply(lambda r: r['seq'][r['a']-1]!='.', axis=1)]
# 7.45 ms ± 137 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit df2[df2['seq'].str.split('', expand=True).to_numpy()[np.arange(len(df2)), df2['a']]!='.']
#2.58 ms ± 52.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
#with 3000 rows
df3 = pd.concat([df1]*1000)
%timeit df3[df3.apply(lambda r: r['seq'][r['a']-1]!='.', axis=1)]
#67.9 ms ± 3.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit df3[df3['seq'].str.split('', expand=True).to_numpy()[np.arange(len(df3)), df3['a']]!='.']
#11.4 ms ± 432 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Speed up list creation from pandas dataframe

I have a pandas dataframe df from which I need to create a list Row_list.
import pandas as pd
df = pd.DataFrame([[1, 572548.283, 166424.411, -11.849, -11.512],
[2, 572558.153, 166442.134, -11.768, -11.983],
[3, 572124.999, 166423.478, -11.861, -11.512],
[4, 572534.264, 166414.417, -11.123, -11.993]],
columns=['PointNo','easting', 'northing', 't_20080729','t_20090808'])
I am able to create the list in the required format with the code below, but my dataframe has up to 8 million rows and the list creation is very slow.
def test_get_value_iterrows(df):
Row_list =[]
for index, rows in df.iterrows():
entirerow = df.values[index,].tolist()
entirerow.append((df.iloc[index,1],df.iloc[index,2]))
Row_list.append(entirerow)
Row_list
%timeit test_get_value_iterrows(df)
436 µs ± 6.16 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Not using df.iterrows() and df.iloc() is a little bit faster,
def test_get_value(df):
Row_list =[]
for i in df.index:
entirerow = df.values[i,].tolist()
entirerow.append((df.iloc[i,1],df.iloc[i,2]))
Row_list.append(entirerow)
Row_list
%timeit test_get_value(df)
270 µs ± 14.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
I am wondering if there is a faster solution to this?
Use list comprehension:
df = pd.concat([df] * 10000, ignore_index=True)
In [123]: %timeit [[*x, (x[1], x[2])] for x in df.values.tolist()]
27.8 ms ± 404 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [124]: %timeit [x + [(x[1], x[2])] for x in df.values.tolist()]
26.6 ms ± 441 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [125]: %timeit (test_get_value(df))
41.2 s ± 1.97 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

Efficient way to check dtype of each row in a series

Say I have mixed ts/other data:
ser = pd.Series(pd.date_range('2017/01/05', '2018/01/05'))
ser.loc[3] = 4
type(ser.loc[0])
> pandas._libs.tslibs.timestamps.Timestamp
I would like to filter for all timestamps. For instance, this gives me what I want:
ser.apply(lambda x: isinstance(x, pd.Timestamp))
0 True
1 True
2 True
3 False
4 True
...
But I assume it would be faster to use a vectorized solution and avoid apply. I thought I should be able to use where:
ser.where(isinstance(ser, pd.Timestamp))
But I get
ValueError: Array conditional must be same shape as self
Is there a way to do this? Also, am I correct in my assumption that it would be faster/more 'Pandasic'?
It depends of length of data, but here for small data (365 rows) is faster list comprehension:
In [108]: %timeit (ser.apply(lambda x: isinstance(x, pd.Timestamp)))
434 µs ± 57.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [109]: %timeit ([isinstance(x, pd.Timestamp) for x in ser])
140 µs ± 5.09 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [110]: %timeit (pd.to_datetime(ser, errors='coerce').notna())
1.01 ms ± 25.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
But if test larger DataFrame is faster to_datetime with test non missing values by Series.isna:
ser = pd.Series(pd.date_range('1980/01/05', '2020/01/05'))
ser.loc[3] = 4
print (len(ser))
14611
In [116]: %timeit (ser.apply(lambda x: isinstance(x, pd.Timestamp)))
6.42 ms ± 541 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [117]: %timeit ([isinstance(x, pd.Timestamp) for x in ser])
4.9 ms ± 256 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [118]: %timeit (pd.to_datetime(ser, errors='coerce').notna())
4.22 ms ± 167 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
To address your question of filtering, you can convert to datetime and drop NaNs.
ser[pd.to_datetime(ser, errors='coerce').notna()]
Or, if you don't mind the result being datetime,
pd.to_datetime(ser, errors='coerce').dropna()

Is there a faster (numpy?) way to combine pandas df int columns into dot-separated str col without TypeError

I want to combine two int columns to create a new dot-separated str column. I've got one way that works but if there is a faster way, it would help. I've also tried a suggestion I found in another answer on SO that produces an error.
This works:
df3 = pd.DataFrame({'job_number': [3913291, 3887250, 3913041],
'task_number': [38544, 0, 1]})
df3['filename'] = df3['job_number'].astype(str) + '.' + df3['task_number'].astype(str)
0 3913291.38544
1 3887250.0
2 3913041.1
This answer to a similar question suggests a "numpy" way, using .values.astype(str), but I haven't gotten it to work yet. Here I run it without including the dot separator:
df3['job_number'].values.astype(int).astype(str) + df3['task_number'].astype(int).astype(str)
0 391329138544
1 38872500
2 39130411
But when I include the dot separator I get an error:
df3['job_number'].values.astype(int).astype(str) + '.' + df3['task_number'].astype(int).astype(str)
TypeError: ufunc 'add' did not contain a loop with signature matching types dtype('<U11') dtype('<U11') dtype('<U11')
The result I want is:
0 3913291.38544
1 3887250.0
2 3913041.1
For comparison of given methods with other available methods do refer #Jezrael answer.
Method 1
To add a dummy column containing ., use it in processing and later drop it:
%%timeit
df3['dummy'] ='.'
res = df3['job_number'].values.astype(str) + df3['dummy'] + df3['task_number'].values.astype(str)
df3.drop(columns=['dummy'], inplace=True)
1.31 ms ± 41.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
To the extension of method 1, if you exclude the processing time of dummy column creation and dropping it then it is the best you get -
%%timeit
df3['job_number'].values.astype(str) + df3['dummy'] + df3['task_number'].values.astype(str)
286 µs ± 15.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Method 2
Use apply
%timeit df3.T.apply(lambda x: str(x[0]) + '.' + str(x[1]))
883 µs ± 22 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
You can use list comprehension:
df3["filename"] = ['.'.join(i) for i in
zip(df3["job_number"].map(str),df3["task_number"].map(str))]
If use python 3.6+ the fastest solution with f-strings:
df3["filename2"] = [f'{i}.{j}' for i,j in zip(df3["job_number"],df3["task_number"])]
Performance in 30k rows:
df3 = pd.DataFrame({'job_number': [3913291, 3887250, 3913041],
'task_number': [38544, 0, 1]})
df3 = pd.concat([df3] * 10000, ignore_index=True)
In [64]: %%timeit
...: df3["filename2"] = [f'{i}.{j}' for i,j in zip(df3["job_number"],df3["task_number"])]
...:
20.5 ms ± 226 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [65]: %%timeit
...: df3["filename3"] = ['.'.join(i) for i in zip(df3["job_number"].map(str),df3["task_number"].map(str))]
...:
30.9 ms ± 189 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [66]: %%timeit
...: df3["filename4"] = df3.T.apply(lambda x: str(x[0]) + '.' + str(x[1]))
...:
1.7 s ± 31.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [67]: %%timeit
...: df3['dummy'] ='.'
...: res = df3['job_number'].values.astype(str) + df3['dummy'] + df3['task_number'].values.astype(str)
...: df3.drop(columns=['dummy'], inplace=True)
...:
73.6 ms ± 1.23 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
But also very fast is original solution:
In [73]: %%timeit
...: df3['filename'] = df3['job_number'].astype(str) + '.' + df3['task_number'].astype(str)
48.3 ms ± 872 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
With small modification - using map instead astype:
In [76]: %%timeit
...: df3['filename'] = df3['job_number'].map(str) + '.' + df3['task_number'].map(str)
...:
26 ms ± 676 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Methods in order of %%timeit results
I timed all the suggested methods and a few more on two DataFrames. Here are the timed results for the suggested methods (thank you #meW and #jezrael). If I missed any or you have another, let me know and I'll add it.
Two timings are shown for each method: first for processing the 3 rows in the example df and then for processing 57K rows in another df. Timings may vary on another system. Solutions that include TEST['dot'] in the concatenation string require this column in the df: add it with TEST['dot'] = '.'.
Original method (still the fastest):
.astype(str), +, '.'
%%timeit
TEST['filename'] = TEST['job_number'].astype(str) + '.' + TEST['task_number'].astype(str)
# 553 µs ± 6.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) on 3 rows
# 69.6 ms ± 876 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) on 57K rows
Proposed methods and a few permutations on them:
.astype(int).astype(str), +, '.'
%%timeit
TEST['filename'] = TEST['job_number'].astype(int).astype(str) + '.' + TEST['task_number'].astype(int).astype(str)
# 553 µs ± 6.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) on 3 rows
# 70.2 ms ± 739 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) on 57K rows
.values.astype(int).astype(str), +, TEST['dot']
%%timeit
TEST['filename'] = TEST['job_number'].values.astype(int).astype(str) + TEST['dot'] + TEST['task_number'].values.astype(int).astype(str)
# 221 µs ± 5.93 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) on 3 rows
# 82.3 ms ± 743 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) on 57K rows
.values.astype(str), +, TEST['dot']
%%timeit
TEST["filename"] = TEST['job_number'].values.astype(str) + TEST['dot'] + TEST['task_number'].values.astype(str)
# 221 µs ± 5.93 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) on 3 rows
# 92.8 ms ± 1.21 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) on 57K rows
'.'.join(), list comprehension, .values.astype(str)
%%timeit
TEST["filename"] = ['.'.join(i) for i in TEST[["job_number",'task_number']].values.astype(str)]
# 743 µs ± 19.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) on 3 rows
# 147 ms ± 532 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) on 57K rows
f-string, list comprehension, .values.astype(str)
%%timeit
TEST["filename2"] = [f'{i}.{j}' for i,j in TEST[["job_number",'task_number']].values.astype(str)]
# 642 µs ± 27.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) on 3 rows
# 167 ms ± 3.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) on 57K rows
'.'.join(), zip, list comprehension, .map(str)
%%timeit
TEST["filename"] = ['.'.join(i) for i in
zip(TEST["job_number"].map(str), TEST["task_number"].map(str))]
# 512 µs ± 5.74 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) on 3 rows
# 181 ms ± 4.17 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) on 57K rows
apply(lambda, str(x[2]), +, '.')
%%timeit
TEST['filename'] = TEST.T.apply(lambda x: str(x[2]) + '.' + str(x[10]))
# 735 µs ± 13.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) on 3 rows
# 2.69 s ± 18.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) on 57K rows
If you see a way to improve on any of these, please let me know and I'll add to the list!

"Pandorable" way to return index in dataframe slicing

Is there a pandorable way to get only the index in dataframe slicing?
In other words, is there a better way to write the following code:
df.loc[df['A'] >5].index
Thanks!
Yes, better is filter only index values, not all DataFrame and then select index:
#filter index
df.index[df['A'] >5]
#filter DataFrame
df[df['A'] >5].index
Difference is in performance too:
np.random.seed(1245)
df = pd.DataFrame({'A':np.random.randint(10, size=1000)})
print (df)
In [40]: %timeit df.index[df['A'] >5]
208 µs ± 11.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [41]: %timeit df[df['A'] >5].index
428 µs ± 6.42 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [42]: %timeit df.loc[df['A'] >5].index
466 µs ± 40.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
If performance is important use numpy - convert values of index and column by values to numpy array:
In [43]: %timeit df.index.values[df['A'] >5]
157 µs ± 8.71 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [44]: %timeit df.index.values[df['A'].values >5]
8.91 µs ± 196 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Categories

Resources