Pandas diff SeriesGroupBy is relatively slow - python
Total time: 1.01876 s
Function: prepare at line 91
Line # Hits Time Per Hit % Time Line Contents
==============================================================
91 #profile
92 def prepare():
93
94 1 5681.0 5681.0 0.6
95 1 2416.0 2416.0 0.2
96
97
98 1 536.0 536.0 0.1 tss = df.groupby('user_id').timestamp
99 1 949643.0 949643.0 93.2 delta = tss.diff()
100 1 1822.0 1822.0 0.2
101 1 13030.0 13030.0 1.3
102 1 5193.0 5193.0 0.5
103 1 1251.0 1251.0 0.1
104
105 1 2038.0 2038.0 0.2
106
107 1 1851.0 1851.0 0.2
108
109 1 282.0 282.0 0.0
110
111 1 3088.0 3088.0 0.3
112 1 2943.0 2943.0 0.3
113 1 438.0 438.0 0.0
114 1 4658.0 4658.0 0.5
115 1 17083.0 17083.0 1.7
116 1 3115.0 3115.0 0.3
117 1 3691.0 3691.0 0.4
118
119 1 2.0 2.0 0.0
I have a dataframe which I group by some key and then select a column from each group and perform diff on that column (per group). As shown in the profiling results, the diff operation is super slow compared to the rest and is kind of a bottleneck. Is this expected? Are there faster alternatives to achieve the same result?
Edit: some more explanation
In my use case timestamps represent the times for some actions of a user to which I want to calculate the deltas between these actions (they are sorted) but each user's actions are completely independent of other users.
Edit: Sample code
import pandas as pd
import numpy as np
df = pd.DataFrame(
{'ts':[1,2,3,4,60,61,62,63,64,150,155,156,
1,2,3,4,60,61,62,63,64,150,155,163,
1,2,3,4,60,61,62,63,64,150,155,183],
'id': [1,2,3,4,60,61,62,63,64,150,155,156,
71,72,73,74,80,81,82,83,64,160,165,166,
21,22,23,24,90,91,92,93,94,180,185,186],
'other':['x','x','x','','x','x','','x','x','','x','',
'y','y','y','','y','y','','y','y','','y','',
'z','z','z','','z','z','','z','z','','z',''],
'user':['x','x','x','x','x','x','x','x','z','x','x','y',
'y','y','y','y','y','y','y','y','x','y','y','x',
'z','z','z','z','z','z','z','z','y','z','z','z']
})
df.set_index('id',inplace=True)
deltas=df.groupby('user').ts.transform(pd.Series.diff)
If you do not wish to sort your data or drop down to numpy, then a significant performance improvement may be possible by changing your user series to Categorical. Categorical data is stored effectively as integer pointers.
In the below example, I see an improvement from 86ms to 59ms. This may improve further for larger datasets and where more users are repeated.
df = pd.concat([df]*10000)
%timeit df.groupby('user').ts.transform(pd.Series.diff) # 86.1 ms per loop
%timeit df['user'].astype('category') # 23.4 ms per loop
df['user'] = df['user'].astype('category')
%timeit df.groupby('user').ts.transform(pd.Series.diff) # 35.7 ms per loop
If you are performing multiple operations, then the one-off cost of converting to categorical can be discounted.
Related
How to modify the number of the rows in .csv file and plot them
I read .csv file using this command df = pd.read_csv('filename.csv', nrows=200) I set the number of rows to 200. So it will only get the data for 200 rows. (200 rows x 1 column) data 1 4.33 2 6.98 . . 200 100.896 I want to plot these data however I would like to divide the number of rows by 50. (there will be 200 elements still but the numbers of the rows will be divided by 50). data 0.02 4.33 0.04 6.98 . . 4 100.896 I'm not sure how I would do that. Is there a way of doing this?
Just divide the index by 50. Here an example : import pandas as pd import random data = pd.DataFrame({'col1' : random.sample(range(300), 200)}, index = range(1,201)) data.index = data.index / 50 data col1 0.02 196 0.04 198 0.06 278 0.08 209 0.10 36 ... ... 3.92 175 3.94 69 3.96 145 3.98 15 4.00 18
How to find average temperatures over previous 2 weeks on daily basis in csv file?
I am a new coder and have recently been introduced to pandas framework. I have a large csv file with daily average temperatures over few years, that looks like this: Station Date Tmax Tmin Tavg 1 5/1/2007 83 50 67 1 5/2/2007 59 42 51 2 5/2/2007 60 43 52 1 5/3/2007 66 46 56 2 5/3/2007 67 48 58 1 5/4/2007 66 49 58 2 5/4/2007 78 51 M 1 5/5/2007 66 53 60 2 5/5/2007 66 54 60 1 5/6/2007 68 49 59 2 5/6/2007 68 52 60 Based on the average temperatures in column Tavg, I need to create another column that will show the average temperatures over previous two weeks on daily basis. Hope, I made myself clear. Any help will be greatly appreciated.
As #MarkSetchell suggested, there is a related question that calculates Rolling Mean on pandas on a specific column Briefly, Andrew L posted the best answer as %timeit weather['ma'] = weather['Tavg'].rolling(5).mean() %timeit weather['ma_2'] = weather.rolling(5).mean()['Tavg'] 1000 loops, best of 3: 497 µs per loop 100 loops, best of 3: 2.6 ms per loop The second method is not recommended unless one needs to store computed rolling means on all other columns.
Pandas group by cumsum length of values does not match length of index
As most of these help questions begin, I'm new to Python and Pandas. I've been learning by doing, especially when I have a particular task to complete. I have searched the help pages and could not find an answer that addressed by specific problem and I could not devise a solution based on answers to similar problems. I have a data set with 50K+ entries. The general format is: code value 0 101 0.0 1 102 0.0 2 103 23.2 3 104 10.3 4 105 0.2 5 106 0.0 6 107 22.6 7 108 0.0 8 109 0.0 9 110 2.2 10 111 3.8 11 112 0.0 My first task was to segregate consecutive non-zero values. Through trial and error, I managed to condense my script to one line that accomplished this. df[df['value'] != 0].groupby((df['value'] == 0).cumsum()) for grp, val in df[df['value'] != 0].groupby((df['value'] == 0).cumsum()): print(f'[group {grp}]') print(val) The output is: [group 2] code value 2 103 23.2 3 104 10.3 4 105 0.2 [group 3] code value 6 107 22.6 [group 5] code value 9 110 2.2 10 111 3.8 I have other manipulations and calculations to do on this data set and I think the easiest way to access these data would be to transform the groupby object into a column (if that is even the correct terminology?), like so: code value group 0 103 23.2 2 1 104 10.3 2 2 105 0.2 2 3 107 22.6 3 4 110 2.2 5 5 111 3.8 5 Obviously, I get a "Length of values does not match length of index" error. I searched the help pages and it seemed that I needed to do some type of reset_index method. I tried various syntax structures and many other coding solutions suggested in other threads the past day and a half without success. I finally decided to give up and ask for help when I returned from a short break and found my cat rolling on the keyboard, adding and deleting gobs of gibberish to the script snippets I had been testing. If someone would be kind enough to help me with this script--to get the groupby object into a column, I would greatly appreciate it. Thanks.
This will give you the groups then drop the zero rows. df = pd.DataFrame({'code': [101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112], 'value': [0.0, 0.0, 23.2, 10.3, 0.2, 0.0, 22.6, 0.0, 0.0, 2.2, 3.8, 0.0]}) df['group'] = df.value.eq(0).cumsum() df = df.loc[df.value.ne(0)] Output code value group 2 103 23.2 2 3 104 10.3 2 4 105 0.2 2 6 107 22.6 3 9 110 2.2 5 10 111 3.8 5
Join dataframe with matrix output using pandas
I am trying to translate the input dataframe (inp_df) to output dataframe (out_df) using the the data from the cell based intermediate dataframe (matrix_df) as shown below. There are several cell number based files with distance values shown in matrix_df . The program iterates by cell & fetches data from appropriate file so each time matrix_df will have the data for all rows of the current cell# that we are iterating for in inp_df. inp_df A B cell 100 200 1 115 270 1 145 255 2 115 266 1 matrix_df (cell_1.csv) B 100 115 199 avg_distance 200 7.5 80.7 67.8 52 270 6.8 53 92 50 266 58 84 31 57 matrix_df (cell_2.csv) B 145 121 166 avg_distance 255 74.9 77.53 8 53.47 out_df dataframe A B cell distance avg_distance 100 200 1 7.5 52 115 270 1 53 50 145 255 2 74.9 53.47 115 266 1 84 57 My current thought process for each cell# based data is use a apply function to go row by row then use a join based on column B in the inp_df with with matrix_df, where the matrix df is somehow translated into a tuple of column name, distance & average distance. But I am looking for a pandonic way of doing this since my approach will slow down when there are millions of rows in the input. I am specifically looking for core logic inside an iteration to fetch the matches, since in each cell the number of columns in matrix_df would vary If its any help the matrix files is the distance based outputs from sklearn.metrics.pairwise.pairwise_distances . NB: In inp_df the value of column B is unique and values of column A may or may not be unique Also the matrix_dfs first column was empty & i had renamed it with the following code for easiness in understanding since it was a header-less matrix output file. dist_df = pd.read_csv(mypath,index_col=False) dist_df.rename(columns={'Unnamed: 0':'B'}, inplace=True)
Step 1: Concatenate your inputs with pd.concat and merge with inp_df using df.merge In [641]: out_df = pd.concat([matrix_df1, matrix_df2]).merge(inp_df) Step 2: Create the distance column with df.apply by using A's values to index into the correct column In [642]: out_df.assign(distance=out_df.apply(lambda x: x[str(int(x['A']))], axis=1))\ [['A', 'B', 'cell', 'distance', 'avg_distance']] Out[642]: A B cell distance avg_distance 0 100 200 1 7.5 52.00 1 115 270 1 53.0 50.00 2 115 266 1 84.0 57.00 3 145 255 2 74.9 53.47
python pandas add a lower level column to multi_index dataframe
Could someone help me to achieve this task? I have data in multi-level data frame through the unstack() operation: Original df: Density Length Range Count 15k 0.60 small 555 15k 0.60 big 17 15k 1.80 small 141 15k 1.80 big 21 15k 3.60 small 150 15k 3.60 big 26 20k 0.60 small 5543 20k 0.60 big 22 20k 1.80 small 553 20k 1.80 big 25 20k 3.60 small 422 20k 3.60 big 35 df = df.set_index(['Density','Length','Range']).unstack('Range') # After unstack: Count Range big small Density Length 15k 0.60 17 555 1.80 21 141 3.60 26 150 20k 0.60 22 5543 1.80 25 553 3.60 35 422 Now I try to add an extra column in level 1. it is the ratio of the small/big. I have tried the following syntax, no error but with different outcomes #df[:]['ratio']=df['Count']['small']/df['Count']['big'] ## case 1. no error, no ratio #df['Count']['ratio']=df['Count']['small']/df['Count']['big'] ## case 2. no error, no ratio #df['ratio']=df['Count']['small']/df['Count']['big'] ## case 3. no error, ratio on column level 0 df['ratio']=df.ix[:,1]/df.ix[:,0] ## case 4. no error, ratio on column level 0 #After execution above code, df: Count ratio Range big small Density Length 15k 0.60 17 555 32.65 1.80 21 141 6.71 3.60 26 150 5.77 20k 0.60 22 5543 251.95 1.80 25 553 22.12 3.60 35 422 12.06 I don't understand why case 1 & 2 show no error neither adding new ratio column. and why in case 3 & 4 the ratio column is on level 0, not the expected level 1. Also like to know if there is a better/concise way to achieve this. Case 4 is the best I can do but I don't like the implicit indexing way (instead of using the name) to refer to a column. Thanks
Case 1: df[:]['ratio']=df['Count']['small']/df['Count']['big'] df[:] is a copy of df. They are different objects, each with its own copy of the underlying data: In [69]: df[:] is df Out[69]: False So modifying the copy has no effect on the original df. Since no reference is maintained for df[:], the object is garbage collected after the assignment, making the assignment useless. Case 2: df['Count']['ratio']=df['Count']['small']/df['Count']['big'] uses chain-indexing. Avoid chain indexing when making assignments. The link explains why assignments using chain-indexing on the left-hand side may not affect df. If you set pd.options.mode.chained_assignment = 'warn' then Pandas will warn you not to use chain-indexing in assignments: SettingWithCopyError: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy Case 3: df['ratio']=df['Count']['small']/df['Count']['big'] and Case 4 df['ratio']=df.ix[:,1]/df.ix[:,0] both work, but it could done more efficently using df['ratio'] = df['Count','small']/df['Count','big'] Here is a microbenchmark showing that using df[tuple_index] is faster than chain-indexing: In [99]: %timeit df['Count']['small'] 1000 loops, best of 3: 501 µs per loop In [128]: %timeit df['Count','small'] 100000 loops, best of 3: 8.91 µs per loop If you want ratio to be the level 1 label, then you must tell Pandas that the level 0 label is Count. You can do that by assigning to df['Count','ratio']: In [96]: df['Count','ratio'] = df['Count']['small']/df['Count','big'] # In [97]: df # Out[97]: # Count # Range big small ratio # Density Length # 15k 0.6 17 555 32.647059 # 1.8 21 141 6.714286 # 3.6 26 150 5.769231 # 20k 0.6 22 5543 251.954545 # 1.8 25 553 22.120000 # 3.6 35 422 12.057143