I have this huge dataset (100M rows) of consumer transactions that looks as follows:
df = pd.DataFrame({'id':[1, 1, 2, 2, 3],'brand':['a','b','a','a','c'], 'date': ['01-01-2020', '01-02-2020', '01-05-2019', '01-06-2019', '01-12-2018']})
For each row (each transaction), I would like to check if the same person (same "id") bought something in the past for a different brand. The resulting dataset should look like this:
id brand date check
0 1 a 01-01-2020 0
1 1 b 01-02-2020 1
2 2 a 01-05-2019 0
3 2 a 01-06-2019 0
4 3 c 01-12-2018 0
Now, my solution was:
def past_transaction(row):
x = df[(df['id'] == row['id']) & (df['brand'] != row['brand']) & (df['date'] < row['date'])]
if x.shape[0]>0:
return 1
else:
return 0
df['check'] = df.appy(past_transaction, axis=1)
This works well, but the performance is abysmal. Is there a more efficient way to do this (with or without Pandas)? Thanks!
I would personally use two booleans,
First check if the id is duplicated.
Second is to check for those that are not duplicated id & brand
import numpy as np
s = df.duplicated(subset=['id'],keep='first')
s1 = ~df.duplicated(subset=['id','brand'],keep=False)
df['check'] = np.where(s & s1,1,0)
id brand date check
0 1 a 01-01-2020 0
1 1 b 01-02-2020 1
2 2 a 01-05-2019 0
3 2 a 01-06-2019 0
4 3 c 01-12-2018 0
A) Use Pandas builtin functions
First step would be to utilize pandas instead of making your own function:
df['check'] = np.logical_and(df.id.duplicated(), ~df[['id','brand']].duplicated())
It will make your code faster already!
B) Take advantage of hardware
Opt-in to utilize all the cores you have in your machine if your RAM permits. You can use modin.pandas or any alternative for that. I recommended this because its minimal changes and will provide exponential speed-up depending on your machine's configuration
C) Big Data Frameworks
If it is a big data problem you should be already utilizing dask or spark dataframes which are meant to handle Big Data as pandas isn't meant to handle such large volumes of data.
Some things I found effective while dealing with a similar problem.
Related
I have a dataset showing below.
What I would like to do is three things.
Step 1: AA to CC is an index, however, happy to keep in the dataset for the future purpose.
Step 2: Count 0 value to each row.
Step 3: If 0 is more than 20% in the row, which means more than 2 in this case because DD to MM is 10 columns, remove the row.
So I did a stupid way to achieve above three steps.
df = pd.read_csv("dataset.csv", header=None)
df_bool = (df == "0")
print(df_bool.sum(axis=1))
then I got an expected result showing below.
0 0
1 0
2 1
3 0
4 1
5 8
6 1
7 0
So removed the row #5 as I indicated below.
df2 = df.drop([5], axis=0)
print(df2)
This works well even this is not an elegant, kind of a stupid way to go though.
However, if I import my dataset as header=0, then this approach did not work at all.
df = pd.read_csv("dataset.csv", header=0)
0 0
1 0
2 0
3 0
4 0
5 0
6 0
7 0
How come this happens?
Also, if I would like to write a code with loop, count and drop functions, what does the code look like?
You can just continue using boolean_indexing:
First we calculate number of columns and number of zeroes per row:
n_columns = len(df.columns) # or df.shape[1]
zeroes = (df == "0").sum(axis=1)
We then select only rows that have less than 20 % zeroes.
proportion_zeroes = zeroes / n_columns
max_20 = proportion_zeroes < 0.20
df[max_20] # This will contain only rows that have less than 20 % zeroes
One liner:
df[((df == "0").sum(axis=1) / len(df.columns)) < 0.2]
It would have been great if you could have posted how the dataframe looks in pandas rather than a picture of an excel file. However, constructing a dummy df
df = pd.DataFrame({'index1':['a','b','c'],'index2':['b','g','f'],'index3':['w','q','z']
,'Col1':[0,1,0],'Col2':[1,1,0],'Col3':[1,1,1],'Col4':[2,2,0]})
Step1, assigning the index can be done using the .set_index() method as per below
df.set_index(['index1','index2','index3'],inplace=True)
instead of doing everything manually when it comes fo filtering out, you can use the return you got from df_bool.sum(axis=1) in the filtering of the dataframe as per below
df.loc[(df==0).sum(axis=1) / (df.shape[1])>0.6]
index1 index2 index3 Col1 Col2 Col3 Col4
c f z 0 0 1 0
and using that you can drop those rows, assuming 20% then you would use
df = df.loc[(df==0).sum(axis=1) / (df.shape[1])<0.2]
Ween it comes to the header issue it's a bit difficult to answer without seeing the what the file or dataframe looks like
The following is a subset of a data frame:
drug_id WD
lexapro.1 flu-like symptoms
lexapro.1 dizziness
lexapro.1 headache
lexapro.14 Dizziness
lexapro.14 headaches
lexapro.23 extremely difficult
lexapro.32 cry at anything
lexapro.32 Anxiety
I need to generate a column id based on the values in drug_id as follows:
id drug_id WD
1 lexapro.1 flu-like symptoms
1 lexapro.1 dizziness
1 lexapro.1 headache
2 lexapro.14 Dizziness
2 lexapro.14 headaches
3 lexapro.23 extremely difficult
4 lexapro.32 cry at anything
4 lexapro.32 Anxiety
I think I need to group them based on drug_id and then generate id based on the size of each group. But I do not know how to do it?
The shift+cumsum pattern mentioned by Boud is good, just make sure to sort by drug_id first. So something like,
df = df.sort_values('drug_id')
df['id'] = (df['drug_id'] != df['drug_id'].shift()).cumsum()
A different approach that does not involve sorting your dataframe would be to map a number to each unique drug_id.
uid = df['drug_id'].unique()
id_map = dict((x, y) for x, y in zip(uid, range(1, len(uid)+1)))
df['id'] = df['drug_id'].map(id_map)
Use the shift+cumsum pattern:
(df.drug_id!=df.drug_id.shift()).cumsum()
Out[5]:
0 1
1 1
2 1
3 2
4 2
5 3
6 4
7 4
Name: drug_id, dtype: int32
I have DF with 10K columns and 70Million rows. I want to calculate the mean and corr of 10K columns. I did below code but it wont work due to code size 64K issue (https://issues.apache.org/jira/browse/SPARK-16845)
Data:
region dept week sal val1 val2 val3 ... val10000
US CS 1 1 2 1 1 ... 2
US CS 2 1.5 2 3 1 ... 2
US CS 3 1 2 2 2.1 2
US ELE 1 1.1 2 2 2.1 2
US ELE 2 2.1 2 2 2.1 2
US ELE 3 1 2 1 2 .... 2
UE CS 1 2 2 1 2 .... 2
Code:
aggList = [func.mean(col) for col in df.columns] #exclude keys
df2= df.groupBy('region', 'dept').agg(*aggList)
code 2
aggList = [func.corr('sal', col).alias(col) for col in df.columns] #exclude keys
df2 = df.groupBy('region', 'dept', 'week').agg(*aggList)
this fails. Is there any alternative way to overcome this bug? and any one tried DF with 10K columns?. Is there any suggestion on performance improvement?
We also ran into the 64KB issue, but in a where clause, which is filed under another bug report. What we used as a workaround, is simply, to do the operations/transformations in several steps.
In your case, this would mean, that you don't do all the aggregatens in one step. Instead loop over the relevant columns in an outer operation:
Use select to create a temporary dataframe, which just contains columns you need for the operation.
Use the groupBy and agg like you did, except not for a list of aggregations, but just for on (or two, you could combine the mean and corr.
After you received references to all temporary dataframes, use withColumn to append the aggregated columns from the temporary dataframes to a result df.
Due to the lazy evaluation of a Spark DAG, this is of course slower as doing it in one operation. But it should evaluate the whole analysis in one run.
Having a DF with columns A and B, I would like to add additional column C which will include the combination of A and B values per row. I.e., if I have a DF:
A B
0 1 1
1 1 2
2 2 1
3 2 2
I would like to create:
A B C
0 1 1 1_1
1 1 2 1_2
2 2 1 2_1
3 2 2 1_2
Obviously, I can go over all rows of the DF and just merge the values. Which is very SLOW for large tables. I can also use .unique() for columns A and B and iterate over all combinations, creating vectors col1_un and col2_un respectively, and then updating the relevant indexes in the table using something like
cols_2_merge = ['A','B']
col1_un = DF[cols_2_merge[0]].unique()
col2_un = DF[cols_2_merge[1]].unique()
for i in range(len(col1_un)):
try:
ind1 = np.where(DF[cols_2_merge[0]].str.contains(col1_un[i], na=False))[0]
except:
ind1 = np.where(DF[cols_2_merge[0]] == col1_un[i])[0]
for j in range(len(col2_un)):
try:
ind2 = np.where(DF[cols_2_merge[1]].str.contains(col2_un[j], na=False))[0]
except:
ind2 = np.where(DF[cols_2_merge[1]] == col2_un[j])[0]
new_ind = col1_un[i] + '-' + col2_un[j]
tmp_ind = np.in1d(ind1, ind2)
ind = ind1[tmp_ind]
if len(ind) > 0:
DF[new_col_name][ind] = new_ind
This is still SLOW. I can play with it a bit more not searching over the entire DF but reducing the field of search to indexes that weren't changed thus far. Still SLOW.
There is the option of group by that does exactly what I want, finding all unique pairs of combinations of the two columns and it's relatively fast, but I haven't figured out how to access the index of the original DF for each group.
Help please?
You can do it this without using groupby, just use the fact that on strings + means concatenation, and that pandas does this elementwise on series:
df['C'] = df['A'].astype(str) + '_' + df['B'].astype(str)
#joris - thank you very much.
It did work, of course! FAST, I need to add :-)
For more complicated group-based combinations one can use
GB = DF[cols_2_merge].groupby(cols_2_merge)
for i in GB.groups:
DO WHATEVER YOU WANT...
Thanks again!
I have a large DataFrame (million +) records I'm using to store core of my data (like a database) and I then have a smaller DataFrame (1 to 2000) records that I'm combining a few of the columns for each time step in my program which can be several thousand time steps . Both DataFrames are indexed the same way by a id column.
the code I'm using is:
df_large.loc[new_ids, core_cols] = df_small.loc[new_ids, core_cols]
Where core_cols is a list of about 10 fields that I'm coping over and new_ids are the ids from the small DataFrame. This code works fine but it is the slowest part of my code my a magnitude of three. I just wanted to know if they was a faster way to merge the data of the two DataFrame together.
I tried merging the data each time with the merge function but process took way to long that is way I have gone to creating a larger DataFrame that I update to improve the speed.
There is nothing inherently slow about using .loc to set with an alignable frame, though it does go through a bit of code to cover lot of cases, so probably it's not ideal to have in a tight loop. FYI, this example is slightly different that the 2nd example.
In [1]: import numpy as np
In [2]: import pandas as pd
In [3]: from pandas import DataFrame
In [4]: df = DataFrame(1.,index=list('abcdefghij'),columns=[0,1,2])
In [5]: df
Out[5]:
0 1 2
a 1 1 1
b 1 1 1
c 1 1 1
d 1 1 1
e 1 1 1
f 1 1 1
g 1 1 1
h 1 1 1
i 1 1 1
j 1 1 1
[10 rows x 3 columns]
In [6]: df2 = DataFrame(0,index=list('afg'),columns=[1,2])
In [7]: df2
Out[7]:
1 2
a 0 0
f 0 0
g 0 0
[3 rows x 2 columns]
In [8]: df.loc[df2.index,df2.columns] = df2
In [9]: df
Out[9]:
0 1 2
a 1 0 0
b 1 1 1
c 1 1 1
d 1 1 1
e 1 1 1
f 1 0 0
g 1 0 0
h 1 1 1
i 1 1 1
j 1 1 1
[10 rows x 3 columns]
Here's an alternative. It may or may not fit your data pattern. If the updates (your small frame) are pretty much independent this would work (IOW you are not updating the big frame, then picking out a new sub-frame, then updating, etc. - if this is your pattern, then using .loc is about right).
Instead of updating the big frame, update the small frame with the columns from the big frame, e.g.:
In [10]: df = DataFrame(1.,index=list('abcdefghij'),columns=[0,1,2])
In [11]: df2 = DataFrame(0,index=list('afg'),columns=[1,2])
In [12]: needed_columns = df.columns-df2.columns
In [13]: df2[needed_columns] = df.reindex(index=df2.index,columns=needed_columns)
In [14]: df2
Out[14]:
1 2 0
a 0 0 1
f 0 0 1
g 0 0 1
[3 rows x 3 columns]
In [15]: df3 = DataFrame(0,index=list('cji'),columns=[1,2])
In [16]: needed_columns = df.columns-df3.columns
In [17]: df3[needed_columns] = df.reindex(index=df3.index,columns=needed_columns)
In [18]: df3
Out[18]:
1 2 0
c 0 0 1
j 0 0 1
i 0 0 1
[3 rows x 3 columns]
And concat everything together when you want (they are kept in a list in the mean time, or see my comments below, these sub-frames could be moved to external storage when created, then read back before this concatenating step).
In [19]: pd.concat([ df.reindex(index=df.index-df2.index-df3.index), df2, df3]).reindex_like(df)
Out[19]:
0 1 2
a 1 0 0
b 1 1 1
c 1 0 0
d 1 1 1
e 1 1 1
f 1 0 0
g 1 0 0
h 1 1 1
i 1 0 0
j 1 0 0
[10 rows x 3 columns]
The beauty of this pattern is that it is easily extended to using an actual db (or much better an HDFStore), to actually store the 'database', then creating/updating sub-frames as needed, then writing out to a new store when finished.
I use this pattern all of the time, though with Panels actually.
perform a computation on a sub-set of the data and write each to a separate file
then at the end read them all in and concat (in memory), and write out a gigantic new file. The concat step could be done all at once in memory, or if truly a large task, then can be done iteratively.
I am able to use multi-processes to perform my computations AND write each individual Panel to a file separate as they are all completely independent. The only dependent part is the concat.
This is essentially a map-reduce pattern.
Quickly: Copy columns a and b from the old df into a new df.
df1 = df[['a', 'b']]
I've had to copy between large dataframes a fair bit. I'm using dataframes with realtime market data, which may not be what pandas is designed for, but this is my experience..
On my pc, copying a single datapoint with .at takes 15µs with the df size making negligible difference. .loc takes a minimum of 550µs and increases as the df gets larger: 3100µs to copy a single point from one 100000x2 df to another. .ix seems to be just barely faster than .loc.
For a single datapoint .at is very fast and is not impacted by the size of the dataframe, but it cannot handle ranges so loops are required, and as such the time scaling is linear. .loc and .ix on the other hand are (relatively) very slow for single datapoints, but they can handle ranges and scale up better than linearly. However, unlike .at they slow down significantly wrt dataframe size.
Therefore when I'm frequently copying small ranges between large dataframes, I tend to use .at with a for loop, and otherwise I use .ix with a range.
for new_id in new_ids:
for core_col in core_cols:
df_large.at[new_id, core_col] = df_small.at[new_id, core_col]
Of course, to do it properly I'd go with Jeff's solution above, but it's nice to have options.
Caveats of .at: it doesn't work with ranges, and it doesn't work if the dtype is datetime (and maybe others).