Optimization of date subtraction on large dataframe - Pandas - python

I'm a beginner learning Python. I have a very large dataset - I'm having trouble optimizing my code to make this run faster.
My goal is to optimize all of this (my current code works but slow):
Subtract two date columns
Create new column with the result of that subtraction
Remove original two columns
Do all of this in a fast manner
Random finds:
Thinking about changing the initial file read method...
https://softwarerecs.stackexchange.com/questions/7463/fastest-python-library-to-read-a-csv-file
I have parse_dates=True when reading the CSV file - so could this be a slowdown? I have 50+ columns but only 1 timestamp column and 1 year column.
This column:
saledate
1 3/26/2004 0:00
2 2/26/2004 0:00
3 5/19/2011 0:00
4 7/23/2009 0:00
5 12/18/2008 0:00
Subtracted by (Should this be converted to a format like 1/1/1996?):
YearMade
1 1996
2 2001
3 2001
4 2007
5 2004
Current code:
mean_YearMade = dfx[dfx['YearMade'] > 1000]['YearMade'].mean()
def age_at_sale(df, mean_YearMade):
'''
INPUT: Dateframe
OUTPUT: Dataframe
Add a column called AgeSale
'''
df.loc[:, 'YearMade'][df['YearMade'] == 1000] = mean_YearMade
# Column has tons of erroneous years with 1000
df['saledate'] = pd.to_datetime(df['saledate'])
df['saleyear'] = df['saledate'].dt.year
df['Age_at_Sale'] = df['saleyear'] - df['YearMade']
df = df.drop('saledate', axis=1)
df = df.drop('YearMade', axis=1)
df = df.drop('saleyear', axis=1)
return df
Any optimization tricks would be much appreciated...

You can try use sub for substract and for select by condition use loc with mask like dfx['YearMade'] > 1000. Also creating column saleyear is not necessary.
dfx['saledate'] = pd.to_datetime(dfx['saledate'])
mean_YearMade = dfx.loc[dfx['YearMade'] > 1000, 'YearMade'].mean()
def age_at_sale(df, mean_YearMade):
'''
INPUT: Dateframe
OUTPUT: Dataframe
Add a column called AgeSale
'''
df.loc[df['YearMade'] == 1000, 'YearMade'] = mean_YearMade
df['Age_at_Sale'] = df['saledate'].dt.year.sub(df['YearMade'])
df = df.drop(['saledate', 'YearMade'], axis=1)
return df

Related

Cannot set a DataFrame with multiple columns to the single column total_servings

I am a beginner and getting familiar with pandas .
It is throwing an error , When I was trying to create a new column this way :
drinks['total_servings'] = drinks.loc[: ,'beer_servings':'wine_servings'].apply(calculate,axis=1)
Below is my code, and I get the following error for line number 9:
"Cannot set a DataFrame with multiple columns to the single column total_servings"
Any help or suggestion would be appreciated :)
import pandas as pd
drinks = pd.read_csv('drinks.csv')
def calculate(drinks):
return drinks['beer_servings']+drinks['spirit_servings']+drinks['wine_servings']
print(drinks)
drinks['total_servings'] = drinks.loc[:, 'beer_servings':'wine_servings'].apply(calculate,axis=1)
drinks['beer_sales'] = drinks['beer_servings'].apply(lambda x: x*2)
drinks['spirit_sales'] = drinks['spirit_servings'].apply(lambda x: x*4)
drinks['wine_sales'] = drinks['wine_servings'].apply(lambda x: x*6)
drinks
In your code, when functioncalculate is called with axis=1, it passes each row of the Dataframe as an argument. Here, the function calculate is returning dataframe with multiple columns but you are trying to assigned to a single column, which is not possible. You can try updating your code to this,
def calculate(each_row):
return each_row['beer_servings'] + each_row['spirit_servings'] + each_row['wine_servings']
drinks['total_servings'] = drinks.apply(calculate, axis=1)
drinks['beer_sales'] = drinks['beer_servings'].apply(lambda x: x*2)
drinks['spirit_sales'] = drinks['spirit_servings'].apply(lambda x: x*4)
drinks['wine_sales'] = drinks['wine_servings'].apply(lambda x: x*6)
print(drinks)
I suppose the reason is the wrong argument name inside calculate method. The given argument is drink but drinks used to calculate sum of columns.
The reason is drink is Series object that represents Row and sum of its elements is scalar. Meanwhile drinks is a DataFrame and sum of its columns will be a Series object
Sample code shows that this method works.
import pandas as pd
df = pd.DataFrame({
"A":[1,1,1,1,1],
"B":[2,2,2,2,2],
"C":[3,3,3,3,3]
})
def calculate(to_calc_df):
return to_calc_df["A"] + to_calc_df["B"] + to_calc_df["C"]
df["total"] = df.loc[:, "A":"C"].apply(calculate, axis=1)
print(df)
Result
A B C total
0 1 2 3 6
1 1 2 3 6
2 1 2 3 6
3 1 2 3 6
4 1 2 3 6

pandas dataframe and external list interaction

I have a pandas dataframe df which looks like this
betasub0 betasub1 betasub2 betasub3 betasub4 betasub5 betasub6 betasub7 betasub8 betasub9 betasub10
0 0.009396 0.056667 0.104636 0.067066 0.009678 0.019402 0.029316 0.187884 0.202597 0.230275 0.083083
1 0.009829 0.058956 0.108205 0.068956 0.009888 0.019737 0.029628 0.187611 0.197627 0.225660 0.083903
2 0.009801 0.058849 0.108092 0.068927 0.009886 0.019756 0.029690 0.188627 0.200235 0.224703 0.081434
3 0.009938 0.059595 0.109310 0.069609 0.009970 0.019896 0.029854 0.189187 0.199424 0.221968 0.081249
4 0.009899 0.059373 0.108936 0.069395 0.009943 0.019852 0.029801 0.188979 0.199893 0.222922 0.081009
Then I have a vector dk that looks like this:
[0.18,0.35,0.71,1.41,2.83,5.66,11.31,22.63,45.25,90.51,181.02]
What I need to do is:
calculate a new vector which is
psik = [np.log2(dki/1e3) for dki in dk]
calculate the sum of each row multiplied with the psik vector (just as the SUMPRODUCT function of excel)
calculate the log2 of each psik value
expected output should be:
betasub0 betasub1 betasub2 betasub3 betasub4 betasub5 betasub6 betasub7 betasub8 betasub9 betasub10 psig dg
0 0.009396 0.056667 0.104636 0.067066 0.009678 0.019402 0.029316 0.187884 0.202597 0.230275 0.083083 -5.848002631 0.017361042
1 0.009829 0.058956 0.108205 0.068956 0.009888 0.019737 0.029628 0.187611 0.197627 0.22566 0.083903 -5.903532822 0.016705502
2 0.009801 0.058849 0.108092 0.068927 0.009886 0.019756 0.02969 0.188627 0.200235 0.224703 0.081434 -5.908820802 0.016644383
3 0.009938 0.059595 0.10931 0.069609 0.00997 0.019896 0.029854 0.189187 0.199424 0.221968 0.081249 -5.930608559 0.016394906
4 0.009899 0.059373 0.108936 0.069395 0.009943 0.019852 0.029801 0.188979 0.199893 0.222922 0.081009 -5.924408689 0.016465513
I would do that with a for loop cycling over the rows like this
for r in rows:
psig_i = sum([d[i]*ri for i,ri in enumerate(r)])
psig.append(sum([d[i]*ri for i,ri in enumerate(r)]))
dg.append(np.log2(psig_i))
df['psig'] = psig
df['dg'] = dg
Is there any other way to update the df without iterating through its rows?
EDIT: I found the solution and I am ashamed for how simple it is
df['psig']=df.mul(psik).sum(axis=1)
df['dg'] = df[psig].apply(lambda x: np.log2(x))
EDIT2: now, my df has more entries, so I have to filter it with a regex to find only the columns with a name starting with "basesub".
I have my array psik and a new column ``psigin thedf. I would like to calculate for each row (i.e. each value of psig```):
sum(((psik-psig)**2)*betasub[0...n])
I did it like this, but maybe there's a better way?
PsimPsig2 = [[(psik_i-psig_i)**2 for psik_i in psik] for psig_i in list(df['psig'])]
psikmpsigname = ['psikmpsig'+str(i) for i in range(len(psik))]
dfPsimPsig2 = pd.DataFrame(data=PsimPsig2,columns=psikmpsigname)
siggAL = np.power(2,(np.power(pd.DataFrame(df.filter(regex=r'^betasub[0-9]',axis=1).values*dfPsimPsig2.values).sum(axis=1),0.5)))
df['siggAL'] = siggAL

create pandas dataframe, then update only last day/row without calling (ie omitting) the first part of the function - all in one statement

As the title suggests, I'm not even sure how to word the question. :D
But here it is, "simply put":
I) I would like to create a df on day x and II) from the next day onwards x+1...x+n I would like to update just day x+n without touching the first part (I) of creating the df - and all that by only calling one function. So basically "just" appending the row for the day the function is called (there is no need to "recreate" the df since it is already there. Is there a possibility to do that all in one statement?
It would look something like this:
import pandas as pd
def pull_data():
data = {'DATE': ['2020-05-01','2020-05-02','2020-05-03','2020-05-04'],
'X': [400,300,200,100],
'Y': [100,200,300,400]
}
df = pd.DataFrame(data, columns = ['DATE', 'X', 'Y'])
return df
data_ = pull_data()
Let's say I call this function on 2020-05-04 --> but now on the next day I want it to automatically ONLY attach 2020-05-05 without creating the whole data frame again.
Does my whole question make any sense/is it comprehensible? I'd be happy about every input! :)
Based on the dataframe and the integer index, you can append a value using the shape of the dataframe with loc:
from datetime import datetime
data_ = pull_data()
value_X = 0
value_Y = 1
data_.loc[data_.shape[0]] = [datetime.now().date(), value_X, value_Y]
data_
# DATE X Y
# 0 2020-05-01 400 100
# 1 2020-05-02 300 200
# 2 2020-05-03 200 300
# 3 2020-05-04 100 400
# 4 2020-05-06 0 1

Pandas Advanced: How to get results for customer who has bought at least twice within 5 days of period?

I have been attempting to solve a problem for hours and stuck on it. Here is the problem outline:
import numpy as np
import pandas as pd
df = pd.DataFrame({'orderid': [10315, 10318, 10321, 10473, 10621, 10253, 10541, 10645],
'customerid': ['ISLAT', 'ISLAT', 'ISLAT', 'ISLAT', 'ISLAT', 'HANAR', 'HANAR', 'HANAR'],
'orderdate': ['1996-09-26', '1996-10-01', '1996-10-03', '1997-03-13', '1997-08-05', '1996-07-10', '1997-05-19', '1997-08-26']})
df
orderid customerid orderdate
0 10315 ISLAT 1996-09-26
1 10318 ISLAT 1996-10-01
2 10321 ISLAT 1996-10-03
3 10473 ISLAT 1997-03-13
4 10621 ISLAT 1997-08-05
5 10253 HANAR 1996-07-10
6 10541 HANAR 1997-05-19
7 10645 HANAR 1997-08-26
I would like to select all the customers who has ordered items more than once WITHIN 5 DAYS.
For example, here only the customer ordered within 5 days of period and he has done it twice.
I would like to get the output in the following format:
Required Output
customerid initial_order_id initial_order_date nextorderid nextorderdate daysbetween
ISLAT 10315 1996-09-26 10318 1996-10-01 5
ISLAT 10318 1996-10-01 10321 1996-10-03 2
First, to be able to count the difference in days, convert orderdate
column to datetime:
df.orderdate = pd.to_datetime(df.orderdate)
Then define the following function:
def fn(grp):
return grp[(grp.orderdate.shift(-1) - grp.orderdate) / np.timedelta64(1, 'D') <= 5]
And finally apply it:
df.sort_values(['customerid', 'orderdate']).groupby('customerid').apply(fn)
It is a bit tricky because there can be any number of purchase pairs within 5 day windows. It is a good use case for leveraging merge_asof, which allows to do approximate-but-not-exact matching of a dataframe with itself.
Input data
import pandas as pd
df = pd.DataFrame({'orderid': [10315, 10318, 10321, 10473, 10621, 10253, 10541, 10645],
'customerid': ['ISLAT', 'ISLAT', 'ISLAT', 'ISLAT', 'ISLAT', 'HANAR', 'HANAR', 'HANAR'],
'orderdate': ['1996-09-26', '1996-10-01', '1996-10-03', '1997-03-13', '1997-08-05', '1996-07-10', '1997-05-19', '1997-08-26']})
Define a function that computes the pairs of purchases, given data for a customer.
def compute_purchase_pairs(df):
# Approximate self join on the date, but not exact.
df_combined = pd.merge_asof(df,df, left_index=True, right_index=True,
suffixes=('_first', '_second') , allow_exact_matches=False)
# Compute difference
df_combined['timedelta'] = df_combined['orderdate_first'] - df_combined['orderdate_second']
return df_combined
Do the preprocessing and compute the pairs
# Convert to datetime
df['orderdate'] = pd.to_datetime(df['orderdate'])
# Sort dataframe from last buy to newest (groupby will not change this order)
df2 = df.sort_values(by='orderdate', ascending=False)
# Create an index for joining
df2 = df.set_index('orderdate', drop=False)
# Compute puchases pairs for each customer
df_differences = df2.groupby('customerid').apply(compute_purchase_pairs)
# Show only the ones we care about
result = df_differences[df_differences['timedelta'].dt.days<=5]
result.reset_index(drop=True)
Result
orderid_first customerid_first orderdate_first orderid_second \
0 10318 ISLAT 1996-10-01 10315.0
1 10321 ISLAT 1996-10-03 10318.0
customerid_second orderdate_second timedelta
0 ISLAT 1996-09-26 5 days
1 ISLAT 1996-10-01 2 days
you can create the column 'daysbetween' with sort_values and diff. After to get the following order, you can join df with df once groupby per customerid and shift all the data. Finally, query where the number of days in 'daysbetween_next ' is met:
df['daysbetween'] = df.sort_values(['customerid', 'orderdate'])['orderdate'].diff().dt.days
df_final = df.join(df.groupby('customerid').shift(-1),
lsuffix='_initial', rsuffix='_next')\
.drop('daysbetween_initial', axis=1)\
.query('daysbetween_next <= 5 and daysbetween_next >=0')
It's quite simple. Let's write down the requirements one at the time and try to build upon.
First, I guess that the customer has a unique id since it's not specified. We'll use that id for identifying customers.
Second, I assume it does not matter if the customer bought 5 days before or after.
My solution, is to use a simple filter. Note that this solution can also be implemented in a SQL database.
As a condition, we require the user to be the same. We can achieve this as follows:
new_df = df[df["ID"] == df["ID"].shift(1)]
We create a new DataFrame, namely new_df, with all rows such that the xth row has the same user id as the xth - 1 row (i.e. the previous row).
Now, let's search for purchases within the 5 days, by adding the condition to the previous piece of code
new_df = df[df["ID"] == df["ID"].shift(1) & (df["Date"] - df["Date"].shift(1)) <= 5]
This should do the work. I cannot test it write now, so some fixes may be needed. I'll try to test it as soon as I can

Spliting DataFrame into Multiple Frames by Dates Python

I fully understand there are a few versions of this questions out there, but none seem to get at the core of my problem. I have a pandas Dataframe with roughly 72,000 rows from 2015 to now. I am using a calculation that finds the most impactful words for a given set of text (tf_idf). This calculation does not account for time, so I need to break my main Dataframe down into time-based segments, ideally every 15 and 30 days (or n days really, not week/month), then run the calculation on each time-segmented Dataframe in order to see and plot what words come up more and less over time.
I have been able to build part of this this out semi-manually with the following:
def dateRange():
start = input("Enter a start date (MM-DD-YYYY) or '30' for last 30 days: ")
if (start != '30'):
datetime.strptime(start, '%m-%d-%Y')
end = input("Enter a end date (MM-DD-YYYY): ")
datetime.strptime(end, '%m-%d-%Y')
dataTime = data[(data['STATUSDATE'] > start) & (data['STATUSDATE'] <= end)]
else:
dataTime = data[data.STATUSDATE > datetime.now() - pd.to_timedelta('30day')]
return dataTime
dataTime = dateRange()
dataTime2 = dateRange()
def calcForDateRange(dateRangeFrame):
##### LONG FUNCTION####
return word and number
calcForDateRange(dataTime)
calcForDateRange(dataTime2)
This works - however, I have to manually create the 2 dates which is expected as I created this as a test. How can I split the Dataframe by increments and run the calculation for each dataframe?
dicts are allegedly the way to do this. I tried:
dict_of_dfs = {}
for n, g in data.groupby(data['STATUSDATE']):
dict_of_dfs[n] = g
for frame in dict_of_dfs:
calcForDateRange(frame)
The dict result was 2015-01-02: Dataframe with no frame. How can I break this down into a 100 or so Dataframes to run my function on?
Also, I do not fully understand how to break down ['STATUSDATE'] by number of days specifically?
I would to avoid iterating as much as possible, but I know I probably will have to someehere.
THank you
Let us assume you have a data frame like this:
date = pd.date_range(start='1/1/2018', end='31/12/2018', normalize=True)
x = np.random.randint(0, 1000, size=365)
df = pd.DataFrame(x, columns = ["X"])
df['Date'] = date
df.head()
Output:
X Date
0 328 2018-01-01
1 188 2018-01-02
2 709 2018-01-03
3 259 2018-01-04
4 131 2018-01-05
So this data frame has 365 rows, one for each day of the year.
Now if you want to group this data into intervals of 20 days and assign each group to a dict, you can do the following
df_dict = {}
for k,v in df.groupby(pd.Grouper(key="Date", freq='20D')):
df_dict[k.strftime("%Y-%m-%d")] = pd.DataFrame(v)
print(df_dict)
How about something like this. It creates a dictionary of non empty dataframes keyed on the
starting date of the period.
import datetime as dt
start = '12-31-2017'
interval_days = 30
start_date = pd.Timestamp(start)
end_date = pd.Timestamp(dt.date.today() + dt.timedelta(days=1))
dates = pd.date_range(start=start_date, end=end_date, freq=f'{interval_days}d')
sub_dfs = {d1.strftime('%Y%m%d'): df.loc[df.dates.ge(d1) & df.dates.lt(d2)]
for d1, d2 in zip(dates, dates[1:])}
# Remove empty dataframes.
sub_dfs = {k: v for k, v in sub_dfs.items() if not v.empty}

Categories

Resources