faster way of creating pandas dataframe from another dataframe - python

I have a dataframe with over 41500 records and 3 fields: ID,start_date and end_date.
I want to create a separate dataframe out of it with just 2 fields as: ID and active_years which will contain records having each identifiers against all the possible years that exists between the start_year and end_year range (inclusive of end year in the range).
This is what I'm doing right now, but for 41500 rows it takes more than 2 hours to finish.
df = pd.DataFrame(columns=['id', 'active_years'])
ix = 0
for _, row in raw_dataset.iterrows():
st_yr = int(row['start_date'].split('-')[0]) # because dates are in the format yyyy-mm-dd
end_yr = int(row['end_date'].split('-')[0])
for year in range(st_yr, end_yr+1):
df.loc[ix, 'id'] = row['ID']
df.loc[ix, 'active_years'] = year
ix = ix + 1
So is there any faster way to achieve this?
[EDIT] some examples to try and work around,
raw_dataset = pd.DataFrame({'ID':['a121','b142','cd3'],'start_date':['2019-10-09','2017-02-06','2012-12-05'],'end_date':['2020-01-30','2019-08-23','2016-06-18']})
print(raw_dataset)
ID start_date end_date
0 a121 2019-10-09 2020-01-30
1 b142 2017-02-06 2019-08-23
2 cd3 2012-12-05 2016-06-18
# the desired dataframe should look like this
print(desired_df)
id active_years
0 a121 2019
1 a121 2020
2 b142 2017
3 b142 2018
4 b142 2019
5 cd3 2012
6 cd3 2013
7 cd3 2014
8 cd3 2015
9 cd3 2016

Dynamically growing python lists is much faster than dynamically growing numpy arrays (which are the underlying data structure of pandas dataframes). See here for a brief explanation. With that in mind:
import pandas as pd
# Initialize input dataframe
raw_dataset = pd.DataFrame({
'ID':['a121','b142','cd3'],
'start_date':['2019-10-09','2017-02-06','2012-12-05'],
'end_date':['2020-01-30','2019-08-23','2016-06-18'],
})
# Create integer columns for start year and end year
raw_dataset['start_year'] = pd.to_datetime(raw_dataset['start_date']).dt.year
raw_dataset['end_year'] = pd.to_datetime(raw_dataset['end_date']).dt.year
# Iterate over input dataframe rows and individual years
id_list = []
active_years_list = []
for row in raw_dataset.itertuples():
for year in range(row.start_year, row.end_year+1):
id_list.append(row.ID)
active_years_list.append(year)
# Create result dataframe from lists
desired_df = pd.DataFrame({
'id': id_list,
'active_years': active_years_list,
})
print(desired_df)
# Output:
# id active_years
# 0 a121 2019
# 1 a121 2020
# 2 b142 2017
# 3 b142 2018
# 4 b142 2019
# 5 cd3 2012
# 6 cd3 2013
# 7 cd3 2014
# 8 cd3 2015
# 9 cd3 2016

Related

how to calculate number of working days with python

I have a dataframe (df):
year month ETP
0 2021 1 49.21
1 2021 2 34.20
2 2021 3 31.27
3 2021 4 29.18
4 2021 5 33.25
5 2021 6 24.70
I would like to add a column that gives me the number of working days for each row excluding holidays and weekends (for a specific country, exp: France or US)
so the output will be :
year month ETP work_day
0 2021 1 49.21 20
1 2021 2 34.20 20
2 2021 3 31.27 21
3 2021 4 29.18 19
4 2021 5 33.25 20
5 2021 6 24.70 19
code :
import numpy as np
import pandas as pd
days = np.busday_count( '2021-01', '2021-06' )
df.insert(3, "work_day", [days])
and I got this error :
ValueError: Length of values does not match length of index
Any suggestions?
Thank you for your help
assuming you are the one that will input the workdays, I suppose you can do it like this:
data = {'year': [2020, 2020, 2021, 2023, 2022],
'month': [1, 2, 3, 4, 6]}
df = pd.DataFrame(data)
df.insert(2, "work_day", [20,20,23,21,22])
Where the 2 is the position of the new column, not just to be at the end, work_day is the name and the list has the values for every row.
EDIT: With NumPy
import numpy as np
import pandas as pd
days = np.busday_count( '2021-02', '2021-03' )
data = {'year': [2021],
'month': ['february']}
df = pd.DataFrame(data)
df.insert(2, "work_day", [days])
with the busday_count you specify the starting and ending dates you want to see the workdays in.
the result :
year month work_day
0 2021 february 20

Crate and append rows based on average of previous rows and condition columns

I'm working on a dataframe named df that contains a year of daily information for a float variable (balance) for many account values (used as main key). I'm trying to create a new column expected_balance by matching the date of previous months, calculating an average and using it as expected future value. I'll explain in detail now:
The dataset is generated after appending and parsing multiple json values, once I finish working on it, I get this:
date balance account day month year fdate
0 2018-04-13 470.57 SP014 13 4 2018 201804
1 2018-04-14 375.54 SP014 14 4 2018 201804
2 2018-04-15 375.54 SP014 15 4 2018 201804
3 2018-04-16 229.04 SP014 16 4 2018 201804
4 2018-04-17 216.62 SP014 17 4 2018 201804
... ... ... ... ... ... ... ...
414857 2019-02-24 381.26 KO012 24 2 2019 201902
414858 2019-02-25 181.26 KO012 25 2 2019 201902
414859 2019-02-26 160.82 KO012 26 2 2019 201902
414860 2019-02-27 0.82 KO012 27 2 2019 201902
414861 2019-02-28 109.50 KO012 28 2 2019 201902
Each account value has 365 values (a starting date when the information was obtained and a year of info), resampled by day. After that, I'm splitting this dataframe into train and test. Train consists of all previous values except for the last 2 months of information and test are these last 2 months (the last month is not necesarilly full, if the last/max date value is 20-04-2019, then train will be from 20-04-2018 to 31-03-2019 and test 01-03-2019 to 20-04-2019). This is how I manage:
df_test_1 = df[df.fdate==df.groupby('account').fdate.transform('max')].copy()
dft = df.drop(df_test_1.index)
df_test_2 = dft[dft.fdate==dft.groupby('account').fdate.transform('max')].copy()
df_train = dft.drop(df_test_2.index)
df_test = pd.concat([df_test_2,df_test_1])
#print("Shape df: ",df.shape) #for validation purposes
#print("Shape test: ",df_test.shape) #for validation purposes
#print("Shape train: ",df_train.shape) #for validation purposes
What I need to do now is create a new column exp_bal (expected balance) for each date in the df_test that's calculated by averaging all train values for the particular day (this is the method requested so I must follow the instructions).
Here is an example of an expected output/result, I'm only printing account's AA001 values for a specific day for the last 2 train months (suppose these values always repeat for the other 8 months):
date balance account day month year fdate
... ... ... ... ... ... ... ...
0 2019-03-20 200.00 AA000 20 3 2019 201903
1 2019-04-20 100.00 AA000 20 4 2019 201904
I should be able to use this information to append a new column for each day that is the average of the same day value for all months of df_train
date balance account day month year fdate exp_bal
0 2018-05-20 470.57 AA000 20 5 2018 201805 150.00
30 2019-06-20 381.26 AA000 20 6 2019 201906 150.00
So then I can calculate a mse for the that prediction for that account.
First of all I'm using this to iterate over each account:
ids = list(df['account'].unique())
for i in range(0,len(ids)):
dft_train = df_train[df_train['account'] == ids[i]]
dft_test = df_test[df_test['account'] == ids[i]]
first_date = min(dft_test['date'])
last_date = max(df_ttest['date'])
dft_train = dft_train.set_index('date')
dft_test = dft_train.set_index('date')
And after this I'm lost on how to use the dft_train values to create this average for a given day that will be appended in a new column in dft_test.
I appreciate any help or suggestion, also feel free to ask for clarification/ more info, I'll gladly edit this. Thanks in advance!
Not sure if it's the only question you have with the above, but this is how to calculate the expected balance of the train data:
import pandas as pd, numpy as np
# make test data
n = 60
df = pd.DataFrame({'Date': np.tile(pd.date_range('2018-01-01',periods=n).values, 2), 'Account': np.repeat(['A', 'B'], n), 'Balance': range(2*n)})
df['Day'] = df.Date.dt.day
# calculate expected balance
df['exp_bal'] = df.groupby(['Account', 'Day']).Balance.transform('mean')
# example output for day 5
print(df[df.Day==5])
Output:
Date Account Balance Day exp_bal
4 2018-01-05 A 4 5 19.5
35 2018-02-05 A 35 5 19.5
64 2018-01-05 B 64 5 79.5
95 2018-02-05 B 95 5 79.5

Drop rows after particular year Pandas

I have a column in my dataframe that has years in the following format:
2018-19
2017-18
The years are object data type. I want to change the type of this column to datetime, then drop all rows before 1979-80. However, I tried to do that and I got formatting errors. What is the correct, or better way, of doing this?
BOS['Season'] = pd.to_datetime(BOS['Season'], format = '%Y%y')
I am quite new to Python, so I could appreciate it if you can tell me what I am doing wrong. Thanks!
I think here is simpliest compare years separately, e.g. before -:
print (BOS)
Season
0 1979-80
1 2018-19
2 2017-18
df = BOS[BOS['Season'].str.split('-').str[0].astype(int) < 2017]
print (df)
Season
0 1979-80
Details:
First is splited value by Series.str.split to lists and then seelcted first lists:
print (BOS['Season'].str.split('-'))
0 [1979, 80]
1 [2018, 19]
2 [2017, 18]
Name: Season, dtype: object
print (BOS['Season'].str.split('-').str[0])
0 1979
1 2018
2 2017
Name: Season, dtype: object
Or convert both years to separately columns:
BOS['start'] = pd.to_datetime(BOS['Season'].str.split('-').str[0], format='%Y').dt.year
BOS['end'] = BOS['start'] + 1
print (BOS)
Season start end
0 1979-80 1979 1980
1 2018-19 2018 2019
2 2017-18 2017 2018
I would have use .str.slice accessor of Series to select the part of the date I wish to keep, to insert it into the pd.to_datetime() function. Then, the select with .loc[] and boolean mask becomes easy.
import pandas as pd
data = {
'date' : ['2016-17', '2017-18', '2018-19', '2019-20']
}
df = pd.DataFrame(data)
print(df)
# date
# 0 2016-17
# 1 2017-18
# 2 2018-19
# 3 2019-20
df['date'] = pd.to_datetime(df['date'].str.slice(0, 4), format='%Y')
print(df)
# date
# 0 2016-01-01
# 1 2017-01-01
# 2 2018-01-01
# 3 2019-01-01
df = df.loc[ df['date'].dt.year < 2018 ]
print(df)
# date
# 0 2016-01-01
# 1 2017-01-01

Pivot and rename Pandas dataframe

I have a dataframe in the format
Date Datediff Cumulative_sum
01 January 2019 1 5
02 January 2019 1 7
02 January 2019 2 15
01 January 2019 2 8
01 January 2019 3 13
and I want to pivot the column Datediff from the dataframe such that the end result looks like
Index Day-1 Day-2 Day-3
01 January 2019 5 8 13
02 January 2019 7 15
I have used the pivot command shuch that
pt = pd.pivot_table(df, index = "Date",
columns = "Datediff",
values = "Cumulative_sum") \
.reset_index() \
.set_index("Date"))
which returns the pivoted table
1 2 3
01 January 2019 5 8 13
02 January 2019 7 15
And I can then rename rename the columns using the loop
for column in pt:
pt.rename(columns = {column : "Day-" + str(column)}, inplace = True)
which returns exactly what I want. However, I was wondering if there is a faster way to rename the columns when pivoting and get rid of the loop altogether.
Use DataFrame.add_prefix:
df.add_prefix('Day-')
In your solution:
pt = (pd.pivot_table(df, index = "Date",
columns = "Datediff",
values = "Cumulative_sum")
.add_prefix('Day-'))

convert year to a date with adding some number of day in pandas

I have a dataframe that looks like this:
Year vl
2017 20
2017 21
2017 22
2017 23
2017 24
2017 25
2017 26
...
I need to convert the year into the format dd.mm.yyyy. Every time start from the first day of the year. For example, 2017 will become 01.01.2017. And then, I need to multiply each value in the column "vl" by 7 and add them line by line to the column as the number of days, where the dates will be in the new format (as in the example 01.01.2017).
The result should be something like this:
Year vl new_date
2017 20 21.05.2017
2017 21 28.05.2017
2017 22 04.06.2017
2017 23 11.06.2017
2017 24 18.06.2017
2017 25 25.06.2017
2017 26 02.07.2017
...
Here is one option by pasting the Year (%Y) and Day of the year (%j) together and then parse and reformat it:
from datetime import datetime
df.apply(lambda r: datetime.strptime("{}{}".format(r.Year, r.vl*7+1), "%Y%j").strftime("%d.%m.%Y"), axis=1)
#0 21.05.2017
#1 28.05.2017
#2 04.06.2017
#3 11.06.2017
#4 18.06.2017
#5 25.06.2017
#6 02.07.2017
#dtype: object
Assign the column back to the original data frame:
df['new_date'] = df.apply(lambda r: datetime.strptime("{}{}".format(r.Year, r.vl*7+1), "%Y%j").strftime("%d.%m.%Y"), axis=1)
Unfortunately %U and %W aren't implemented in Pandas
But we can use the following vectorized approach:
In [160]: pd.to_datetime(df.Year.astype(str), format='%Y') + \
pd.to_timedelta(df.vl.mul(7).astype(str) + ' days')
Out[160]:
0 2017-05-21
1 2017-05-28
2 2017-06-04
3 2017-06-11
4 2017-06-18
5 2017-06-25
6 2017-07-02
dtype: datetime64[ns]

Categories

Resources