how to calculate number of working days with python - python

I have a dataframe (df):
year month ETP
0 2021 1 49.21
1 2021 2 34.20
2 2021 3 31.27
3 2021 4 29.18
4 2021 5 33.25
5 2021 6 24.70
I would like to add a column that gives me the number of working days for each row excluding holidays and weekends (for a specific country, exp: France or US)
so the output will be :
year month ETP work_day
0 2021 1 49.21 20
1 2021 2 34.20 20
2 2021 3 31.27 21
3 2021 4 29.18 19
4 2021 5 33.25 20
5 2021 6 24.70 19
code :
import numpy as np
import pandas as pd
days = np.busday_count( '2021-01', '2021-06' )
df.insert(3, "work_day", [days])
and I got this error :
ValueError: Length of values does not match length of index
Any suggestions?
Thank you for your help

assuming you are the one that will input the workdays, I suppose you can do it like this:
data = {'year': [2020, 2020, 2021, 2023, 2022],
'month': [1, 2, 3, 4, 6]}
df = pd.DataFrame(data)
df.insert(2, "work_day", [20,20,23,21,22])
Where the 2 is the position of the new column, not just to be at the end, work_day is the name and the list has the values for every row.
EDIT: With NumPy
import numpy as np
import pandas as pd
days = np.busday_count( '2021-02', '2021-03' )
data = {'year': [2021],
'month': ['february']}
df = pd.DataFrame(data)
df.insert(2, "work_day", [days])
with the busday_count you specify the starting and ending dates you want to see the workdays in.
the result :
year month work_day
0 2021 february 20

Related

Pandas - plot Bureau of Labor statistics with years on y-axis and months on x-axis

I'm working with BLS inflation statistics, which are presented thus:
I want to make a very simple line chart (probably will use altair but that's not entirely relevant to the question).
In pandas, what is the most efficient/idiomatic way to restructure the DataFrame to prepare for time-series visualization in this case?
NOTE: this is essentially the inverse of this question: https://stackoverflow.com/questions/48211424/how-to-make-a-years-on-y-axis-and-months-on-x-axis-plot-with-pandas
There might be a more elegant way to do this, but you can loop through all of the month columns, and combine that with all of the values in the year column to a get a year-month, and store the values and the year-months together in a pd.Series.
For example, we can create a similar dataframe to yours:
from datetime import datetime
import numpy as np
import pandas as pd
## recreate a dataframe with a similar structure
np.random.seed(42)
data = np.random.randint(low=1, high=10, size=(4, 13))
month_cols = [datetime.strptime(str(i), "%m").strftime("%b") for i in range(1,13)]
years = [1960.0,1961.0,1962.0,1963.0]
df = pd.DataFrame(
data,
columns= ['Date'] + month_cols
)
df['Date'] = years
>>> df
Date Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
0 1960.0 4 8 5 7 3 7 8 5 4 8 8 3
1 1961.0 5 2 8 6 2 5 1 6 9 1 3 7
2 1962.0 9 3 5 3 7 5 9 7 2 4 9 2
3 1963.0 5 2 4 7 8 3 1 4 2 8 4 2
And transform it into a timeseries:
ts_index, ts_values = [], []
for col in month_cols:
ts_values.extend(df[col].tolist())
ts_index.extend(pd.to_datetime(
[f"{int(year)}-{col}" for year in years]
))
timeseries = pd.Series(
index=ts_index,
data=ts_values
).sort_index()
1960-01-01 4
1960-02-01 8
1960-03-01 5
1960-04-01 7
...
1963-09-01 2
1963-10-01 8
1963-11-01 4
1963-12-01 2
With a table like this, which I call bls_df:
Year Jan Feb Mar ... Nov Dec HALF1 HALF2
0 2013 230.280 232.166 232.773 ... 233.069 233.049 232.366 233.548
1 2014 233.916 234.781 236.293 ... 236.151 234.812 236.384 237.088
2 2015 233.707 234.722 236.119 ... 237.336 236.525 236.265 237.769
3 2016 236.916 237.111 238.132 ... 241.353 241.432 238.778 241.237
4 2017 242.839 243.603 243.801 ... 246.669 246.524 244.076 246.163
5 2018 247.867 248.991 249.554 ... 252.038 251.233 250.089 252.125
6 2019 251.712 252.776 254.202 ... 257.208 256.974 254.412 256.903
7 2020 257.971 258.678 258.115 ... 260.229 260.474 257.557 260.065
8 2021 261.582 263.014 264.877 ... 277.948 278.802 266.236 275.703
9 2022 281.148 283.716 287.504 ... 297.711 296.797 288.347 296.963
10 2023 299.170 NaN NaN ... NaN NaN NaN NaN
[11 rows x 15 columns]
Reshape by setting index then unstacking.
nbls = bls_df.set_index('Year').unstack().reset_index().rename(columns={'level_0': 'month'})
nbls = nbls[nbls['month'].isin(list(calendar.month_abbr))] # subset to real months
Generate date by concatenating the month and year columns in a parseable format:
>>> nbls['date'] = pd.to_datetime(nbls['month'] + '-' + nbls['Year'].astype(str))
>>> nbls.sort_values('date') # nb does not act in-place
month Year 0 date
0 Jan 2013 230.280 2013-01-01
11 Feb 2013 232.166 2013-02-01
22 Mar 2013 232.773 2013-03-01
33 Apr 2013 232.531 2013-04-01
44 May 2013 232.945 2013-05-01
.. ... ... ... ...
87 Aug 2023 NaN 2023-08-01
98 Sep 2023 NaN 2023-09-01
109 Oct 2023 NaN 2023-10-01
120 Nov 2023 NaN 2023-11-01
131 Dec 2023 NaN 2023-12-01
[132 rows x 4 columns]
I also do hope that you know that the BLS already has the data stored in a long format. You can read it in directly using pd.read_csv(THE_URL, sep='\s+'), where THE_URL is this link: https://download.bls.gov/pub/time.series/cu/cu.data.0.Current. You will still need to do some working to generate a datetime column and also to filter based on the series codes that the BLS assigns, but it isn't too difficult.

Converting different columns to a datetime

The time in my csv file is divided into 4 columns, (year, julian day, hour/minut(utc) and second), and I wanted to convert to a single column so that it looks like this: 14/11/2017 00:16:00.
Is there a easy way to do this?
A sample of the code is
cols = [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
D14 = pd.read_csv(r'C:\Users\William Jacondino\Desktop\DadosTimeSeries\PIRATA-PROFILE\Dados FLUXO\Dados_brutos_copy-20220804T151840Z-002\Dados_brutos_copy\tm_data_2017_11_14_0016.dat', header=None, usecols=cols, names=["Year","Julian day", "Hour/minut (UTC)", "Second", "Bateria (V)", "PTemp (°C)", "Latitude", "Longitude", "Magnectic_Variation (arb)", "Altitude (m)", "Course (º)", "WS", "Nmbr_of_Satellites (arb)", "RAD", "Tar", "UR", "slp",], sep=',')
D14 = D14.loc[:, ["Year","Julian day", "Hour/minut (UTC)", "Second", "Latitude", "Longitude","WS", "RAD", "Tar", "UR", "slp"]]
My array looks like that:
The file: csv file sample
The "Hour/minut (UTC)" column has the first two digits referring to the Local Time and the last two digits referring to the minute.
The beginning of the time in the "Hour/minut (UTC)" column starts at 016 which refers to 0 hour UTC and minute 16.
and goes up to hour 12 UTC and minute 03.
I wanted to unify everything into a single datetime column so from the beginning to the end of the array:
1 - 2017
1412 - 14/11/2017 12:03:30
but the column "Hour/minut (UTC)" from hour 0 to hour 9 only has one value like this:
9
instead of 09
How do I create the array with the correct datetime?
You can create a new column which also adds the data from other columns.
For example, if you have a dataframe like so:
df = pd.DataFrame(dict)
# Print df:
year month day a b c
0 2010 jan 1 1 4 7
1 2010 feb 2 2 5 8
2 2020 mar 3 3 6 9
You can add a new column field on the DataFrame, with the values extracted from the Year Month and Date columns.
df['newColumn'] = df.year.astype(str) + '-' + df.month + '-' + df.day.astype(str)
Edit: In your situation instead of using df.month use df['Julian Day'] since the column name is different. To understand more on why this is, read here
The data in the new column will be as string with the way you like to format it. You can also substitute the dash '-' with a slash '/' or however you need to format the outcome. You just need to convert the integers into strings with .astype(str)
Output:
year month day a b c newColumn
0 2010 jan 1 1 4 7 2010-jan-1
1 2010 feb 2 2 5 8 2010-feb-2
2 2020 mar 3 3 6 9 2020-mar-3
After that you can do anything as you would on a dataframe object.
If you only need it for data analysis you can do it with the function .groupBy() which groups the data fields and performs the analysis.
source
If your dataframe looks like
import pandas as pd
df = pd.DataFrame({
"year": [2017, 2017], "julian day": [318, 318], "hour/minut(utc)": [16, 16],
"second": [0, 30],
})
year julian day hour/minut(utc) second
0 2017 318 16 0
1 2017 318 16 30
then you could use pd.to_datetime() and pd.to_timedelta() to do
df["datetime"] = (
pd.to_datetime(df["year"].astype("str"), format="%Y")
+ pd.to_timedelta(df["julian day"] - 1, unit="days")
+ pd.to_timedelta(df["hour/minut(utc)"], unit="minutes")
+ pd.to_timedelta(df["second"], unit="seconds")
).dt.strftime("%d/%m/%Y %H:%M:%S")
and get
year julian day hour/minut(utc) second datetime
0 2017 318 16 0 14/11/2017 00:16:00
1 2017 318 16 30 14/11/2017 00:16:30
The column datetime now contains strings. Remove the .dt.strftime("%d/%m/%Y %H:%M:%S") part at the end, if you want datetimes instead.
Regarding your comment: If I understand correctly, you could try the following:
df["hours_min"] = df["hour/minut(utc)"].astype("str").str.zfill(4)
df["hour"] = df["hours_min"].str[:2].astype("int")
df["minute"] = df["hours_min"].str[2:].astype("int")
df = df.drop(columns=["hours_min", "hour/minut(utc)"])
df["datetime"] = (
pd.to_datetime(df["year"].astype("str"), format="%Y")
+ pd.to_timedelta(df["julian day"] - 1, unit="days")
+ pd.to_timedelta(df["hour"], unit="hours")
+ pd.to_timedelta(df["minute"], unit="minutes")
+ pd.to_timedelta(df["second"], unit="seconds")
).dt.strftime("%d/%m/%Y %H:%M:%S")
Result for the sample dataframe df
df = pd.DataFrame({
"year": [2017, 2017, 2018, 2019], "julian day": [318, 318, 10, 50],
"hour/minut(utc)": [16, 16, 234, 1201], "second": [0, 30, 1, 2],
})
year julian day hour/minut(utc) second
0 2017 318 16 0
1 2017 318 16 30
2 2018 10 234 1
3 2019 50 1201 2
would be
year julian day second hour minute datetime
0 2017 318 0 0 16 14/11/2017 00:16:00
1 2017 318 30 0 16 14/11/2017 00:16:30
2 2018 10 1 2 34 10/01/2018 02:34:01
3 2019 50 2 12 1 19/02/2019 12:01:02

Return a Cell from a Pandas Dataframe based on other values

I have the following dataset in a Pandas Dataframe:
Id
Year
Month
Total
0
2020
9
11788.33
1
2020
10
18373.99
2
2020
11
31018.59
3
2020
12
29279.30
4
2021
1
1875.10
5
2021
2
9550.06
6
2021
3
33844.39
7
2021
4
33126.53
8
2021
5
12910.05
9
2021
6
44628.63
10
2021
7
25830.03
11
2021
8
54463.08
12
2021
9
49723.93
13
2021
10
23753.81
14
2021
11
52532.49
15
2021
12
7467.32
16
2022
1
24333.54
17
2022
2
12394.11
18
2022
3
76575.46
19
2022
4
95119.82
20
2022
5
63048.05
I am trying to dynamically return the value from the Total column based on the first month (Month 1) from last year (Year 2021). Solution is 1875.10.
I am using Python in PyCharm to complete this.
Note: The "Id" column is the one that is automatically generated when using a pandas Dataframe. I believe it is called an index within Pandas.
Any help would be greatly appreciated.
Thank you.
You can use .loc[]:
df.loc[(df['Year'] == 2021) & (df['Month'] == 1), 'Total']
Which will give you:
0 1875.1
Name: Total, dtype: float64
To get the actual number you can add .iloc[] on the end:
df.loc[(df['Year'] == 2021) & (df['Month'] == 1), 'Total'].iloc[0]
Output:
1875.1
Another method is doing this.
df[df['Year']==2021].iloc[0]['Total']
This part df[df['Year']==2021] creates a new dataframe, where we only have values from 2021, and the .iloc fetches the value at position 0 in the 'Total' column
Would simple filter suffice?
df[(df.Year == 2021) & (df.Month == 1)].Total

faster way of creating pandas dataframe from another dataframe

I have a dataframe with over 41500 records and 3 fields: ID,start_date and end_date.
I want to create a separate dataframe out of it with just 2 fields as: ID and active_years which will contain records having each identifiers against all the possible years that exists between the start_year and end_year range (inclusive of end year in the range).
This is what I'm doing right now, but for 41500 rows it takes more than 2 hours to finish.
df = pd.DataFrame(columns=['id', 'active_years'])
ix = 0
for _, row in raw_dataset.iterrows():
st_yr = int(row['start_date'].split('-')[0]) # because dates are in the format yyyy-mm-dd
end_yr = int(row['end_date'].split('-')[0])
for year in range(st_yr, end_yr+1):
df.loc[ix, 'id'] = row['ID']
df.loc[ix, 'active_years'] = year
ix = ix + 1
So is there any faster way to achieve this?
[EDIT] some examples to try and work around,
raw_dataset = pd.DataFrame({'ID':['a121','b142','cd3'],'start_date':['2019-10-09','2017-02-06','2012-12-05'],'end_date':['2020-01-30','2019-08-23','2016-06-18']})
print(raw_dataset)
ID start_date end_date
0 a121 2019-10-09 2020-01-30
1 b142 2017-02-06 2019-08-23
2 cd3 2012-12-05 2016-06-18
# the desired dataframe should look like this
print(desired_df)
id active_years
0 a121 2019
1 a121 2020
2 b142 2017
3 b142 2018
4 b142 2019
5 cd3 2012
6 cd3 2013
7 cd3 2014
8 cd3 2015
9 cd3 2016
Dynamically growing python lists is much faster than dynamically growing numpy arrays (which are the underlying data structure of pandas dataframes). See here for a brief explanation. With that in mind:
import pandas as pd
# Initialize input dataframe
raw_dataset = pd.DataFrame({
'ID':['a121','b142','cd3'],
'start_date':['2019-10-09','2017-02-06','2012-12-05'],
'end_date':['2020-01-30','2019-08-23','2016-06-18'],
})
# Create integer columns for start year and end year
raw_dataset['start_year'] = pd.to_datetime(raw_dataset['start_date']).dt.year
raw_dataset['end_year'] = pd.to_datetime(raw_dataset['end_date']).dt.year
# Iterate over input dataframe rows and individual years
id_list = []
active_years_list = []
for row in raw_dataset.itertuples():
for year in range(row.start_year, row.end_year+1):
id_list.append(row.ID)
active_years_list.append(year)
# Create result dataframe from lists
desired_df = pd.DataFrame({
'id': id_list,
'active_years': active_years_list,
})
print(desired_df)
# Output:
# id active_years
# 0 a121 2019
# 1 a121 2020
# 2 b142 2017
# 3 b142 2018
# 4 b142 2019
# 5 cd3 2012
# 6 cd3 2013
# 7 cd3 2014
# 8 cd3 2015
# 9 cd3 2016

pandas rename: change values of index for a specific column only

I have the following pandas dataframe:
Cost
Year Month ID
2016 1 10 40
2 11 50
2017 4 1 60
The columns Year, Month and ID make up the index. I want to set the values within Month to be the name equivalent (e.g. 1 = Jan, 2 = Feb). I've come up with the following code:
df.rename(index={i: calendar.month_abbr[i] for i in range(1, 13)}, inplace=True)
However, this changes the values within every column in the index:
Cost
Year Month ID
2016 Jan 10 40
Feb 11 50
2017 Apr Jan 60 # Jan here is incorrect
I obviously only want to change the values in the Month column. How can I fix this?
use set_levels
m = {1: 'Jan', 2: 'Feb', 4: 'Mar'}
df.index.set_levels(
df.index.levels[1].to_series().map(m).values,
1, inplace=True)
print(df)
Cost
Year Month ID
2016 Jan 10 40
Feb 11 50
2017 Mar 1 60

Categories

Resources