I am importing some stock data that has annual report information into a pandas DataFrame. But the date for the annual report end date is an odd month (end of january) rather than end of year.
years = ['2017-01-31', '2016-01-31', '2015-01-31']
df = pd.DataFrame(data = years, columns = ['years'])
df
Out[357]:
years
0 2017-01-31
1 2016-01-31
2 2015-01-31
When I try to add in a PeriodIndex which shows the period of time the report data is valid for, it defaults to ending in December rather than inferring it from the date string
df.index = pd.PeriodIndex(df['years'], freq ='A')
df.index
Out[367]: PeriodIndex(['2017', '2016', '2015'], dtype='period[A-DEC]',
name='years', freq='A-DEC')
Note that the frequency should be 'A-JAN'.
I assume this means that the end date can't be inferred from PeriodIndex and the end date string I gave it.
I can change it using the asfreq method and anchored offsets anchored offsets using "A-JAN" as the frequency string. But, this changes all of the individual periods in the PeriodIndex rather than individually as years can have different reporting end dates for their annual report (in the case of a company that changed their reporting period).
Is there a way to interpret each date string and correctly set each period for each row in my pandas frame?
My end goal is to set a period column or index that has a frequency of 'annual' but with the period end date set to the date from the corresponding row of the years column.
** Expanding this question a bit further. Consider that I have many stocks with 3-4 years of annual financial data for each and all with varying start and end dates for their annual reporting frequencies (or quarterly for that matter).
Out[14]:
years tickers
0 2017-01-31 PG
1 2016-01-31 PG
2 2015-01-31 PG
3 2017-05-31 T
4 2016-05-31 T
5 2015-05-31 T
What I'm trying to get to is a column with proper Period objects that are configured with proper end dates (from the year column) and all with annual frequencies. I've thought about trying to iterate through the years and use apply.map or lambda function and the pd.Period function. It may be that a PeriodIndex can't exist with varying Period Objects in it that have varying end dates. something like
for row in df.years:
s.append(pd.Period(row, freq='A")
df['period']= s
#KRkirov got me thinking. It appears the Period constructor is not smart enough to set the end date of the frequency by reading the date string. I was able to get the frequency end date right by building up an anchor string from the end date of the reporting period as follows:
# return a month in 3 letter abbreviation format (eg. "JAN")
df['offset'] = df['years'].dt.strftime('%b').str.upper()
# now build up an anchor offset string (eg. "A-JAN" )
# for quarterly report (eg. "Q-JAN") for q report ending January for year
df['offset_strings'] = "A" + '-' + df.offset
Anchor strings are documented in the pandas docs here.
And then iterate through the rows of the DataFrame to construct each Period and put it in a list, then add the list of Period objects (which is coerced to a PeriodIndex) to a column.
ps = []
for i, r in df.iterrows():
p = pd.Period(r['years'], freq = r['offset_strings']))
ps.append(p)
df['period'] = ps
This returns a proper PeriodIndex with the Period Objects set correctly:
df['period']
Out[40]:
0 2017
1 2016
2 2015
Name: period, dtype: object
df['period'][0]
Out[41]: Period('2017', 'A-JAN')
df.index = df.period
df.index
Out[43]: PeriodIndex(['2017', '2016', '2015'], dtype='period[A-JAN]',
name='period', freq='A-JAN')
Not pretty, but I could not find another way.
Related
I have a pandas dataframe with date values, however, I need to convert it from dates to text General format like in Excel, not to date string, in order to match with primary keys values in SQL, which are, unfortunately, reordered in general format. Is it possible to do it Python or the only way to convert this column to general format in Excel?
Here is how the dataframe's column looks like:
ID Desired Output
1/1/2022 44562
7/21/2024 45494
1/1/1931 11324
Yes, it's possible. The general format in Excel starts counting the days from the date 1900-1-1.
You can calculate a time delta between the dates in ID and 1900-1-1.
Inspired by this post you could do...
data = pd.DataFrame({'ID': ['1/1/2022','7/21/2024','1/1/1931']})
data['General format'] = (
pd.to_datetime(data["ID"]) - pd.Timestamp("1900-01-01")
).dt.days + 2
print(data)
ID General format
0 1/1/2022 44562
1 7/21/2024 45494
2 1/1/1931 11324
The +2 is because:
Excel starts counting from 1 instead of 0
Excel incorrectly considers 1900 as a leap year
Excel stores dates as sequential serial numbers so that they can be
used in calculations. By default, January 1, 1900 is serial number 1,
and January 1, 2008 is serial number 39448 because it is 39,447 days
after January 1, 1900.
-Microsoft's documentation
So you can just calculate (difference between your date and January 1, 1900) + 1
see How to calculate number of days between two given dates
I have a pandas dataframe with 3 columns:
OrderID_new (integer)
OrderTotal (float)
OrderDate_new (string or datetime sometimes)
Sales order ID's are in the first column, order values (totals) are in the 2nd column and order date - in mm/dd/yyyy format are in the last column.
I need to do 2 things:
to aggregate the order totals:
a) first into total sales per each day and then
b) into total sales per each calendar month
to convert values in OrderDate_new from mm/dd/yyyy format (e.g. 01/30/2015) into MM YYYY (e.g. January 2015) format.
The problem is some input files have 3rd column (date) already in datetime format while some have it as string format so that means sometimes string to datetime parsing will be needed while in other cases, reformatting datetime.
I have been trying to do 2 step aggregation with groupby but I'm getting some strange daily and monthly totals that make no sense.
What I need as the final stage is time series with 2 columns - 1. monthly sales and 2. month (Month Year)...
Then I will need to select and train some model for monthly sales time series forecast (out of scope for this question)...
What am I doing wrong?
How to do it effectively in Python?
dataframe example:
You did not provide usable sample data, hence I've synthesized.
resample() allows you to rollup a date column. Have provided daily and monthly
pd.to_datetime() gives you what you want
def mydf(size=10):
return pd.DataFrame({"OrderID_new":np.random.randint(100,200, size),
"OrderTotal":np.random.randint(200, 10000, size),
"OrderDate_new":np.random.choice(pd.date_range(dt.date(2019,8,1),dt.date(2020,1,1)),size)})
# smash orderdate to be a string for some rows
df = pd.concat([mydf(5), mydf(5).assign(OrderDate_new=lambda dfa: dfa.OrderDate_new.dt.strftime("%Y/%m/%d"))])
# make sure everything is a date..
df.OrderDate_new = pd.to_datetime(df.OrderDate_new)
# totals
df.resample("1d", on="OrderDate_new")["OrderTotal"].sum()
df.resample("1m", on="OrderDate_new")["OrderTotal"].sum()
This question already has answers here:
Extracting just Month and Year separately from Pandas Datetime column
(13 answers)
Closed 3 months ago.
I have a dataframe with a date column (type datetime). I can easily extract the year or the month to perform groupings, but I can't find a way to extract both year and month at the same time from a date. I need to analyze performance of a product over a 1 year period and make a graph with how it performed each month. Naturally I can't just group by month because it will add the same months for 2 different years, and grouping by year doesn't produce my desired results because I need to look at performance monthly.
I've been looking at several solutions, but none of them have worked so far.
So basically, my current dates look like this
2018-07-20
2018-08-20
2018-08-21
2018-10-11
2019-07-20
2019-08-21
And I'd just like to have 2018-07, 2018-08, 2018-10, and so on.
You can use to_period
df['month_year'] = df['date'].dt.to_period('M')
If they are stored as datetime you should be able to create a string with just the year and month to group by using datetime.strftime (https://strftime.org/).
It would look something like:
df['ym-date'] = df['date'].dt.strftime('%Y-%m')
If you have some data that uses datetime values, like this:
sale_date = [
pd.date_range('2017', freq='W', periods=121).to_series().reset_index(drop=True).rename('Sale Date'),
pd.Series(np.random.normal(1000, 100, 121)).rename('Quantity')
]
sales = pd.concat(data, axis='columns')
You can group by year and date simultaneously like this:
d = sales['Sale Date']
sales.groupby([d.dt.year.rename('Year'), d.dt.month.rename('Month')]).sum()
You can also create a string that represents the combination of month and year and group by that:
ym_id = d.apply("{:%Y-%m}".format).rename('Sale Month')
sales.groupby(ym_id).sum()
A couple of options, one is to map to the first of each month:
Assuming your dates are in a column called 'Date', something like:
df['Date_no_day'] = df['Date'].apply(lambda x: x.replace(day=1))
If you are really keen on storing the year and month only, you could map to a (year, month) tuple, eg:
df['Date_no_day'] = df['Date'].apply(lambda x: (x.year, x.month))
From here, you can groupby/aggregate by this new column and perform your analysis
One way could be to transform the column to get the first of month for all of these dates and then create your analsis on month to month:
date_col = pd.to_datetime(['2011-09-30', '2012-02-28'])
new_col = date_col + pd.offsets.MonthBegin(1)
Here your analysis remains intact as monthly
I am working on a dataset that has some 26 million rows and 13 columns including two datetime columns arr_date and dep_date. I am trying to create a new boolean column to check if there is any US holidays between these dates.
I am using apply function to the entire dataframe but the execution time is too slow. The code has been running for more than 48 hours now on Goolge Cloud Platform (24GB ram, 4 core). Is there a faster way to do this?
The dataset looks like this:
Sample data
The code I am using is -
import pandas as pd
import numpy as np
from pandas.tseries.holiday import USFederalHolidayCalendar as calendar
df = pd.read_pickle('dataGT70.pkl')
cal = calendar()
def mark_holiday(df):
df.apply(lambda x: True if (len(cal.holidays(start=x['dep_date'], end=x['arr_date']))>0 and x['num_days']<20) else False, axis=1)
return df
df = mark_holiday(df)
This took me about two minutes to run on a sample dataframe of 30m rows with two columns, start_date and end_date.
The idea is to get a sorted list of all holidays occurring on or after the minimum start date, and then to use bisect_left from the bisect module to determine the next holiday occurring on or after each start date. This holiday is then compared to the end date. If it is less than or equal to the end date, then there must be at least one holiday in the date range between the start and end dates (both inclusive).
from bisect import bisect_left
import pandas as pd
from pandas.tseries.holiday import USFederalHolidayCalendar as calendar
# Create sample dataframe of 10k rows with an interval of 1-19 days.
np.random.seed(0)
n = 10000 # Sample size, e.g. 10k rows.
years = np.random.randint(2010, 2019, n)
months = np.random.randint(1, 13, n)
days = np.random.randint(1, 29, n)
df = pd.DataFrame({'start_date': [pd.Timestamp(*x) for x in zip(years, months, days)],
'interval': np.random.randint(1, 20, n)})
df['end_date'] = df['start_date'] + pd.TimedeltaIndex(df['interval'], unit='d')
df = df.drop('interval', axis=1)
# Get a sorted list of holidays since the fist start date.
hols = calendar().holidays(df['start_date'].min())
# Determine if there is a holiday between the start and end dates (both inclusive).
df['holiday_in_range'] = df['end_date'].ge(
df['start_date'].apply(lambda x: bisect_left(hols, x)).map(lambda x: hols[x]))
>>> df.head(6)
start_date end_date holiday_in_range
0 2015-07-14 2015-07-31 False
1 2010-12-18 2010-12-30 True # 2010-12-24
2 2013-04-06 2013-04-16 False
3 2013-09-12 2013-09-24 False
4 2017-10-28 2017-10-31 False
5 2013-12-14 2013-12-29 True # 2013-12-25
So, for a given start_date timestamp (e.g. 2013-12-14), bisect_right(hols, '2013-12-14') would yield 39, and hols[39] results in 2013-12-25, the next holiday falling on or after the 2013-12-14 start date. The next holiday calculated as df['start_date'].apply(lambda x: bisect_left(hols, x)).map(lambda x: hols[x]). This holiday is then compared to the end_date, and holiday_in_range is thus True if the end_date is greater than or equal to this holiday value, otherwise the holiday must fall after this end_date.
Have you already considered using pandas.merge_asof for this?
I could imagine that map and apply with lambda functions cannot be executed that efficiently.
UPDATE: Ah sorry, I just read, that you only need a boolean if there are any holidays inbetween, this makes it much easier. If thats enough you just need to perform steps 1-5 then group the DataFrame that is the result of step5 by start/end date and use count as the aggregate function to have the number of holidays in the ranges. This result you can join to your original dataset similar to step 8 described below. Then fill the rest of the values with fillna(0). Do something like joined_df['includes_holiday']= joined_df['joined_count_column']>0. After that, you can delete the joined_count_column again from your DataFrame, if you like.
If you use pandas_merge_asof you could work through these steps (step 6 and 7 are only necessary if you need to have all the holidays inbetween start and end in your result DataFrame as well, not just the booleans):
Load your holiday records in a DataFrame and index it on the date. The holidays should be one date per line (storing ranges like for christmas from 24th-26th in one row, would make it much more complex).
Create a copy of your dataframe with just the start, end date columns. UPDATE: every start, end date should only occur once in it. E.g. by using groupby.
Use merge_asof with a reasonable tolerance value (if you join over the start of the period, use direction='forward', if you use the end date, use direction='backward' and how='inner'.
As a result you have a merged DataFrame with your start, end columns and the date column from your holiday dataframe. You get only records, for which a holiday was found with the given tolerance, but later you can merge this data back with your original DataFrame. You will probably now have duplicates of your original records.
Then check the joined holiday for your records with indexers by comparing them with the start and end column and remove the holidays, which are not inbetween.
Sort the dataframe you obtained form step 5 (use something like df.sort_values(['start', 'end', 'holiday'], inplace=True). Now you should insert a number column that numbers the holidays between your periods (the ones you obtained after step 5) form 1 to ... (for each period starting from 1). This is necesary to use unstack in the next step to get the holidays in columns.
Add an index on your dataframe based on period start date, period end date and the count column you inserted in step 6. Use df.unstack(level=-1) on the DataFrame you prepared in steps 1-7. What you now have, is a condensed DataFrame with your original periods with the holidays arranged columnwise.
Now you only have to merge this DataFrame back to your original data using original_df.merge(df_from_step7, left_on=['start', 'end'], right_index=True, how='left')
The result of this is a file with your original data containing the date ranges and for each date range the holidays that lie inbetween the period are stored in a separte columns each behind the data. Loosely speaking the numbering in step 6 assigns the holidays to the columns and has the effect, that the holidays are always assigned from right to left to the columns (you wouldn't have a holiday in column 3 if column 1 is empty).
Step 6. is probably also a bit tricky, but you can do that for example by adding a series filled with a range and then fixing it, so the numbering starts by 0 or 1 in each group by using shift or grouping by start, end with aggregate({'idcol':'min') and joining the result back to subtract it from the value assigned by the range-sequence.
In all, I think it sounds more complicated, than it is and it should be performed quite efficient. Especially if your periods are not that large, because then after step 5, your result set should be much smaller than your original dataframe, but even if that is not the case, it should still be quite efficient, since it can use compiled code.
I'm opening a CSV file with two columns and about 10,000 rows. The first column has a unique date and time stamp (ascending in 30-minute intervals, called 'date_time') and the second column has an integer, 'intnum'. I use the date_time column as my index and then use conditions to sum only the integers that fall into specific date ranges. All of the conditions work perfectly, EXCEPT the last condition is based on matching those dates with the USFederalHolidayCalendar.
Here's the rub, the indexed date is more complex (eg. '2015-02-16 12:30:00.00000') than the holiday list date (eg. '2015-02-16', President's Day). So when I run an 'isin' function against the holiday list, it doesn't find all of the integers associated with the whole day because '2015-02-16 12:30:00.00000' is not equal to '2015-02-16', despite the fact that it is the same day.
Code snippet:
import numpy as np
import pandas as pd
from pandas.tseries.holiday import USFederalHolidayCalendar, get_calendar
newcal = get_calendar('USFederalHolidayCalendar')
holidays = newcal.holidays(start='2010-01-01', end='2016-12-31')
filename = "/Users/Me/Desktop/test.csv"
int_array = pd.read_csv(filename, header=0, parse_dates=['date_time'], index_col='date_time')
intnum_total = int(int_array['intnum'][(int_array.index.month >= 2) &
(int_array.index.month <= 3) & (int_array.index.hour >= 12) &
(int_array.index.isin(holidays) == TRUE)].sum()
print intnum_total
Now, I get no errors, so the syntax and functions work "properly", but I know for a fact the holiday match is not working.
Any thoughts?
Thanks ahead of time - this is my first post, so hopefully the formatting and question is clear.
Here are some some thoughts...
Say you have a list of holidays for 2016:
cal = USFederalHolidayCalendar()
holidays = cal.holidays(start='2016-01-01', end='2016-12-31')
print holidays.size
Which yields:
10
So there are 10 holidays in 2016 based on USFederalHolidayCalendar.
You also have your DateTimeIndex, which, let's say is covering 2015 and 2016:
idx = pd.DatetimeIndex(pd.date_range(start='2015-1-1',
end='2016-12-31', freq='30min'))
print idx.size
Which shows:
35041
Now if I would want to see how many holidays are in my 30 min based idx I would take the date part of the DateTimeIndex and compare it to date part of the holidays:
idx[pd.DatetimeIndex(idx.date).isin(holidays.date)].size
Which would give me:
480
Which is 10 holidays * 24 hours * 2 halfhours in an hour.
Does that sound correct?
Note that when you do index.isin(other_index) you get back a boolean array which is sufficient for indexing, and you don't need to do an extra comparison index.isin(other_index) == True
Can't you just access the date from your timestamp and see if it is in your list of federal holidays? I don't know why you need your second integer index column; I would think a boolean value should suffice (e.g. fed_holiday).
df = pd.DataFrame(pd.date_range(start='2016-1-1', end='2016-12-31', freq='30min', name='ts'))
df['fed_holiday'] = [ts.date() in holidays for ts in df.ts]
>>> df.fed_holiday.sum() / (24 * 2.)
10.0