I have a pandas dataframe with date values, however, I need to convert it from dates to text General format like in Excel, not to date string, in order to match with primary keys values in SQL, which are, unfortunately, reordered in general format. Is it possible to do it Python or the only way to convert this column to general format in Excel?
Here is how the dataframe's column looks like:
ID Desired Output
1/1/2022 44562
7/21/2024 45494
1/1/1931 11324
Yes, it's possible. The general format in Excel starts counting the days from the date 1900-1-1.
You can calculate a time delta between the dates in ID and 1900-1-1.
Inspired by this post you could do...
data = pd.DataFrame({'ID': ['1/1/2022','7/21/2024','1/1/1931']})
data['General format'] = (
pd.to_datetime(data["ID"]) - pd.Timestamp("1900-01-01")
).dt.days + 2
print(data)
ID General format
0 1/1/2022 44562
1 7/21/2024 45494
2 1/1/1931 11324
The +2 is because:
Excel starts counting from 1 instead of 0
Excel incorrectly considers 1900 as a leap year
Excel stores dates as sequential serial numbers so that they can be
used in calculations. By default, January 1, 1900 is serial number 1,
and January 1, 2008 is serial number 39448 because it is 39,447 days
after January 1, 1900.
-Microsoft's documentation
So you can just calculate (difference between your date and January 1, 1900) + 1
see How to calculate number of days between two given dates
Related
Looking to clean multiple data sets in a more automated way. The current format is year as column, month as row, the number values.
Below is an example of the current format, the original data has multiple years/months.
Current Format:
Year
Jan
Feb
2022
300
200
Below is an example of how I would like the new format to look like. It combines month and year into one column and transposes the number into another column.
How would I go about doing this in excel or python? Have files with many years and multiple months.
New Format:
Date
Number
2022-01
300
2022-02
200
Check below solution. You need to extend month_df for the months, current just cater to the example.
import pandas as pd
df = pd.DataFrame({'Year':[2022],'Jan':[300],'Feb':[200]})
month_df = pd.DataFrame({'Char_Month':['Jan','Feb'], 'Int_Month':['01','02']})
melted_df = pd.melt(df, id_vars=['Year'], value_vars=['Jan', 'Feb'], var_name='Char_Month',value_name='Number')
pd.merge(melted_df, month_df,left_on='Char_Month', right_on='Char_Month').\
assign(Year=melted_df['Year'].astype('str')+'-'+month_df['Int_Month'])\
[['Year','Number']]
Output:
I have a pandas dataframe with 3 columns:
OrderID_new (integer)
OrderTotal (float)
OrderDate_new (string or datetime sometimes)
Sales order ID's are in the first column, order values (totals) are in the 2nd column and order date - in mm/dd/yyyy format are in the last column.
I need to do 2 things:
to aggregate the order totals:
a) first into total sales per each day and then
b) into total sales per each calendar month
to convert values in OrderDate_new from mm/dd/yyyy format (e.g. 01/30/2015) into MM YYYY (e.g. January 2015) format.
The problem is some input files have 3rd column (date) already in datetime format while some have it as string format so that means sometimes string to datetime parsing will be needed while in other cases, reformatting datetime.
I have been trying to do 2 step aggregation with groupby but I'm getting some strange daily and monthly totals that make no sense.
What I need as the final stage is time series with 2 columns - 1. monthly sales and 2. month (Month Year)...
Then I will need to select and train some model for monthly sales time series forecast (out of scope for this question)...
What am I doing wrong?
How to do it effectively in Python?
dataframe example:
You did not provide usable sample data, hence I've synthesized.
resample() allows you to rollup a date column. Have provided daily and monthly
pd.to_datetime() gives you what you want
def mydf(size=10):
return pd.DataFrame({"OrderID_new":np.random.randint(100,200, size),
"OrderTotal":np.random.randint(200, 10000, size),
"OrderDate_new":np.random.choice(pd.date_range(dt.date(2019,8,1),dt.date(2020,1,1)),size)})
# smash orderdate to be a string for some rows
df = pd.concat([mydf(5), mydf(5).assign(OrderDate_new=lambda dfa: dfa.OrderDate_new.dt.strftime("%Y/%m/%d"))])
# make sure everything is a date..
df.OrderDate_new = pd.to_datetime(df.OrderDate_new)
# totals
df.resample("1d", on="OrderDate_new")["OrderTotal"].sum()
df.resample("1m", on="OrderDate_new")["OrderTotal"].sum()
I have an excel file with a date column. Is there a way to change the date format to MM-DD-YY and create one more column with Quarter & Year? I am very new to Python and I would really appreciate it if you could help me with this one. Thanks!
Current format
Date format: Jan 1, 2016
Desired outcome
Date format: 01/01/2016
One more additional column with something like this "Q1-2016"
Python's datetime module's got you covered. For input:
myDate = datetime.strptime(<datestring>, "%b %d, %Y")
And for output:
print(myDate.strftime("%m/%d/%Y"))
Getting the quarter would be a little bit harder, but you could use myDate.month to figure something out with time ranges. See also, python datetime reference
example, using simple division so january-march are Q1, april-june are Q2, etc.:
print("Q%d-%d" % (myDate.month // 3 + 1, myDate.year))
I am importing some stock data that has annual report information into a pandas DataFrame. But the date for the annual report end date is an odd month (end of january) rather than end of year.
years = ['2017-01-31', '2016-01-31', '2015-01-31']
df = pd.DataFrame(data = years, columns = ['years'])
df
Out[357]:
years
0 2017-01-31
1 2016-01-31
2 2015-01-31
When I try to add in a PeriodIndex which shows the period of time the report data is valid for, it defaults to ending in December rather than inferring it from the date string
df.index = pd.PeriodIndex(df['years'], freq ='A')
df.index
Out[367]: PeriodIndex(['2017', '2016', '2015'], dtype='period[A-DEC]',
name='years', freq='A-DEC')
Note that the frequency should be 'A-JAN'.
I assume this means that the end date can't be inferred from PeriodIndex and the end date string I gave it.
I can change it using the asfreq method and anchored offsets anchored offsets using "A-JAN" as the frequency string. But, this changes all of the individual periods in the PeriodIndex rather than individually as years can have different reporting end dates for their annual report (in the case of a company that changed their reporting period).
Is there a way to interpret each date string and correctly set each period for each row in my pandas frame?
My end goal is to set a period column or index that has a frequency of 'annual' but with the period end date set to the date from the corresponding row of the years column.
** Expanding this question a bit further. Consider that I have many stocks with 3-4 years of annual financial data for each and all with varying start and end dates for their annual reporting frequencies (or quarterly for that matter).
Out[14]:
years tickers
0 2017-01-31 PG
1 2016-01-31 PG
2 2015-01-31 PG
3 2017-05-31 T
4 2016-05-31 T
5 2015-05-31 T
What I'm trying to get to is a column with proper Period objects that are configured with proper end dates (from the year column) and all with annual frequencies. I've thought about trying to iterate through the years and use apply.map or lambda function and the pd.Period function. It may be that a PeriodIndex can't exist with varying Period Objects in it that have varying end dates. something like
for row in df.years:
s.append(pd.Period(row, freq='A")
df['period']= s
#KRkirov got me thinking. It appears the Period constructor is not smart enough to set the end date of the frequency by reading the date string. I was able to get the frequency end date right by building up an anchor string from the end date of the reporting period as follows:
# return a month in 3 letter abbreviation format (eg. "JAN")
df['offset'] = df['years'].dt.strftime('%b').str.upper()
# now build up an anchor offset string (eg. "A-JAN" )
# for quarterly report (eg. "Q-JAN") for q report ending January for year
df['offset_strings'] = "A" + '-' + df.offset
Anchor strings are documented in the pandas docs here.
And then iterate through the rows of the DataFrame to construct each Period and put it in a list, then add the list of Period objects (which is coerced to a PeriodIndex) to a column.
ps = []
for i, r in df.iterrows():
p = pd.Period(r['years'], freq = r['offset_strings']))
ps.append(p)
df['period'] = ps
This returns a proper PeriodIndex with the Period Objects set correctly:
df['period']
Out[40]:
0 2017
1 2016
2 2015
Name: period, dtype: object
df['period'][0]
Out[41]: Period('2017', 'A-JAN')
df.index = df.period
df.index
Out[43]: PeriodIndex(['2017', '2016', '2015'], dtype='period[A-JAN]',
name='period', freq='A-JAN')
Not pretty, but I could not find another way.
I am trying to convert a dataframe column with a date and timestamp to a year-weeknumber format, i.e., 01-05-2017 03:44 = 2017-1. This is pretty easy, however, I am stuck at dates that are in a new year, yet their weeknumber is still the last week of the previous year. The same thing that happens here.
I did the following:
df['WEEK_NUMBER'] = df.date.dt.year.astype(str).str.cat(df.date.dt.week.astype(str), sep='-')
Where df['date'] is a very large column with date and times, ranging over multiple years.
A date which gives a problem is for example:
Timestamp('2017-01-01 02:11:27')
The output for my code will be 2017-52, while it should be 2016-52. Since the data covers multiple years, and weeknumbers and their corresponding dates change every year, I cannot simply subtract a few days.
Does anybody have an idea of how to fix this? Thanks!
Replace df.date.dt.year by this:
(df.date.dt.year- ((df.date.dt.week>50) & (df.date.dt.month==1)))
Basically, it means that you will substract 1 to the year value if the week number is greater than 50 and the month is January.