Converting string to date-time pandas - python

I am fetching data from an API into a pandas dataframe whose index values are as follows:-
df.index=['Q1-2013',
'Q1-2014',
'Q1-2015',
'Q1-2016',
'Q1-2017',
'Q1-2018',
'Q2-2013',
'Q2-2014',
'Q2-2015',
'Q2-2016',
'Q2-2017',
'Q2-2018',
'Q3-2013',
'Q3-2014',
'Q3-2015',
'Q3-2016',
'Q3-2017',
'Q3-2018',
'Q4-2013',
'Q4-2014',
'Q4-2015',
'Q4-2016',
'Q4-2017',
'Q4-2018']
It is a list of string values. Is there a way to convert this to pandas datetime?
I explored few Q&A and they are about using pd.to_datetime which works when the index is of object type.
In this example, index values are strings.
Expected output:
new_df=magic_function(df.index)
print(new_df.index[0])
01-2013
Wondering how to build "magic_function". Thanks in advance.
Q1 is quarter1 which is January, Q2 is quarter2 which is April and Q3 is quarter3 which is July, Q4 is quarter4 which is October

With a bit of manipulation for the parsing to work, you can use pd.PeriodIndex and format as wanted (reason being that the format %Y%q is expected):
df.index = [''.join(s.split('-')[::-1]) for s in df.index]
df.index = pd.PeriodIndex(df.index, freq='Q').to_timestamp().strftime('%m-%Y')
print(df.index)
Index(['01-2013', '01-2014', '01-2015', '01-2016', '01-2017', '01-2018',
'04-2013', '04-2014', '04-2015', '04-2016', '04-2017', '04-2018',
'07-2013', '07-2014', '07-2015', '07-2016', '07-2017', '07-2018',
'10-2013', '10-2014', '10-2015', '10-2016', '10-2017', '10-2018'],
dtype='object')
We could also get the required format using str.replace:
df.index = df.index.str.replace(r'(Q\d)-(\d+)', r'\2\1')
df.index = pd.PeriodIndex(df.index, freq='Q').to_timestamp().strftime('%m-%Y')

You can map a function to index: pandas.Index.map
quarter_months = {
'Q1': 1,
'Q2': 4,
'Q3': 7,
'Q4': 10,
}
def quarter_to_month_year(quarter_year):
quarter, year = quarter_year.split('-')
month_year = '%s-%s'%(quarter_months[quarter], year)
return pd.to_datetime(month_year, format='%m-%Y')
df.index = df.index.map(quarter_to_month_year)
This would produce following result:
DatetimeIndex(['2013-01-01', '2014-01-01', '2015-01-01', '2016-01-01',
'2017-01-01', '2018-01-01', '2013-04-01', '2014-04-01',
'2015-04-01', '2016-04-01', '2017-04-01', '2018-04-01',
'2013-07-01', '2014-07-01', '2015-07-01', '2016-07-01',
'2017-07-01', '2018-07-01', '2013-10-01', '2014-10-01',
'2015-10-01', '2016-10-01', '2017-10-01', '2018-10-01'],
dtype='datetime64[ns]', name='index', freq=None)

to_datetime() function https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html
It is a datetime64 object when applying to_datetime(), to_period() turns it into a period object, further modifications like to_timestamp().strftime('%m-%Y') turn the index items into strings :
import pandas as pd
df = pd.DataFrame(index=['Q1-2013',
'Q1-2014',
'Q1-2015',
'Q1-2016',
'Q1-2017',
'Q1-2018',
'Q2-2013',
'Q2-2014',
'Q2-2015',
'Q2-2016',
'Q2-2017',
'Q2-2018',
'Q3-2013',
'Q3-2014',
'Q3-2015',
'Q3-2016',
'Q3-2017',
'Q3-2018',
'Q4-2013',
'Q4-2014',
'Q4-2015',
'Q4-2016',
'Q4-2017',
'Q4-2018'])
# df_new = pd.DataFrame(index=pd.to_datetime(['-'.join(s.split('-')[::-1]) for s in df.index]))
df_new = pd.DataFrame(index=pd.to_datetime(['-'.join(s.split('-')[::-1]) for s in df.index]).to_period('M'))
# df_new = pd.DataFrame(index=pd.to_datetime(['-'.join(s.split('-')[::-1]) for s in df.index]).to_period('M').to_timestamp().strftime('m-%Y'))
print(df_new.index)
PeriodIndex(['2013-01', '2014-01', '2015-01', '2016-01', '2017-01', '2018-01',
'2013-04', '2014-04', '2015-04', '2016-04', '2017-04', '2018-04',
'2013-07', '2014-07', '2015-07', '2016-07', '2017-07', '2018-07',
'2013-10', '2014-10', '2015-10', '2016-10', '2017-10', '2018-10'],
dtype='period[M]', freq='M')

Related

Convert np datetime64 column to pandas DatetimeIndex with frequency attribute set correctly

Reproducing the data I have:
import numpy as np
import pandas as pd
dts = ['2016-01-01', '2016-02-01', '2016-03-01', '2016-04-01',
'2016-05-01', '2016-06-01', '2016-07-01', '2016-08-01',
'2016-09-01', '2016-10-01', '2016-11-01', '2016-12-01',
'2017-01-01', '2017-02-01', '2017-03-01', '2017-04-01']
my_df = pd.DataFrame({'col1': range(len(dts)), 'month_beginning': dts})#, dtype={'month_beginning': np.datetime64})
my_df['month_beginning'] = my_df.month_beginning.astype(np.datetime64)
And what I want is to set month_beginning as a datetime index, and specifically I need it to have the frequency attribute set correctly as monthly
Here's what I've tried so far, and how they have not worked:
First attempt
my_df = my_df.set_index('month_beginning')
...however after executing the above, my_df.index shows a DatetimeIndex but with freq=None.
Second attempt
dt_idx = pd.DatetimeIndex(my_df.month_beginning, freq='M')
...but that throws the following error:
ValueError: Inferred frequency MS from passed values does not conform to passed frequency M
...This is particularly confusing to me given that, as can be checked in my data above, my dts/month-beginning series is in fact monthly and is not missing any months...
you could convert the time series to the specified frequency using asfreq:
import pandas as pd
dts = ['2016-01-01', '2016-02-01', '2016-03-01', '2016-04-01',
'2016-05-01', '2016-06-01', '2016-07-01', '2016-08-01',
'2016-09-01', '2016-10-01', '2016-11-01', '2016-12-01',
'2017-01-01', '2017-02-01', '2017-03-01', '2017-04-01']
df = pd.DataFrame({'col1': range(len(dts)), 'month_beginning': dts})
df['month_beginning'] = pd.to_datetime(df['month_beginning'])
df.index = df['month_beginning']
df = df.asfreq("MS")
df.index
DatetimeIndex(['2016-01-01', '2016-02-01', '2016-03-01', '2016-04-01',
'2016-05-01', '2016-06-01', '2016-07-01', '2016-08-01',
'2016-09-01', '2016-10-01', '2016-11-01', '2016-12-01',
'2017-01-01', '2017-02-01', '2017-03-01', '2017-04-01'],
dtype='datetime64[ns]', name='month_beginning', freq='MS')

Get column names with corresponding index in python pandas

I have this dataframe df where
>>> df = pd.DataFrame({'Date':['10/2/2011', '11/2/2011', '12/2/2011', '13/2/11'],
'Event':['Music', 'Poetry', 'Theatre', 'Comedy'],
'Cost':[10000, 5000, 15000, 2000],
'Name':['Roy', 'Abraham', 'Blythe', 'Sophia'],
'Age':['20', '10', '13', '17']})
I want to determine the column index with the corresponding name. I tried it with this:
>>> list(df.columns)
But the solution above only returns the column names without index numbers.
How can I code it so that it would return the column names and the corresponding index for that column? Like This:
0 Date
1 Event
2 Cost
3 Name
4 Age
Simpliest is add pd.Series constructor:
pd.Series(list(df.columns))
Or convert columns to Series and create default index:
df.columns.to_series().reset_index(drop=True)
Or:
df.columns.to_series(index=False)
You can use loop like this:
myList = list(df.columns)
index = 0
for value in myList:
print(index, value)
index += 1
A nice short way to get a dictionary:
d = dict(enumerate(df))
output: {0: 'Date', 1: 'Event', 2: 'Cost', 3: 'Name', 4: 'Age'}
For a Series, pd.Series(list(df)) is sufficient as iteration occurs directly on the column names
In addition to using enumerate, this also can get a numbers in order using zip, as follows:
import pandas as pd
df = pd.DataFrame({'Date':['10/2/2011', '11/2/2011', '12/2/2011', '13/2/11'],
'Event':['Music', 'Poetry', 'Theatre', 'Comedy'],
'Cost':[10000, 5000, 15000, 2000],
'Name':['Roy', 'Abraham', 'Blythe', 'Sophia'],
'Age':['20', '10', '13', '17']})
result = list(zip([i for i in range(len(df.columns))], df.columns.values,))
for r in result:
print(r)
#(0, 'Date')
#(1, 'Event')
#(2, 'Cost')
#(3, 'Name')
#(4, 'Age')

a KeyError when trying to forecast using ExponentialSmoothing

I'm trying to forecast some data about my city in terms of population. I have a table showing the population of my city from 1950 till 2021. Using pandas and ExpotentialSmoothing, I'm trying to forecast and see the next 10 years how much my city will have population. I'm stuck here:
train_data = df.iloc[:60]
test_data = df.iloc[59:]
fitted = ExponentialSmoothing(train_data["Population"],
trend = "add",
seasonal = "add",
seasonal_periods=12).fit()
fitted.forecast(10)
However, I get this message:
'The start argument could not be matched to a location related to the index of the data.'
Update: here are some codes from my work:
Jeddah_tb = pd.read_html("https://www.macrotrends.net/cities/22421/jiddah/population", match ="Jiddah - Historical Population Data", parse_dates=True)
df['Year'] = pd.to_datetime(df['Year'], format="%Y")
df.set_index("Year", inplace=True)
and here is the index:
DatetimeIndex(['2021-01-01', '2020-01-01', '2019-01-01', '2018-01-01',
'2017-01-01', '2016-01-01', '2015-01-01', '2014-01-01',
'2013-01-01', '2012-01-01', '2011-01-01', '2010-01-01',
'2009-01-01', '2008-01-01', '2007-01-01', '2006-01-01',
'2005-01-01', '2004-01-01', '2003-01-01', '2002-01-01',
'2001-01-01', '2000-01-01', '1999-01-01', '1998-01-01',
'1997-01-01', '1996-01-01', '1995-01-01', '1994-01-01',
'1993-01-01', '1992-01-01', '1991-01-01', '1990-01-01',
'1989-01-01', '1988-01-01', '1987-01-01', '1986-01-01',
'1985-01-01', '1984-01-01', '1983-01-01', '1982-01-01',
'1981-01-01', '1980-01-01', '1979-01-01', '1978-01-01',
'1977-01-01', '1976-01-01', '1975-01-01', '1974-01-01',
'1973-01-01', '1972-01-01', '1971-01-01', '1970-01-01',
'1969-01-01', '1968-01-01', '1967-01-01', '1966-01-01',
'1965-01-01', '1964-01-01', '1963-01-01', '1962-01-01',
'1961-01-01', '1960-01-01', '1959-01-01', '1958-01-01',
'1957-01-01', '1956-01-01', '1955-01-01', '1954-01-01',
'1953-01-01', '1952-01-01', '1951-01-01', '1950-01-01'],
dtype='datetime64[ns]', name='Year', freq='-1AS-JAN')
I didn't face any issue while trying to reproduce your code. However, before for time series forecasting make sure your data is in ascending order of dates. df = df.sort_values(by='Year',ascending = True). In your case, train_data is from 2021 to 1962 and test_data is from 1962-1950. So you are training on recent data but testing it on past. So sort your dataframe in ascending order. Also make test_data = df.iloc[60:] because 1962 is present in both train_data and test_data.

list of DataFrames as an argument of function/loop

I have multiple DataFrame and I need to perform various operations on them. I want to put them in one list to avoid listing them all the time as in the example bellow:
for df in (df1, df2,df3,df4,df5,df6,df7):
df.columns=['COUNTRY','2018','2019']
df.replace({':':''}, regex=True, inplace=True)
df.replace({' ':''}, regex=True, inplace=True)
df["2018"] = pd.to_numeric(df["2018"], downcast="float")
df["2019"] = pd.to_numeric(df["2019"], downcast="float")
I tried to make a list of them (DataFrames=[df1,df2,df3,df4,df5,df6,df7]) but it's working neither in the loop or as an argument of a function.
for df in (DataFrame):
df.columns=['COUNTRY','2018','2019']
df.replace({':':''}, regex=True, inplace=True)
df.replace({' ':''}, regex=True, inplace=True)
df["2018"] = pd.to_numeric(df["2018"], downcast="float")
df["2019"] = pd.to_numeric(df["2019"], downcast="float")
you can place the dataframes on a list and add the columns like this:
import pandas as pd
from pandas import DataFrame
data = {'COUNTRY': ['country1', 'country2', 'country3'],
'2018': [12.0, 27, 35],
'2019': [23, 39.6, 40.3],
'2020': [35, 42, 56]}
df_list = [DataFrame(data), DataFrame(data), DataFrame(data),
DataFrame(data), DataFrame(data), DataFrame(data),
DataFrame(data)]
def change_dataframes(data_frames=None):
for df in data_frames:
df = df.loc[:, ['COUNTRY', '2018', '2019']]
df.replace({':': ''}, regex=True, inplace=True)
df.replace({' ': ''}, regex=True, inplace=True)
pd.to_numeric(df['2018'], downcast="float")
pd.to_numeric(df['2019'], downcast="float")
return data_frames
Using nunvie's answer as a base, here is another option for you:
import pandas as pd
data = {
'COUNTRY': ['country1', 'country2', 'country3'],
'2018': ['12.0', '27', '35'],
'2019': ['2:3', '3:9.6', '4:0.3'],
'2020': ['35', '42', '56']
}
df_list = [pd.DataFrame(data) for i in range(5)]
def data_prep(df: pd.DataFrame):
df = df.loc[:, ['COUNTRY', '2018', '2019']]
df.replace({':': ''}, regex=True, inplace=True)
df.replace({' ': ''}, regex=True, inplace=True)
df['2018'] = pd.to_numeric(df['2018'], downcast="float")
df['2019'] = pd.to_numeric(df['2019'], downcast="float")
return df
new_df_list = map(data_prep, df_list)
The improvements (in my opinion) are as follows. First, it is more concise to use list comprehension for the test setup (that's not directly related to the answer). Second, pd.to_numeric doesn't have inplace (at least in pandas 1.2.3). It returns the series you passed if the parsing succeeded. Thus, you need to explicitly say df['my_col'] = pd.to_numeric(df['my_col']).
And third, I've used map to apply the data_prep function to each DataFrame in the list. This makes data_prep responsible for only one data frame and also saves you from writing loops. The benefit is leaner and more readable code, if you like the functional flavour of it, of course.

Changing part of a column name in pandas?

I have a list of columns that I want to rename a portion of based on a list of values.
I am looking at a file which has 12 months of data and each month is a different column (I need to keep it in this specific format unfortunately). This file is generated once per month and I keep the column names more general since I have to do a lot of calculations on them based the month number (for example, I need to compare information against the average of month 8, 9, and 10 every month).
Here are the columns I want to rename:
['month_1_Sign',
'month_2_Sign',
'month_3_Sign',
'month_4_Sign',
'month_5_Sign',
'month_6_Sign',
'month_7_Sign',
'month_8_Sign',
'month_9_Sign',
'month_10_Sign',
'month_11_Sign',
'month_12_Sign',
'month_1_Actual',
'month_2_Actual',
'month_3_Actual',
'month_4_Actual',
'month_5_Actual',
'month_6_Actual',
'month_7_Actual',
'month_8_Actual',
'month_9_Actual',
'month_10_Actual',
'month_11_Actual',
'month_12_Actual',
'month_1_Target',
'month_2_Target',
'month_3_Target',
'month_4_Target',
'month_5_Target',
'month_6_Target',
'month_7_Target',
'month_8_Target',
'month_9_Target',
'month_10_Target',
'month_11_Target',
'month_12_Target']
Here are the names I want to place:
required_date_range = sorted(list(pd.Series(pd.date_range((dt.datetime.today().date() + pd.DateOffset(months=-13)), periods=12, freq='MS')).dt.strftime('%Y-%m-%d')))
['2015-03-01',
'2015-04-01',
'2015-05-01',
'2015-06-01',
'2015-07-01',
'2015-08-01',
'2015-09-01',
'2015-10-01',
'2015-11-01',
'2015-12-01',
'2016-01-01',
'2016-02-01']
So month_12 columns (month_12 is always the latest month) would be changed to '2016-02-01_Sign', '2016-02-01_Actual', '2016-02-01_Target' in this example.
I tried doing this but it doesn't change anything (trying to change the month_# with the actual date it refers to):
final.replace('month_10', required_date_range[9], inplace=True)
final.replace('month_11', required_date_range[10], inplace=True)
final.replace('month_12', required_date_range[11], inplace=True)
final.replace('month_1', required_date_range[0], inplace=True)
final.replace('month_2', required_date_range[1], inplace=True)
final.replace('month_3', required_date_range[2], inplace=True)
final.replace('month_4', required_date_range[3], inplace=True)
final.replace('month_5', required_date_range[4], inplace=True)
final.replace('month_6', required_date_range[5], inplace=True)
final.replace('month_7', required_date_range[6], inplace=True)
final.replace('month_8', required_date_range[7], inplace=True)
final.replace('month_9', required_date_range[8], inplace=True)
You could construct a dict and then call map on the split column str:
In [27]:
d = dict(zip([str(x) for x in range(1,13)], required_date_range))
d
Out[27]:
{'1': '2015-03-01',
'10': '2015-12-01',
'11': '2016-01-01',
'12': '2016-02-01',
'2': '2015-04-01',
'3': '2015-05-01',
'4': '2015-06-01',
'5': '2015-07-01',
'6': '2015-08-01',
'7': '2015-09-01',
'8': '2015-10-01',
'9': '2015-11-01'}
In [26]:
df.columns = df.columns.to_series().str.rsplit('_').str[1].map(d) + '_' + df.columns.to_series().str.rsplit('_').str[-1]
df.columns
Out[26]:
Index(['2015-03-01_Sign', '2015-04-01_Sign', '2015-05-01_Sign',
'2015-06-01_Sign', '2015-07-01_Sign', '2015-08-01_Sign',
'2015-09-01_Sign', '2015-10-01_Sign', '2015-11-01_Sign',
'2015-12-01_Sign', '2016-01-01_Sign', '2016-02-01_Sign',
'2015-03-01_Actual', '2015-04-01_Actual', '2015-05-01_Actual',
'2015-06-01_Actual', '2015-07-01_Actual', '2015-08-01_Actual',
'2015-09-01_Actual', '2015-10-01_Actual', '2015-11-01_Actual',
'2015-12-01_Actual', '2016-01-01_Actual', '2016-02-01_Actual',
'2015-03-01_Target', '2015-04-01_Target', '2015-05-01_Target',
'2015-06-01_Target', '2015-07-01_Target', '2015-08-01_Target',
'2015-09-01_Target', '2015-10-01_Target', '2015-11-01_Target',
'2015-12-01_Target', '2016-01-01_Target', '2016-02-01_Target'],
dtype='object')
You're going to want to use the .rename method instead of the .replace! For instance this code:
import pandas as pd
d = {'a': [1, 2, 4], 'b':[2,3,4],'c':[3,4,5]}
df = pd.DataFrame(d)
df.rename(columns={'a': 'x1', 'b': 'x2'}, inplace=True)
Changes the 'a' and 'b' column title to 'x1' and 'x2' respectively.
The first line of the renaming code you have would change to:
final.rename(columns={'month_10':required_date_range[9]}, inplace=True)
In fact you could do every column in that one command by adding entries to the columns dictionary argument.
final.rename(columns={'month_10':required_date_range[9],
'month_9':required_date-range[8], ... (and so on) }, inplace=True)
from collections import product
df = pd.DataFrame(np.random.rand(3, 12 * 3), columns=['month_' + str(c[0]) + '_' + c[1] for c in product(range(1, 13), ['Sign', 'Actual', 'Target'])])
First create a mapping to the relevant months.
mapping = {'month_' + str(n): date for n, date in enumerate(required_date_range, 1)}
>>> mapping
{'month_1': '2015-03-01',
'month_10': '2015-12-01',
'month_11': '2016-01-01',
'month_12': '2016-02-01',
'month_2': '2015-04-01',
'month_3': '2015-05-01',
'month_4': '2015-06-01',
'month_5': '2015-07-01',
'month_6': '2015-08-01',
'month_7': '2015-09-01',
'month_8': '2015-10-01',
'month_9': '2015-11-01'}
Then reassign columns, joining the mapped month (e.g. '2016-02-01') to the rest of the column name. This was done using a list comprehension.
df.columns = [mapping.get(c[:c.find('_', 6)]) + c[c.find('_', 6):] for c in cols]
>>> df.columns.tolist()
['2015-03-01_Sign',
'2015-04-01_Sign',
'2015-05-01_Sign',
'2015-06-01_Sign',
'2015-07-01_Sign',
'2015-08-01_Sign',
'2015-09-01_Sign',
'2015-10-01_Sign',
'2015-11-01_Sign',
'2015-12-01_Sign',
'2016-01-01_Sign',
'2016-02-01_Sign',
'2015-03-01_Actual',
'2015-04-01_Actual',
'2015-05-01_Actual',
'2015-06-01_Actual',
'2015-07-01_Actual',
'2015-08-01_Actual',
'2015-09-01_Actual',
'2015-10-01_Actual',
'2015-11-01_Actual',
'2015-12-01_Actual',
'2016-01-01_Actual',
'2016-02-01_Actual',
'2015-03-01_Target',
'2015-04-01_Target',
'2015-05-01_Target',
'2015-06-01_Target',
'2015-07-01_Target',
'2015-08-01_Target',
'2015-09-01_Target',
'2015-10-01_Target',
'2015-11-01_Target',
'2015-12-01_Target',
'2016-01-01_Target',
'2016-02-01_Target']

Categories

Resources