I have a column which has data as :
Date
'2021-01-01'
'2021-01-10'
'2021-01-09'
'2021-01-11'
I need to get only the "year and month" as one column and have it as an integer instead of string like '2021-01-01' should be saved as 202101. (I don't need the day part).
When I try to clean the data I am able to do it but it removes the leading zeroes.
df['period'] = df['Date'].str[:4] + df['Date'].str[6:7]
This gives me:
Date
20211
202110
20219
202111
As you can see, for months Jan to Sept, it returns only 1 to 9 instead of 01 to 09, which creates discrepancy. If I add a zero manually as part of the merge it will make '2021-10' as 2021010. I want it simply as the Year and month without the hyphen and keeping the leading zeroes for months. See below how I would want it to come in the new column.
Date
202101
202110
202109
202111
I can do it using loop but that's not efficient. Is there a better way to do it in python?
The leading zeros are being dropped because of a misunderstanding about the use of slice notation in Python.
Try changing your code to:
df['period'] = df['Date'].str[:4] + df['Date'].str[5:7]
Note the change from [6:7] to [5:7].
strip the inverted comma, coerce the date to datetime in your desired format and convert it to integer. Code below
df['Date_edited']=pd.to_datetime(df['Date'].str.strip("''")).dt.strftime('%Y%m').astype(int)
Date Date_edited
0 '2021-01-01' 202101
1 '2021-01-10' 202101
2 '2021-01-09' 202101
3 '2021-01-11' 202101
Related
enter image description heretable of movie data
As you can see the column of release date puts the month in a string.
I want to change all of them into numbers.
For example, Dec 18, 2009 can just be 12. I am not interested in the year.
Update: I think I got it. They still come out as objects when I do .info() but at least I was able to get to the number
You can convert it to datetime and use Series.dt.month
df['release date'] = pd.to_datetime(df['release date']).dt.month
print(df)
release date
0 12
I have a dataframe that has Date as its index. The dataframe has stock market related data so the dates are not continuous. If I want to move lets say 120 rows up in the dataframe, how do I do that. For example:
If I want to get the data starting from 120 trading days before the start of yr 2018, how do I do that below:
df['2018-01-01':'2019-12-31']
Thanks
Try this:
df[df.columns[df.columns.get_loc('2018-01-01'):df.columns.get_loc('2019-12-31')]]
Get location of both Columns in the column array and index them to get the desired.
UPDATE :
Based on your requirement, make some little modifications of above.
Yearly Indexing
>>> df[df.columns[(df.columns.get_loc('2018')).start:(df.columns.get_loc('2019')).stop]]
Above df.columns.get_loc('2018') will give out numpy slice object and will give us index of first element of 2018 (which we index using .start attribute of slice) and similarly we do for index of last element of 2019.
Monthly Indexing
Now consider you want data for First 6 months of 2018 (without knowing what is the first day), the same can be done using:
>>> df[df.columns[(df.columns.get_loc('2018-01')).start:(df.columns.get_loc('2018-06')).stop]]
As you can see above we have indexed the first 6 months of 2018 using the same logic.
Assuming you are using pandas and the dataframe is sorted by dates, a very simple way would be:
initial_date = '2018-01-01'
initial_date_index = df.loc[df['dates']==initial_date].index[0]
offset=120
start_index = initial_date_index-offset
new_df = df.loc[start_index:]
So I have a column which contains dates as string objects, however the dates are not all in the same format. Some are MM/YYYY or YYYY. I would like them to be all YYYY, and then convert them to floating objects. I am trying to use a regular expression to replace these strings but I am having difficulty. The column name is 'cease_date' and the DF is called 'dete_resignations'.
pattern2 = r"(?P<cease_date>[1-2][0-9]{3})?"
years = dete_resignations['cease_date'].str.extractall(pattern2)
print(years['cease_date'].value_counts())
2013 146
2012 129
2014 22
2010 2
2006 1
So from the above the regular expression works, but I have no idea how to get it back into the original dataframe. I tried doing a boolean index but it didn't work. Am I going about this the wrong way?
You can use this regex to extract the last four digits in your strings:
years = dete_resignations['cease_date'].str.extract('(\d{4})$')[0]
I am importing some stock data that has annual report information into a pandas DataFrame. But the date for the annual report end date is an odd month (end of january) rather than end of year.
years = ['2017-01-31', '2016-01-31', '2015-01-31']
df = pd.DataFrame(data = years, columns = ['years'])
df
Out[357]:
years
0 2017-01-31
1 2016-01-31
2 2015-01-31
When I try to add in a PeriodIndex which shows the period of time the report data is valid for, it defaults to ending in December rather than inferring it from the date string
df.index = pd.PeriodIndex(df['years'], freq ='A')
df.index
Out[367]: PeriodIndex(['2017', '2016', '2015'], dtype='period[A-DEC]',
name='years', freq='A-DEC')
Note that the frequency should be 'A-JAN'.
I assume this means that the end date can't be inferred from PeriodIndex and the end date string I gave it.
I can change it using the asfreq method and anchored offsets anchored offsets using "A-JAN" as the frequency string. But, this changes all of the individual periods in the PeriodIndex rather than individually as years can have different reporting end dates for their annual report (in the case of a company that changed their reporting period).
Is there a way to interpret each date string and correctly set each period for each row in my pandas frame?
My end goal is to set a period column or index that has a frequency of 'annual' but with the period end date set to the date from the corresponding row of the years column.
** Expanding this question a bit further. Consider that I have many stocks with 3-4 years of annual financial data for each and all with varying start and end dates for their annual reporting frequencies (or quarterly for that matter).
Out[14]:
years tickers
0 2017-01-31 PG
1 2016-01-31 PG
2 2015-01-31 PG
3 2017-05-31 T
4 2016-05-31 T
5 2015-05-31 T
What I'm trying to get to is a column with proper Period objects that are configured with proper end dates (from the year column) and all with annual frequencies. I've thought about trying to iterate through the years and use apply.map or lambda function and the pd.Period function. It may be that a PeriodIndex can't exist with varying Period Objects in it that have varying end dates. something like
for row in df.years:
s.append(pd.Period(row, freq='A")
df['period']= s
#KRkirov got me thinking. It appears the Period constructor is not smart enough to set the end date of the frequency by reading the date string. I was able to get the frequency end date right by building up an anchor string from the end date of the reporting period as follows:
# return a month in 3 letter abbreviation format (eg. "JAN")
df['offset'] = df['years'].dt.strftime('%b').str.upper()
# now build up an anchor offset string (eg. "A-JAN" )
# for quarterly report (eg. "Q-JAN") for q report ending January for year
df['offset_strings'] = "A" + '-' + df.offset
Anchor strings are documented in the pandas docs here.
And then iterate through the rows of the DataFrame to construct each Period and put it in a list, then add the list of Period objects (which is coerced to a PeriodIndex) to a column.
ps = []
for i, r in df.iterrows():
p = pd.Period(r['years'], freq = r['offset_strings']))
ps.append(p)
df['period'] = ps
This returns a proper PeriodIndex with the Period Objects set correctly:
df['period']
Out[40]:
0 2017
1 2016
2 2015
Name: period, dtype: object
df['period'][0]
Out[41]: Period('2017', 'A-JAN')
df.index = df.period
df.index
Out[43]: PeriodIndex(['2017', '2016', '2015'], dtype='period[A-JAN]',
name='period', freq='A-JAN')
Not pretty, but I could not find another way.
I have a data frame with a column called "Date" and want all the values from this column to have the same value (the year only). Example:
City Date
Paris 01/04/2004
Lisbon 01/09/2004
Madrid 2004
Pekin 31/2004
What I want is:
City Date
Paris 2004
Lisbon 2004
Madrid 2004
Pekin 2004
Here is my code:
fr61_70xls = pd.ExcelFile('AMADEUS FRANCE 1961-1970.xlsx')
#Here we import the individual sheets and clean the sheets
years=(['1961','1962','1963','1964','1965','1966','1967','1968','1969','1970'])
fr={}
header=(['City','Country','NACE','Cons','Last_year','Op_Rev_EUR_Last_avail_yr','BvD_Indep_Indic','GUO_Name','Legal_status','Date_of_incorporation','Legal_status_date'])
for year in years:
# save every sheet in variable fr['1961'], fr['1962'] and so on
fr[year]=fr61_70xls.parse(year,header=0,parse_cols=10)
fr[year].columns=header
# drop the entire Legal status date column
fr[year]=fr[year].drop(['Legal_status_date','Date_of_incorporation'],axis=1)
# drop every row where GUO Name is empty
fr[year]=fr[year].dropna(axis=0,how='all',subset=[['GUO_Name']])
fr[year]=fr[year].set_index(['GUO_Name','Date_of_incorporation'])
It happens that in my DataFrames, called for example fr['1961'] the values of Date_of_incorporation can be anything (strings, integer, and so on), so maybe it would be best to completely erase this column and then attach another column with only the year to the DataFrames?
As #DSM points out, you can do this more directly using the vectorised string methods:
df['Date'].str[-4:].astype(int)
Or using extract (assuming there is only one set of digits of length 4 somewhere in each string):
df['Date'].str.extract('(?P<year>\d{4})').astype(int)
An alternative slightly more flexible way, might be to use apply (or equivalently map) to do this:
df['Date'] = df['Date'].apply(lambda x: int(str(x)[-4:]))
# converts the last 4 characters of the string to an integer
The lambda function, is taking the input from the Date and converting it to a year.
You could (and perhaps should) write this more verbosely as:
def convert_to_year(date_in_some_format):
date_as_string = str(date_in_some_format) # cast to string
year_as_string = date_in_some_format[-4:] # last four characters
return int(year_as_string)
df['Date'] = df['Date'].apply(convert_to_year)
Perhaps 'Year' is a better name for this column...
You can do a column transformation by using apply
Define a clean function to remove the dollar and commas and convert your data to float.
def clean(x):
x = x.replace("$", "").replace(",", "").replace(" ", "")
return float(x)
Next, call it on your column like this.
data['Revenue'] = data['Revenue'].apply(clean)
Or if one want to use lambda function in the apply function:
data['Revenue']=data['Revenue'].apply(lambda x:float(x.replace("$","").replace(",", "").replace(" ", "")))