Problem with the date format in a column of my DataFrame - python

So I have a column which contains dates as string objects, however the dates are not all in the same format. Some are MM/YYYY or YYYY. I would like them to be all YYYY, and then convert them to floating objects. I am trying to use a regular expression to replace these strings but I am having difficulty. The column name is 'cease_date' and the DF is called 'dete_resignations'.
pattern2 = r"(?P<cease_date>[1-2][0-9]{3})?"
years = dete_resignations['cease_date'].str.extractall(pattern2)
print(years['cease_date'].value_counts())
2013 146
2012 129
2014 22
2010 2
2006 1
So from the above the regular expression works, but I have no idea how to get it back into the original dataframe. I tried doing a boolean index but it didn't work. Am I going about this the wrong way?

You can use this regex to extract the last four digits in your strings:
years = dete_resignations['cease_date'].str.extract('(\d{4})$')[0]

Related

how can I extract year from a column in python. the data is in this form: 'October 1, 2020 (United States)'?

I am trying to apply a different approach but nothing is working as I can't slice the text as the month fields have variable length.
I tried slicing and extracting as well, but it makes a new dataframe and makes the code longer because then I have to split the column first, extract the year, and then concatenate the values back to the dataframe.
Use str.split() to turn it into a list. You can grab the year and convert it into an int from there.
df = pd.DataFrame({'date': ['October 1 2022 (United States)']})
df['year'] = int(df['date'].str.split()[0][2])
Output:
date year
October 1 2022 (United States) 2022
You can also use regex and pd.Series.str.extract:
df['year'] = df['date'].str.extract(r'(?P<Year>\d{4}(?=(?:\s+\()))')
df
date year
0 October 1 2022 (United States) 2022
The regular expression I used matches values with similar pattern to your sample date. In case they differ in patterns we could use more flexible regex.

Pandas : 'to_datetime' function not consistent with dates

When I read a date say '01/12/2020', which is in the format dd/mm/yyyy, with pd.to_datetime(), it detects the month as 01.
pd.to_datetime('01/12/2020').month
>> 1
But this behavior is not consistent.
When we create a dataframe with a column containing dates in this format, and convert using the same to_datetime function, it then detects 12 as the month.
tt.dt.month[0]
>> 12
What could be the reason ?
pandas automagically tries to detect the date format, which can be very nice, or annoying in your case.
Be explicit, use the dayfirst parameter:
pd.to_datetime('01/12/2020', dayfirst=False).month
# 1
pd.to_datetime('01/12/2020', dayfirst=True).month
# 12
Example of ambiguous use:
tt = pd.to_datetime(pd.Series(['30/05/2020', '01/12/2020']))
tt.dt.month
UserWarning: Parsing dates in DD/MM/YYYY format when dayfirst=False (the default) was specified. This may lead to inconsistently parsed dates! Specify a format to ensure consistent parsing.
tt = pd.to_datetime(pd.Series(['30/05/2020', '01/12/2020']))
0 5
1 1
dtype: int64

How to keep leading zeroes from a panda column post operation?

I have a column which has data as :
Date
'2021-01-01'
'2021-01-10'
'2021-01-09'
'2021-01-11'
I need to get only the "year and month" as one column and have it as an integer instead of string like '2021-01-01' should be saved as 202101. (I don't need the day part).
When I try to clean the data I am able to do it but it removes the leading zeroes.
df['period'] = df['Date'].str[:4] + df['Date'].str[6:7]
This gives me:
Date
20211
202110
20219
202111
As you can see, for months Jan to Sept, it returns only 1 to 9 instead of 01 to 09, which creates discrepancy. If I add a zero manually as part of the merge it will make '2021-10' as 2021010. I want it simply as the Year and month without the hyphen and keeping the leading zeroes for months. See below how I would want it to come in the new column.
Date
202101
202110
202109
202111
I can do it using loop but that's not efficient. Is there a better way to do it in python?
The leading zeros are being dropped because of a misunderstanding about the use of slice notation in Python.
Try changing your code to:
df['period'] = df['Date'].str[:4] + df['Date'].str[5:7]
Note the change from [6:7] to [5:7].
strip the inverted comma, coerce the date to datetime in your desired format and convert it to integer. Code below
df['Date_edited']=pd.to_datetime(df['Date'].str.strip("''")).dt.strftime('%Y%m').astype(int)
Date Date_edited
0 '2021-01-01' 202101
1 '2021-01-10' 202101
2 '2021-01-09' 202101
3 '2021-01-11' 202101

How to isolate part of string in pandas dataframe

I have a dataframe containing a column of strings. I want to take out a part of each string in each row, which is the year and then create a new column and assign it to that column. My problem is to isolate the last part of the string. An example could be: 'TON GFR 2018 N' For this string I would be able to execute by running one of the following (For this I want to isolate 18 and not 2018).
new_data['Year'] = pd.DataFrame([str(ele[1])[:2] for ele in list(new_data['Name'].str.split('20'))])
new_data['Year'] = new_data['Name'].str.split('20').str[1]
new_data['Year'] = new_data['Year'].str[:2]
However, I also meet names like these: 'TON RO20 2018 N' or TON 2020 N and then it does not work. I also encounter different number of spaces in different rows in the dataframe, hence it does not work to count the number of spaces in the string.
Any smart solutions to my problem?
Use .str.extract() to extract 4 digits string starting with 20 and get the last 2 digits, as follows:
new_data['Year'] = new_data['Name'].str.extract(r'20(\d\d)')
If you want to ensure the 4-digit string is not part of a longer string/number, you can further use regex meta-character \b (word boundary) to enclose the target strings, as follows:
new_data['Year'] = new_data['Name'].str.extract(r'\b20(\d\d)\b')
Demo
Input data:
print(new_data)
Name
0 TON GFR 2018 N
1 TON RO20 2018 N
2 TON 2020 N
Result:
print(new_data)
Name Year
0 TON GFR 2018 N 18
1 TON RO20 2018 N 18
2 TON 2020 N 20
if this is all the time the same distance from the end you could use:
new_data["Year"] = new_data["Name"].str.slice(start=-4, stop=-2)

Pandas: how to change all the values of a column?

I have a data frame with a column called "Date" and want all the values from this column to have the same value (the year only). Example:
City Date
Paris 01/04/2004
Lisbon 01/09/2004
Madrid 2004
Pekin 31/2004
What I want is:
City Date
Paris 2004
Lisbon 2004
Madrid 2004
Pekin 2004
Here is my code:
fr61_70xls = pd.ExcelFile('AMADEUS FRANCE 1961-1970.xlsx')
#Here we import the individual sheets and clean the sheets
years=(['1961','1962','1963','1964','1965','1966','1967','1968','1969','1970'])
fr={}
header=(['City','Country','NACE','Cons','Last_year','Op_Rev_EUR_Last_avail_yr','BvD_Indep_Indic','GUO_Name','Legal_status','Date_of_incorporation','Legal_status_date'])
for year in years:
# save every sheet in variable fr['1961'], fr['1962'] and so on
fr[year]=fr61_70xls.parse(year,header=0,parse_cols=10)
fr[year].columns=header
# drop the entire Legal status date column
fr[year]=fr[year].drop(['Legal_status_date','Date_of_incorporation'],axis=1)
# drop every row where GUO Name is empty
fr[year]=fr[year].dropna(axis=0,how='all',subset=[['GUO_Name']])
fr[year]=fr[year].set_index(['GUO_Name','Date_of_incorporation'])
It happens that in my DataFrames, called for example fr['1961'] the values of Date_of_incorporation can be anything (strings, integer, and so on), so maybe it would be best to completely erase this column and then attach another column with only the year to the DataFrames?
As #DSM points out, you can do this more directly using the vectorised string methods:
df['Date'].str[-4:].astype(int)
Or using extract (assuming there is only one set of digits of length 4 somewhere in each string):
df['Date'].str.extract('(?P<year>\d{4})').astype(int)
An alternative slightly more flexible way, might be to use apply (or equivalently map) to do this:
df['Date'] = df['Date'].apply(lambda x: int(str(x)[-4:]))
# converts the last 4 characters of the string to an integer
The lambda function, is taking the input from the Date and converting it to a year.
You could (and perhaps should) write this more verbosely as:
def convert_to_year(date_in_some_format):
date_as_string = str(date_in_some_format) # cast to string
year_as_string = date_in_some_format[-4:] # last four characters
return int(year_as_string)
df['Date'] = df['Date'].apply(convert_to_year)
Perhaps 'Year' is a better name for this column...
You can do a column transformation by using apply
Define a clean function to remove the dollar and commas and convert your data to float.
def clean(x):
x = x.replace("$", "").replace(",", "").replace(" ", "")
return float(x)
Next, call it on your column like this.
data['Revenue'] = data['Revenue'].apply(clean)
Or if one want to use lambda function in the apply function:
data['Revenue']=data['Revenue'].apply(lambda x:float(x.replace("$","").replace(",", "").replace(" ", "")))

Categories

Resources