Pandas converting date with string in - python

I'm starting with python and pandas and matplotlib. I'm working with data with over million entries. I'm trying to change the date format. In CSV file date format is 23-JUN-11. I will like to use dates in future to plot amount of donation for each candidate. How to convert the date format to a readable format for pandas?
Here is the link to cut file 149 entries
My code:
%matplotlib
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
First candidate
reader_bachmann = pd.read_csv('P00000001-ALL.csv' ,converters={'cand_id': lambda x: str(x)[1:]},parse_dates=True, squeeze=True, low_memory=False, nrows=411 )
date_frame = pd.DataFrame(reader_bachmann, columns = ['contb_receipt_dt'])
Data slice
s = date_frame.iloc[:,0]
date_slice = pd.Series([s])
date_strip = date_slice.str.replace('JUN','6')
Trying to convert to new date format
date = pd.to_datetime(s, format='%d%b%Y')
print(date_slice)
Here is the error message
ValueError: could not convert string to float: '05-JUL-11'

You need to use a different date format string:
format='%d-%b-%y'
Why?
The error message gives a clue as to what is wrong:
ValueError: could not convert string to float: '05-JUL-11'
The format string controls the conversion, and is currently:
format='%d%b%Y'
And the fields needed are:
%y - year without a century (range 00 to 99)
%b - abbreviated month name
%d - day of the month (01 to 31)
What is missing is the - that are separating the field in your data string, and the y for a two digit year instead of the current Y for a four digit year.

As an alternative you can use dateutil.parser to parse dates containing string directly, I have created a random dataframe for demo.
l = []
for i in range(100):
l.append('23-JUN-11')
B = pd.DataFrame({'Date':l})
Now, Let's import dateutil.parser and apply it on our date column
import dateutil.parser
B['Date2'] = B['Date'].apply(lambda x : dateutil.parser.parse(x))
B.head()
Out[106]:
Date Date2
0 23-JUN-11 2011-06-23
1 23-JUN-11 2011-06-23
2 23-JUN-11 2011-06-23
3 23-JUN-11 2011-06-23
4 23-JUN-11 2011-06-23

Related

Pandas to datetime

I have a date that is formatted like this:
01-19-71
and 71 is 1971 but whenever to_datetime is used it converts is to 2071! how can I solve this problem? I am told that this would need regex but I can't imagine how since there are many cases in this data
my current code:
re_1 = r"\d{1,2}[/-]\d{1,2}[/-]\d{2,4}"
re_2 = r"(?:\d{1,2} )?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*[ \-\.,]+(?:\d{1,2}[\w]*[ \-,]+)?[1|2]\d{3}"
re_3 = r"(?:\d{1,2}/)?[1|2]\d{3}"
# Correct misspillings
df = df.str.replace("Janaury", "January")
df = df.str.replace("Decemeber", "December")
# Extract dates
regex = "((%s)|(%s)|(%s))"%(re_1, re_2, re_3)
dates = df.str.extract(regex)
# Sort the Series
dates = pd.Series(pd.to_datetime(dates.iloc[:,0]))
dates.sort_values(ascending=True, inplace=True)
Considering that one has a string as follows
date = '01-19-71'
In order to convert to datetime object where 71 is converted to 1971 and not 2071, one can use datetime.strptime as follows
import datetime as dt
date = dt.datetime.strptime(date, '%m-%d-%y')
[Out]:
1971-01-19 00:00:00

What is the correct format code for this type of date?

I have a numpy array (called dates) of dates (as strings) which I thought were in the form %Y-%m-%d %H:%M:%S. However, I get an error that I have dates such as 2021-05-11T00:00:00.0000000. Not sure where did that additional 'T' come and why is the time so precise.
I am trying to get rid of the time and only have the date.
My code is here:
dates = dataset.iloc[:,0].to_numpy()
newDates = []
for i in range(0,len(dates)):
newDates.append(datetime.strptime(dates[i], '%Y-%m-%dT%H:%M:%S.%f'))
newDates[i] = newDates[i].strftime('%Y-%m-%d')
dates = newDates
I get an error saying "ValueError: unconverted data remains: 0".
If I wrote instead
newDates.append(datetime.strptime(dates[i], '%Y-%m-%dT%H:%M:%S%f'))
I get an error "ValueError: unconverted data remains: .0000000".
In which format should the date be given?
If you have datetime in dataframe you can use pd.to_datetime and Series.dt.strftime for converting to desired format. pandas do all for you! (why convert values in dataframe to numpy.array.)
import pandas as pd
# example df
df = pd.DataFrame({'datetime': ['2021-05-11T00:00:00.0000000' ,
'2021-05-20T00:00:00.0000000' ,
'2021-06-24T00:00:00.0000000']})
df['datetime'] = pd.to_datetime(df['datetime']).dt.strftime('%Y-%m-%d')
print(df)
datetime
0 2021-05-11
1 2021-05-20
2 2021-06-24
Does this help? https://strftime.org/
The extra T can be seen after %Y-%m-%d
If you just want to get the date, just split the string like this.
date = date.split('T')[0]
this will first split the date string into to parts,
[2021-05-11','00:00:00.0000000]
then you can extract the first variable in the list by saving only index 0
then you are just left with
date = '2021-05-11'
dates = dataset.iloc[:,0].to_numpy()
newDates = []
for i in dates:
newDates.append(i.split('T')[0])
dates = newDates
assuming dates is a list

Split the Date and Year and format the date into standard MM/DD/YYYY in Python

I'm working on date formatting and few cells contains data i.e. June/142017(no slash between date and year). I want to split the date and year and convert into standard format MM/DD/YYYY.
I'm formatting the date into standard format, which is becoming exclusive to June Month, by using the replace function, i.e. replace("June/142017", "June/14/2017"). Please, could you assist me with the code that should split and convert into standard format which is not specific.
Below is the code I'm using:
`import pandas as pd
import datetime as dt
File = pd.read_excel("Final_file.xlsx")
LFile = File.replace("June/142017","June/14/2017")
LFile["Date"] = pd.to_datetime(LFile["Date"]).dt.strftime("%m/%d/%Y")
LFile.to_excel("Updated_Final_File.xlsx")`
*** FYI - I'm new to Python.
Thank you in Advance.
Use format %B/%d%Y for match June/142017:
File = pd.read_excel("Final_file.xlsx")
d1 = pd.to_datetime(LFile["Date"], format='%B/%d%Y', errors='coerce')
d2 = pd.to_datetime(LFile["Date"], errors='coerce')
LFile["Date"] = d2.fillna(d1).dt.strftime("%m/%d/%Y")
LFile.to_excel("Updated_Final_File.xlsx")

Can't convert string object to time

I have a dataframe containing different recorded times as string objects, such as 1:02:45, 51:11, 54:24.
I can't convert to time objects, this is the error I am getting:
"time data '49:49' does not match format '%H:%M:%S"
This is the code I am using:
df_plot2 = df[['year', 'gun_time', 'chip_time']]
df_plot2['gun_time'] = pd.to_datetime(df_plot2['gun_time'], format = '%H:%M:%S')
df_plot2['chip_time'] = pd.to_datetime(df_plot2['chip_time'], format = '%H:%M:%S')
Thanks in advance for your help!
you can create a common format in the time Series by checking string len and adding the hours as zero '00:' where there are only minutes and seconds. Then parse to datetime. Ex:
import pandas as pd
s = pd.Series(["1:02:45", "51:11", "54:24"])
m = s.str.len() <= 5
s.loc[m] = '00:' + s.loc[m]
dts = pd.to_datetime(s)
print(dts)
0 2021-12-01 01:02:45
1 2021-12-01 00:51:11
2 2021-12-01 00:54:24
dtype: datetime64[ns]
I believe it may be because for %H python expects to see 01, 02, 03 etc instead of 1, 2, 3. To use your specific example 1:02:45 may have to be in the 01:02:45 format for python to be able to convert it to a datetime variable with %H:%M:$S.

Reading .dat file date string

In Python:
Got a .dat file one column is a datestr 'yyyy-mm-dd'. Column years range from 2000, to 2010 I only want to use 2005.
How can I successfully read using np.loadtxt, keeping in the same format.
I am then going to use:
time_string = yyyy-mm-dd
doy = int (time.strftime ("%j", time.strptime ( time_string, "%Y, %m, %d")))
to convert yyyy-mm-dd to day of year (1-365)
The question doesn't point any reason to the use of loadtxt from numpy, so you actually don't care about how your data is loaded. Said that, in this case you simply use dtype=object for loading it.
Suppose this is your .dat file, let us call it d1.dat:
1 2000-01-01 blah
2 2005-01-01 bleh
3 2006-02-03 blih
4 2008-03-04 bloh
5 2010-04-05 bluh
6 2005-03-12 blahr
Then (for example) to load it using numpy:
import numpy
data = numpy.loadtxt('d1.dat', usecols=[1,2], dtype=object)
Now you can apply your function to extract the day of the year from the first column in data:
for date, _ in data:
print time.strftime("%j", time.strptime(date, "%Y-%m-%d"))
from datetime import datetime
time_string = '2012-12-31'
dt = datetime.strptime(time_string, '%Y-%m-%d')
print dt.timetuple().tm_yday
366

Categories

Resources