Data parsing in pandas, python - python

I have an excel file with many columns, one of them, 'Column3' is date with some text in it, basically it looks like that:
26/05/20
XXX
YYY
12/05/2020
The data is written in DD/MM/YY format but pandas, just like excel, thinks that 12/05/2020 it's 05 Dec 2020 while it is 12 May 2020. (My windows is set to american date format)
Important note: when I open stock excel file, cells with 12/05/2020 already are Date type, trying to convert it to text it gives me 44170 which will give me wrong date if I just reformat it into DD/MM/YY
I added this line of code:
iport pandas as pd
dateparse = lambda x: pd.datetime.strptime(x,'%d/%m/%y')
df = pd.read_excel("my_file.xlsx", parse_dates=['Column3'], date_parser=dateparse)
But the text in the column generates an error.
ValueError: time data 'XXX' does not match format '%d/%m/%y'
I went a step further and manually removed all text (obviously I can't do it all the time) to see whether it works or nor, but then I got following error
dateparse = lambda x: pd.datetime.strptime(x,'%d/%m/%y')
TypeError: strptime() argument 1 must be str, not datetime.datetime
I also tried this:
df['Column3'] = pd.to_datetime(df.Column3, format ='%d/%m/%y', errors="coerce")
# if I make errors="ignore" it doesn't change anything.
in that case my 26/05/20 was correctly converted to 26 May 2020 but I lost all my text data(it's ok) and other dates which didn't match with my format argument. Because previously they were recognized as American type date.
My objective is to convert the data in Column3 to the same format so I could apply filters with pandas.
I think it's couple solutions:
tell Pandas to not convert text to date at all (but it is already saved as Date type in stock file, will it work?)
somehow ignore text values and use date_parser= method co convert add dates to DD/MM/YY
with help of pd.to_datetime convert 26/05/20 to 26 May 2020 and than convert 2020-09-06 00:00:00 to 9 June 2020 (seems to be the simplest one but ignore argument doesn't work.)
Here's link to small sample file https://easyupload.io/ca5p6w

You can pass a date_parser to read_excel:
dateparser = lambda x: pd.to_datetime(x, dayfirst=True)
pd.read_excel('test.xlsx', date_parser = dateparser)

Posting this as an answer, since it's too long for a comment
The problem originates in Excel. If I open it in Excel, I see 2 strings that look like dates 26/05/20, 05/12/2020 and 06/02/2020. Note the difference between the 20 and 2020 On lines 24 and 48 I see dates in Column4. This seems to indicate the Excel is put together. Is this Excel assembled by copy-paste, or programmatically?
loading it with just pd.read_excel gives these results for the dates:
26/05/20
2020-12-05 00:00:00
2020-02-06 00:00:00
If I do df["Column3"].apply(type)
gives me
str
<class 'datetime.datetime'>
<class 'datetime.datetime'>
So in the Excel file these are marked as datetime.
Loading them with df = pd.read_excel(DATA_DIR / "sample.xlsx", dtype={"Column3": str}) changes the type of all to str, but does not change the output.
If you open the extract the file, and go look at the xml file xl\worksheets\sheet1.xml directly and look for cell C26, you see it as 44170, while C5 is 6, which is a reference to 26/05/20 in xl/sharedStrings.xml
How do you 'make' this Excel file? This can best be solved in how this file is put together.
Workaround
As a workaround, you can convert the dates piecemeal. The different format allows this:
format1 = "%d/%m/%y"
format2 = "%Y-%d-%m %H:%M:%S"
Then you can do pd.to_datetime(dates, format=format1, errors="coerce") to only get the first dates, and NaT for the ones not according to the format. Then you use combine_first to fill the voids.
dates = df["Column3"] # of the one imported with dtype={"Column3": str}
dates_parsed = (
pd.to_datetime(dates, format=format1, errors="coerce")
.combine_first(pd.to_datetime(dates, format=format2, errors="coerce"))
.astype(object)
.combine_first(dates)
)
The astype(object) is needed to fill in the empty places with the string values.

I think, first you should import the file without date parsing then convert it to date format using following:
df['column3']= pd.to_datetime(df['column3'], errors='coerce')
Hope this will work

Related

How can I add a zero to dates in a string so all months are 2 characters? [duplicate]

Using a Python script, I need to read a CVS file where dates are formated as DD/MM/YYYY, and convert them to YYYY-MM-DD before saving this into a SQLite database.
This almost works, but fails because I don't provide time:
from datetime import datetime
lastconnection = datetime.strptime("21/12/2008", "%Y-%m-%d")
#ValueError: time data did not match format: data=21/12/2008 fmt=%Y-%m-%d
print lastconnection
I assume there's a method in the datetime object to perform this conversion very easily, but I can't find an example of how to do it. Thank you.
Your example code is wrong. This works:
import datetime
datetime.datetime.strptime("21/12/2008", "%d/%m/%Y").strftime("%Y-%m-%d")
The call to strptime() parses the first argument according to the format specified in the second, so those two need to match. Then you can call strftime() to format the result into the desired final format.
you first would need to convert string into datetime tuple, and then convert that datetime tuple to string, it would go like this:
lastconnection = datetime.strptime("21/12/2008", "%d/%m/%Y").strftime('%Y-%m-%d')
I am new to programming. I wanted to convert from yyyy-mm-dd to dd/mm/yyyy to print out a date in the format that people in my part of the world use and recognise.
The accepted answer above got me on the right track.
The answer I ended up with to my problem is:
import datetime
today_date = datetime.date.today()
print(today_date)
new_today_date = today_date.strftime("%d/%m/%Y")
print (new_today_date)
The first two lines after the import statement gives today's date in the USA format (2017-01-26). The last two lines convert this to the format recognised in the UK and other countries (26/01/2017).
You can shorten this code, but I left it as is because it is helpful to me as a beginner. I hope this helps other beginner programmers starting out!
Does anyone else else think it's a waste to convert these strings to date/time objects for what is, in the end, a simple text transformation? If you're certain the incoming dates will be valid, you can just use:
>>> ddmmyyyy = "21/12/2008"
>>> yyyymmdd = ddmmyyyy[6:] + "-" + ddmmyyyy[3:5] + "-" + ddmmyyyy[:2]
>>> yyyymmdd
'2008-12-21'
This will almost certainly be faster than the conversion to and from a date.
#case_date= 03/31/2020
#Above is the value stored in case_date in format(mm/dd/yyyy )
demo=case_date.split("/")
new_case_date = demo[1]+"-"+demo[0]+"-"+demo[2]
#new format of date is (dd/mm/yyyy) test by printing it
print(new_case_date)
If you need to convert an entire column (from pandas DataFrame), first convert it (pandas Series) to the datetime format using to_datetime and then use .dt.strftime:
def conv_dates_series(df, col, old_date_format, new_date_format):
df[col] = pd.to_datetime(df[col], format=old_date_format).dt.strftime(new_date_format)
return df
Sample usage:
import pandas as pd
test_df = pd.DataFrame({"Dates": ["1900-01-01", "1999-12-31"]})
old_date_format='%Y-%m-%d'
new_date_format='%d/%m/%Y'
conv_dates_series(test_df, "Dates", old_date_format, new_date_format)
Dates
0 01/01/1900
1 31/12/1999
The most simplest way
While reading the csv file, put an argument parse_dates
df = pd.read_csv("sample.csv", parse_dates=['column_name'])
This will convert the dates of mentioned column to YYYY-MM-DD format
Convert date format DD/MM/YYYY to YYYY-MM-DD according to your question, you can use this:
from datetime import datetime
lastconnection = datetime.strptime("21/12/2008", "%d/%m/%Y").strftime("%Y-%m-%d")
print(lastconnection)
df is your data frame
Dateclm is the column that you want to change
This column should be in DateTime datatype.
df['Dateclm'] = pd.to_datetime(df['Dateclm'])
df.dtypes
#Here is the solution to change the format of the column
df["Dateclm"] = pd.to_datetime(df["Dateclm"]).dt.strftime('%Y-%m-%d')
print(df)

Converting Date Format in a Dataframe from a CSV File [duplicate]

This question already has answers here:
Convert DataFrame column type from string to datetime
(6 answers)
Convert Pandas Column to DateTime
(8 answers)
Closed 1 year ago.
I need to convert the date format of my csv file into the proper pandas format so I could sort it later on. My current format cannot be interacted reasonably in pandas so I had to convert it.
This is what my csv file looks like:
ARTIST,ALBUM,TRACK,DATE
ARTIST1,ALBUM1,TRACK1,23 Nov 2019 02:08
ARTIST1,ALBUM1,TRACK1,23 Nov 2019 02:11
ARTIST1,ALBUM1,TRACK1,23 Nov 2019 02:15
So far I've successfully converted it into pandas format by doing this:
df= pd.read_csv("mycsv.csv", delimiter=',')
convertdate= pd.to_datetime(df["DATE"])
print convertdate
####
#Original date format: 23 Nov 2019 02:08
#Output and desired date format: 2019-11-23 02:08:00
However, that only changes the values in the entire "DATE" column. Printing the dataframe of the csv file still outputs the original, non-converted date format. I need to append the converted format into the source csv file.
My desired output would then be
ARTIST,ALBUM,TRACK,DATE
ARTIST1,ALBUM1,TRACK1,2019-11-23 02:08:00
ARTIST1,ALBUM1,TRACK1,2019-11-23 02:11:00
ARTIST1,ALBUM1,TRACK1,2019-11-23 02:15:00
There are many options to the read_csv method.
Make sure to read the data in in the format you want instead of fixing it later.
df = pd.read_csv('mycsv.csv"', parse_dates=['DATE'])
Just pass in to the parse_dates argument the column names you want transformed.
There were 2 problems in the original code.
It wasn't a part of the original dataframe because you didn't save it back to the column once you transformed it.
so instead of:
convertdate= pd.to_datetime(df["DATE"])
use:
df["DATE"]= pd.to_datetime(df["DATE"])
and for goodness sake stop using python 2.
dateparse = lambda x: pd.datetime.strptime(x, '%Y-%m-%d %H:%M:%S')
df = pd.read_csv('mycsv.csv', parse_dates=['DATE'], date_parser=dateparse)

dataframe datetimeindex changes

I have a dataframe with a date column. I want to turn this date column into my index. When I change the date column into pd.to_datetime(df['Date'], errors='raise', dayfirst=True) I get:
df1.head()
Out[60]:
Date Open High Low Close Volume Market Cap
0 2018-03-14 0.789569 0.799080 0.676010 0.701902 479149000 30865600000
1 2018-03-13 0.798451 0.805729 0.778471 0.789711 279679000 31213000000
2 2018-12-03 0.832127 0.838328 0.787882 0.801048 355031000 32529500000
3 2018-11-03 0.795765 0.840407 0.775737 0.831122 472972000 31108000000
4 2018-10-03 0.854872 0.860443 0.793736 0.796627 402670000 33418600000
The format of Date originally is string dd-mm-yyyy, but as you can see, the tranformation to datetime messes things up from the 2nd row on. How can I get consistent datetimes?
Edit: I think I solved it. Using the answers below about format I found out the error was in a package that I used to generate the data (\[cryptocmd\]). I changed the format to %Y-%m-%d in the utils script of the package and now it seems to work fine.
According to the docs:
dayfirst : boolean, default False
Specify a date parse order if arg is str or its list-likes. If True,
parses dates with the day first, eg 10/11/12 is parsed as 2012-11-10.
Warning: dayfirst=True is not strict, but will prefer to parse with
day first (this is a known bug, based on dateutil behavior).
Emphasis mine. Since you apparently know that your format is "dd-mm-yyyy" you should specify it explicitly:
df['Date'] = pd.to_datetime(df['Date'], format='%d-%m-%Y', errors='raise')

skip rows with bad dates while using pd.read_csv

I'm reading in csv files from an external data source using pd.read_csv, as in the code below:
pd.read_csv(
BytesIO(raw_data),
parse_dates=['dates'],
date_parser=np.datetime64,
)
However, somewhere in the csv that's being sent, there is a misformatted date, resulting in the following error:
ValueError: Error parsing datetime string "2015-08-2" at position 8
This causes the entire application to crash. Of course, I can handle this case with a try/except, but then I will lose all the other data in that particular csv. I need pandas to keep and parse that other data.
I have no way of predicting when/where this data (which changes daily) will have badly formatted dates. Is there some way to get pd.read_csv to skip only the rows with bad dates but to still parse all the other rows in the csv?
somewhere in the csv that's being sent, there is a misformatted date
np.datetime64 needs ISO8601 formatted strings to work properly. The good news is that you can wrap np.datetime64 in your own function and use this as the date_parser:
def parse_date(v):
try:
return np.datetime64(v)
except:
# apply whatever remedies you deem appropriate
pass
return v
pd.read_csv(
...
date_parser=parse_date
)
I need pandas to keep and parse that other data.
I often find that a more flexible date parser like dateutil works better than np.datetime64 and may even work without the extra function:
import dateutil
pd.read_csv(
BytesIO(raw_data),
parse_dates=['dates'],
date_parser=dateutil.parser.parse,
)
Here's another way to do this using pd.convert_objects() method:
# make good and bad date csv files
# read in good dates file using parse_dates - no problem
df = pd.read_csv('dategood.csv', parse_dates=['dates'], date_parser=np.datetime64)
df.dtypes
dates datetime64[ns]
data float64
dtype: object
# try same code on bad dates file - throws exceptions
df = pd.read_csv('datebad.csv', parse_dates=['dates'], date_parser=np.datetime64)
ValueError: Error parsing datetime string "Q%Bte0tvk5" at position 0
# read the file first without converting dates
# then use convert objects to force conversion
df = pd.read_csv('datebad.csv')
df['cdate'] = df.dates.convert_objects(convert_dates='coerce')
# resulting new date column is a datetime64 same as good data file
df.dtype
dates object
data float64
cdate datetime64[ns]
dtype: object
# the bad date has NaT in the cdate column - can clean it later
df.head()
dates data cdate
0 2015-12-01 0.914836 2015-12-01
1 2015-12-02 0.866848 2015-12-02
2 2015-12-03 0.103718 2015-12-03
3 2015-12-04 0.514086 2015-12-04
4 Q%Bte0tvk5 0.583617 NaT
use inbuilt pd.to_datetime, which converts the non date type data to NaT
pd.read_csv(
BytesIO(raw_data),
parse_dates=['dates'],
date_parser=pd.to_datetime,
)
Now you can filter out the invalid rows with standard nan/ null check
df = df[~df["dates"].isnull()]

Converting date between DD/MM/YYYY and YYYY-MM-DD?

Using a Python script, I need to read a CVS file where dates are formated as DD/MM/YYYY, and convert them to YYYY-MM-DD before saving this into a SQLite database.
This almost works, but fails because I don't provide time:
from datetime import datetime
lastconnection = datetime.strptime("21/12/2008", "%Y-%m-%d")
#ValueError: time data did not match format: data=21/12/2008 fmt=%Y-%m-%d
print lastconnection
I assume there's a method in the datetime object to perform this conversion very easily, but I can't find an example of how to do it. Thank you.
Your example code is wrong. This works:
import datetime
datetime.datetime.strptime("21/12/2008", "%d/%m/%Y").strftime("%Y-%m-%d")
The call to strptime() parses the first argument according to the format specified in the second, so those two need to match. Then you can call strftime() to format the result into the desired final format.
you first would need to convert string into datetime tuple, and then convert that datetime tuple to string, it would go like this:
lastconnection = datetime.strptime("21/12/2008", "%d/%m/%Y").strftime('%Y-%m-%d')
I am new to programming. I wanted to convert from yyyy-mm-dd to dd/mm/yyyy to print out a date in the format that people in my part of the world use and recognise.
The accepted answer above got me on the right track.
The answer I ended up with to my problem is:
import datetime
today_date = datetime.date.today()
print(today_date)
new_today_date = today_date.strftime("%d/%m/%Y")
print (new_today_date)
The first two lines after the import statement gives today's date in the USA format (2017-01-26). The last two lines convert this to the format recognised in the UK and other countries (26/01/2017).
You can shorten this code, but I left it as is because it is helpful to me as a beginner. I hope this helps other beginner programmers starting out!
Does anyone else else think it's a waste to convert these strings to date/time objects for what is, in the end, a simple text transformation? If you're certain the incoming dates will be valid, you can just use:
>>> ddmmyyyy = "21/12/2008"
>>> yyyymmdd = ddmmyyyy[6:] + "-" + ddmmyyyy[3:5] + "-" + ddmmyyyy[:2]
>>> yyyymmdd
'2008-12-21'
This will almost certainly be faster than the conversion to and from a date.
#case_date= 03/31/2020
#Above is the value stored in case_date in format(mm/dd/yyyy )
demo=case_date.split("/")
new_case_date = demo[1]+"-"+demo[0]+"-"+demo[2]
#new format of date is (dd/mm/yyyy) test by printing it
print(new_case_date)
If you need to convert an entire column (from pandas DataFrame), first convert it (pandas Series) to the datetime format using to_datetime and then use .dt.strftime:
def conv_dates_series(df, col, old_date_format, new_date_format):
df[col] = pd.to_datetime(df[col], format=old_date_format).dt.strftime(new_date_format)
return df
Sample usage:
import pandas as pd
test_df = pd.DataFrame({"Dates": ["1900-01-01", "1999-12-31"]})
old_date_format='%Y-%m-%d'
new_date_format='%d/%m/%Y'
conv_dates_series(test_df, "Dates", old_date_format, new_date_format)
Dates
0 01/01/1900
1 31/12/1999
The most simplest way
While reading the csv file, put an argument parse_dates
df = pd.read_csv("sample.csv", parse_dates=['column_name'])
This will convert the dates of mentioned column to YYYY-MM-DD format
Convert date format DD/MM/YYYY to YYYY-MM-DD according to your question, you can use this:
from datetime import datetime
lastconnection = datetime.strptime("21/12/2008", "%d/%m/%Y").strftime("%Y-%m-%d")
print(lastconnection)
df is your data frame
Dateclm is the column that you want to change
This column should be in DateTime datatype.
df['Dateclm'] = pd.to_datetime(df['Dateclm'])
df.dtypes
#Here is the solution to change the format of the column
df["Dateclm"] = pd.to_datetime(df["Dateclm"]).dt.strftime('%Y-%m-%d')
print(df)

Categories

Resources