parse date-time while reading 'csv' file with pandas

parse date-time while reading 'csv' file with pandas - python

I am trying to parse dates while I am reading my data from cvs file. The command that I use is
df = pd.read_csv('/Users/n....', names=names, parse_dates=['date']) )
And it is working on my files generally.
But I have couple of data sets which has variety in date formats. I mean it has date format is like that (09/20/15 09:59 ) while it has another format in other lines is like that ( 2015-09-20 10:22:01.013 ) in the same file. And the command that I wrote above doesn't work on these file. It is working when I delete (parse_dates=['date']), but that time I can't use date column as datetime format, it reads that column as integer . I would be appreciate anyone could answer that!

Pandas read_csv accepts date_parser argument which you can define your own date parsing function. So for example in your case you have 2 different datetime formats you can simply do:
import datetime
def date_parser(d):
try:
d = datetime.datetime.strptime("format 1")
except ValueError:
try:
d = datetime.datetime.strptime("format 2")
except:
# both formats not match, do something about it
return d
df = pd.read_csv('/Users/n....',
names=names,
parse_dates=['date1', 'date2']),
date_parser=date_parser)
You can then parse those dates in different formats in those columns.

Like this:
df = pd.read_csv(file, names=names)
df['date'] = pd.to_datetime(df['date'])

Related

Why does Python Pandas read the string of an excel file as datetime

I have the following questions.
I have Excel files as follows:
When i read the file using
df = pd.read_excel(file,dtype=str).
the first row turned to 2003-02-14 00:00:00 while the rest are displayed as it is.
How do i prevent pd.read_excel() from converting its value into datetime or something else?
Thanks!

As #ddejohn correctly said it in the comments, the behavior you face is actually coming from Excel, automatically converting the data to date. Thus pandas will have to deal with that data AS date, and treat it later to get the correct format back as you expect, as like you say you cannot modify the input Excel file.
Here is a short script to make it work as you expect:
import pandas as pd
def rev(x: str) -> str:
'''
converts '2003-02-14 00:00:00' to '14.02.03'
'''
hours = '00:00:00'
if not hours in x:
return x
y = x.split()[0]
y = y.split('-')
return '.'.join([i[-2:] for i in y[::-1]])
file = r'C:\your\folder\path\Classeur1.xlsx'
df = pd.read_excel(file, dtype=str)
df['column'] = df['column'].apply(rev)
Replace df['column'] by your actual column name.
You then get the desired format in your dataframe.

Pandas dataframe to_datetime() gives error

Goal:
I read measurement data from a .csv and convert them to a dataframe. Then I add the date information from the filename to the time string which is already in the dataframe. And the last step is to convert this string with date and time informatin into a datetime object.
First steps that worked:
import pandas as pd
filename = '2022_02_14_data_0.csv
path = 'C:/Users/ma1075116/switchdrive/100_Schaltag/100_Digitales_Model/Messungen/'
measData = pd.read_csv(path+filename, sep = '\t', header = [0,1], encoding = 'ISO-8859-1')
# add the date to the timestamp string
measData['Timestamp'] = filename[:11]+measData['Timestamp']
An object in the Dataframe measData['Timestamp'] has now exacty a string with the following pattern:
'2022_02_14_00:00:06'
Now I want to convert this string to datetime:
measData['Timestamp'] = pd.to_datetime(measData['Timestamp'], format= '%Y_%m_%d_%H:%M:%S')
This raises the error:
ValueError: to assemble mappings requires at least that [year, month, day] be specified: [day,month,year] is missing
Why do I get this error and how can I avoid it? I am pretty shure that the format is correct.
Edit:
I wrote a sample code which should do exactly the same, and it works:
filename = '2022_02_14_data_0.csv'
timestamps = {'Timestamp': ['00:00:00', '00:00:01', '00:00:04']}
testFrame = pd.DataFrame(timestamps)
testFrame['Timestamp'] = testFrame['Timestamp']#
testFrame['Timestamp'] = filename[:11]+testFrame['Timestamp']
testFrame['Timestamp'] = pd.to_datetime(testFrame['Timestamp'], format= '%Y_%m_%d_%X')
My next step is now to check if all timestamp entries in the dataframe have the same format.
Solution:
I do not understand the error but I found a working solution. Now I parse for the time already in the read_csv function and add the date information from the filename there. This works, measData(timeStamp) has now the datatype datetime64.
filename = '2022_02_14_data_0.csv'
path = 'C:/Users/ma1075116/switchdrive/100_Schaltag/100_Digitales_Model/Messungen/'
measData = pd.read_csv(path+filename, sep = '\t', header = [0,1],
parse_dates=[0], # parse for the time in the first column
date_parser = lambda col: pd.to_datetime(filename[:11]+col, format= '%Y_%m_%d_%X'),
encoding = 'ISO-8859-1')

You can do it like this using datetime.datetime.strptime and apply in the column.
Recreating your dataset:
import datetime
import pandas as pd
data={'2016_03_29_08:15:27', '2017_03_29_08:18:27',
'2018_06_30_08:15:27','2019_07_29_08:15:27'}
columns={'time'}
df= pd.DataFrame(data=data, columns=columns)
Applying the desired transformation:
df['time'] = df.apply(lambda row : datetime.datetime.strptime(row['time'],
'%Y_%m_%d_%H:%M:%S'), axis=1)

Your format seems to be missing an underscore after day.
This works for me:
import pandas as pd
date_str = '2022_02_14_00:00:06'
pd.to_datetime(date_str, format= '%Y_%m_%d_%H:%M:%S')
EDIT:
This works fine for me (measData["Timestamp"] is a pd.Series):
import pandas as pd
measData = pd.DataFrame({"Timestamp": ['2022_02_14_00:00:06', '2022_02_14_13:55:06', '2022_02_14_12:00:06']})
pd.to_datetime(measData["Timestamp"], format= '%Y_%m_%d_%H:%M:%S')
The only way I found to reproduce your error is this (measData is a pd.DataFrame):
import pandas as pd
measData = pd.DataFrame({"Timestamp": ['2022_02_14_00:00:06', '2022_02_14_13:55:06', '2022_02_14_12:00:06']})
pd.to_datetime(measData, format= '%Y_%m_%d_%H:%M:%S')
So make sure that what you are putting into to_datetime is a pd.Series. If this does not help, please provide a small sample of your data.

"df = pd.read_csv('xxx.csv')" resets the date format in my csv file

In my csv file, I save my dates in this format: "yyyy-mm-dd".
Every time I pull the data from csv and into a pandas dataframe, it will reset the format to "yyyy/mm/dd" in my csv file. This will cause errors if I test my code again, so I have to open the csv and reformat the date column to yyyy-mm-dd again.
Do you know why CSV does this? Is there a permanent solution to make sure my date format doesn't reset every time pandas reads my csv file?
Here is some of my code directly related to reading my csv file:
origindf = pd.read_csv('testlist.csv')
origindf = pd.DataFrame(origindf, columns=["ticker","date"])
origintickers = origindf['ticker'].values.tolist()
origintickersiterate = origindf['ticker'].values.tolist()
origindates = origindf['date'].values.tolist()
masterdf = pd.DataFrame(columns = ['ticker', 'date', 'time', 'vol', 'vwap', 'open', 'high', 'low','close','trades'])
for ticker in origintickersiterate:
polygonapi = 'xxxxxxxxxxxxxxx'
limit = 10000
multiplier = 1
timespan = 'minute'
adjusted = 'False'
theticker = origintickers.pop()
thedate = origindates.pop()

Assuming that pandas recognises the original text as dates, it will represent it as datetime64[ns] which is not text and how it displays on screen, eg with df.head() is irrelevant. You can check the data formats with df.dtypes to make sure.
Pandas to_csv allows you to control the format of the output dates with the date_format parameter, eg:
df.to_csv('testlist.csv', date_format='%Y-%m-%d')
I suggest viewing the output in a text editor because Excel will parse the dates and may convert them.
Current documentation:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html

How to correctly import and plot time index format HH:MM:SS.fs into Pandas dataframe

I'm new to python and I'm trying to correctly parse a .txt data file into pandas using the time column in the format HH:MM:SS.fs as the index for the dataframe. An example line of the .txt input file looks like this:
00:07:01.250 10.7
I've tried the following code using the datetime function, however this adds todays date in addition to importing the timestamp which I don't want to be displayed. I've also read about the timestamp and timedelta functions but can't see how these would work for this use case.
df = pd.read_csv(f, engine='python', delimiter='\t+', skiprows=23, header=None, usecols=[0,3], index_col=0, names=['Time (HH:MM:SS.fs)', 'NL (%)'], decimal=',')
df.index = pd.to_datetime(df.index)
Here is the existing import code for the datetime import:
df = pd.read_csv(f, engine='python', delimiter='\t+', skiprows=23, header=None, usecols=[0,3], index_col=0, names=['Time (HH:MM:SS.fs)', 'NL (%)'], decimal=',')
df.index = pd.to_datetime(df.index)
An example line of the output looks like this:
2019-09-26 00:07:01.250 10.7
But want I want is this (without the date):
00:07:01.250 10.7
Any ideas on what I'm doing wrong?

You could read the columns as string, using:
df.some_column = df.some_column.astype('str')
And then use the "format" argument of "to_datetime". It uses the python "strptime" method to convert a string to a datetime, and that method let you specify the exact format of the converted object, as the following link you show:
https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior
I hope this helps. Wish I have more time to test but unfortunately I don't.

Leave dates as strings using read_excel function from pandas in python

Python 2.7.10
Tried pandas 0.17.1 -- function read_excel
Tried pyexcel 0.1.7 + pyexcel-xlsx 0.0.7 -- function get_records()
When using pandas in Python is it possible to read excel files (formats: xls, xlsx) and leave columns containing date or date + time values as strings rather than auto-converting to datetime.datetime or timestamp types?
If this is not possible using pandas can someone suggest an alternate method/library to read xls, xlsx files and leave date column values as strings?
For the pandas solution attempts the df.info() and resulting date column types are shown below:
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 117 entries, 0 to 116
Columns: 176 entries, Mine to Index
dtypes: datetime64[ns](2), float64(145), int64(26), object(3)
memory usage: 161.8+ KB
>>> type(df['Start Date'][0])
Out[6]: pandas.tslib.Timestamp
>>> type(df['End Date'][0])
Out[7]: pandas.tslib.Timestamp
Attempt/Approach 1:
def read_as_dataframe(filename, ext):
import pandas as pd
if ext in ('xls', 'xlsx'):
# problem: date columns auto converted to datetime.datetime or timestamp!
df = pd.read_excel(filename) # unwanted - date columns converted!
return df, name, ext
Attempt/Approach 2:
import pandas as pd
# import datetime as datetime
# parse_date = lambda x: datetime.strptime(x, '%Y%m%d %H')
parse_date = lambda x: x
elif ext in ('xls', 'xlsx', ):
df = pd.read_excel(filename, parse_dates=False)
date_cols = [df.columns.get_loc(c) for c in df.columns if c in ('Start Date', 'End Date')]
# problem: date columns auto converted to datetime.datetime or timestamp!
df = pd.read_excel(filename, parse_dates=date_cols, date_parser=parse_date)
And have also tried pyexcel library but it does the same auto-magic convert behavior:
Attempt/Approach 3:
import pyexcel as pe
import pyexcel.ext.xls
import pyexcel.ext.xlsx
t0 = time.time()
if ext == 'xlsx':
records = pe.get_records(file_name=filename)
for record in records:
print("start date = %s (type=%s), end date = %s (type=%s)" %
(record['Start Date'],
str(type(record['Start Date'])),
record['End Date'],
str(type(record['End Date'])))
)

I ran into an identical problem, except pandas was oddly converting only some cells into datetimes. I ended up manually converting each cell into a string like so:
def undate(x):
if pd.isnull(x):
return x
try:
return x.strftime('%d/%m/%Y')
except AttributeError:
return x
except Exception:
raise
for i in list_of_possible_date_columns:
df[i] = df[i].apply(undate)

I tried saving the file in a CSV UTF-8 format (manually) and used pd.read_csv() and worked fine.
I tried a bunch of things to figure the same thing with read_excel. Did not work anything for me. So, I am guessing read_excel is probably updating your string in a datetime object which you can not control.

Using converters{'Date': str} option inside the pandas.read_excel which helps.
pandas.read_excel(xlsx, sheet, converters={'Date': str})
you can try convert your timestamp back to the original format
df['Date'][0].strftime('%Y/%m/%d')

I have had the same issue when extracting Dates from excel. My columns had the MM/DD/YYYY format, but reading it in Pyhton and converting to CSV the format was then converted to MM/DD/YYYY 00:00:00.
Fortunately figured out a solution. Using
excel = pd.read_excel(file, dtype=object)
Instead of
excel = pd.read_excel(file, dtype=str)
Works.
I have no idea why and would appreciate some experienced python programmer to explain.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

parse date-time while reading 'csv' file with pandas - python

Like this: df = pd.read_csv(file, names=names) df['date'] = pd.to_datetime(df['date'])

Related

Why does Python Pandas read the string of an excel file as datetime

Pandas dataframe to_datetime() gives error

"df = pd.read_csv('xxx.csv')" resets the date format in my csv file

How to correctly import and plot time index format HH:MM:SS.fs into Pandas dataframe

Leave dates as strings using read_excel function from pandas in python

Categories

Resources