pd.read_csv only parsing one out of two column dates - python

I'm trying to read data from a csv and parsing dates but I found this issue in which only one of the columns gets a datetime format and the other one still remains an object.
dtype = {'col1':'category','col2':'category','Start Date':'str','End Date':'str'}
dates = ['col3','col4']
df = pd.read_csv(filepath,dtype=dtype,parse_dates=dates,dayfirst=False)
Both date columns have same format.
when I do df.info() I get the following:
df.info() output
I tried using dayfirst input and the formatter but it didn't help.
I expect that both columns in the list would get datetime object but for some reason they aren't.
Update: tried to recreate a minimal reproducible data set by doing the code block below but this is behaving as expected, producing both Start Date and End Date columns as datetime.
import pandas as pd
df = pd.DataFrame({'col1':['ABC','ABC','DCF','DCF'],
'Start Date':['12-31-2022','12-31-2022','12-31-2022','12-31-2022'],
'End Date':['12-31-2023','12-31-2023','12-31-2023','12-31-2023']
})
df.to_csv('test.csv',index=0)
df2 = pd.read_csv('test.csv',
dtype={'col1':'category','Start Date':'str','End Date':'str'},
parse_dates = ['Start Date','End Date'],
dayfirst=False
)
df2.info()

Related

Removing time from a date column in pandas

I have pandas data frame that had a Date (string) which i could convert and set it up as a index using the set_index and to_datetime functions
usd2inr_df.set_index(pd.to_datetime(usd2inr_df['Date']), inplace=True)
but the resulting dataframe has the time portion which i wanted to remove ...
2023-02-14 00:00:00
I wanted to have it as 2023-02-14
How do i setup the call such that, i can get have the date without the time portion as a index on my dataframe
usd2inr_df['Date'] = pd.to_datetime(usd2inr_df['Date']).dt.normalize()
usd2inr_df.set_index(usd2inr_df['date'])
Using the .to_datetime() method, converts a Series to a pandas datetime object.
Using the Series.dt.date, returns a 'yyyy-mm-dd' date form.
Using the DataFrame.index, sets the index of the dataFrame.
import pandas as pd
# create a dataFrame as an example
df = pd.DataFrame({'Name': ['Example'],'Date': ['2023-02-14 10:01:11']})
print(df)
# convert 'yyyy-mm-dd hh:mm:ss' to 'yyyy-mm-dd'.
df['Date'] = pd.to_datetime(df['Date']).dt.date
# set 'Date' as index
df.index = df['Date']
print(df)
Output
Name Date
0 Example 2023-02-14 10:01:11
-------------------------------------------------------
Name Date
Date
2023-02-14 Example 2023-02-14

Convert index to DateTime

I have an excel file with data. I defined this file as a DataFrame (5000,12) using python/pandas. As an index, I set the date based on the below:
Data_Final=Data.set_index(['Date Time']) # Data_Final is Dataframe
For example, the first index is 01/01/2016 00:00. Now I want this index in datetime. How is this conversion done?
use the .to_datetime() method
Data_Final = Data
Data_Final['Date Time'] = pd.to_datetime(Data['Date Time'])
Data_Final.set_index('Date Time', inplace=True)
How to convert string to datetime format in pandas python?

In pandas, how to infer date typed columns with a custom format

I am trying to parse a csv file using pandas, with read_csv, and I am running into an issue where dates are not properly parsed, since they are in the format "%d.%m.%Y" (example : 22.01.2022)
I understand a custom date parser is needed, so I passed one in input, such as here:
data = pd.read_csv(p, skiprows=[0,1,2,4], keep_default_na=False,
date_parser=lambda x: datetime.strptime(x, "%d.%m.%Y").date(),
sep="\t"
)
This data extraction doesn't parse the dates as expected.
If I pass the list of columns that I expect to have dates in it, then those columns are properly parsed as dates, so I assume my custom date parser works:
data = pd.read_csv(p, skiprows=[0,1,2,4], keep_default_na=False,
date_parser=lambda x: datetime.strptime(x, "%d.%m.%Y").date(),
parse_dates=['date1', 'date2'],
sep="\t"
)
But I would like to avoid having to manually specify which columns pandas should be trying to parse as date columns, since the data source could evolve. I would like to have pandas guess which columns contain dates, like it does when the dates match a more standard format.
Since the pandas behaviour I was looking for turned out to not exist, here is the solution that I went with, which involves building the dataframe, and then turning the appropriate columns to date.
First I find the columns where the string data matches my format, and then apply the type:
date_cols = [col for col in df.columns if df[col].astype(str).str.contains(r'^\d{2}\.\d{2}\.\d{4}$', case=True, regex=True).any()]
for col in date_cols:
df[col] = pd.to_datetime(df[col],format='%d.%m.%Y')

How to keep date format the same in pandas? [duplicate]

This question already has answers here:
How to change the datetime format in Pandas
(8 answers)
Closed 1 year ago.
import pandas as pd
import sys
df = pd.read_csv(sys.stdin, sep='\t', parse_dates=['Date'], index_col=0)
df.to_csv(sys.stdout, sep='\t')
Date Open
2020/06/15 182.809924
2021/06/14 257.899994
I got the following output with the input shown above.
Date Open
2020-06-15 182.809924
2021-06-14 257.899994
The date format is changed. Is there a way to maintain the date format automatically? (For example, if the input is in YYYY/MM/DD format, the output should be in YYYY-MM-DD. If the input is in YYYY-MM-DD, the output should in YYYY-MM-DD, etc.)
I prefer a way that I don't have to manually test the data format. It is best if there is an automatical way to maintain the date format, no matter what the particular date format is.
You can specify the date_format argument in to_csv:
df.to_csv(sys.stdout, sep='\t', date_format="%Y/%m/%d")
Keep the dates as strings and parse them into an extra column if you need to operate on them as dates?
df = pd.read_csv(sys.stdin, sep='\t', index_col=0)
df['DateParsed'] = pd.to_datetime(df["Date"])

Using Pandas in python while converting date to datetime datatype from obj , I am not getting my dates properly converted. Why?

Please have look at both these images, especially Dates from Sno 32. The month column and day column are not properly converted . How can I make this correct? I have already referred to questions regarding timeseries but haven't found any answer to this kind of issue.
There is problem pandas by default parse months first if possible.
You can specify the format as DD/MM/YY
df['date'] = pd.to_datetime(df['date'], format='%d/%m/%y')
Or try using dayfirst=True parameter:
df['date'] = pd.to_datetime(df['date'], dayfirst=True)
Or if create DataFrame from file use parse_dates and dayfirst=True parameters:
df = pd.read_csv(file, parse_dates=['date'], dayfirst=True)

Categories

Resources