Hi I am working on a data set given below
Month,Travellers('000)
Jan-91,1724
Feb-91,1638
Mar-91,1987
Apr-91,1825
May-91,
Jun-91,1879
I am using the below code to format the date
data = pd.read_csv('Metrail+dataset.csv', header = None)
data.columns = ['Month','Travellers']
data['Month'] = pd.to_datetime(data['Month'], format='%m-%Y')
data = data.set_index('Month')
data.head(12)
However, getting the below error
ValueError: time data 'Month' does not match format '%m-%Y' (match)
Could someone help me what is the mistake and any useful links to learn more on the date format
%Y is for year on 4 digits < VS > %y is for year on 2 digits
%m is for month with digits < VS > %b is for shorten month name
Also remove header=None because this counts the header row as data, this is wrong
data = pd.read_csv('data.csv')
data.columns = ['Month', 'Travellers']
data['Month'] = pd.to_datetime(data['Month'], format='%b-%y')
use %b and (as mentioned) %y
data['Month'] = pd.to_datetime(data['Month'], format='%b-%y')
From the docs
%b Month as locale’s abbreviated name. Sep
Related
I am trying to convert a dataframe column "date" from string to datetime. I have this format: "January 1, 2001 Monday".
I tried to use the following:
from dateutil import parser
for index,v in df['date'].items():
df['date'][index] = parser.parse(df['date'][index])
But it gives me the following error:
ValueError: Cannot set non-string value '2001-01-01 00:00:00' into a StringArray.
I checked the datatype of the column "date" and it tells me string type.
This is the snippet of the dataframe:
Any help would be most appreciated!
why don't you try this instead of dateutils, pandas offer much simpler tools such as pd.to_datetime function:
df['date'] = pd.to_datetime(df['date'], format='%B %d, %Y %A')
You need to specify the format for the datetime object in order it to be parsed correctly. The documentation helps with this:
%A is for Weekday as locale’s full name, e.g., Monday
%B is for Month as locale’s full name, e.g., January
%d is for Day of the month as a zero-padded decimal number.
%Y is for Year with century as a decimal number, e.g., 2021.
Combining all of them we have the following function:
from datetime import datetime
def mdy_to_ymd(d):
return datetime.strptime(d, '%B %d, %Y %A').strftime('%Y-%m-%d')
print(mdy_to_ymd('January 1, 2021 Monday'))
> 2021-01-01
One more thing is for your case, .apply() will work faster, thus the code is:
df['date'] = df['date'].apply(lambda x: mdy_to_ymd)
Feel free to add Hour-Minute-Second if needed.
I have a date column in excel,with year_month_day format I want to extract only year of my date and group the column by year,but I got an error
df.index = pd.to_datetime(df[18], format='%y/%m/%d %I:%M%p')
df.groupby(by=[df.index.year])
18 is index of my date column
error=ValueError: time data '2022/04/23' does not match format '%y/%m/%d %I:%M%p' (match)
I don't know how can I fix it.
By the looks of it, the error message indicates that the format string you are using, %y/%m/%d %I:%M%p, doesn't match the format of the dates in your column.
It appears that your date format is YYYY/MM/DD, but the format string you're using is trying to parse it as YY/MM/DD %I:%M%p.
I think you should change the format string to %Y/%m/%d.
df.index = pd.to_datetime(df[18], format='%Y/%m/%d')
Then you can extract the year using the year attribute of the datetime object, and group by the year as you are doing.
Make sure your date column is formatted correctly. I provide here a code with which you can adjust the format of the dates.
import pandas as pd
df = pd.DataFrame({'date': ['2022/04/23', '2022/04/24', '2022/04/25']})
df['date'] = pd.to_datetime(df['date'], format='%Y/%m/%d')
Simplified huge df with date column of inconsistent string formatting containing errors:
df_length = 10000
df = pd.DataFrame({
"to_ignore": np.random.randint(1, 500, df_length),
"date": np.random.choice(["11 Nov 2018", "Feb 2019", "2021-11-02", "asdf"], df_length),
})
We need to convert date col to datetime but can't find a solution that doesn't drop data or processes within a usable time. Tried formatting successively with errors='ignore':
df['date'] = pd.to_datetime(df['date'], format='%b %Y', errors='ignore')
df['date'] = pd.to_datetime(df['date'], format='%d %b %Y', errors='ignore')
But with erroneous strings ("asdf") the col seems unaffected. Trying formats successively with errors='coerce' obviously drops data.
We tried dateparser, df['date'] = df['date'].apply(lambda x: dateparser.parse(x)), which kinda worked except it sometimes got date of month wrong (2019-02-02 should be 2019-02-01):
to_ignore date
0 115 2019-02-02
1 285 NaT
...
This is also prohibitively slow (play with df_length).
What's a good way to do this?
Figured it out. df['date'] = pd.to_datetime(df['date'], errors='coerce') is performant and captures common formats. My question assumed this wasn't the case due to a formatting mistake I've corrected to help others avoid confusion.
If you need to capture dates in complex strings you can create a function to use dateparser.parse() as needed when matching regex expressions:
def date_process(x):
if bool(re.search("^\D\D\D \d\d\d\d$", x)):
return dt.datetime.strptime(x, "%b %Y")
elif bool(re.search("^\d\d \D\D\D \d\d\d\d$", x)):
return dt.datetime.strptime(x, "%d %b %Y")
elif bool(re.search("^\d\d\d\d-\d\d-\d\d$", x)):
return dt.datetime.strptime(x, "%Y-%m-%d")
else:
return dateparser.parse(x)
df['date'] = df['date'].apply(date_process)
This question already has answers here:
How to parse string dates with 2-digit year?
(6 answers)
Closed 1 year ago.
I am getting the error:
time data '1/1/03 0:00' does not match format '%m/%d/%Y %H:%M' (match)
Does this format not match..? I am new to working with Date-Time formats so this is rather confusing to me.
This is the line of code I am using:
date_time = pd.to_datetime(df['time'], format='%m/%d/%Y %H:%M')
Note that the csv file, 'df' that is being used has 1 column named "time", hence I am getting all possible values of it with df['time'].
I should comment that this is:
12/31/16 23:00
is another entry, so I know that it goes month/day/year, is it because the year is only two digits?
The issue comes from the matching of the year. %Y matches the year with the century so in that case it should be 2003 to be matched. You should use %y instead.
date_time = pd.to_datetime(df['time'], format='%m/%d/%y %H:%M')
The problem in your case is the year:
date_time = pd.to_datetime(df['time'], format='%m/%d/%y %H:%M') # the lowercase y fixes it
Its basically the same as in the datetime module:
from datetime import datetime as dt;
dt.strptime('1/1/03 0:00', '%m/%d/%y %H:%M')
>> datetime.datetime(2003, 1, 1, 0, 0)
dt.strptime('1/1/03 0:00', '%m/%d/%Y %H:%M')
>> ## ERROR
All the codes to clarify
I am trying to do scraping from https://finansial.bisnis.com/read/20210506/90/1391096/laba-bank-mega-tumbuh-dua-digit-kuartal-i-2021-ini-penopangnya. I am trying to scrape the date of news, here's my code:
news['tanggal'] = newsScrape['date']
dates = []
for x in news['tanggal']:
x = listToString(x)
x = x.strip()
x = x.replace('\r', '').replace('\n', '').replace(' \xa0|\xa0', ',').replace('|', ', ')
dates.append(x)
dates = listToString(dates)
dates = dates[0:20]
if len(dates) == 0:
continue
news['tanggal'] = dt.datetime.strptime(dates, '%d %B %Y, %H:%M')
but I got this error:
ValueError: time data '06 Mei 2021, 11:32 ' does not match format '%d %B %Y, %H:%M'
My assumption is because Mei is in Indonesian language, meanwhile the format need May which is in English. How to change Mei to be May? I have tried dates = dates.replace('Mei', 'May') but it doesnt work on me. When I tried it, I got error ValueError: unconverted data remains: The type of dates is string. Thanks
You can try with the following
import datetime as dt
import requests
from bs4 import BeautifulSoup
import urllib.request
url="https://finansial.bisnis.com/read/20210506/90/1391096/laba-bank-mega-tumbuh-dua-digit-kuartal-i-2021-ini-penopangnya"
r = requests.get(url, verify=False)
soup = BeautifulSoup(r.content, 'html.parser')
info_soup= soup.find(class_="new-description")
x=info_soup.find('span').get_text(strip=True)
x = x.strip()
x = x.replace('\r', '').replace('\n', '').replace(' \xa0|\xa0', ',').replace('|', ', ')
x = x[0:20]
x = x.rstrip()
date= dt.datetime.strptime(x.replace('Mei', 'May'), '%d %B %Y, %H:%M')
print(date)
result:
2021-05-06 11:45:00
Your assumption regarding the May -> Mei change is correct, the reason you're likely facing a problem after the replacement are the trailing spaces in your string, which are not accounted for in your format. You can use string.rstrip() to remove these spaces.
import datetime as dt
dates = "06 Mei 2021, 11:32 "
dates = dates.replace("Mei", "May") # The replacement will have to be handled for all months, this is only an example
dates = dates.rstrip()
date = dt.datetime.strptime(dates, "%d %B %Y, %H:%M")
print(date) # 2021-05-06 11:32:00
While
this does fix the problem here, it's messy to have to shorten the string like this after dates = dates[0:20]. Consider using regex to gain the appropriate format at once.
The problem seems to be just the trailing white space you have, which explains the error ValueError: unconverted data remains: . It is complaining that it is unable to convert the remaining data (whitespace).
s = '06 Mei 2021, 11:32 '.replace('Mei', 'May').strip()
datetime.strptime(s, '%d %B %Y, %H:%M')
# Returns datetime.datetime(2021, 5, 6, 11, 32)
Also, to convert all the Indonesian months to English, you can use a dictionary:
id_en_dict = {
...,
'Mei': 'May',
...
}