Scraping Date of News - python

I am trying to do scraping from https://finansial.bisnis.com/read/20210506/90/1391096/laba-bank-mega-tumbuh-dua-digit-kuartal-i-2021-ini-penopangnya. I am trying to scrape the date of news, here's my code:
news['tanggal'] = newsScrape['date']
dates = []
for x in news['tanggal']:
x = listToString(x)
x = x.strip()
x = x.replace('\r', '').replace('\n', '').replace(' \xa0|\xa0', ',').replace('|', ', ')
dates.append(x)
dates = listToString(dates)
dates = dates[0:20]
if len(dates) == 0:
continue
news['tanggal'] = dt.datetime.strptime(dates, '%d %B %Y, %H:%M')
but I got this error:
ValueError: time data '06 Mei 2021, 11:32 ' does not match format '%d %B %Y, %H:%M'
My assumption is because Mei is in Indonesian language, meanwhile the format need May which is in English. How to change Mei to be May? I have tried dates = dates.replace('Mei', 'May') but it doesnt work on me. When I tried it, I got error ValueError: unconverted data remains: The type of dates is string. Thanks

You can try with the following
import datetime as dt
import requests
from bs4 import BeautifulSoup
import urllib.request
url="https://finansial.bisnis.com/read/20210506/90/1391096/laba-bank-mega-tumbuh-dua-digit-kuartal-i-2021-ini-penopangnya"
r = requests.get(url, verify=False)
soup = BeautifulSoup(r.content, 'html.parser')
info_soup= soup.find(class_="new-description")
x=info_soup.find('span').get_text(strip=True)
x = x.strip()
x = x.replace('\r', '').replace('\n', '').replace(' \xa0|\xa0', ',').replace('|', ', ')
x = x[0:20]
x = x.rstrip()
date= dt.datetime.strptime(x.replace('Mei', 'May'), '%d %B %Y, %H:%M')
print(date)
result:
2021-05-06 11:45:00

Your assumption regarding the May -> Mei change is correct, the reason you're likely facing a problem after the replacement are the trailing spaces in your string, which are not accounted for in your format. You can use string.rstrip() to remove these spaces.
import datetime as dt
dates = "06 Mei 2021, 11:32 "
dates = dates.replace("Mei", "May") # The replacement will have to be handled for all months, this is only an example
dates = dates.rstrip()
date = dt.datetime.strptime(dates, "%d %B %Y, %H:%M")
print(date) # 2021-05-06 11:32:00
While
this does fix the problem here, it's messy to have to shorten the string like this after dates = dates[0:20]. Consider using regex to gain the appropriate format at once.

The problem seems to be just the trailing white space you have, which explains the error ValueError: unconverted data remains: . It is complaining that it is unable to convert the remaining data (whitespace).
s = '06 Mei 2021, 11:32 '.replace('Mei', 'May').strip()
datetime.strptime(s, '%d %B %Y, %H:%M')
# Returns datetime.datetime(2021, 5, 6, 11, 32)
Also, to convert all the Indonesian months to English, you can use a dictionary:
id_en_dict = {
...,
'Mei': 'May',
...
}

Related

time data 'Month' does not match format '%Y-%b' (match)

Hi I am working on a data set given below
Month,Travellers('000)
Jan-91,1724
Feb-91,1638
Mar-91,1987
Apr-91,1825
May-91,
Jun-91,1879
I am using the below code to format the date
data = pd.read_csv('Metrail+dataset.csv', header = None)
data.columns = ['Month','Travellers']
data['Month'] = pd.to_datetime(data['Month'], format='%m-%Y')
data = data.set_index('Month')
data.head(12)
However, getting the below error
ValueError: time data 'Month' does not match format '%m-%Y' (match)
Could someone help me what is the mistake and any useful links to learn more on the date format
%Y is for year on 4 digits < VS > %y is for year on 2 digits
%m is for month with digits < VS > %b is for shorten month name
Also remove header=None because this counts the header row as data, this is wrong
data = pd.read_csv('data.csv')
data.columns = ['Month', 'Travellers']
data['Month'] = pd.to_datetime(data['Month'], format='%b-%y')
use %b and (as mentioned) %y
data['Month'] = pd.to_datetime(data['Month'], format='%b-%y')
From the docs
%b Month as locale’s abbreviated name. Sep

Converting String to datetime with a specific initial format

I scraped some dates with this format:
review_date = ['9 August 2018 ', '7 August 2018 ']
and I wanted to convert each string to a datetime format like this '%d-%m-%Y'
for d in review_date:
d = datetime.datetime.strptime(d, '%d-%m-%Y')
but the following error is appearing because it isn't in the inicial format datetime wants
ValueError: time data '9 August 2018 ' does not match format
'%d-%m-%Y'
Is there an easy way to convert this or do I need to replace my string?
strptime converts the string to a datetime() object. The format string you pass to it is to specify the format the string is already in:
d = datetime.datetime.strptime(d, '%d %B %Y') # %B is the full month name
You can generate any string you want from that object later with strftime() - then you can pass the format you want:
s = d.strftime('%d-%m-%Y')
You can chain the calls in a single line:
result = datetime.datetime.strptime(d, '%d %B %Y').strftime('%d-%m-%Y')

Parsing a string and converting a date using Python

I am trying to parse this "For The Year Ending December 31, 2015" and convert it to 2015-12-31 using the datetime lib. How would I go about partitioning and then converting the date? My program is looking through an excel file with multiple sheets and combining them into one; however, there is need now to add a date column, but I can only get it write the full value to the cell. So my data column currently has "For The Year Ending December 31, 2015" in all the rows.
Thanks in advance!
Here is the code block that is working now. Thanks all! edited to account for text that could vary.
if rx > (options.startrow-1):
ws.write(rowcount, 0, sheet.name)
date_value = sheet.cell_value(4,0)
s = date_value.split(" ")
del s[-1]
del s[-1]
del s[-1]
string = ' '.join(s)
d = datetime.strptime(date_value, string + " %B %d, %Y")
result = datetime.strftime(d, '%Y-%m-%d')
ws.write(rowcount, 9, result)
for cx in range(sheet.ncols):
Simply include the hard-coded portion and then use the proper identifiers:
>>> import datetime
>>> s = "For The Year Ending December 31, 2015"
>>> d = datetime.datetime.strptime(s, 'For The Year Ending %B %d, %Y')
>>> result = datetime.datetime.strftime(d, '%Y-%m-%d')
>>> print(result)
2015-12-31
from datetime import datetime
date_string = 'For The Year Ending December 31, 2015'
date_string_format = 'For The Year Ending %B %d, %Y'
date_print_class = datetime.strptime(date_string, date_string_format)
wanted_date = datetime.strftime(date_print_class, '%Y-%m-%d')
print(wanted_date)

Convert date from mm/dd/yyyy to another format in Python

I am trying to write a program that asks for the user to input the date in the format mm/dd/yyyy and convert it. So, if the user input 01/01/2009, the program should display January 01, 2009. This is my program so far. I managed to convert the month, but the other elements have a bracket around them so it displays January [01] [2009].
date=input('Enter a date(mm/dd/yyy)')
replace=date.replace('/',' ')
convert=replace.split()
day=convert[1:2]
year=convert[2:4]
for ch in convert:
if ch[:2]=='01':
print('January ',day,year )
Thank you in advance!
Don't reinvent the wheel and use a combination of strptime() and strftime() from datetime module which is a part of python standard library (docs):
>>> from datetime import datetime
>>> date_input = input('Enter a date(mm/dd/yyyy): ')
Enter a date(mm/dd/yyyy): 11/01/2013
>>> date_object = datetime.strptime(date_input, '%m/%d/%Y')
>>> print(date_object.strftime('%B %d, %Y'))
November 01, 2013
You might want to look into python's datetime library which will take care of interpreting dates for you. https://docs.python.org/2/library/datetime.html#module-datetime
from datetime import datetime
d = input('Enter a date(mm/dd/yyy)')
# now convert the string into datetime object given the pattern
d = datetime.strptime(d, "%m/%d/%Y")
# print the datetime in any format you wish.
print d.strftime("%B %d, %Y")
You can check what %m, %d and other identifiers stand for here: https://docs.python.org/2/library/datetime.html#strftime-and-strptime-behavior
As a suggestion use dateutil, which infers the format by itself:
>>> from dateutil.parser import parse
>>> parse('01/05/2009').strftime('%B %d, %Y')
'January 05, 2009'
>>> parse('2009-JAN-5').strftime('%B %d, %Y')
'January 05, 2009'
>>> parse('2009.01.05').strftime('%B %d, %Y')
'January 05, 2009'
Split it by the slashes
convert = replace.split('/')
and then create a dictionary of the months:
months = {1:"January",etc...}
and then to display it do:
print months[convert[0]] + day + year

ValueError: unconverted data remains: 02:05

I have some dates in a json files, and I am searching for those who corresponds to today's date :
import os
import time
from datetime import datetime
from pytz import timezone
input_file = file(FILE, "r")
j = json.loads(input_file.read().decode("utf-8-sig"))
os.environ['TZ'] = 'CET'
for item in j:
lt = time.strftime('%A %d %B')
st = item['start']
st = datetime.strptime(st, '%A %d %B')
if st == lt :
item['start'] = datetime.strptime(st,'%H:%M')
I had an error like this :
File "/home/--/--/--/app/route.py", line 35, in file.py
st = datetime.strptime(st, '%A %d %B')
File "/usr/lib/python2.7/_strptime.py", line 328, in _strptime
data_string[found.end():])
ValueError: unconverted data remains: 02:05
Do you have any suggestions ?
The value of st at st = datetime.strptime(st, '%A %d %B') line something like 01 01 2013 02:05 and the strptime can't parse this. Indeed, you get an hour in addition of the date... You need to add %H:%M at your strptime.
Best answer is to use the from dateutil import parser.
usage:
from dateutil import parser
datetime_obj = parser.parse('2018-02-06T13:12:18.1278015Z')
print datetime_obj
# output: datetime.datetime(2018, 2, 6, 13, 12, 18, 127801, tzinfo=tzutc())
You have to parse all of the input string, you cannot just ignore parts.
from datetime import date, datetime
for item in j:
st = datetime.strptime(item['start'], '%A %d %B %H:%M')
if st.date() == date.today():
item['start'] = st.time()
Here, we compare the date to today's date by using more datetime objects instead of trying to use strings.
The alternative is to only pass in part of the item['start'] string (splitting out just the time), but there really is no point here, not when you could just parse everything in one step first.
Well it was very simple. I was missing the format of the date in the json file, so I should write :
st = datetime.strptime(st, '%A %d %B %H %M')
because in the json file the date was like :
"start": "Friday 06 December 02:05",
timeobj = datetime.datetime.strptime(my_time, '%Y-%m-%d %I:%M:%S')
File "/usr/lib/python2.7/_strptime.py", line 335, in _strptime
data_string[found.end():])
ValueError: unconverted data remains:
In my case, the problem was an extra space in the input date string. So I used strip() and it started to work.
just cut the string that match the format, do something like:
st = datetime.strptime(st[:-6], '%A %d %B')
ValueError: unconverted data remains: 02:05 means that part of your date including time is not in the datetime.strptime used pattern my suggestion is to make simple trick and check if your string date has time or not eg. len(date_string) > 10:
from datetime import datetime
date_strings = ['2022-12-31 02:05:00', '2022-12-31', '2023-01-01 05:30:00', '2023-01-01']
dates = []
for date_string in date_strings:
if len(date_string) > 10:
# String has time information
date = datetime.strptime(date_string, "%Y-%m-%d %H:%M:%S")
else:
# String has no time information
date = datetime.strptime(date_string, "%Y-%m-%d")
dates.append(date)
print(dates)

Categories

Resources