I have this list that represent Fedex tracking
history = ['Tuesday, March 16, 2021', '3:03 PM Hollywood, FL\nDelivered\nLeft at front door. Signature Service not requested.', '5:52 AM MIAMI, FL\nOn FedEx vehicle for delivery', '5:40 AM MIAMI, FL\nAt local FedEx facility', 'Monday, March 15, 2021', '11:42 PM OCALA, FL\nDeparted FedEx location', '10:01 PM OCALA, FL\nArrived at FedEx location', '8:28 PM OCALA, FL\nIn transit', '12:42 AM OCALA, FL\nIn transit']
How do I transform this list into this 3 columns dataframe
history = [
"Tuesday, March 16, 2021",
"3:03 PM Hollywood, FL\nDelivered\nLeft at front door. Signature Service not requested.",
"5:52 AM MIAMI, FL\nOn FedEx vehicle for delivery",
"5:40 AM MIAMI, FL\nAt local FedEx facility",
"Monday, March 15, 2021",
"11:42 PM OCALA, FL\nDeparted FedEx location",
"10:01 PM OCALA, FL\nArrived at FedEx location",
"8:28 PM OCALA, FL\nIn transit",
"12:42 AM OCALA, FL\nIn transit",
]
import re
r = re.compile("^(?:Sunday|Monday|Tuesday|Wednesday|Thursday|Friday|Saturday)")
data, cur_group = [], ""
for line in history:
if r.match(line):
cur_group = line
else:
data.append([cur_group, *line.split("\n", maxsplit=1)])
df = pd.DataFrame(data)
print(df)
Prints:
0 1 2
0 Tuesday, March 16, 2021 3:03 PM Hollywood, FL Delivered\nLeft at front door. Signature Servi...
1 Tuesday, March 16, 2021 5:52 AM MIAMI, FL On FedEx vehicle for delivery
2 Tuesday, March 16, 2021 5:40 AM MIAMI, FL At local FedEx facility
3 Monday, March 15, 2021 11:42 PM OCALA, FL Departed FedEx location
4 Monday, March 15, 2021 10:01 PM OCALA, FL Arrived at FedEx location
5 Monday, March 15, 2021 8:28 PM OCALA, FL In transit
6 Monday, March 15, 2021 12:42 AM OCALA, FL In transit
You can use dateutil.parser.parse to check if an element is a valid datetime.
This should be safer than just checking if an element contains a day string (Monday, Tuesday, etc.) in case an event also contains a day string somewhere (e.g., Delivery failed\nWill reattempt on Monday).
import dateutil.parser
history = ['Tuesday, March 16, 2021', '3:03 PM Hollywood, FL\nDelivered\nLeft at front door. Signature Service not requested.', '5:52 AM MIAMI, FL\nOn FedEx vehicle for delivery', '5:40 AM MIAMI, FL\nAt local FedEx facility', 'Monday, March 15, 2021', '11:42 PM OCALA, FL\nDeparted FedEx location', '10:01 PM OCALA, FL\nArrived at FedEx location', '8:28 PM OCALA, FL\nIn transit', '12:42 AM OCALA, FL\nIn transit']
data = []
for string in history:
try:
day = dateutil.parser.parse(string)
except:
data.append([day, *string.split('\n', maxsplit=1)])
df = pd.DataFrame(data)
# 0 1 2
# 0 2021-03-16 3:03 PM Hollywood, FL Delivered\nLeft at front door. Signature Servi...
# 1 2021-03-16 5:52 AM MIAMI, FL On FedEx vehicle for delivery
# 2 2021-03-16 5:40 AM MIAMI, FL At local FedEx facility
# 3 2021-03-15 11:42 PM OCALA, FL Departed FedEx location
# 4 2021-03-15 10:01 PM OCALA, FL Arrived at FedEx location
# 5 2021-03-15 8:28 PM OCALA, FL In transit
# 6 2021-03-15 12:42 AM OCALA, FL In transit
ok this is a bit hacky but might get the job done if the format is consistent, long term regex might be a better approach
col1 = []
col2 = []
col3 = []
for h in history:
if 'FL' in h:
col1.append(date)
new_list = h.split(',')
item2 = new_list[0][4:]
item3 = new_list[1][4:]
col2.append(item2.replace('\n', '. '))
col3.append(item3.replace('\n', '. '))
else:
date = h
pd.DataFrame({'col1': col1,
'col2': col2,
'col3': col3})
Related
I'm a newbie seeking help.
I've tried without success with the following.
from bs4 import BeautifulSoup
import pandas as pd
url = "https://www.canada.ca/en/immigration-refugees-citizenship/corporate/mandate/policies-operational-instructions-agreements/ministerial-instructions/express-entry-rounds.html"
html_text = requests.get(url).text
soup = BeautifulSoup(html_text, 'html.parser')
data = []
# Verifying tables and their classes
print('Classes of each table:')
for table in soup.find_all('table'):
print(table.get('class'))
Result:
['table']
None
Can anyone help me with how to get this data?
Thank you so much.
The data you see on the page is loaded from external URL. To load the data you can use next example:
import requests
import pandas as pd
url = "https://www.canada.ca/content/dam/ircc/documents/json/ee_rounds_123_en.json"
data = requests.get(url).json()
df = pd.DataFrame(data["rounds"])
df = df.drop(columns=["drawNumberURL", "DrawText1", "mitext"])
print(df.head(10).to_markdown(index=False))
Prints:
drawNumber
drawDate
drawDateFull
drawName
drawSize
drawCRS
drawText2
drawDateTime
drawCutOff
drawDistributionAsOn
dd1
dd2
dd3
dd4
dd5
dd6
dd7
dd8
dd9
dd10
dd11
dd12
dd13
dd14
dd15
dd16
dd17
dd18
231
2022-09-14
September 14, 2022
No Program Specified
3,250
510
Federal Skilled Worker, Canadian Experience Class, Federal Skilled Trades and Provincial Nominee Program
September 14, 2022 at 13:29:26 UTC
January 08, 2022 at 10:24:52 UTC
September 12, 2022
408
6,228
63,860
5,845
9,505
19,156
16,541
12,813
58,019
12,245
12,635
9,767
11,186
12,186
68,857
35,833
5,068
238,273
230
2022-08-31
August 31, 2022
No Program Specified
2,750
516
Federal Skilled Worker, Canadian Experience Class, Federal Skilled Trades and Provincial Nominee Program
August 31, 2022 at 13:55:23 UTC
April 16, 2022 at 18:24:41 UTC
August 29, 2022
466
7,224
63,270
5,554
9,242
19,033
16,476
12,965
58,141
12,287
12,758
9,796
11,105
12,195
68,974
36,001
5,120
239,196
229
2022-08-17
August 17, 2022
No Program Specified
2,250
525
Federal Skilled Worker, Canadian Experience Class, Federal Skilled Trades and Provincial Nominee Program
August 17, 2022 at 13:43:47 UTC
December 28, 2021 at 11:03:15 UTC
August 15, 2022
538
8,221
62,753
5,435
9,129
18,831
16,465
12,893
58,113
12,200
12,721
9,801
11,138
12,253
68,440
35,745
5,137
238,947
228
2022-08-03
August 3, 2022
No Program Specified
2,000
533
Federal Skilled Worker, Canadian Experience Class, Federal Skilled Trades and Provincial Nominee Program
August 03, 2022 at 15:16:24 UTC
January 06, 2022 at 14:29:50 UTC
August 2, 2022
640
8,975
62,330
5,343
9,044
18,747
16,413
12,783
57,987
12,101
12,705
9,747
11,117
12,317
68,325
35,522
5,145
238,924
227
2022-07-20
July 20, 2022
No Program Specified
1,750
542
Federal Skilled Worker, Canadian Experience Class, Federal Skilled Trades and Provincial Nominee Program
July 20, 2022 at 16:32:49 UTC
December 30, 2021 at 15:29:35 UTC
July 18, 2022
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
226
2022-07-06
July 6, 2022
No Program Specified
1,500
557
Federal Skilled Worker, Canadian Experience Class, Federal Skilled Trades and Provincial Nominee Program
July 6, 2022 at 14:34:34 UTC
November 13, 2021 at 02:20:46 UTC
July 11, 2022
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
225
2022-06-22
June 22, 2022
Provincial Nominee Program
636
752
Provincial Nominee Program
June 22, 2022 at 14:13:57 UTC
April 19, 2022 at 13:45:45 UTC
June 20, 2022
664
8,017
55,917
4,246
7,845
16,969
15,123
11,734
53,094
10,951
11,621
8,800
10,325
11,397
64,478
33,585
4,919
220,674
224
2022-06-08
June 8, 2022
Provincial Nominee Program
932
796
Provincial Nominee Program
June 08, 2022 at 14:03:28 UTC
October 18, 2021 at 17:13:17 UTC
June 6, 2022
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
223
2022-05-25
May 25, 2022
Provincial Nominee Program
590
741
Provincial Nominee Program
May 25, 2022 at 13:21:23 UTC
February 02, 2022 at 12:29:53 UTC
May 23, 2022
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
222
2022-05-11
May 11, 2022
Provincial Nominee Program
545
753
Provincial Nominee Program
May 11, 2022 at 14:08:07 UTC
December 15, 2021 at 20:32:57 UTC
May 9, 2022
635
7,193
52,684
3,749
7,237
16,027
14,466
11,205
50,811
10,484
11,030
8,393
9,945
10,959
62,341
32,590
4,839
211,093
The current code scrapes individual fields, but I would like to map the time and the titles together.
Since the webpage does not have the time and titles in the same class, how would this mapping occur?
Piggy-backing off this question -Link (My question uses an example where the time and title is not of equal length)
Website for reference:
https://ash.confex.com/ash/2021/webprogram/WALKS.html
Sample Expected Output:
5:00 PM-6:00 PM, ASH Poster Walk on Geriatric Hematology: Selecting the Right Treatment for the Patient, Not Just the Disease
5:00 PM-6:00 PM, ASH Poster Walk on Healthcare Quality Improvement
etc
import requests
from bs4 import BeautifulSoup
url = 'https://ash.confex.com/ash/2021/webprogram/WALKS.html'
res = requests.get(url)
soup = BeautifulSoup(res.content,'html.parser')
productlist = soup.select('div.itemtitle > a')
times = soup.select('.time')
This could be an alternative:
import requests
from bs4 import BeautifulSoup
url = 'https://ash.confex.com/ash/2021/webprogram/WALKS.html'
#this is to get the url part before the last "/"
base_url = url.rsplit("/", 1)[0]
res = requests.get(url)
soup = BeautifulSoup(res.content,'html.parser')
productlist = soup.select('div.itemtitle > a')
#times = soup.select('.time')
for a in productlist:
title = a.text.strip()
time = a.find_previous('h3').text.strip()
date = a.find_previous('h4').text.strip()
page = a['href'].strip()
#sep = "/" is the separator between each parameter
#end = "makes the double linebreak when print function is done"
print(title, date, time, base_url + page, sep = "\n", end = "\n\n")
OUTPUT
ASH Poster Walk on What's Hot in Sickle Cell Disease
Wednesday, December 15, 2021
10:00 AM-11:00 AM
https://ash.confex.com/ash/2021/webprogramSession20816.html
ASH Poster Walk on Geriatric Hematology: Selecting the Right Treatment for the Patient, Not Just the Disease
Wednesday, December 15, 2021
5:00 PM-6:00 PM
https://ash.confex.com/ash/2021/webprogramSession20695.html
ASH Poster Walk on Healthcare Quality Improvement
Wednesday, December 15, 2021
5:00 PM-6:00 PM
https://ash.confex.com/ash/2021/webprogramSession21143.html
ASH Poster Walk on Natural Killer Cell-Based Immunotherapy
Wednesday, December 15, 2021
5:00 PM-6:00 PM
https://ash.confex.com/ash/2021/webprogramSession20655.html
ASH Poster Walk on Pediatric Non-malignant Hematology Highlights
Wednesday, December 15, 2021
5:00 PM-6:00 PM
https://ash.confex.com/ash/2021/webprogramSession20721.html
ASH Poster Walk on Clinical Trials In Progress
Thursday, December 16, 2021
10:00 AM-11:00 AM
https://ash.confex.com/ash/2021/webprogramSession20589.html
ASH Poster Walk on Financial Toxicity in Hematologic Malignancies
Thursday, December 16, 2021
10:00 AM-11:00 AM
https://ash.confex.com/ash/2021/webprogramSession20663.html
ASH Poster Walk on Diversity, Equity, and Inclusion in Hematologic Malignancies and Cell Therapy
Thursday, December 16, 2021
5:00 PM-6:00 PM
https://ash.confex.com/ash/2021/webprogramSession20809.html
ASH Poster Walk on Emerging Research in Immunotherapies
Thursday, December 16, 2021
5:00 PM-6:00 PM
https://ash.confex.com/ash/2021/webprogramSession20805.html
ASH Poster Walk on the Spectrum of Hemostasis and Thrombosis Research
Thursday, December 16, 2021
5:00 PM-6:00 PM
https://ash.confex.com/ash/2021/webprogramSession20821.html
Try this:
content = soup.find('div', {"class": "content"})
times = content.find_all("h3")
output = []
for i,h3 in enumerate(times):
for j in h3.next_siblings:
if j.name:
if j.name == "h3":
break
j = j.text.replace('\n', '')
output.append(f"{times[i].text}, {j}")
print(output)
I want to extract only time from a text with so many different formats of dates and time such as 'thursday, august 6, 2020 4:32:54 pm', '25 september 2020 04:05 pm' and '29 april 2020 07:42'. So I want to extract, for example, 4:32:54, 07:42, 04:05. Can you help me with that?
I would try something like this:
times = [
'thursday, august 6, 2020 4:32:54 pm',
'25 september 2020 04:05 pm',
'29 april 2020 07:42',
]
print("\n".join("".join(i for i in t.split() if ":" in i) for t in times))
Output:
4:32:54
04:05
07:42
I am scraping lists of US presidents using beautiful soup and requests. I want to scrape both the date for example start of the presidency and end of the presidency date and for some reason it's showing list index out of range error . I'll Provide you the link so you can understand better .
website Link : https://en.wikipedia.org/wiki/List_of_presidents_of_the_United_States
from bs4 import BeautifulSoup
from urllib.request import urlopen as uReq
my_url = 'https://en.wikipedia.org/wiki/List_of_presidents_of_the_United_States'
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = BeautifulSoup(page_html , 'html.parser' )
containers = page_soup.find_all('table' , class_ = 'wikitable')
#print(containers[0])
#print(len(containers))
#print(soup.prettify(containers[0]))
container = containers[0]
date =container.find_all('span' , attrs = {'class': 'date'})
#print(len(date))
#print(date[0].text)
for container in containers:
date_container = container.find_all('span', attrs={'class': 'date'})
print(date_container[0].text)
The find_all function can return an empty list, which can lead you to getting an error.
You can simple check this:
all_dates = []
for container in containers:
date_container = container.find_all('span', attrs={'class': 'date'})
all_dates.extend([date.text for date in date_container])
As you have last lines of code, that store all spans of dates on first table "wikitable", you can make list comprehension:
date = [x.text for x in container.find_all('span' , attrs = {'class': 'date'})]
print(date)
Which will print:
['April 30, 1789', 'March 4, 1797', 'March 4, 1797', 'March 4, 1801', 'March 4, 1801'...
Since it has <table> tags, have you considered using pandas' .read_html()? It uses BeautifulSoup under the hood. Takes alot of the work out and puts it straight into a dataframe for you. The only work then needed is any manipulation or cleanup/filtering:
import pandas as pd
import re
my_url = 'https://en.wikipedia.org/wiki/List_of_presidents_of_the_United_States'
# Returns a list of dataframes
dfs = pd.read_html(my_url)
# Get the specific dataframe with the desired columns
df = dfs[1].iloc[:,[1,3]]
# Rename the columns
df.columns = ['Date','Name']
# Split the date column into start and end dates and drop the date column
df[['Start','End']] = df.Date.str.split('–', expand=True)
df = df.drop('Date',axis=1)
# Clean up the name column using regex to pull out the name
df['Name'] = [re.match(r'.+?(?=\d)', x)[0].strip().split('Born')[0] for x in df['Name']]
# Drop duplicate rows
df.drop_duplicates(inplace = True)
print (df)
Output:
print (df.to_string())
Name Start End
0 George Washington April 30, 1789[d] March 4, 1797
1 John Adams March 4, 1797 March 4, 1801
2 Thomas Jefferson March 4, 1801 March 4, 1809
3 James Madison March 4, 1809 March 4, 1817
4 James Monroe March 4, 1817 March 4, 1825
5 John Quincy Adams March 4, 1825 March 4, 1829
6 Andrew Jackson March 4, 1829 March 4, 1837
7 Martin Van Buren March 4, 1837 March 4, 1841
8 William Henry Harrison March 4, 1841 April 4, 1841(Died in office)
9 John Tyler April 4, 1841[i] March 4, 1845
10 James K. Polk March 4, 1845 March 4, 1849
11 Zachary Taylor March 4, 1849 July 9, 1850(Died in office)
12 Millard Fillmore July 9, 1850[k] March 4, 1853
13 Franklin Pierce March 4, 1853 March 4, 1857
14 James Buchanan March 4, 1857 March 4, 1861
15 Abraham Lincoln March 4, 1861 April 15, 1865(Assassinated)
16 Andrew Johnson April 15, 1865 March 4, 1869
17 Ulysses S. Grant March 4, 1869 March 4, 1877
18 Rutherford B. Hayes March 4, 1877 March 4, 1881
19 James A. Garfield March 4, 1881 September 19, 1881(Assassinated)
20 Chester A. Arthur September 19, 1881[n] March 4, 1885
21 Grover Cleveland March 4, 1885 March 4, 1889
22 Benjamin Harrison March 4, 1889 March 4, 1893
23 Grover Cleveland March 4, 1893 March 4, 1897
24 William McKinley March 4, 1897 September 14, 1901(Assassinated)
25 Theodore Roosevelt September 14, 1901 March 4, 1909
26 William Howard Taft March 4, 1909 March 4, 1913
27 Woodrow Wilson March 4, 1913 March 4, 1921
28 Warren G. Harding March 4, 1921 August 2, 1923(Died in office)
29 Calvin Coolidge August 2, 1923[o] March 4, 1929
30 Herbert Hoover March 4, 1929 March 4, 1933
31 Franklin D. Roosevelt March 4, 1933 April 12, 1945(Died in office)
32 Harry S. Truman April 12, 1945 January 20, 1953
33 Dwight D. Eisenhower January 20, 1953 January 20, 1961
34 John F. Kennedy January 20, 1961 November 22, 1963(Assassinated)
35 Lyndon B. Johnson November 22, 1963 January 20, 1969
36 Richard Nixon January 20, 1969 August 9, 1974(Resigned)
37 Gerald Ford August 9, 1974 January 20, 1977
38 Jimmy Carter January 20, 1977 January 20, 1981
39 Ronald Reagan January 20, 1981 January 20, 1989
40 George H. W. Bush January 20, 1989 January 20, 1993
41 Bill Clinton January 20, 1993 January 20, 2001
42 George W. Bush January 20, 2001 January 20, 2009
43 Barack Obama January 20, 2009 January 20, 2017
44 Donald Trump January 20, 2017 Incumbent
First, obligatory advance apologies - almost newbie here, and this is my first question; please be kind...
I'm struggling to scrape javascript generated pages; in particular those of the Metropolitan Opera schedule. For any given month, I would like to create a calendar with just the name of the production, and the date and time of performance. I threw beautifulsoup and selenium at it, and I can get tons of info about the composer's love life, etc. - but not these 3 elements. Any help would be greatly appreciated.
Link to a random month in their schedule
One thing that you should look for (in the future) on websites are calls to an API. I opened up Chrome Dev Tools (F12) and reloaded the page while in the Network tab.
I found two api calls, one for "productions" and one for "events". The "events" response has much more information. This code below makes a call to the "events" endpoint and then returns a subset of that data (specifically, title, date and time according to your description).
I wasn't sure what you wanted to do with that data so I just printed it out. Let me know if the code needs to be updated/modified and I will do my best to help!
I wrote this code using Python 3.6.4
from datetime import datetime
import requests
BASE_URL = 'http://www.metopera.org/api/v1/calendar'
EVENT = """\
Title: {title}
Date: {date}
Time: {time}
---------------\
"""
def get_events(*, month, year):
params = {
'month': month,
'year': year
}
r = requests.get('{}/events'.format(BASE_URL), params=params)
r.raise_for_status()
return r.json()
def get_name_date_time(*, events):
result = []
for event in events:
d = datetime.strptime(event['eventDateTime'], '%Y-%m-%dT%H:%M:%S')
result.append({
'title': event['title'],
'date': d.strftime('%A, %B %d, %Y'),
'time': d.strftime('%I:%M %p')
})
return result
if __name__ == '__main__':
events = get_events(month=11, year=2018)
names_dates_times = get_name_date_time(events=events)
for event in names_dates_times:
print(EVENT.format(**event))
Console:
Title: Tosca
Date: Friday, November 02, 2018
Time: 08:00 PM
---------------
Title: Carmen
Date: Saturday, November 03, 2018
Time: 01:00 PM
---------------
Title: Marnie
Date: Saturday, November 03, 2018
Time: 08:00 PM
---------------
Title: Tosca
Date: Monday, November 05, 2018
Time: 08:00 PM
---------------
Title: Carmen
Date: Tuesday, November 06, 2018
Time: 07:30 PM
---------------
Title: Marnie
Date: Wednesday, November 07, 2018
Time: 07:30 PM
---------------
Title: Mefistofele
Date: Thursday, November 08, 2018
Time: 07:30 PM
---------------
Title: Tosca
Date: Friday, November 09, 2018
Time: 08:00 PM
---------------
Title: Marnie
Date: Saturday, November 10, 2018
Time: 01:00 PM
---------------
Title: Carmen
Date: Saturday, November 10, 2018
Time: 08:00 PM
---------------
Title: Mefistofele
Date: Monday, November 12, 2018
Time: 07:30 PM
---------------
Title: Tosca
Date: Tuesday, November 13, 2018
Time: 07:30 PM
---------------
Title: Les Pêcheurs de Perles (The Pearl Fishers)
Date: Wednesday, November 14, 2018
Time: 07:30 PM
---------------
Title: Carmen
Date: Thursday, November 15, 2018
Time: 07:30 PM
---------------
Title: Mefistofele
Date: Friday, November 16, 2018
Time: 07:30 PM
---------------
Title: Tosca
Date: Saturday, November 17, 2018
Time: 01:00 PM
---------------
Title: Les Pêcheurs de Perles (The Pearl Fishers)
Date: Saturday, November 17, 2018
Time: 08:00 PM
---------------
Title: Mefistofele
Date: Monday, November 19, 2018
Time: 07:30 PM
---------------
Title: Les Pêcheurs de Perles (The Pearl Fishers)
Date: Tuesday, November 20, 2018
Time: 08:00 PM
---------------
Title: Il Trittico
Date: Friday, November 23, 2018
Time: 07:30 PM
---------------
Title: Les Pêcheurs de Perles (The Pearl Fishers)
Date: Saturday, November 24, 2018
Time: 01:00 PM
---------------
Title: Mefistofele
Date: Saturday, November 24, 2018
Time: 08:00 PM
---------------
Title: Il Trittico
Date: Monday, November 26, 2018
Time: 07:30 PM
---------------
Title: Mefistofele
Date: Tuesday, November 27, 2018
Time: 07:30 PM
---------------
Title: Les Pêcheurs de Perles (The Pearl Fishers)
Date: Wednesday, November 28, 2018
Time: 07:30 PM
---------------
Title: La Bohème
Date: Thursday, November 29, 2018
Time: 07:30 PM
---------------
Title: Il Trittico
Date: Friday, November 30, 2018
Time: 07:30 PM
---------------
For reference, here is a link to the full JSON response from the events endpoint. There is a bunch more potentially interesting information you may want but I just grabbed the subset of what you asked for in the description.