Related
I want to extract only time from a text with so many different formats of dates and time such as 'thursday, august 6, 2020 4:32:54 pm', '25 september 2020 04:05 pm' and '29 april 2020 07:42'. So I want to extract, for example, 4:32:54, 07:42, 04:05. Can you help me with that?
I would try something like this:
times = [
'thursday, august 6, 2020 4:32:54 pm',
'25 september 2020 04:05 pm',
'29 april 2020 07:42',
]
print("\n".join("".join(i for i in t.split() if ":" in i) for t in times))
Output:
4:32:54
04:05
07:42
I am scraping lists of US presidents using beautiful soup and requests. I want to scrape both the date for example start of the presidency and end of the presidency date and for some reason it's showing list index out of range error . I'll Provide you the link so you can understand better .
website Link : https://en.wikipedia.org/wiki/List_of_presidents_of_the_United_States
from bs4 import BeautifulSoup
from urllib.request import urlopen as uReq
my_url = 'https://en.wikipedia.org/wiki/List_of_presidents_of_the_United_States'
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = BeautifulSoup(page_html , 'html.parser' )
containers = page_soup.find_all('table' , class_ = 'wikitable')
#print(containers[0])
#print(len(containers))
#print(soup.prettify(containers[0]))
container = containers[0]
date =container.find_all('span' , attrs = {'class': 'date'})
#print(len(date))
#print(date[0].text)
for container in containers:
date_container = container.find_all('span', attrs={'class': 'date'})
print(date_container[0].text)
The find_all function can return an empty list, which can lead you to getting an error.
You can simple check this:
all_dates = []
for container in containers:
date_container = container.find_all('span', attrs={'class': 'date'})
all_dates.extend([date.text for date in date_container])
As you have last lines of code, that store all spans of dates on first table "wikitable", you can make list comprehension:
date = [x.text for x in container.find_all('span' , attrs = {'class': 'date'})]
print(date)
Which will print:
['April 30, 1789', 'March 4, 1797', 'March 4, 1797', 'March 4, 1801', 'March 4, 1801'...
Since it has <table> tags, have you considered using pandas' .read_html()? It uses BeautifulSoup under the hood. Takes alot of the work out and puts it straight into a dataframe for you. The only work then needed is any manipulation or cleanup/filtering:
import pandas as pd
import re
my_url = 'https://en.wikipedia.org/wiki/List_of_presidents_of_the_United_States'
# Returns a list of dataframes
dfs = pd.read_html(my_url)
# Get the specific dataframe with the desired columns
df = dfs[1].iloc[:,[1,3]]
# Rename the columns
df.columns = ['Date','Name']
# Split the date column into start and end dates and drop the date column
df[['Start','End']] = df.Date.str.split('–', expand=True)
df = df.drop('Date',axis=1)
# Clean up the name column using regex to pull out the name
df['Name'] = [re.match(r'.+?(?=\d)', x)[0].strip().split('Born')[0] for x in df['Name']]
# Drop duplicate rows
df.drop_duplicates(inplace = True)
print (df)
Output:
print (df.to_string())
Name Start End
0 George Washington April 30, 1789[d] March 4, 1797
1 John Adams March 4, 1797 March 4, 1801
2 Thomas Jefferson March 4, 1801 March 4, 1809
3 James Madison March 4, 1809 March 4, 1817
4 James Monroe March 4, 1817 March 4, 1825
5 John Quincy Adams March 4, 1825 March 4, 1829
6 Andrew Jackson March 4, 1829 March 4, 1837
7 Martin Van Buren March 4, 1837 March 4, 1841
8 William Henry Harrison March 4, 1841 April 4, 1841(Died in office)
9 John Tyler April 4, 1841[i] March 4, 1845
10 James K. Polk March 4, 1845 March 4, 1849
11 Zachary Taylor March 4, 1849 July 9, 1850(Died in office)
12 Millard Fillmore July 9, 1850[k] March 4, 1853
13 Franklin Pierce March 4, 1853 March 4, 1857
14 James Buchanan March 4, 1857 March 4, 1861
15 Abraham Lincoln March 4, 1861 April 15, 1865(Assassinated)
16 Andrew Johnson April 15, 1865 March 4, 1869
17 Ulysses S. Grant March 4, 1869 March 4, 1877
18 Rutherford B. Hayes March 4, 1877 March 4, 1881
19 James A. Garfield March 4, 1881 September 19, 1881(Assassinated)
20 Chester A. Arthur September 19, 1881[n] March 4, 1885
21 Grover Cleveland March 4, 1885 March 4, 1889
22 Benjamin Harrison March 4, 1889 March 4, 1893
23 Grover Cleveland March 4, 1893 March 4, 1897
24 William McKinley March 4, 1897 September 14, 1901(Assassinated)
25 Theodore Roosevelt September 14, 1901 March 4, 1909
26 William Howard Taft March 4, 1909 March 4, 1913
27 Woodrow Wilson March 4, 1913 March 4, 1921
28 Warren G. Harding March 4, 1921 August 2, 1923(Died in office)
29 Calvin Coolidge August 2, 1923[o] March 4, 1929
30 Herbert Hoover March 4, 1929 March 4, 1933
31 Franklin D. Roosevelt March 4, 1933 April 12, 1945(Died in office)
32 Harry S. Truman April 12, 1945 January 20, 1953
33 Dwight D. Eisenhower January 20, 1953 January 20, 1961
34 John F. Kennedy January 20, 1961 November 22, 1963(Assassinated)
35 Lyndon B. Johnson November 22, 1963 January 20, 1969
36 Richard Nixon January 20, 1969 August 9, 1974(Resigned)
37 Gerald Ford August 9, 1974 January 20, 1977
38 Jimmy Carter January 20, 1977 January 20, 1981
39 Ronald Reagan January 20, 1981 January 20, 1989
40 George H. W. Bush January 20, 1989 January 20, 1993
41 Bill Clinton January 20, 1993 January 20, 2001
42 George W. Bush January 20, 2001 January 20, 2009
43 Barack Obama January 20, 2009 January 20, 2017
44 Donald Trump January 20, 2017 Incumbent
First, obligatory advance apologies - almost newbie here, and this is my first question; please be kind...
I'm struggling to scrape javascript generated pages; in particular those of the Metropolitan Opera schedule. For any given month, I would like to create a calendar with just the name of the production, and the date and time of performance. I threw beautifulsoup and selenium at it, and I can get tons of info about the composer's love life, etc. - but not these 3 elements. Any help would be greatly appreciated.
Link to a random month in their schedule
One thing that you should look for (in the future) on websites are calls to an API. I opened up Chrome Dev Tools (F12) and reloaded the page while in the Network tab.
I found two api calls, one for "productions" and one for "events". The "events" response has much more information. This code below makes a call to the "events" endpoint and then returns a subset of that data (specifically, title, date and time according to your description).
I wasn't sure what you wanted to do with that data so I just printed it out. Let me know if the code needs to be updated/modified and I will do my best to help!
I wrote this code using Python 3.6.4
from datetime import datetime
import requests
BASE_URL = 'http://www.metopera.org/api/v1/calendar'
EVENT = """\
Title: {title}
Date: {date}
Time: {time}
---------------\
"""
def get_events(*, month, year):
params = {
'month': month,
'year': year
}
r = requests.get('{}/events'.format(BASE_URL), params=params)
r.raise_for_status()
return r.json()
def get_name_date_time(*, events):
result = []
for event in events:
d = datetime.strptime(event['eventDateTime'], '%Y-%m-%dT%H:%M:%S')
result.append({
'title': event['title'],
'date': d.strftime('%A, %B %d, %Y'),
'time': d.strftime('%I:%M %p')
})
return result
if __name__ == '__main__':
events = get_events(month=11, year=2018)
names_dates_times = get_name_date_time(events=events)
for event in names_dates_times:
print(EVENT.format(**event))
Console:
Title: Tosca
Date: Friday, November 02, 2018
Time: 08:00 PM
---------------
Title: Carmen
Date: Saturday, November 03, 2018
Time: 01:00 PM
---------------
Title: Marnie
Date: Saturday, November 03, 2018
Time: 08:00 PM
---------------
Title: Tosca
Date: Monday, November 05, 2018
Time: 08:00 PM
---------------
Title: Carmen
Date: Tuesday, November 06, 2018
Time: 07:30 PM
---------------
Title: Marnie
Date: Wednesday, November 07, 2018
Time: 07:30 PM
---------------
Title: Mefistofele
Date: Thursday, November 08, 2018
Time: 07:30 PM
---------------
Title: Tosca
Date: Friday, November 09, 2018
Time: 08:00 PM
---------------
Title: Marnie
Date: Saturday, November 10, 2018
Time: 01:00 PM
---------------
Title: Carmen
Date: Saturday, November 10, 2018
Time: 08:00 PM
---------------
Title: Mefistofele
Date: Monday, November 12, 2018
Time: 07:30 PM
---------------
Title: Tosca
Date: Tuesday, November 13, 2018
Time: 07:30 PM
---------------
Title: Les Pêcheurs de Perles (The Pearl Fishers)
Date: Wednesday, November 14, 2018
Time: 07:30 PM
---------------
Title: Carmen
Date: Thursday, November 15, 2018
Time: 07:30 PM
---------------
Title: Mefistofele
Date: Friday, November 16, 2018
Time: 07:30 PM
---------------
Title: Tosca
Date: Saturday, November 17, 2018
Time: 01:00 PM
---------------
Title: Les Pêcheurs de Perles (The Pearl Fishers)
Date: Saturday, November 17, 2018
Time: 08:00 PM
---------------
Title: Mefistofele
Date: Monday, November 19, 2018
Time: 07:30 PM
---------------
Title: Les Pêcheurs de Perles (The Pearl Fishers)
Date: Tuesday, November 20, 2018
Time: 08:00 PM
---------------
Title: Il Trittico
Date: Friday, November 23, 2018
Time: 07:30 PM
---------------
Title: Les Pêcheurs de Perles (The Pearl Fishers)
Date: Saturday, November 24, 2018
Time: 01:00 PM
---------------
Title: Mefistofele
Date: Saturday, November 24, 2018
Time: 08:00 PM
---------------
Title: Il Trittico
Date: Monday, November 26, 2018
Time: 07:30 PM
---------------
Title: Mefistofele
Date: Tuesday, November 27, 2018
Time: 07:30 PM
---------------
Title: Les Pêcheurs de Perles (The Pearl Fishers)
Date: Wednesday, November 28, 2018
Time: 07:30 PM
---------------
Title: La Bohème
Date: Thursday, November 29, 2018
Time: 07:30 PM
---------------
Title: Il Trittico
Date: Friday, November 30, 2018
Time: 07:30 PM
---------------
For reference, here is a link to the full JSON response from the events endpoint. There is a bunch more potentially interesting information you may want but I just grabbed the subset of what you asked for in the description.
I have a text that will change weekly:
text = "Weekly Comparison, Week 50 October 28 - November 3, 2016 October 30 - November 5, 2015"
I'm looking for regex patterns for year 1, and year 2.
(Both will change weekly so I need the formula to capture all months, days, years)
My output should be the following:
2015 = November 5, 2015
2016 = November 3, 2016
The framework I'm using does not allow for regex capture groups or splits, so I need the formula to be specialized for this type of string.
Thanks!
Code
As per my original comments
See regex in use here
(\w+\s+\d+,\s*(\d+))
Note: The above regex and the regex on regex101 do not match. This is done purposely. Regex101 can only demonstrate the output of substitutions, thus I've prepended .*? to the regex in order to properly display the expected output.
Results
Input
Weekly Comparison, Week 50 October 28 - November 3, 2016 October 30 - November 5, 2015
Output
2016 = November 3, 2016
2015 = November 5, 2015
Usage
import re
regex = r"(\w+\s+\d+,\s*(\d+))"
str = "Weekly Comparison, Week 50 October 28 - November 3, 2016 October 30 - November 5, 2015"
for (date, year) in re.findall(regex, str):
print year + ' = ' + date
You can try this:
text = "Weekly Comparison, Week 50 October 28 - November 3, 2016 October 30 - November 5, 2015"
import re
final_data = sorted(["{} = {}".format(re.findall("\d+$", i)[0], i) for i in re.findall("[a-zA-Z]+\s\d+,\s\d+", text)], key=lambda x:int(re.findall("^\d+", x)[0]))
Output:
['2015 = November 5, 2015', '2016 = November 3, 2016']
Using #ctwheels regex:
text = "Weekly Comparison, Week 50 October 28 - November 3, 2016 October 30 - November 5, 2015"
import re
result = [(date.split(",")[1].strip(), date) for date in re.findall(r'\w+\s+\d+,\s*\d+', text)]
print(result)
# [('2016', 'November 3, 2016'), ('2015', 'November 5, 2015')]
I need to extract the info from this page -http://www.investing.com/currencies/usd-brl-historical-data. I need Date, Price, Open, High, Low,Change %.
I`m new to Python so I got stuck at this step:
import requests
from bs4 import BeautifulSoup
from datetime import datetime
url='http://www.investing.com/currencies/usd-brl-historical-data'
r = requests.get(url)
soup=BeautifulSoup(r.content,'lxml')
g_data = soup.find_all('table', {'class':'genTbl closedTbl historicalTbl'})
d=[]
for item in g_data:
Table_Values = item.find_all('tr')
N=len(Table_Values)-1
for n in range(N):
k = (item.find_all('td', {'class':'first left bold noWrap'})[n].text)
print(item.find_all('td', {'class':'first left bold noWrap'})[n].text)
Here I have several problems:
Column for Price can de tagged as or . How can I specify that I want items tagged with class = 'redFont' or/and 'greenfont'?. Also Change % can also have class redFont and greenFont. Other columns are tagged by . How can I extract them?
Is there a way to extract columns from table?
Ideally I would like to have a dateframe with Columns Date, Price, Open, High, Low,Change %.
Thanks
How to parse the table from that site I have already answered here but since you want a DataFrame, just use pandas.read_html
url = 'http://www.investing.com/currencies/usd-brl-historical-data'
r = requests.get(url)
import pandas as pd
df = pd.read_html(r.content,attrs = {'id': 'curr_table'})[0]
Which will give you:
Date Price Open High Low Change %
0 Jun 08, 2016 3.3609 3.4411 3.4465 3.3584 -2.36%
1 Jun 07, 2016 3.4421 3.4885 3.5141 3.4401 -1.36%
2 Jun 06, 2016 3.4896 3.5265 3.5295 3.4840 -1.09%
3 Jun 05, 2016 3.5280 3.5280 3.5280 3.5280 0.11%
4 Jun 03, 2016 3.5240 3.5910 3.5947 3.5212 -1.91%
5 Jun 02, 2016 3.5926 3.6005 3.6157 3.5765 -0.22%
6 Jun 01, 2016 3.6007 3.6080 3.6363 3.5755 -0.29%
7 May 31, 2016 3.6111 3.5700 3.6383 3.5534 1.11%
8 May 30, 2016 3.5713 3.6110 3.6167 3.5675 -1.11%
9 May 27, 2016 3.6115 3.5824 3.6303 3.5792 0.81%
10 May 26, 2016 3.5825 3.5826 3.5857 3.5757 -0.03%
11 May 25, 2016 3.5836 3.5702 3.6218 3.5511 0.34%
12 May 24, 2016 3.5713 3.5717 3.5903 3.5417 -0.04%
13 May 23, 2016 3.5728 3.5195 3.5894 3.5121 1.49%
14 May 20, 2016 3.5202 3.5633 3.5663 3.5154 -1.24%
15 May 19, 2016 3.5644 3.5668 3.6197 3.5503 -0.11%
16 May 18, 2016 3.5683 3.4877 3.5703 3.4854 2.28%
17 May 17, 2016 3.4888 3.4990 3.5300 3.4812 -0.32%
18 May 16, 2016 3.5001 3.5309 3.5366 3.4944 -0.96%
19 May 13, 2016 3.5340 3.4845 3.5345 3.4630 1.39%
20 May 12, 2016 3.4855 3.4514 3.5068 3.4346 0.95%
21 May 11, 2016 3.4528 3.4755 3.4835 3.4389 -0.66%
22 May 10, 2016 3.4758 3.5155 3.5173 3.4623 -1.15%
23 May 09, 2016 3.5164 3.5010 3.6766 3.4906 0.40%
You can generally pass the url directly but we get a 403 error for this particular site using urllib2 which is the lib used by read_html so we need to use requests to get that html.
Here's a way to convert the html table into a nested list
The solution is to find the specific table, then loop through each tr in the table, creating a sublist of the text of all the items inside that tr. The code to do this is a nested list comprehension.
import requests
from bs4 import BeautifulSoup
from pprint import pprint
url='http://www.investing.com/currencies/usd-brl-historical-data'
r = requests.get(url)
soup = BeautifulSoup(r.content,'html.parser')
table = soup.find("table", {"id" : "curr_table"})
#first row is empty
tableRows = [[td.text for td in row.find_all("td")] for row in table.find_all("tr")[1:]]
pprint(tableRows)
This gets all the data from the table
[['Jun 08, 2016', '3.3614', '3.4411', '3.4465', '3.3584', '-2.34%'],
['Jun 07, 2016', '3.4421', '3.4885', '3.5141', '3.4401', '-1.36%'],
['Jun 06, 2016', '3.4896', '3.5265', '3.5295', '3.4840', '-1.09%'],
['Jun 05, 2016', '3.5280', '3.5280', '3.5280', '3.5280', '0.11%'],
['Jun 03, 2016', '3.5240', '3.5910', '3.5947', '3.5212', '-1.91%'],
['Jun 02, 2016', '3.5926', '3.6005', '3.6157', '3.5765', '-0.22%'],
['Jun 01, 2016', '3.6007', '3.6080', '3.6363', '3.5755', '-0.29%'],
['May 31, 2016', '3.6111', '3.5700', '3.6383', '3.5534', '1.11%'],
['May 30, 2016', '3.5713', '3.6110', '3.6167', '3.5675', '-1.11%'],
['May 27, 2016', '3.6115', '3.5824', '3.6303', '3.5792', '0.81%'],
['May 26, 2016', '3.5825', '3.5826', '3.5857', '3.5757', '-0.03%'],
['May 25, 2016', '3.5836', '3.5702', '3.6218', '3.5511', '0.34%'],
['May 24, 2016', '3.5713', '3.5717', '3.5903', '3.5417', '-0.04%'],
['May 23, 2016', '3.5728', '3.5195', '3.5894', '3.5121', '1.49%'],
['May 20, 2016', '3.5202', '3.5633', '3.5663', '3.5154', '-1.24%'],
['May 19, 2016', '3.5644', '3.5668', '3.6197', '3.5503', '-0.11%'],
['May 18, 2016', '3.5683', '3.4877', '3.5703', '3.4854', '2.28%'],
['May 17, 2016', '3.4888', '3.4990', '3.5300', '3.4812', '-0.32%'],
['May 16, 2016', '3.5001', '3.5309', '3.5366', '3.4944', '-0.96%'],
['May 13, 2016', '3.5340', '3.4845', '3.5345', '3.4630', '1.39%'],
['May 12, 2016', '3.4855', '3.4514', '3.5068', '3.4346', '0.95%'],
['May 11, 2016', '3.4528', '3.4755', '3.4835', '3.4389', '-0.66%'],
['May 10, 2016', '3.4758', '3.5155', '3.5173', '3.4623', '-1.15%'],
['May 09, 2016', '3.5164', '3.5010', '3.6766', '3.4906', '0.40%']]
If you want to convert it to a pandas dataframe you just need to also grab the table headings and add them
import requests
from bs4 import BeautifulSoup
import pandas
from pprint import pprint
url='http://www.investing.com/currencies/usd-brl-historical-data'
r = requests.get(url)
soup = BeautifulSoup(r.content,'html.parser')
table = soup.find("table", {"id" : "curr_table"})
tableRows = [[td.text for td in row.find_all("td")] for row in table.find_all("tr")[1:]]
#get headers for dataframe
tableHeaders = [th.text for th in table.find_all("th")]
#build df from tableRows and headers
df = pandas.DataFrame(tableRows, columns=tableHeaders)
print(df)
Then you'll get a dataframe that looks like this:
Date Price Open High Low Change %
0 Jun 08, 2016 3.3596 3.4411 3.4465 3.3584 -2.40%
1 Jun 07, 2016 3.4421 3.4885 3.5141 3.4401 -1.36%
2 Jun 06, 2016 3.4896 3.5265 3.5295 3.4840 -1.09%
3 Jun 05, 2016 3.5280 3.5280 3.5280 3.5280 0.11%
4 Jun 03, 2016 3.5240 3.5910 3.5947 3.5212 -1.91%
5 Jun 02, 2016 3.5926 3.6005 3.6157 3.5765 -0.22%
6 Jun 01, 2016 3.6007 3.6080 3.6363 3.5755 -0.29%
7 May 31, 2016 3.6111 3.5700 3.6383 3.5534 1.11%
8 May 30, 2016 3.5713 3.6110 3.6167 3.5675 -1.11%
9 May 27, 2016 3.6115 3.5824 3.6303 3.5792 0.81%
10 May 26, 2016 3.5825 3.5826 3.5857 3.5757 -0.03%
11 May 25, 2016 3.5836 3.5702 3.6218 3.5511 0.34%
12 May 24, 2016 3.5713 3.5717 3.5903 3.5417 -0.04%
13 May 23, 2016 3.5728 3.5195 3.5894 3.5121 1.49%
14 May 20, 2016 3.5202 3.5633 3.5663 3.5154 -1.24%
15 May 19, 2016 3.5644 3.5668 3.6197 3.5503 -0.11%
16 May 18, 2016 3.5683 3.4877 3.5703 3.4854 2.28%
17 May 17, 2016 3.4888 3.4990 3.5300 3.4812 -0.32%
18 May 16, 2016 3.5001 3.5309 3.5366 3.4944 -0.96%
19 May 13, 2016 3.5340 3.4845 3.5345 3.4630 1.39%
20 May 12, 2016 3.4855 3.4514 3.5068 3.4346 0.95%
21 May 11, 2016 3.4528 3.4755 3.4835 3.4389 -0.66%
22 May 10, 2016 3.4758 3.5155 3.5173 3.4623 -1.15%
23 May 09, 2016 3.5164 3.5010 3.6766 3.4906 0.40%