Web scraping just printing "[]"? - python

I am attempting to get the names and prices of the listings on a cruise website.
import requests
from bs4 import BeautifulSoup
URL = 'https://www.ncl.com/vacations?cruise-destination=transatlantic'
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
names = soup.find_all('h2', class_='headline -medium -small-xs -variant-1')
prices = soup.find_all('span', class_='headline-3 -variant-1')
print(names)
print(prices)
This just ends up printing brackets.

BeautifulSoup can only see HTML elements which exist in the HTML document at the time the document is served to you from the server. It cannot see elements in the DOM which normally would be populated/created asynchronously using JavaScript (by a browser).
The page you're trying to scrape is of the second kind: The HTML document the server served to you at the time you requested it only contains the "barebones" scaffolding of the page, which, if you're viewing the page in a browser, will be populated at a later point in time via JavaScript. This is typically achieved by the browser by making additional requests to other resources/APIs, whose response contains the information with which to populate the page.
BeautifulSoup is not a browser. It's just an HTML/XML parser. You made a single request to a (mostly empty) template HTML. You can expect BeautifulSoup NOT to work for any "fancy" pages - if you see a spinning "loading" graphic, you should immediately think "this page is populated asynchronously using JavaScript and BeautifulSoup won't work for this".
There are cases where the information you're trying to scrape is actually embeded somewhere in the HTML at the time the server served it to you - in a <script> tag possibly, and then the browser is expected to use JavaScript to make this data presentable. In such a case, BeautifulSoup would be able to see the data - that's a separate matter, though.
In your case, one solution would be to view the page in a browser, and log your network traffic. Doing this reveals that, once the page loads, an XHR HTTP GET request is made to a REST API endpoint, the response of which is JSON and contains all the information you're trying to scrape. The trick then is to imitate that request: copy the endpoint URL (including query-string parameters) and any necessary request headers (and payload, if it's a POST request. In this case, it isn't).
Inspecting the response gives us further clues on how to write our script: The JSON response contains ALL itineraries, even ones we aren't interested in (such as non-transatlantic trips). This means that, normally, the browser must run some JavaScript to filter the itineraries - this happens client-side, not server-side. Therefore, our script will have to perform the same kind of filtering.
def get_itineraries():
import requests
url = "https://www.ncl.com/fr/en/api/vacations/v1/itineraries"
params = {
"guests": "2",
"v": "1414764913-1626184979267"
}
headers = {
"accept": "application/json",
"accept-encoding": "gzip, deflate",
"user-agent": "Mozilla/5.0"
}
response = requests.get(url, params=params, headers=headers)
response.raise_for_status()
def predicate(itinerary):
return any(dest["code"] == "TRANSATLANTIC" for dest in itinerary["destination"])
yield from filter(predicate, response.json()["itineraries"])
def main():
from itertools import islice
def get_cheapest_price(itinerary):
def get_valid_option(sailing):
def predicate(option):
return "combinedPrice" in option
return next(filter(predicate, sailing["pricing"]))
return min(get_valid_option(sailing)["combinedPrice"] for sailing in itinerary["sailings"])
itineraries = list(islice(get_itineraries(), 50))
prices = map(get_cheapest_price, itineraries)
for itinerary, price in sorted(zip(itineraries, prices), key=lambda tpl: tpl[1]):
print("[{}{}] - {}".format(itinerary["currency"]["symbol"], price, itinerary["title"]["fullTitle"]))
return 0
if __name__ == "__main__":
import sys
sys.exit(main())
Output:
[€983] - 12-Day Transatlantic From London To New York: Spain & Bermuda
[€984] - 11-Day Transatlantic from Miami to Barcelona: Ponta Delgada, Azores
[€1024] - 15-Day Transatlantic from Rio de Janeiro to Barcelona: Spain & Brazil
[€1177] - 15-Day Transatlantic from Rome to New York: Italy, France & Spain
[€1190] - 14-Day Transatlantic from Barcelona to New York: Spain & Bermuda
[€1234] - 14-Day Transatlantic from Lisbon to Rio de Janeiro: Spain & Brazil
[€1254] - 11-Day Europe from Rome to London: Italy, France, Spain & Portugal
[€1271] - 15-Day Transatlantic From New York to Rome: Italy, France & Spain
[€1274] - 15-Day Transatlantic from New York to Barcelona: Spain & Bermuda
[€1296] - 13-Day Transatlantic From New York to London: France & Ireland
[€1411] - 17-Day Transatlantic from Rome to Miami: Italy, France & Spain
[€1420] - 15-Day Transatlantic From New York to Barcelona: France & Spain
[€1438] - 16-Day Transatlantic from Rome to New York: Italy, France & Spain
[€1459] - 15-Day Transatlantic from Barcelona to Tampa: Bahamas, Spain & Bermuda
[€1473] - 11-Day Transatlantic from New York to Reykjavik: Halifax & Akureyri
[€1486] - 16-Day Transatlantic from Rome to New York: Italy, France & Spain
[€1527] - 15-Day Transatlantic from New York to Rome: Italy, France & Spain
[€1529] - 14-Day Transatlantic From New York to London: France & Ireland
[€1580] - 16-day Transatlantic From Barcelona to New York: Spain & Bermuda
[€1595] - 16-Day Transatlantic From New York to Rome: Italy, France & Spain
[€1675] - 16-Day Transatlantic from New York to Rome: Italy, France & Spain
[€1776] - 14-Day Transatlantic from New York to London: England & Ireland
[€1862] - 12-Day Transatlantic From London to New York: Scotland & Iceland
[€2012] - 15-Day Transatlantic from New York to Barcelona: Spain & Bermuda
[€2552] - 14-Day Transatlantic from New York to London: England & Ireland
[€2684] - 16-Day Transatlantic from New York to London: France & Ireland
[€3460] - 16-Day Transatlantic from New York to London: France & Ireland
>>>
For more information on logging your browser's network traffic, finding REST API endpoints (if they exist), and imitating requests, take a look at this other answer I posted to a similar question.

Related

Scraping a HTML site using BeautifulSoup and finding the value of "total_pages" in it

I'm writing a python code that scrapes the following website and looks for the value of "total_pages" in it.
The website is https://www.usnews.com/best-colleges/fl
When I open the website in a browser and investigate the source code, the value of "total_pages" is 8. I want my python code to be able to get the same value.
I have written the following code:
import requests
from bs4 import BeautifulSoup
headers ={'User-Agent': 'Mozilla/5.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}
main_site=requests.get("https://www.usnews.com/best-colleges/fl",headers=headers)
main_site_content=main_site.content
main_site_content_soup=BeautifulSoup(main_site_content,"html.parser")
But then I get stuck on how to look for the "total_pages" in the parsed data. I have tried find_all() method but no luck. I think I'm not using the method correctly.
One note: the solution does not have to use BeautifulSoup. I just used BeautifulSoup since I was a bit familiar with it.
No need for BeautifulSoup. Here I make a request to their API to get the list of universities.
from rich import print is used to pretty-print the JSON. It should make it easier to read.
Need more help or advice, leave a comment below.
import requests
from rich import print
LINK = "https://www.usnews.com/best-colleges/api/search?format=json&location=Florida&_sort=rank&_sortDirection=asc&_page=1"
def get_data(url):
print("Making request to:", url)
response = requests.get(url, timeout=5, headers={"User-Agent": "Mozilla/5.0"})
if response.status_code == 200:
print("Request Successful!")
data = response.json()["data"]
return data["items"], data["next_link"]
print("Request failed!")
return None, None
def main():
print("Starting Scraping...")
items, next_link = get_data(LINK)
# if there's a `next_link`, scrape it.
while next_link is not None:
print("Getting data from:", next_link)
new_items, next_link = get_data(next_link)
items += new_items
# cleaning the data, for the pandas dataframe.
items = [
{
"name": item["institution"]["displayName"],
"state": item["institution"]["state"],
"city": item["institution"]["city"],
}
for item in items
]
df = pd.DataFrame(items)
print(df.to_markdown())
if __name__ == "__main__":
main()
The output looks like this:
name
state
city
0
University of Florida
FL
Gainesville
1
Florida State University
FL
Tallahassee
2
University of Miami
FL
Coral Gables
3
University of South Florida
FL
Tampa
4
University of Central Florida
FL
Orlando
5
Florida International University
FL
Miami
6
Florida A&M University
FL
Tallahassee
7
Florida Institute of Technology
FL
Melbourne
8
Nova Southeastern University
FL
Ft. Lauderdale
...
...
...
...
74
St. John Vianney College Seminary
FL
Miami
75
St. Petersburg College
FL
St. Petersburg
76
Tallahassee Community College
FL
Tallahassee
77
Valencia College
FL
Orlando

How to scrape hidden class data using selenium and beautiful soup

I'm trying to scrape java script enabled web page content. I need to extract data in the table of that website. However each row of the table has button (arrow) by which we get additional information of that row.
I need to extract that additional description of each row. By inspecting it is observed that the contents of those arrow of each row belong to same class. However the class is hidden in source code. It can be observed only while inspecting. The data I'm trying to sparse is from the webpage.
I have used selenium and beautiful soup. I'm able to scrape data of table but not content of those arrows in the table. My python is returning me an empty list for the class of that arrow. But working for the classs of normal table data.
from bs4 import BeautifulSoup
from selenium import webdriver
browser = webdriver.Firefox()
browser.get('https://projects.sfchronicle.com/2020/layoff-tracker/')
html_source = browser.page_source
soup = BeautifulSoup(html_source,'html.parser')
data = soup.find_all('div',class_="sc-fzoLsD jxXBhc rdt_ExpanderRow")
print(data.text)
To print hidden data, you can use this example:
import re
import json
import requests
from bs4 import BeautifulSoup
url = 'https://projects.sfchronicle.com/2020/layoff-tracker/'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
data_url = 'https://projects.sfchronicle.com' + soup.select_one('link[href*="commons-"]')['href']
data = re.findall(r'n\.exports=JSON\.parse\(\'(.*?)\'\)', requests.get(data_url).text)[1]
data = json.loads(data.replace(r"\'", "'"))
# uncomment this to see all data:
# print(json.dumps(data, indent=4))
for d in data[4:]:
print('{:<50}{:<10}{:<30}{:<30}{:<30}{:<30}{:<30}'.format(*d.values()))
Prints:
Company Layoffs City County Month Industry Company description
Tesla (Temporary layoffs. Factory reopened) 11083 Fremont Alameda County April Industrial Car maker
Bon Appetit Management Co. 3015 San Francisco San Francisco County April Food Food supplier
GSW Arena LLC-Chase Center 1720 San Francisco San Francisco County May Sports Arena vendors
YMCA of Silicon Valley 1657 Santa Clara Santa Clara County May Sports Gym
Nutanix Inc. (Temporary furlough of 2 weeks) 1434 San Jose Santa Clara County April Tech Cloud computing
TeamSanJose 1304 San Jose Santa Clara County April Travel Tourism bureau
San Francisco Giants 1200 San Francisco San Francisco County April Sports Stadium vendors
Lyft 982 San Francisco San Francisco County April Tech Ride hailing
YMCA of San Francisco 959 San Francisco San Francisco County May Sports Gym
Hilton San Francisco Union Square 923 San Francisco San Francisco County April Travel Hotel
Six Flags Discovery Kingdom 911 Vallejo Solano County June Entertainment Amusement park
San Francisco Marriott Marquis 808 San Francisco San Francisco County April Travel Hotel
Aramark 777 Oakland Alameda County April Food Food supplier
The Palace Hotel 774 San Francisco San Francisco County April Travel Hotel
Back of the House Inc 743 San Francisco San Francisco County April Food Restaurant
DPR Construction 715 Redwood City San Mateo County April Real estate Construction
...and so on.
The content you are interested in is generated when you click a button, so you would want to locate the button. A million ways you could do this but I would suggest something like:
element = driver.find_elements(By.XPATH, '//button')
for your specific case you could also use:
element = driver.find_elements(By.CSS_SELECTOR, 'button[class|="sc"]')
Once you get the button element, we can then do:
element.click()
Parsing the page after this should get you the javascript generated content you are looking for

Extracting JavaScript Variables into Python Dictionaries

Understanding that I have to use PyQt5 in conjunction with BeautifulSoup to run javascript on my client after extracting the html using BeautifulSoup, I am trying to convert variable _Flourish_data into a Python dictionary.
Is there an easy way to extract the Javascript variable, _Flourish_data, into a Python dictionary? Here is my current Python to extract the Javascript using PyQt5 and BeautifulSoup:
import bs4 as bs
import sys
import urllib.request
from PyQt5.QtWebEngineWidgets import QWebEnginePage
from PyQt5.QtWidgets import QApplication
from PyQt5.QtCore import QUrl
class Page(QWebEnginePage):
def __init__(self, url):
self.app = QApplication(sys.argv)
QWebEnginePage.__init__(self)
self.html = ''
self.loadFinished.connect(self._on_load_finished)
self.load(QUrl(url))
self.app.exec_()
def _on_load_finished(self):
self.html = self.toHtml(self.Callable)
def Callable(self, html_str):
self.html = html_str
self.app.quit()
page = Page('https://flo.uri.sh/visualisation/2451841/embed?auto=1')
soup = bs.BeautifulSoup(page.html, 'html.parser')
js_test = soup.find_all('script')
js_test[5]
The output of the existing code is
<script>
function _Flourish_unflattenInto(dest, src) {
dest = dest || {};
for (var k in src) {
var t = dest;
for (var i = k.indexOf("."), p = 0; i >= 0; i = k.indexOf(".", p = i+1)) {
var s = k.substring(p, i);
if (!(s in t)) t[s] = {};
t = t[s];
}
t[k.substring(p)] = src[k];
}
return dest;
}
var _Flourish_settings = {"cell_fill_1":"#ffffff","cell_fill_2":"#ebebeb","cell_fill_direction":"horizontal","cell_font_size":"1","cell_height":20,"cell_horizontal_alignment":"center","cell_link_color":"#2886b2","cell_padding_horizontal":16,"cell_padding_vertical":11,"column_width_mode":"auto","column_widths":"10%, 10%, 10%, 10%, 50%, 10%","header_fill":"#181f6c","header_font_color":"#ffffff","header_font_default":false,"header_font_size":1.1,"header_horizontal_alignment":"center","header_style_default":true,"layout.body_font":{"name":"Source Sans Pro","url":"https://fonts.googleapis.com/css?family=Source+Sans+Pro:400,700"},"layout.layout_order":"stack-default","layout.space_between_sections":"0.5","mobile.view":true,"no_results_text":"Use the search bar to find your state","pagination_amount":41,"pagination_amount_search":"5","search_enabled":false,"search_hide_table":false,"search_placeholder":"Search to find your state","search_resize":true,"search_width":15};
_Flourish_unflattenInto(window.template.state, _Flourish_settings);
var _Flourish_data_column_names = {"rows":{"columns":["State ","Earliest/Planned Start Date for 20/21 Academic Year ","","","",""]}},
_Flourish_data = {"rows":[{"columns":["Alabama","Varies by district","","","",""]},{"columns":["Alaska","Varies by district","","","",""]},{"columns":["American Samoa","Unknown","","","",""]},{"columns":["Arizona","Varies by district","","","",""]},{"columns":["Arkansas","Varies by district","","","",""]},{"columns":["Bureau of Indian Education","Varies by district","","","",""]},{"columns":["California","Varies by district","","","",""]},{"columns":["Colorado","Varies by district","","","",""]},{"columns":["Connecticut","Not yet determined","","","",""]},{"columns":["Delaware","Varies by district","","","",""]},{"columns":["Department of Defense Education Activity\n ","Varies by district","","","",""]},{"columns":["District of Columbia","8/31/2020","","","",""]},{"columns":["Florida","Unknown","","","",""]},{"columns":["Georgia","Unknown","","","",""]},{"columns":["Guam","Unknown","","","",""]},{"columns":["Hawaii","Not yet determined","","","",""]},{"columns":["Idaho","Varies by District","","","",""]},{"columns":["Illinois","Varies by district","","","",""]},{"columns":["Indiana","Not yet determined","","","",""]},{"columns":["Iowa","Varies by district","","","",""]},{"columns":["Kansas","Not yet determined","","","",""]},{"columns":["Kentucky","Unknown","","","",""]},{"columns":["Louisiana","Varies by district","","","",""]},{"columns":["Maine","Varies by district","","","",""]},{"columns":["Maryland","Not yet determined","","","",""]},{"columns":["Massachusetts","Not yet determined","","","",""]},{"columns":["Michigan","Not yet determined","","","",""]},{"columns":["Minnesota","Not yet determined","","","",""]},{"columns":["Mississippi ","Varies by district","","","",""]},{"columns":["Missouri","Varies by district","","","",""]},{"columns":["Montana","Varies by district","","","",""]},{"columns":["Nebraska","Varies by district","","","",""]},{"columns":["Nevada","Varies by district","","","",""]},{"columns":["New Hampshire","Not yet determined","","","",""]},{"columns":["New Jersey","Varies by district","","","",""]},{"columns":["New Mexico","Unknown","","","",""]},{"columns":["New York","Not yet determined","","","",""]},{"columns":["North Carolina","8/17/2020","","","",""]},{"columns":["North Dakota","Varies by district","","","",""]},{"columns":["Northern Marianas","Unknown","","","",""]},{"columns":["Ohio","Not yet determined","","","",""]},{"columns":["Oklahoma","Varies by district","","","",""]},{"columns":["Oregon","Not yet determined","","","",""]},{"columns":["Pennsylvania","Varies by district","","","",""]},{"columns":["Puerto Rico","Unknown","","","",""]},{"columns":["Rhode Island","Not yet determined","","","",""]},{"columns":["South Carolina","Not yet determined","","","",""]},{"columns":["South Dakota","Varies by district","","","",""]},{"columns":["Tennessee","Varies by district","","","",""]},{"columns":["Texas","Varies by district","","","",""]},{"columns":["U.S. Virgin Islands\n ","Not yet determined","","","",""]},{"columns":["Utah","Varies by district","","","",""]},{"columns":["Vermont","Not yet determined","","","",""]},{"columns":["Virginia","Not yet determined","","","",""]},{"columns":["Washington","Varies by District","","","",""]},{"columns":["West Virginia","Not yet determined","","","",""]},{"columns":["Wisconsin","Varies by district","","","",""]},{"columns":["Wyoming","Not yet determined","","","",""]}]};
for (var _Flourish_dataset in _Flourish_data) {
window.template.data[_Flourish_dataset] = _Flourish_data[_Flourish_dataset];
window.template.data[_Flourish_dataset].column_names = _Flourish_data_column_names[_Flourish_dataset];
}
window.template.draw();
</script>
I just want var _flourish_data from HTML tag, as shown below:
_Flourish_data = {"rows":[{"columns":["Alabama","Varies by district","","","",""]},{"columns":["Alaska","Varies by district","","","",""]},{"columns":["American Samoa","Unknown","","","",""]},{"columns":["Arizona","Varies by district","","","",""]},{"columns":["Arkansas","Varies by district","","","",""]},{"columns":["Bureau of Indian Education","Varies by district","","","",""]},{"columns":["California","Varies by district","","","",""]},{"columns":["Colorado","Varies by district","","","",""]},{"columns":["Connecticut","Not yet determined","","","",""]},{"columns":["Delaware","Varies by district","","","",""]},{"columns":["Department of Defense Education Activity\n ","Varies by district","","","",""]},{"columns":["District of Columbia","8/31/2020","","","",""]},{"columns":["Florida","Unknown","","","",""]},{"columns":["Georgia","Unknown","","","",""]},{"columns":["Guam","Unknown","","","",""]},{"columns":["Hawaii","Not yet determined","","","",""]},{"columns":["Idaho","Varies by District","","","",""]},{"columns":["Illinois","Varies by district","","","",""]},{"columns":["Indiana","Not yet determined","","","",""]},{"columns":["Iowa","Varies by district","","","",""]},{"columns":["Kansas","Not yet determined","","","",""]},{"columns":["Kentucky","Unknown","","","",""]},{"columns":["Louisiana","Varies by district","","","",""]},{"columns":["Maine","Varies by district","","","",""]},{"columns":["Maryland","Not yet determined","","","",""]},{"columns":["Massachusetts","Not yet determined","","","",""]},{"columns":["Michigan","Not yet determined","","","",""]},{"columns":["Minnesota","Not yet determined","","","",""]},{"columns":["Mississippi ","Varies by district","","","",""]},{"columns":["Missouri","Varies by district","","","",""]},{"columns":["Montana","Varies by district","","","",""]},{"columns":["Nebraska","Varies by district","","","",""]},{"columns":["Nevada","Varies by district","","","",""]},{"columns":["New Hampshire","Not yet determined","","","",""]},{"columns":["New Jersey","Varies by district","","","",""]},{"columns":["New Mexico","Unknown","","","",""]},{"columns":["New York","Not yet determined","","","",""]},{"columns":["North Carolina","8/17/2020","","","",""]},{"columns":["North Dakota","Varies by district","","","",""]},{"columns":["Northern Marianas","Unknown","","","",""]},{"columns":["Ohio","Not yet determined","","","",""]},{"columns":["Oklahoma","Varies by district","","","",""]},{"columns":["Oregon","Not yet determined","","","",""]},{"columns":["Pennsylvania","Varies by district","","","",""]},{"columns":["Puerto Rico","Unknown","","","",""]},{"columns":["Rhode Island","Not yet determined","","","",""]},{"columns":["South Carolina","Not yet determined","","","",""]},{"columns":["South Dakota","Varies by district","","","",""]},{"columns":["Tennessee","Varies by district","","","",""]},{"columns":["Texas","Varies by district","","","",""]},{"columns":["U.S. Virgin Islands\n ","Not yet determined","","","",""]},{"columns":["Utah","Varies by district","","","",""]},{"columns":["Vermont","Not yet determined","","","",""]},{"columns":["Virginia","Not yet determined","","","",""]},{"columns":["Washington","Varies by District","","","",""]},{"columns":["West Virginia","Not yet determined","","","",""]},{"columns":["Wisconsin","Varies by district","","","",""]},{"columns":["Wyoming","Not yet determined","","","",""]}]};
Any help would be greatly appreciated!
You don't need to execute Javascript. It can be done with json and re module.
For example:
import re
import json
import requests
url = 'https://flo.uri.sh/visualisation/2451841/embed?auto=1'
html_data = requests.get(url).text
data = re.search(r'_Flourish_data = (\{.*?\});', html_data).group(1)
data = json.loads(data)
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
for row in data['rows']:
print('{:<55}{}'.format(*map(str.strip, row['columns'][:2])))
Prints:
Alabama Varies by district
Alaska Varies by district
American Samoa Unknown
Arizona Varies by district
Arkansas Varies by district
Bureau of Indian Education Varies by district
California Varies by district
Colorado Varies by district
Connecticut Not yet determined
Delaware Varies by district
Department of Defense Education Activity Varies by district
District of Columbia 8/31/2020
Florida Unknown
Georgia Unknown
Guam Unknown
Hawaii Not yet determined
Idaho Varies by District
Illinois Varies by district
Indiana Not yet determined
Iowa Varies by district
Kansas Not yet determined
Kentucky Unknown
Louisiana Varies by district
Maine Varies by district
Maryland Not yet determined
Massachusetts Not yet determined
Michigan Not yet determined
Minnesota Not yet determined
Mississippi Varies by district
Missouri Varies by district
Montana Varies by district
Nebraska Varies by district
Nevada Varies by district
New Hampshire Not yet determined
New Jersey Varies by district
New Mexico Unknown
New York Not yet determined
North Carolina 8/17/2020
North Dakota Varies by district
Northern Marianas Unknown
Ohio Not yet determined
Oklahoma Varies by district
Oregon Not yet determined
Pennsylvania Varies by district
Puerto Rico Unknown
Rhode Island Not yet determined
South Carolina Not yet determined
South Dakota Varies by district
Tennessee Varies by district
Texas Varies by district
U.S. Virgin Islands Not yet determined
Utah Varies by district
Vermont Not yet determined
Virginia Not yet determined
Washington Varies by District
West Virginia Not yet determined
Wisconsin Varies by district
Wyoming Not yet determined
import requests
import re
import json
def main(url):
r = requests.get(url)
match = json.loads(re.search(r'_Flourish_data = ({.*})', r.text).group(1))
print(match.keys())
main("https://flo.uri.sh/visualisation/2451841/embed?auto=1")

Scraping content with python and selenium

I would like to extract all the league names (e.g. England Premier League, Scotland Premiership, etc.) from this website https://mobile.bet365.com/#type=Splash;key=1;ip=0;lng=1
Taking the inspector tools from Chrome/Firefox I can see that they are located here:
<span>England Premier League</span>
So I tried this
from lxml import html
from selenium import webdriver
session = webdriver.Firefox()
url = 'https://mobile.bet365.com/#type=Splash;key=1;ip=0;lng=1'
session.get(url)
tree = html.fromstring(session.page_source)
leagues = tree.xpath('//span/text()')
print(leagues)
Unfortunately this doesn't return the desired results :-(
To me it looks like the website has different frames and I'm extracting the content from the wrong frame.
Could anyone please help me out here or point me in the right direction? As an alternative if someone knows how to extract the information through their api then this would obviously be the superior solution.
Any help is much appreciated. Thank you!
Hope you are looking for something like this:
from selenium import webdriver
import bs4, time
driver = webdriver.Chrome()
url = 'https://mobile.bet365.com/#type=Splash;key=1;ip=0;lng=1'
driver.get(url)
driver.maximize_window()
# sleep is given so that JS populate data in this time
time.sleep(10)
pSource= driver.page_source
soup = bs4.BeautifulSoup(pSource, "html.parser")
for data in soup.findAll('div',{'class':'eventWrapper'}):
for res in data.find_all('span'):
print res.text
It will print the below data:
Wednesday's Matches
International List
Elite Euro List
UK List
Australia List
Club Friendly List
England Premier League
England EFL Cup
England Championship
England League 1
England League 2
England National League
England National League North
England National League South
Scotland Premiership
Scotland League Cup
Scotland Championship
Scotland League One
Scotland League Two
Northern Ireland Reserve League
Scotland Development League East
Wales Premier League
Wales Cymru Alliance
Asia - World Cup Qualifying
UEFA Champions League
UEFA Europa League
Wednesday's Matches
International List
Elite Euro List
UK List
Australia List
Club Friendly List
England Premier League
England EFL Cup
England Championship
England League 1
England League 2
England National League
England National League North
England National League South
Scotland Premiership
Scotland League Cup
Scotland Championship
Scotland League One
Scotland League Two
Northern Ireland Reserve League
Scotland Development League East
Wales Premier League
Wales Cymru Alliance
Asia - World Cup Qualifying
UEFA Champions League
UEFA Europa League
Only problem is its printing result set twice
Required content is absent in initial page source. It comes dynamically from https://mobile.bet365.com/V6/sport/splash/splash.aspx?zone=0&isocode=RO&tzi=4&key=1&gn=0&cid=1&lng=1&ctg=1&ct=156&clt=8881&ot=2
To be able to get this content you can use ExplicitWait as below:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium import webdriver
session = webdriver.Firefox()
url = 'https://mobile.bet365.com/#type=Splash;key=1;ip=0;lng=1'
session.get(url)
WebDriverWait(session, 10).until(EC.presence_of_element_located((By.ID, 'Splash')))
for collapsed in session.find_elements_by_xpath('//h3[contains(#class, "collapsed")]'):
collapsed.location_once_scrolled_into_view
collapsed.click()
for event in session.find_elements_by_xpath('//div[contains(#class, "eventWrapper")]//span'):
print(event.text)

Want to store variable names in list, not said variable's contents

Sorry if the title is confusing; let me explain.
So, I've written a program that categorizes emails by topic using nltk and tools from sklearn.
Here is that code:
#Extract Emails
tech = extract_message("C:\\Users\\Cody\\Documents\\Emails\\tech.html")
gary = extract_message("C:\\Users\\Cody\\Documents\\Emails\\gary.html")
gary2 = extract_message("C:\\Users\\Cody\\Documents\\Emails\\gary2.html")
jesus = extract_message("C:\\Users\\Cody\\Documents\\Emails\\Jesus.html")
jesus2 = extract_message("C:\\Users\\Cody\\Documents\\Emails\\jesus2.html")
hockey = extract_message("C:\\Users\\Cody\\Documents\\Emails\\hockey.html")
hockey2 = extract_message("C:\\Users\\Cody\\Documents\\Emails\\hockey2.html")
shop = extract_message("C:\\Users\\Cody\\Documents\\Emails\\shop.html")
#Build dictionary of features
count_vect = CountVectorizer()
x_train_counts = count_vect.fit_transform(news.data)
#Downscaling
tfidf_transformer = TfidfTransformer()
x_train_tfidf = tfidf_transformer.fit_transform(x_train_counts)
tf_transformer = TfidfTransformer(use_idf=False).fit(x_train_counts)
x_train_tf = tf_transformer.transform(x_train_counts)
#Train classifier
clf = MultinomialNB().fit(x_train_tfidf, news.target)
#List of the extracted emails
docs_new = [gary, gary2, jesus, jesus2, shop, tech, hockey, hockey2]
#Extract feautures from emails
x_new_counts = count_vect.transform(docs_new)
x_new_tfidf = tfidf_transformer.transform(x_new_counts)
#Predict the categories for each email
predicted = clf.predict(x_new_tfidf)
Now I'm looking to store each variable in an appropriate list, based off of the predicted label. I figured I could do that doing this:
#Store Files in a category
hockey_emails = []
computer_emails = []
politics_emails = []
tech_emails = []
religion_emails = []
forsale_emails = []
#Print out results and store each email in the appropritate category list
for doc, category in zip(docs_new, predicted):
print('%r ---> %s' % (doc, news.target_names[category]))
if(news.target_names[category] == 'comp.sys.ibm.pc.hardware'):
computer_emails.append(doc)
if(news.target_names[category] == 'rec.sport.hockey'):
hockey_emails.append(doc)
if(news.target_names[category] == 'talk.politics.misc'):
politics_emails.append(doc)
if(news.target_names[category] == 'soc.religion.christian'):
religion_emails.append(doc)
if(news.target_names[category] == 'misc.forsale'):
forsale_emails.append(doc)
if(news.target_names[category] == 'comp.sys.ibm.pc.hardware'):
computer_emails.append(doc)
My output if I were to print out one of these lists, let's say hockey for instance, displays the contents stored in the variable rather than the variable itself.
I want this:
print(hockey_emails)
output: ['hockey', 'hockey2']
but instead I'm getting this:
output: ['View View online click here Hi Thanks for signing up as a EA SPORTS NHL insider You ll now receive all of the latest and greatest news and info at this e mail address as you ve requested EA com If you need technical assistance please contact EA Help Privacy Policy Our Certified Online Privacy Policy gives you confidence whenever you play EA games To view our complete Privacy and Cookie Policy go to privacy ea com or write to Privacy Policy Administrator Electronic Arts Inc Redwood Shores Parkway Redwood City CA Electronic Arts Inc All Rights Reserved Privacy Policy User Agreement Legal ActionsMark as UnreadMark as ReadMark as SpamStarClear StarArchive Previous Next ', 'View News From The Hockey Writers The Editor s Choice stories from The Hockey Writers View this email in your browser edition Recap Stars Steamroll Predators By Matt Pryor on Dec am As the old Mary Chapin Carpenter song goes Sometimes you re the windshield Sometimes you re the bug It hasn t happened very often this season but the Dallas Stars had a windshield Continue Reading A Review of Years in Blue and White Damien Cox One on One By Anthony Fusco on Dec pm The Toronto Maple Leafs are one of the most storied and iconic franchises in the entire National Hockey League They have a century of history that spans all the way back to the early s When you have an Continue Reading Bruins Will Not Miss Beleskey By Kyle Benson on Dec am On Monday it was announced that Matt Beleskey will miss the next six weeks due to a knee injury he sustained over the weekend in a game against the Buffalo Sabres Six weeks is a long stint to be without a potential top Continue Reading Recent Articles Galchenyuk Injury Costly for CanadiensFacing Off Picking Team Canada for World JuniorsAre Johnson s Nomadic Days Over Share Tweet Forward Latest News Prospects Anaheim Ducks Arizona Coyotes Boston Bruins Buffalo Sabres Calgary Flames Carolina Hurricanes Chicago Blackhawks Colorado Avalanche Columbus Blue Jackets Dallas Stars Detroit Red Wings Edmonton Oilers Florida Panthers Los Angeles Kings Minnesota Wild Montreal Canadiens Nashville Predators New Jersey Devils New York Islanders New York Rangers Philadelphia Flyers Pittsburgh Penguins Ottawa Senators San Jose Sharks St Louis Blues Tampa Bay Lightning Toronto Maple Leafs Vancouver Canucks Washington Capitals Winnipeg Jets Copyright The Hockey Writers All rights reserved You are receiving this email because you opted in at The Hockey Writers or one of our Network Sites Our mailing address is The Hockey Writers Victoria Ave St Lambert QC J R R CanadaAdd us to your address book unsubscribe from this list update subscription preferences ActionsMark as UnreadMark as ReadMark as SpamStarClear StarArchive Previous Next ']
I figured this would be simple, but I'm sitting here scratching my head. Is this even possible? Should I use something else instead of a list? This is probably simple I'm just blanking.
You have to keep track of the names yourself, Python won't do it for you.
names = 'gary gary2 Jesus jesus2 shop tech hockey hockey2'.split()
docs_new = [extract_message("C:\\Users\\Cody\\Documents\\Emails\\%s.html" % name)
for name in names]
for name, category in zip(names, predicted):
print('%r ---> %s' % (name, news.target_names[category]))
if (news.target_names[category] == 'comp.sys.ibm.pc.hardware'):
computer_emails.append(name)
Don't do this. Use a dictionary to hold your collection of emails, and you can print the dictionary keys when you want to know what is what.
docs_new = dict()
docs_new["tech"] = extract_message("C:\\Users\\Cody\\Documents\\Emails\\tech.html")
docs_new["gary"] = extract_message("C:\\Users\\Cody\\Documents\\Emails\\gary.html")
etc.
When you iterate over the dictionary, you'll see the keys.
for doc, category in zip(docs_new, predicted):
print('%s ---> %s' % (doc, news.target_names[category]))
(More dictionary basics: To iterate over dict values, replace docs_new above with docs_new.values(); or use docs_new.items() for both keys and values.)

Categories

Resources