Web scraping just printing "[]"? - python
I am attempting to get the names and prices of the listings on a cruise website.
import requests
from bs4 import BeautifulSoup
URL = 'https://www.ncl.com/vacations?cruise-destination=transatlantic'
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
names = soup.find_all('h2', class_='headline -medium -small-xs -variant-1')
prices = soup.find_all('span', class_='headline-3 -variant-1')
print(names)
print(prices)
This just ends up printing brackets.
BeautifulSoup can only see HTML elements which exist in the HTML document at the time the document is served to you from the server. It cannot see elements in the DOM which normally would be populated/created asynchronously using JavaScript (by a browser).
The page you're trying to scrape is of the second kind: The HTML document the server served to you at the time you requested it only contains the "barebones" scaffolding of the page, which, if you're viewing the page in a browser, will be populated at a later point in time via JavaScript. This is typically achieved by the browser by making additional requests to other resources/APIs, whose response contains the information with which to populate the page.
BeautifulSoup is not a browser. It's just an HTML/XML parser. You made a single request to a (mostly empty) template HTML. You can expect BeautifulSoup NOT to work for any "fancy" pages - if you see a spinning "loading" graphic, you should immediately think "this page is populated asynchronously using JavaScript and BeautifulSoup won't work for this".
There are cases where the information you're trying to scrape is actually embeded somewhere in the HTML at the time the server served it to you - in a <script> tag possibly, and then the browser is expected to use JavaScript to make this data presentable. In such a case, BeautifulSoup would be able to see the data - that's a separate matter, though.
In your case, one solution would be to view the page in a browser, and log your network traffic. Doing this reveals that, once the page loads, an XHR HTTP GET request is made to a REST API endpoint, the response of which is JSON and contains all the information you're trying to scrape. The trick then is to imitate that request: copy the endpoint URL (including query-string parameters) and any necessary request headers (and payload, if it's a POST request. In this case, it isn't).
Inspecting the response gives us further clues on how to write our script: The JSON response contains ALL itineraries, even ones we aren't interested in (such as non-transatlantic trips). This means that, normally, the browser must run some JavaScript to filter the itineraries - this happens client-side, not server-side. Therefore, our script will have to perform the same kind of filtering.
def get_itineraries():
import requests
url = "https://www.ncl.com/fr/en/api/vacations/v1/itineraries"
params = {
"guests": "2",
"v": "1414764913-1626184979267"
}
headers = {
"accept": "application/json",
"accept-encoding": "gzip, deflate",
"user-agent": "Mozilla/5.0"
}
response = requests.get(url, params=params, headers=headers)
response.raise_for_status()
def predicate(itinerary):
return any(dest["code"] == "TRANSATLANTIC" for dest in itinerary["destination"])
yield from filter(predicate, response.json()["itineraries"])
def main():
from itertools import islice
def get_cheapest_price(itinerary):
def get_valid_option(sailing):
def predicate(option):
return "combinedPrice" in option
return next(filter(predicate, sailing["pricing"]))
return min(get_valid_option(sailing)["combinedPrice"] for sailing in itinerary["sailings"])
itineraries = list(islice(get_itineraries(), 50))
prices = map(get_cheapest_price, itineraries)
for itinerary, price in sorted(zip(itineraries, prices), key=lambda tpl: tpl[1]):
print("[{}{}] - {}".format(itinerary["currency"]["symbol"], price, itinerary["title"]["fullTitle"]))
return 0
if __name__ == "__main__":
import sys
sys.exit(main())
Output:
[€983] - 12-Day Transatlantic From London To New York: Spain & Bermuda
[€984] - 11-Day Transatlantic from Miami to Barcelona: Ponta Delgada, Azores
[€1024] - 15-Day Transatlantic from Rio de Janeiro to Barcelona: Spain & Brazil
[€1177] - 15-Day Transatlantic from Rome to New York: Italy, France & Spain
[€1190] - 14-Day Transatlantic from Barcelona to New York: Spain & Bermuda
[€1234] - 14-Day Transatlantic from Lisbon to Rio de Janeiro: Spain & Brazil
[€1254] - 11-Day Europe from Rome to London: Italy, France, Spain & Portugal
[€1271] - 15-Day Transatlantic From New York to Rome: Italy, France & Spain
[€1274] - 15-Day Transatlantic from New York to Barcelona: Spain & Bermuda
[€1296] - 13-Day Transatlantic From New York to London: France & Ireland
[€1411] - 17-Day Transatlantic from Rome to Miami: Italy, France & Spain
[€1420] - 15-Day Transatlantic From New York to Barcelona: France & Spain
[€1438] - 16-Day Transatlantic from Rome to New York: Italy, France & Spain
[€1459] - 15-Day Transatlantic from Barcelona to Tampa: Bahamas, Spain & Bermuda
[€1473] - 11-Day Transatlantic from New York to Reykjavik: Halifax & Akureyri
[€1486] - 16-Day Transatlantic from Rome to New York: Italy, France & Spain
[€1527] - 15-Day Transatlantic from New York to Rome: Italy, France & Spain
[€1529] - 14-Day Transatlantic From New York to London: France & Ireland
[€1580] - 16-day Transatlantic From Barcelona to New York: Spain & Bermuda
[€1595] - 16-Day Transatlantic From New York to Rome: Italy, France & Spain
[€1675] - 16-Day Transatlantic from New York to Rome: Italy, France & Spain
[€1776] - 14-Day Transatlantic from New York to London: England & Ireland
[€1862] - 12-Day Transatlantic From London to New York: Scotland & Iceland
[€2012] - 15-Day Transatlantic from New York to Barcelona: Spain & Bermuda
[€2552] - 14-Day Transatlantic from New York to London: England & Ireland
[€2684] - 16-Day Transatlantic from New York to London: France & Ireland
[€3460] - 16-Day Transatlantic from New York to London: France & Ireland
>>>
For more information on logging your browser's network traffic, finding REST API endpoints (if they exist), and imitating requests, take a look at this other answer I posted to a similar question.
Related
Scraping a HTML site using BeautifulSoup and finding the value of "total_pages" in it
I'm writing a python code that scrapes the following website and looks for the value of "total_pages" in it. The website is https://www.usnews.com/best-colleges/fl When I open the website in a browser and investigate the source code, the value of "total_pages" is 8. I want my python code to be able to get the same value. I have written the following code: import requests from bs4 import BeautifulSoup headers ={'User-Agent': 'Mozilla/5.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'} main_site=requests.get("https://www.usnews.com/best-colleges/fl",headers=headers) main_site_content=main_site.content main_site_content_soup=BeautifulSoup(main_site_content,"html.parser") But then I get stuck on how to look for the "total_pages" in the parsed data. I have tried find_all() method but no luck. I think I'm not using the method correctly. One note: the solution does not have to use BeautifulSoup. I just used BeautifulSoup since I was a bit familiar with it.
No need for BeautifulSoup. Here I make a request to their API to get the list of universities. from rich import print is used to pretty-print the JSON. It should make it easier to read. Need more help or advice, leave a comment below. import requests from rich import print LINK = "https://www.usnews.com/best-colleges/api/search?format=json&location=Florida&_sort=rank&_sortDirection=asc&_page=1" def get_data(url): print("Making request to:", url) response = requests.get(url, timeout=5, headers={"User-Agent": "Mozilla/5.0"}) if response.status_code == 200: print("Request Successful!") data = response.json()["data"] return data["items"], data["next_link"] print("Request failed!") return None, None def main(): print("Starting Scraping...") items, next_link = get_data(LINK) # if there's a `next_link`, scrape it. while next_link is not None: print("Getting data from:", next_link) new_items, next_link = get_data(next_link) items += new_items # cleaning the data, for the pandas dataframe. items = [ { "name": item["institution"]["displayName"], "state": item["institution"]["state"], "city": item["institution"]["city"], } for item in items ] df = pd.DataFrame(items) print(df.to_markdown()) if __name__ == "__main__": main() The output looks like this: name state city 0 University of Florida FL Gainesville 1 Florida State University FL Tallahassee 2 University of Miami FL Coral Gables 3 University of South Florida FL Tampa 4 University of Central Florida FL Orlando 5 Florida International University FL Miami 6 Florida A&M University FL Tallahassee 7 Florida Institute of Technology FL Melbourne 8 Nova Southeastern University FL Ft. Lauderdale ... ... ... ... 74 St. John Vianney College Seminary FL Miami 75 St. Petersburg College FL St. Petersburg 76 Tallahassee Community College FL Tallahassee 77 Valencia College FL Orlando
How to scrape hidden class data using selenium and beautiful soup
I'm trying to scrape java script enabled web page content. I need to extract data in the table of that website. However each row of the table has button (arrow) by which we get additional information of that row. I need to extract that additional description of each row. By inspecting it is observed that the contents of those arrow of each row belong to same class. However the class is hidden in source code. It can be observed only while inspecting. The data I'm trying to sparse is from the webpage. I have used selenium and beautiful soup. I'm able to scrape data of table but not content of those arrows in the table. My python is returning me an empty list for the class of that arrow. But working for the classs of normal table data. from bs4 import BeautifulSoup from selenium import webdriver browser = webdriver.Firefox() browser.get('https://projects.sfchronicle.com/2020/layoff-tracker/') html_source = browser.page_source soup = BeautifulSoup(html_source,'html.parser') data = soup.find_all('div',class_="sc-fzoLsD jxXBhc rdt_ExpanderRow") print(data.text)
To print hidden data, you can use this example: import re import json import requests from bs4 import BeautifulSoup url = 'https://projects.sfchronicle.com/2020/layoff-tracker/' soup = BeautifulSoup(requests.get(url).content, 'html.parser') data_url = 'https://projects.sfchronicle.com' + soup.select_one('link[href*="commons-"]')['href'] data = re.findall(r'n\.exports=JSON\.parse\(\'(.*?)\'\)', requests.get(data_url).text)[1] data = json.loads(data.replace(r"\'", "'")) # uncomment this to see all data: # print(json.dumps(data, indent=4)) for d in data[4:]: print('{:<50}{:<10}{:<30}{:<30}{:<30}{:<30}{:<30}'.format(*d.values())) Prints: Company Layoffs City County Month Industry Company description Tesla (Temporary layoffs. Factory reopened) 11083 Fremont Alameda County April Industrial Car maker Bon Appetit Management Co. 3015 San Francisco San Francisco County April Food Food supplier GSW Arena LLC-Chase Center 1720 San Francisco San Francisco County May Sports Arena vendors YMCA of Silicon Valley 1657 Santa Clara Santa Clara County May Sports Gym Nutanix Inc. (Temporary furlough of 2 weeks) 1434 San Jose Santa Clara County April Tech Cloud computing TeamSanJose 1304 San Jose Santa Clara County April Travel Tourism bureau San Francisco Giants 1200 San Francisco San Francisco County April Sports Stadium vendors Lyft 982 San Francisco San Francisco County April Tech Ride hailing YMCA of San Francisco 959 San Francisco San Francisco County May Sports Gym Hilton San Francisco Union Square 923 San Francisco San Francisco County April Travel Hotel Six Flags Discovery Kingdom 911 Vallejo Solano County June Entertainment Amusement park San Francisco Marriott Marquis 808 San Francisco San Francisco County April Travel Hotel Aramark 777 Oakland Alameda County April Food Food supplier The Palace Hotel 774 San Francisco San Francisco County April Travel Hotel Back of the House Inc 743 San Francisco San Francisco County April Food Restaurant DPR Construction 715 Redwood City San Mateo County April Real estate Construction ...and so on.
The content you are interested in is generated when you click a button, so you would want to locate the button. A million ways you could do this but I would suggest something like: element = driver.find_elements(By.XPATH, '//button') for your specific case you could also use: element = driver.find_elements(By.CSS_SELECTOR, 'button[class|="sc"]') Once you get the button element, we can then do: element.click() Parsing the page after this should get you the javascript generated content you are looking for
Extracting JavaScript Variables into Python Dictionaries
Understanding that I have to use PyQt5 in conjunction with BeautifulSoup to run javascript on my client after extracting the html using BeautifulSoup, I am trying to convert variable _Flourish_data into a Python dictionary. Is there an easy way to extract the Javascript variable, _Flourish_data, into a Python dictionary? Here is my current Python to extract the Javascript using PyQt5 and BeautifulSoup: import bs4 as bs import sys import urllib.request from PyQt5.QtWebEngineWidgets import QWebEnginePage from PyQt5.QtWidgets import QApplication from PyQt5.QtCore import QUrl class Page(QWebEnginePage): def __init__(self, url): self.app = QApplication(sys.argv) QWebEnginePage.__init__(self) self.html = '' self.loadFinished.connect(self._on_load_finished) self.load(QUrl(url)) self.app.exec_() def _on_load_finished(self): self.html = self.toHtml(self.Callable) def Callable(self, html_str): self.html = html_str self.app.quit() page = Page('https://flo.uri.sh/visualisation/2451841/embed?auto=1') soup = bs.BeautifulSoup(page.html, 'html.parser') js_test = soup.find_all('script') js_test[5] The output of the existing code is <script> function _Flourish_unflattenInto(dest, src) { dest = dest || {}; for (var k in src) { var t = dest; for (var i = k.indexOf("."), p = 0; i >= 0; i = k.indexOf(".", p = i+1)) { var s = k.substring(p, i); if (!(s in t)) t[s] = {}; t = t[s]; } t[k.substring(p)] = src[k]; } return dest; } var _Flourish_settings = {"cell_fill_1":"#ffffff","cell_fill_2":"#ebebeb","cell_fill_direction":"horizontal","cell_font_size":"1","cell_height":20,"cell_horizontal_alignment":"center","cell_link_color":"#2886b2","cell_padding_horizontal":16,"cell_padding_vertical":11,"column_width_mode":"auto","column_widths":"10%, 10%, 10%, 10%, 50%, 10%","header_fill":"#181f6c","header_font_color":"#ffffff","header_font_default":false,"header_font_size":1.1,"header_horizontal_alignment":"center","header_style_default":true,"layout.body_font":{"name":"Source Sans Pro","url":"https://fonts.googleapis.com/css?family=Source+Sans+Pro:400,700"},"layout.layout_order":"stack-default","layout.space_between_sections":"0.5","mobile.view":true,"no_results_text":"Use the search bar to find your state","pagination_amount":41,"pagination_amount_search":"5","search_enabled":false,"search_hide_table":false,"search_placeholder":"Search to find your state","search_resize":true,"search_width":15}; _Flourish_unflattenInto(window.template.state, _Flourish_settings); var _Flourish_data_column_names = {"rows":{"columns":["State ","Earliest/Planned Start Date for 20/21 Academic Year ","","","",""]}}, _Flourish_data = {"rows":[{"columns":["Alabama","Varies by district","","","",""]},{"columns":["Alaska","Varies by district","","","",""]},{"columns":["American Samoa","Unknown","","","",""]},{"columns":["Arizona","Varies by district","","","",""]},{"columns":["Arkansas","Varies by district","","","",""]},{"columns":["Bureau of Indian Education","Varies by district","","","",""]},{"columns":["California","Varies by district","","","",""]},{"columns":["Colorado","Varies by district","","","",""]},{"columns":["Connecticut","Not yet determined","","","",""]},{"columns":["Delaware","Varies by district","","","",""]},{"columns":["Department of Defense Education Activity\n ","Varies by district","","","",""]},{"columns":["District of Columbia","8/31/2020","","","",""]},{"columns":["Florida","Unknown","","","",""]},{"columns":["Georgia","Unknown","","","",""]},{"columns":["Guam","Unknown","","","",""]},{"columns":["Hawaii","Not yet determined","","","",""]},{"columns":["Idaho","Varies by District","","","",""]},{"columns":["Illinois","Varies by district","","","",""]},{"columns":["Indiana","Not yet determined","","","",""]},{"columns":["Iowa","Varies by district","","","",""]},{"columns":["Kansas","Not yet determined","","","",""]},{"columns":["Kentucky","Unknown","","","",""]},{"columns":["Louisiana","Varies by district","","","",""]},{"columns":["Maine","Varies by district","","","",""]},{"columns":["Maryland","Not yet determined","","","",""]},{"columns":["Massachusetts","Not yet determined","","","",""]},{"columns":["Michigan","Not yet determined","","","",""]},{"columns":["Minnesota","Not yet determined","","","",""]},{"columns":["Mississippi ","Varies by district","","","",""]},{"columns":["Missouri","Varies by district","","","",""]},{"columns":["Montana","Varies by district","","","",""]},{"columns":["Nebraska","Varies by district","","","",""]},{"columns":["Nevada","Varies by district","","","",""]},{"columns":["New Hampshire","Not yet determined","","","",""]},{"columns":["New Jersey","Varies by district","","","",""]},{"columns":["New Mexico","Unknown","","","",""]},{"columns":["New York","Not yet determined","","","",""]},{"columns":["North Carolina","8/17/2020","","","",""]},{"columns":["North Dakota","Varies by district","","","",""]},{"columns":["Northern Marianas","Unknown","","","",""]},{"columns":["Ohio","Not yet determined","","","",""]},{"columns":["Oklahoma","Varies by district","","","",""]},{"columns":["Oregon","Not yet determined","","","",""]},{"columns":["Pennsylvania","Varies by district","","","",""]},{"columns":["Puerto Rico","Unknown","","","",""]},{"columns":["Rhode Island","Not yet determined","","","",""]},{"columns":["South Carolina","Not yet determined","","","",""]},{"columns":["South Dakota","Varies by district","","","",""]},{"columns":["Tennessee","Varies by district","","","",""]},{"columns":["Texas","Varies by district","","","",""]},{"columns":["U.S. Virgin Islands\n ","Not yet determined","","","",""]},{"columns":["Utah","Varies by district","","","",""]},{"columns":["Vermont","Not yet determined","","","",""]},{"columns":["Virginia","Not yet determined","","","",""]},{"columns":["Washington","Varies by District","","","",""]},{"columns":["West Virginia","Not yet determined","","","",""]},{"columns":["Wisconsin","Varies by district","","","",""]},{"columns":["Wyoming","Not yet determined","","","",""]}]}; for (var _Flourish_dataset in _Flourish_data) { window.template.data[_Flourish_dataset] = _Flourish_data[_Flourish_dataset]; window.template.data[_Flourish_dataset].column_names = _Flourish_data_column_names[_Flourish_dataset]; } window.template.draw(); </script> I just want var _flourish_data from HTML tag, as shown below: _Flourish_data = {"rows":[{"columns":["Alabama","Varies by district","","","",""]},{"columns":["Alaska","Varies by district","","","",""]},{"columns":["American Samoa","Unknown","","","",""]},{"columns":["Arizona","Varies by district","","","",""]},{"columns":["Arkansas","Varies by district","","","",""]},{"columns":["Bureau of Indian Education","Varies by district","","","",""]},{"columns":["California","Varies by district","","","",""]},{"columns":["Colorado","Varies by district","","","",""]},{"columns":["Connecticut","Not yet determined","","","",""]},{"columns":["Delaware","Varies by district","","","",""]},{"columns":["Department of Defense Education Activity\n ","Varies by district","","","",""]},{"columns":["District of Columbia","8/31/2020","","","",""]},{"columns":["Florida","Unknown","","","",""]},{"columns":["Georgia","Unknown","","","",""]},{"columns":["Guam","Unknown","","","",""]},{"columns":["Hawaii","Not yet determined","","","",""]},{"columns":["Idaho","Varies by District","","","",""]},{"columns":["Illinois","Varies by district","","","",""]},{"columns":["Indiana","Not yet determined","","","",""]},{"columns":["Iowa","Varies by district","","","",""]},{"columns":["Kansas","Not yet determined","","","",""]},{"columns":["Kentucky","Unknown","","","",""]},{"columns":["Louisiana","Varies by district","","","",""]},{"columns":["Maine","Varies by district","","","",""]},{"columns":["Maryland","Not yet determined","","","",""]},{"columns":["Massachusetts","Not yet determined","","","",""]},{"columns":["Michigan","Not yet determined","","","",""]},{"columns":["Minnesota","Not yet determined","","","",""]},{"columns":["Mississippi ","Varies by district","","","",""]},{"columns":["Missouri","Varies by district","","","",""]},{"columns":["Montana","Varies by district","","","",""]},{"columns":["Nebraska","Varies by district","","","",""]},{"columns":["Nevada","Varies by district","","","",""]},{"columns":["New Hampshire","Not yet determined","","","",""]},{"columns":["New Jersey","Varies by district","","","",""]},{"columns":["New Mexico","Unknown","","","",""]},{"columns":["New York","Not yet determined","","","",""]},{"columns":["North Carolina","8/17/2020","","","",""]},{"columns":["North Dakota","Varies by district","","","",""]},{"columns":["Northern Marianas","Unknown","","","",""]},{"columns":["Ohio","Not yet determined","","","",""]},{"columns":["Oklahoma","Varies by district","","","",""]},{"columns":["Oregon","Not yet determined","","","",""]},{"columns":["Pennsylvania","Varies by district","","","",""]},{"columns":["Puerto Rico","Unknown","","","",""]},{"columns":["Rhode Island","Not yet determined","","","",""]},{"columns":["South Carolina","Not yet determined","","","",""]},{"columns":["South Dakota","Varies by district","","","",""]},{"columns":["Tennessee","Varies by district","","","",""]},{"columns":["Texas","Varies by district","","","",""]},{"columns":["U.S. Virgin Islands\n ","Not yet determined","","","",""]},{"columns":["Utah","Varies by district","","","",""]},{"columns":["Vermont","Not yet determined","","","",""]},{"columns":["Virginia","Not yet determined","","","",""]},{"columns":["Washington","Varies by District","","","",""]},{"columns":["West Virginia","Not yet determined","","","",""]},{"columns":["Wisconsin","Varies by district","","","",""]},{"columns":["Wyoming","Not yet determined","","","",""]}]}; Any help would be greatly appreciated!
You don't need to execute Javascript. It can be done with json and re module. For example: import re import json import requests url = 'https://flo.uri.sh/visualisation/2451841/embed?auto=1' html_data = requests.get(url).text data = re.search(r'_Flourish_data = (\{.*?\});', html_data).group(1) data = json.loads(data) # uncomment this to print all data: # print(json.dumps(data, indent=4)) for row in data['rows']: print('{:<55}{}'.format(*map(str.strip, row['columns'][:2]))) Prints: Alabama Varies by district Alaska Varies by district American Samoa Unknown Arizona Varies by district Arkansas Varies by district Bureau of Indian Education Varies by district California Varies by district Colorado Varies by district Connecticut Not yet determined Delaware Varies by district Department of Defense Education Activity Varies by district District of Columbia 8/31/2020 Florida Unknown Georgia Unknown Guam Unknown Hawaii Not yet determined Idaho Varies by District Illinois Varies by district Indiana Not yet determined Iowa Varies by district Kansas Not yet determined Kentucky Unknown Louisiana Varies by district Maine Varies by district Maryland Not yet determined Massachusetts Not yet determined Michigan Not yet determined Minnesota Not yet determined Mississippi Varies by district Missouri Varies by district Montana Varies by district Nebraska Varies by district Nevada Varies by district New Hampshire Not yet determined New Jersey Varies by district New Mexico Unknown New York Not yet determined North Carolina 8/17/2020 North Dakota Varies by district Northern Marianas Unknown Ohio Not yet determined Oklahoma Varies by district Oregon Not yet determined Pennsylvania Varies by district Puerto Rico Unknown Rhode Island Not yet determined South Carolina Not yet determined South Dakota Varies by district Tennessee Varies by district Texas Varies by district U.S. Virgin Islands Not yet determined Utah Varies by district Vermont Not yet determined Virginia Not yet determined Washington Varies by District West Virginia Not yet determined Wisconsin Varies by district Wyoming Not yet determined
import requests import re import json def main(url): r = requests.get(url) match = json.loads(re.search(r'_Flourish_data = ({.*})', r.text).group(1)) print(match.keys()) main("https://flo.uri.sh/visualisation/2451841/embed?auto=1")
Scraping content with python and selenium
I would like to extract all the league names (e.g. England Premier League, Scotland Premiership, etc.) from this website https://mobile.bet365.com/#type=Splash;key=1;ip=0;lng=1 Taking the inspector tools from Chrome/Firefox I can see that they are located here: <span>England Premier League</span> So I tried this from lxml import html from selenium import webdriver session = webdriver.Firefox() url = 'https://mobile.bet365.com/#type=Splash;key=1;ip=0;lng=1' session.get(url) tree = html.fromstring(session.page_source) leagues = tree.xpath('//span/text()') print(leagues) Unfortunately this doesn't return the desired results :-( To me it looks like the website has different frames and I'm extracting the content from the wrong frame. Could anyone please help me out here or point me in the right direction? As an alternative if someone knows how to extract the information through their api then this would obviously be the superior solution. Any help is much appreciated. Thank you!
Hope you are looking for something like this: from selenium import webdriver import bs4, time driver = webdriver.Chrome() url = 'https://mobile.bet365.com/#type=Splash;key=1;ip=0;lng=1' driver.get(url) driver.maximize_window() # sleep is given so that JS populate data in this time time.sleep(10) pSource= driver.page_source soup = bs4.BeautifulSoup(pSource, "html.parser") for data in soup.findAll('div',{'class':'eventWrapper'}): for res in data.find_all('span'): print res.text It will print the below data: Wednesday's Matches International List Elite Euro List UK List Australia List Club Friendly List England Premier League England EFL Cup England Championship England League 1 England League 2 England National League England National League North England National League South Scotland Premiership Scotland League Cup Scotland Championship Scotland League One Scotland League Two Northern Ireland Reserve League Scotland Development League East Wales Premier League Wales Cymru Alliance Asia - World Cup Qualifying UEFA Champions League UEFA Europa League Wednesday's Matches International List Elite Euro List UK List Australia List Club Friendly List England Premier League England EFL Cup England Championship England League 1 England League 2 England National League England National League North England National League South Scotland Premiership Scotland League Cup Scotland Championship Scotland League One Scotland League Two Northern Ireland Reserve League Scotland Development League East Wales Premier League Wales Cymru Alliance Asia - World Cup Qualifying UEFA Champions League UEFA Europa League Only problem is its printing result set twice
Required content is absent in initial page source. It comes dynamically from https://mobile.bet365.com/V6/sport/splash/splash.aspx?zone=0&isocode=RO&tzi=4&key=1&gn=0&cid=1&lng=1&ctg=1&ct=156&clt=8881&ot=2 To be able to get this content you can use ExplicitWait as below: from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium import webdriver session = webdriver.Firefox() url = 'https://mobile.bet365.com/#type=Splash;key=1;ip=0;lng=1' session.get(url) WebDriverWait(session, 10).until(EC.presence_of_element_located((By.ID, 'Splash'))) for collapsed in session.find_elements_by_xpath('//h3[contains(#class, "collapsed")]'): collapsed.location_once_scrolled_into_view collapsed.click() for event in session.find_elements_by_xpath('//div[contains(#class, "eventWrapper")]//span'): print(event.text)
Want to store variable names in list, not said variable's contents
Sorry if the title is confusing; let me explain. So, I've written a program that categorizes emails by topic using nltk and tools from sklearn. Here is that code: #Extract Emails tech = extract_message("C:\\Users\\Cody\\Documents\\Emails\\tech.html") gary = extract_message("C:\\Users\\Cody\\Documents\\Emails\\gary.html") gary2 = extract_message("C:\\Users\\Cody\\Documents\\Emails\\gary2.html") jesus = extract_message("C:\\Users\\Cody\\Documents\\Emails\\Jesus.html") jesus2 = extract_message("C:\\Users\\Cody\\Documents\\Emails\\jesus2.html") hockey = extract_message("C:\\Users\\Cody\\Documents\\Emails\\hockey.html") hockey2 = extract_message("C:\\Users\\Cody\\Documents\\Emails\\hockey2.html") shop = extract_message("C:\\Users\\Cody\\Documents\\Emails\\shop.html") #Build dictionary of features count_vect = CountVectorizer() x_train_counts = count_vect.fit_transform(news.data) #Downscaling tfidf_transformer = TfidfTransformer() x_train_tfidf = tfidf_transformer.fit_transform(x_train_counts) tf_transformer = TfidfTransformer(use_idf=False).fit(x_train_counts) x_train_tf = tf_transformer.transform(x_train_counts) #Train classifier clf = MultinomialNB().fit(x_train_tfidf, news.target) #List of the extracted emails docs_new = [gary, gary2, jesus, jesus2, shop, tech, hockey, hockey2] #Extract feautures from emails x_new_counts = count_vect.transform(docs_new) x_new_tfidf = tfidf_transformer.transform(x_new_counts) #Predict the categories for each email predicted = clf.predict(x_new_tfidf) Now I'm looking to store each variable in an appropriate list, based off of the predicted label. I figured I could do that doing this: #Store Files in a category hockey_emails = [] computer_emails = [] politics_emails = [] tech_emails = [] religion_emails = [] forsale_emails = [] #Print out results and store each email in the appropritate category list for doc, category in zip(docs_new, predicted): print('%r ---> %s' % (doc, news.target_names[category])) if(news.target_names[category] == 'comp.sys.ibm.pc.hardware'): computer_emails.append(doc) if(news.target_names[category] == 'rec.sport.hockey'): hockey_emails.append(doc) if(news.target_names[category] == 'talk.politics.misc'): politics_emails.append(doc) if(news.target_names[category] == 'soc.religion.christian'): religion_emails.append(doc) if(news.target_names[category] == 'misc.forsale'): forsale_emails.append(doc) if(news.target_names[category] == 'comp.sys.ibm.pc.hardware'): computer_emails.append(doc) My output if I were to print out one of these lists, let's say hockey for instance, displays the contents stored in the variable rather than the variable itself. I want this: print(hockey_emails) output: ['hockey', 'hockey2'] but instead I'm getting this: output: ['View View online click here Hi Thanks for signing up as a EA SPORTS NHL insider You ll now receive all of the latest and greatest news and info at this e mail address as you ve requested EA com If you need technical assistance please contact EA Help Privacy Policy Our Certified Online Privacy Policy gives you confidence whenever you play EA games To view our complete Privacy and Cookie Policy go to privacy ea com or write to Privacy Policy Administrator Electronic Arts Inc Redwood Shores Parkway Redwood City CA Electronic Arts Inc All Rights Reserved Privacy Policy User Agreement Legal ActionsMark as UnreadMark as ReadMark as SpamStarClear StarArchive Previous Next ', 'View News From The Hockey Writers The Editor s Choice stories from The Hockey Writers View this email in your browser edition Recap Stars Steamroll Predators By Matt Pryor on Dec am As the old Mary Chapin Carpenter song goes Sometimes you re the windshield Sometimes you re the bug It hasn t happened very often this season but the Dallas Stars had a windshield Continue Reading A Review of Years in Blue and White Damien Cox One on One By Anthony Fusco on Dec pm The Toronto Maple Leafs are one of the most storied and iconic franchises in the entire National Hockey League They have a century of history that spans all the way back to the early s When you have an Continue Reading Bruins Will Not Miss Beleskey By Kyle Benson on Dec am On Monday it was announced that Matt Beleskey will miss the next six weeks due to a knee injury he sustained over the weekend in a game against the Buffalo Sabres Six weeks is a long stint to be without a potential top Continue Reading Recent Articles Galchenyuk Injury Costly for CanadiensFacing Off Picking Team Canada for World JuniorsAre Johnson s Nomadic Days Over Share Tweet Forward Latest News Prospects Anaheim Ducks Arizona Coyotes Boston Bruins Buffalo Sabres Calgary Flames Carolina Hurricanes Chicago Blackhawks Colorado Avalanche Columbus Blue Jackets Dallas Stars Detroit Red Wings Edmonton Oilers Florida Panthers Los Angeles Kings Minnesota Wild Montreal Canadiens Nashville Predators New Jersey Devils New York Islanders New York Rangers Philadelphia Flyers Pittsburgh Penguins Ottawa Senators San Jose Sharks St Louis Blues Tampa Bay Lightning Toronto Maple Leafs Vancouver Canucks Washington Capitals Winnipeg Jets Copyright The Hockey Writers All rights reserved You are receiving this email because you opted in at The Hockey Writers or one of our Network Sites Our mailing address is The Hockey Writers Victoria Ave St Lambert QC J R R CanadaAdd us to your address book unsubscribe from this list update subscription preferences ActionsMark as UnreadMark as ReadMark as SpamStarClear StarArchive Previous Next '] I figured this would be simple, but I'm sitting here scratching my head. Is this even possible? Should I use something else instead of a list? This is probably simple I'm just blanking.
You have to keep track of the names yourself, Python won't do it for you. names = 'gary gary2 Jesus jesus2 shop tech hockey hockey2'.split() docs_new = [extract_message("C:\\Users\\Cody\\Documents\\Emails\\%s.html" % name) for name in names] for name, category in zip(names, predicted): print('%r ---> %s' % (name, news.target_names[category])) if (news.target_names[category] == 'comp.sys.ibm.pc.hardware'): computer_emails.append(name)
Don't do this. Use a dictionary to hold your collection of emails, and you can print the dictionary keys when you want to know what is what. docs_new = dict() docs_new["tech"] = extract_message("C:\\Users\\Cody\\Documents\\Emails\\tech.html") docs_new["gary"] = extract_message("C:\\Users\\Cody\\Documents\\Emails\\gary.html") etc. When you iterate over the dictionary, you'll see the keys. for doc, category in zip(docs_new, predicted): print('%s ---> %s' % (doc, news.target_names[category])) (More dictionary basics: To iterate over dict values, replace docs_new above with docs_new.values(); or use docs_new.items() for both keys and values.)