I am new to scraping. I am asked to get a list of store number, city, state from website: https://www.lowes.com/Lowes-Stores
Below is what I have tried so far. Since the structure does not have an attribute, I am not sure how to continue my code. Please guide!
import requests
from bs4 import BeautifulSoup
import json
from pandas import DataFrame as df
url = "https://www.lowes.com/Lowes-Stores"
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}
page = requests.get(url, headers=headers)
page.encoding = 'ISO-885901'
soup = BeautifulSoup(page.text, 'html.parser')
lowes_list = soup.find_all(class_ = "list unstyled")
for i in lowes_list[:2]:
print(i)
example = lowes_list[0]
example_content = example.contents
example_content
You've found the list elements that contain the links that you need for state store lookups in your for loop. You will need to get the href attribute from the "a" tag inside each "li" element.
This is only the first step since you'll need to follow those links to get the store results for each state.
Since you know the structure of this state link result, you can simply do:
for i in lowes_list:
list_items = i.find_all('li')
for x in list_items:
for link in x.find_all('a'):
print(link['href'])
There are definitely more efficient ways of doing this, but the list is very small and this works.
Once you have the links for each state, you can create another request for each one to visit those store results pages. Then obtain the href attribute from those search results links on each state's page. The
Anchorage Lowe's
contains the city and the store number.
Here is a full example. I included lots of comments to illustrate the points.
You pretty much had everything up to Line 27, but you needed to follow the links for each state. A good technique for approaching these is to test the path out in your web browser first with the dev tools open, watching the HTML so you have a good idea of where to start with the code.
This script will obtain the data you need, but doesn't provide any data presentation.
import requests
from bs4 import BeautifulSoup as bs
url = "https://www.lowes.com/Lowes-Stores"
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36"
}
page = requests.get(url, headers=headers, timeout=5)
page.encoding = "ISO-885901"
soup = bs(page.text, "html.parser")
lowes_state_lists = soup.find_all(class_="list unstyled")
# we will store the links for each state in this array
state_stores_links = []
# now we populate the state_stores_links array by finding the href in each li tag
for ul in lowes_state_lists:
list_items = ul.find_all("li")
# now we have all the list items from the page, we have to extract the href
for li in list_items:
for link in li.find_all("a"):
state_stores_links.append(link["href"])
# This next part is what the original question was missing, following the state links to their respective search result pages.
# at this point we have to request a new page for each state and store the results
# you can use pandas, but an dict works too.
states_stores = {}
for link in state_stores_links:
# splitting up the link on the / gives us the parts of the URLs.
# by inspecting with Chrome DevTools, we can see that each state follows the same pattern (state name and state abbreviation)
link_components = link.split("/")
state_name = link_components[2]
state_abbreviation = link_components[3]
# let's use the state_abbreviation as the dict's key, and we will have a stores array that we can do reporting on
# the type and shape of this dict is irrelevant at this point. This example illustrates how to obtain the info you're after
# in the end the states_stores[state_abbreviation]['stores'] array will dicts each with a store_number and a city key
states_stores[state_abbreviation] = {"state_name": state_name, "stores": []}
try:
# simple error catching in case something goes wrong, since we are sending many requests.
# our link is just the second half of the URL, so we have to craft the new one.
new_link = "https://www.lowes.com" + link
state_search_results = requests.get(new_link, headers=headers, timeout=5)
stores = []
if state_search_results.status_code == 200:
store_directory = bs(state_search_results.content, "html.parser")
store_directory_div = store_directory.find("div", class_="storedirectory")
# now we get the links inside the storedirectory div
individual_store_links = store_directory_div.find_all("a")
# we now have all the stores for this state! Let's parse and save them into our store dict
# the store's city is after the state's abbreviation followed by a dash, the store number is the last thing in the link
# example: "/store/AK-Wasilla/2512"
for store in individual_store_links:
href = store["href"]
try:
# by splitting the href which looks to be consistent throughout the site, we can get the info we need
split_href = href.split("/")
store_number = split_href[3]
# the store city is after the -, so we have to split that element up into its two parts and access the second part.
store_city = split_href[2].split("-")[1]
# creating our store dict
store_object = {"city": store_city, "store_number": store_number}
# adding the dict to our state's dict
states_stores[state_abbreviation]["stores"].append(store_object)
except Exception as e:
print(
"Error getting store info from {0}. Exception: {1}".format(
split_href, e
)
)
# let's print something so we can confirm our script is working
print(
"State store count for {0} is: {1}".format(
states_stores[state_abbreviation]["state_name"],
len(states_stores[state_abbreviation]["stores"]),
)
)
else:
print(
"Error fetching: {0}, error code: {1}".format(
link, state_search_results.status_code
)
)
except Exception as e:
print("Error fetching: {0}. Exception: {1}".format(state_abbreviation, e))
Related
I am scraping multiple google scholar pages and I have already written code using beautiful soup to extract information of title, author, journal.
This is a sample page.
https://scholar.google.com/citations?user=cp-8uaAAAAAJ&hl=en
I want to now extract information on h-index,i-10 index and citations. When I inspected the page, I saw that all these have the same class names (gsc_rsb_std). Given this, my doubt is
How to extract this information using beautiful soup? The following code extracted information on citations. How to do it for the other two since the class name is the same?
columns['Citations'] = soup.findAll('td',{'class':'gsc_rsb_std'}).text
There is only one value for name, citations, h-index and i-index. However, there are multiple rows of papers. Ideally, I want my output in the following form.
Name h-index paper1
Name h-index paper2
Name h-index paper3
I tried the following and I am getting the output as above but only the last paper is repeated. Not sure what is happening here.
soup = BeautifulSoup(driver.page_source, 'html.parser')
columns = {}
columns['Name'] = soup.find('div', {'id': 'gsc_prf_in'}).text
papers = soup.find_all('tr', {'class': 'gsc_a_tr'})
for paper in papers:
columns['title'] = paper.find('a', {'class': 'gsc_a_at'}).text
File.append(columns)
My output is like this. Looks like there is something wrong with the loop.
Name h-index paper3
Name h-index paper3
Name h-index paper3
Appreciate any help. Thanks in advance!
I would consider using :has and :contains and target by search string
import requests
from bs4 import BeautifulSoup
searches = ['Citations', 'h-index', 'i10-index']
r = requests.get('https://scholar.google.com/citations?user=cp-8uaAAAAAJ&hl=en')
soup = BeautifulSoup(r.text, 'html.parser')
for search in searches:
all_value = soup.select_one(f'td:has(a:contains("{search}")) + td')
print(f'{search} All:', all_value.text)
since_2016 = all_value.find_next('td')
print(f'{search} since 2016:', since_2016.text)
You could also have used pandas read_html to grab that table by index.
Selenium question:
The element has an id, which is faster to match on using css selectors/find_element_by_id e.g.
driver.find_element_by_id("gsc_prf_in").text
I see no need, however, for selenium when scraping this page.
You can use SelectorGadgets Chrome extension to visually grab CSS selectors.
Element(s) highlighted in:
red excludes from search.
green included in the search.
yellow is guessing what the user is looking to find and needs additional clarification.
Grab h-index:
Grab i10-index:
Code and example in the online IDE:
from bs4 import BeautifulSoup
import requests, lxml, os
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
proxies = {
'http': os.getenv('HTTP_PROXY')
}
html = requests.get('https://scholar.google.com/citations?user=cp-8uaAAAAAJ&hl=en', headers=headers, proxies=proxies).text
soup = BeautifulSoup(html, 'lxml')
for cited_by_public_access in soup.select('.gsc_rsb'):
citations_all = cited_by_public_access.select_one('tr:nth-child(1) .gsc_rsb_sc1+ .gsc_rsb_std').text
citations_since2016 = cited_by_public_access.select_one('tr:nth-child(1) .gsc_rsb_std+ .gsc_rsb_std').text
h_index_all = cited_by_public_access.select_one('tr:nth-child(2) .gsc_rsb_sc1+ .gsc_rsb_std').text
h_index_2016 = cited_by_public_access.select_one('tr:nth-child(2) .gsc_rsb_std+ .gsc_rsb_std').text
i10_index_all = cited_by_public_access.select_one('tr~ tr+ tr .gsc_rsb_sc1+ .gsc_rsb_std').text
i10_index_2016 = cited_by_public_access.select_one('tr~ tr+ tr .gsc_rsb_std+ .gsc_rsb_std').text
articles_num = cited_by_public_access.select_one('.gsc_rsb_m_a:nth-child(1) span').text.split(' ')[0]
articles_link = cited_by_public_access.select_one('#gsc_lwp_mndt_lnk')['href']
print('Citiation info:')
print(f'{citations_all}\n{citations_since2016}\n{h_index_all}\n{h_index_2016}\n{i10_index_all}\n{i10_index_2016}\n{articles_num}\nhttps://scholar.google.com{articles_link}\n')
Output:
Citiation info:
55399
34899
69
59
148
101
23
https://scholar.google.com/citations?view_op=list_mandates&hl=en&user=cp-8uaAAAAAJ
Alternatively, you can do the same thing using Google Scholar Author API from SerpApi. It's a paid API with a free plan.
The main difference, in a particular example, is that you don't have to guess and tinker with how to grab certain elements of the HTML page.
Another thing is that you don't have to think about how to solve the CAPTCHA, (it could appear at some point) or find good proxies if there's a need for many requests.
Code to integrate:
from serpapi import GoogleSearch
import os
params = {
"api_key": os.getenv("API_KEY"),
"engine": "google_scholar_author",
"author_id": "cp-8uaAAAAAJ",
"hl": "en",
}
search = GoogleSearch(params)
results = search.get_dict()
citations_all = results['cited_by']['table'][0]['citations']['all']
citations_2016 = results['cited_by']['table'][0]['citations']['since_2016']
h_inedx_all = results['cited_by']['table'][1]['h_index']['all']
h_index_2016 = results['cited_by']['table'][1]['h_index']['since_2016']
i10_index_all = results['cited_by']['table'][2]['i10_index']['all']
i10_index_2016 = results['cited_by']['table'][2]['i10_index']['since_2016']
print(f'{citations_all}\n{citations_2016}\n{h_inedx_all}\n{h_index_2016}\n{i10_index_all}\n{i10_index_2016}\n')
public_access_link = results['public_access']['link']
public_access_available_articles = results['public_access']['available']
print(f'{public_access_link}\n{public_access_available_articles}')
Output:
55399
34899
69
59
148
101
https://scholar.google.com/citations?view_op=list_mandates&hl=en&user=cp-8uaAAAAAJ
23
Disclaimer, I work for SerpApi.
I've created a script to parse the links of different cases revealed upon selecting an option in dropdown from a webpage. This is the website link and this is the option Probate that should be chosen from the dropdown titled as Case Type located at the top right before hitting the search button. All the other options should be as they are.
The script can parse the links of different cases from the first page flawlessly. However, I can't make the script go on to the next pages to collect links from there as well.
This is how next pages are visible in there at the bottom:
And the dropdown should look when the option is chosen:
I've tried so far:
import requests
from bs4 import BeautifulSoup
link = "http://surrogateweb.co.ocean.nj.us/BluestoneWeb/Default.aspx"
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1; ) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36'
r = s.get(link)
soup = BeautifulSoup(r.text,"lxml")
payload = {i['name']:i.get('value','') for i in soup.select('input[name],select')}
for k,v in payload.items():
if k.endswith('ComboBox_case_type'):
payload[k] = "Probate"
elif k.endswith('ComboBox_case_type_VI'):
payload[k] = "WILL"
elif k.endswith('ComboBox_case_type$DDD$L'):
payload[k] = "WILL"
elif k.endswith('ComboBox_town$DDD$L'):
payload[k] = "%"
r = s.post(link,data=payload)
soup = BeautifulSoup(r.text,"lxml")
for pk_id in soup.select("a.dxeHyperlink_Youthful[href*='Q_PK_ID']"):
print(pk_id.get("href"))
How can I collect the links of different cases from next pages using requests?
PS I'm not after any selenium related solution.
Firstly examine the network requests in Dev Tools (press F12 in Chromes) and monitor the payload. There are bits of data that are missing in your request.
The reason for the missing form data is because it is added by JavaScript (when the user clicks on the page number). Once the form data has been set, there is JavaScript that executes the following:
xmlRequest.open("POST", action, true);
xmlRequest.setRequestHeader("Content-Type", "application/x-www-form-urlencoded; charset=utf-8");
xmlRequest.send(postData);
So all you need to do is emulate that in your Python script. Although it looks like the paging functionality only requires two additional values __CALLBACKID and __CALLBACKPARAM
In the following example; I've scraped the first 4 pages (note: the first post is just the landing page):
import requests
from bs4 import BeautifulSoup
link = "http://surrogateweb.co.ocean.nj.us/BluestoneWeb/Default.aspx"
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1; ) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36'
r = s.get(link)
r.raise_for_status()
soup = BeautifulSoup(r.text,"lxml")
payload = {i['name']:i.get('value','') for i in soup.select('input[name],select')}
for k,v in payload.items():
if k.endswith('ComboBox_case_type'):
payload[k] = "Probate"
elif k.endswith('ComboBox_case_type_VI'):
payload[k] = "WILL"
elif k.endswith('ComboBox_case_type$DDD$L'):
payload[k] = "WILL"
elif k.endswith('ComboBox_town$DDD$L'):
payload[k] = "%"
page_id_list = ['PN0','PN1', 'PN2', 'PN3'] # TODO: This is proof of concept. You need to refactor code. Purhaps scrape the page id from paging html.
for page_id in page_id_list:
# Add 2 post items. This is required for ASP.NET Gridview AJAX postback event.
payload['__CALLBACKID'] = 'ctl00$ContentPlaceHolder1$ASPxGridView_search',
# TODO: you might want to examine "__CALLBACKPARAM" acrross multiple pages. However it looks like it works by swapping PageID (e.g PN1, PN2)
payload['__CALLBACKPARAM'] = 'c0:KV|151;["5798534","5798533","5798532","5798531","5798529","5798519","5798518","5798517","5798515","5798514","5798512","5798503","5798501","5798496","5798495"];CR|2;{};GB|20;12|PAGERONCLICK3|' + page_id + ';'
r = s.post(link, data=payload)
r.raise_for_status()
soup = BeautifulSoup(r.text,"lxml")
for pk_id in soup.select("a.dxeHyperlink_Youthful[href*='Q_PK_ID']"):
print(pk_id.get("href"))
Output:
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798668
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798588
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798584
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798573
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798572
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798570
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798569
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798568
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798566
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798564
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798560
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798552
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798542
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798541
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798535
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798534
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798533
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798532
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798531
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798529
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798519
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798518
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798517
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798515
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798514
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798512
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798503
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798501
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798496
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798495
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798494
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798492
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798485
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798480
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798479
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798476
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798475
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798474
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798472
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798471
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798470
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798469
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798466
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798463
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798462
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798460
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798459
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798458
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798457
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798455
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798454
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798453
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798452
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798449
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798448
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798447
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798446
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798445
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798444
WebPages/web_case_detail_ocean.aspx?Q_PK_ID=5798443
Whilst the solution can be achieved using Requests. It can be temperamental. Selenium is usually a better approach.
This codes works but use selenium instead of requests.
You need to install selenium python lib and download gecko driver. If you do not want to have geckodriver in c:/program you have to change executable_path= to the path you have geckodriver in. You maybe want to make the sleep time shorter to, but the site is loading so slow (for me) so i have to set long sleep times so the site loads correctly before trying to read from it.
from selenium import webdriver
from bs4 import BeautifulSoup
import time
link = "http://surrogateweb.co.ocean.nj.us/BluestoneWeb/Default.aspx"
driver = webdriver.Firefox(executable_path='c:/program/geckodriver.exe')
driver.get(link)
dropdown = driver.find_element_by_css_selector('#ContentPlaceHolder1_ASPxSplitter1_ASPxComboBox_case_type_B-1')
dropdown.click()
time.sleep(0.5)
cases = driver.find_elements_by_css_selector('.dxeListBoxItem_Youthful')
for case in cases:
if case.text == 'Probate':
time.sleep(5)
case.click()
time.sleep(5)
search = driver.find_element_by_css_selector('#ContentPlaceHolder1_ASPxSplitter1_ASPxButton_search')
search.click()
while True:
time.sleep(15)
soup = BeautifulSoup(driver.page_source,"lxml")
for pk_id in soup.select("a.dxeHyperlink_Youthful[href*='Q_PK_ID']"):
print(pk_id.get("href"))
next = driver.find_elements_by_css_selector('.dxWeb_pNext_Youthful')
if len(next) > 0:
next[0].click()
else:
break
Here's how you can use PBN to paginate through all the results. The key thing you need to do pass the callback state.
import html
import requests
import lxml.html
import demjson
import html
def paginate(url, callback_id):
response = requests.get(url)
tree = lxml.html.fromstring(response.text)
yield tree
# The first page of results is embedded in the full html
# page. Subsequent pages of results will be extract from
# partial html returned from an endpoint intended for AJAX
# Set up the pagination payload with it's constant values
payload = {}
payload['__EVENTARGUMENT'] = None
payload['__EVENTTARGET'] = None
payload['__VIEWSTATE'], = tree.xpath(
"//input[#name='__VIEWSTATE']/#value")
payload['__VIEWSTATEGENERATOR'], = tree.xpath(
"//input[#name='__VIEWSTATEGENERATOR']/#value")
payload['__EVENTVALIDATION'], = tree.xpath(
"//input[#name='__EVENTVALIDATION']/#value")
payload['__CALLBACKID'] = callback_id
# To get the next page of results from the AJAX endpoint,
# it's basically a post request with a 'PBN' argument. But,
# we also have to pass around the callback state that
# the endpoint expects
event_callback_source, = tree.xpath('''//script[contains(text(), "var dxo = new ASPxClientGridView('{}');")]/text()'''.format(callback_id.replace('$', '_')))
callback_state = demjson.decode(re.search(r'^dxo\.stateObject = \((?P<body>.*)\);$', event_callback_source, re.MULTILINE).group('body'))
# You may wonder why we are encoding the callback_state back to a string
# right after we decoded it from a string.
#
# The reasons is that the original string uses single quotes and is
# not html-escaped, and we need to use double quotes and html escape.
payload[callback_id] = html.escape(demjson.encode(callback_state))
item_keys = callback_state['keys']
payload['__CALLBACKPARAM'] = 'c0:KV|61;{};GB|20;12|PAGERONCLICK3|PBN;'.format(demjson.encode(item_keys))
# We'll break when we attempt to paginate to a next
# page but we get the same keys
previous_item_keys = None
while item_keys != previous_item_keys:
response = requests.post(url, payload)
previous_item_keys = item_keys
data_str = re.match(r'.*?/\*DX\*/\((?P<body>.*)\)', response.text)\
.group('body')
data = demjson.decode(data_str)
table_tree = lxml.html.fromstring(data['result']['html'])
yield table_tree
callback_state = data['result']['stateObject']
payload[callback_id] = html.escape(demjson.encode(callback_state))
item_keys = callback_state['keys']
payload['__CALLBACKPARAM'] = 'c0:KV|61;{};GB|20;12|PAGERONCLICK3|PBN;'.format(demjson.encode(item_keys))
if __name__ == '__main__':
url = "http://surrogateweb.co.ocean.nj.us/BluestoneWeb/Default.aspx"
callback_id = 'ctl00$ContentPlaceHolder1$ASPxGridView_search'
results = paginate(url, callback_id)
I have query regarding for loops and adding one to an already working web scraper to run through a list of webpages. What I'm looking at it probably two or three simple lines of code.
I appreciate this has probably been asked many times before and answered but I've been struggling to get some code to work for me for quite some time now. I'm relatively new to Python and looking to improve.
Background info:
I've written a web scraper using Python and Beautifulsoup which is successfully able to take a webpage from TransferMarkt.com and scrape all the required web links. The script is made up of two parts:
In the first part, I am taking the webpage for a football league,
e.g. The Premier League, and extract the webpage links for all the
individual teams in the league table and put them in a list.
In the second part of my script, I then take this list of individual teams and further extract information of each of the individual players for each team and then join this together to form one big pandas DataFrame of player information.
My query is regarding how to add a for loop to the first part of this web scraper to not just extract the team links from one league webpage, but to extract links from a list of league webpages.
Below I've included an example of a football league webpage, my web scraper code, and the output.
Example:
Example webpage to scrape (Premier League - code GB1): https://www.transfermarkt.co.uk/jumplist/startseite/wettbewerb/gb1/plus/?saison_id=2019
Code (part 1 of 2) - scrape individual team links from league webpage:
# Python libraries
## Data Preprocessing
import pandas as pd
## Data scraping libraries
from bs4 import BeautifulSoup
import requests
# Assign league by code, e.g. Premier League = 'GB1', to the list_league_selected variable
list_league_selected = 'GB1'
# Assign season by year to season variable e.g. 2014/15 season = 2014
season = '2019'
# Create an empty list to assign these values to
team_links = []
# Web scraper script
## Process League Table
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}
page = 'https://www.transfermarkt.co.uk/jumplist/startseite/wettbewerb/' + id + '/plus/?saison_id=' + season
tree = requests.get(page, headers = headers)
soup = BeautifulSoup(tree.content, 'html.parser')
## Create an empty list to assign these values to - team_links
team_links = []
## Extract all links with the correct CSS selector
links = soup.select("a.vereinprofil_tooltip")
## We need the location that the link is pointing to, so for each link, take the link location.
## Additionally, we only need the links in locations 1, 3, 5, etc. of our list, so loop through those only
for i in range(1,59,3):
team_links.append(links[i].get("href"))
## For each location that we have taken, add the website before it - this allows us to call it later
for i in range(len(team_links)):
team_links[i] = "https://www.transfermarkt.co.uk" + team_links[i]
# View list of team weblinks assigned to variable - team_links
team_links
Output:
Extracted links from example webpage (20 links in total for example webpage, just showing 4):
team_links = ['https://www.transfermarkt.co.uk/manchester-city/startseite/verein/281/saison_id/2019',
'https://www.transfermarkt.co.uk/fc-liverpool/startseite/verein/31/saison_id/2019',
'https://www.transfermarkt.co.uk/tottenham-hotspur/startseite/verein/148/saison_id/2019',
'https://www.transfermarkt.co.uk/fc-chelsea/startseite/verein/631/saison_id/2019',
...,
'https://www.transfermarkt.co.uk/sheffield-united/startseite/verein/350/saison_id/2019']
Using this list of teams - team_links, I am then able to further extract information for all the players of each team with the following code. From this output I'm then able to create a pandas DataFrame of all players info:
Code (part 2 of 2) - scrape individual player information using the team_links list:
# Create an empty DataFrame for the data, df
df = pd.DataFrame()
# Run the scraper through each of the links in the team_links list
for i in range(len(team_links)):
# Download and process the team page
page = team_links[i]
df_headers = ['position_number' , 'position_description' , 'name' , 'dob' , 'nationality' , 'value']
pageTree = requests.get(page, headers = headers)
pageSoup = BeautifulSoup(pageTree.content, 'lxml')
# Extract all data
position_number = [item.text for item in pageSoup.select('.items .rn_nummer')]
position_description = [item.text for item in pageSoup.select('.items td:not([class])')]
name = [item.text for item in pageSoup.select('.hide-for-small .spielprofil_tooltip')]
dob = [item.text for item in pageSoup.select('.zentriert:nth-of-type(4):not([id])')]
nationality = ['/'.join([i['title'] for i in item.select('[title]')]) for item in pageSoup.select('.zentriert:nth-of-type(5):not([id])')]
value = [item.text for item in pageSoup.select('.rechts.hauptlink')]
df_temp = pd.DataFrame(list(zip(position_number, position_description, name, dob, nationality, value)), columns = df_headers)
df = df.append(df_temp) # This last line of code is mine. It appends to temporary data to the master DataFrame, df
# View the pandas DataFrame
df
My question to you - adding a for loop to go through all the leagues:
What I need to do is replace the list_league_selected variable assigned to an individual league code in the first part of my code, and instead use a for loop to go through the full list of league codes - list_all_leagues. This list of league codes is as follows:
list_all_leagues = ['L1', 'GB1', 'IT1', 'FR1', 'ES1'] # codes for the top 5 European leagues
I've read through several solutions but I'm struggling to get the loop to work and append the full list of team webpages at the correct part. I believe I'm now really close to completing my scraper and any advice on how to create this for loop would be much appreciated!
Thanks in advance for your help!
Actually I've taken time to clear much mistakes in your code. and shorten the big road. Below you can achieve your target.
I considered been under antibiotic protection (😋) meant under requests.Session() to maintain the Session during my loop, which means to prevent TCP layer security from blocking/refusing/dropping my packet/request while Scraping.
import requests
from bs4 import BeautifulSoup
import pandas as pd
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:75.0) Gecko/20100101 Firefox/75.0'
}
leagues = ['L1', 'GB1', 'IT1', 'FR1', 'ES1']
def main(url):
with requests.Session() as req:
links = []
for lea in leagues:
print(f"Fetching Links from {lea}")
r = req.get(url.format(lea), headers=headers)
soup = BeautifulSoup(r.content, 'html.parser')
link = [f"{url[:31]}{item.next_element.get('href')}" for item in soup.findAll(
"td", class_="hauptlink no-border-links hide-for-small hide-for-pad")]
links.extend(link)
print(f"Collected {len(links)} Links")
goals = []
for num, link in enumerate(links):
print(f"Extracting Page# {num +1}")
r = req.get(link, headers=headers)
soup = BeautifulSoup(r.content, 'html.parser')
target = soup.find("table", class_="items")
pn = [pn.text for pn in target.select("div.rn_nummer")]
pos = [pos.text for pos in target.findAll("td", class_=False)]
name = [name.text for name in target.select("td.hide")]
dob = [date.find_next(
"td").text for date in target.select("td.hide")]
nat = [" / ".join([a.get("alt") for a in nat.find_all_next("td")[1] if a.get("alt")]) for nat in target.findAll(
"td", itemprop="athlete")]
val = [val.get_text(strip=True)
for val in target.select('td.rechts.hauptlink')]
goal = zip(pn, pos, name, dob, nat, val)
df = pd.DataFrame(goal, columns=[
'position_number', 'position_description', 'name', 'dob', 'nationality', 'value'])
goals.append(df)
new = pd.concat(goals)
new.to_csv("data.csv", index=False)
main("https://www.transfermarkt.co.uk/jumplist/startseite/wettbewerb/{}/plus/?saison_id=2019")
Output: View Online
"Hello, i am quite new to web-scraping. I recently retrieved a list of web-links and there are URLs within these links containing data from tables. I am planning to scrape the data but can't seem to even get the URLs. Any form of help is much appreciated"
"The list of weblinks are
https://aviation-safety.net/database/dblist.php?Year=1919
https://aviation-safety.net/database/dblist.php?Year=1920
https://aviation-safety.net/database/dblist.php?Year=1921
https://aviation-safety.net/database/dblist.php?Year=1922
https://aviation-safety.net/database/dblist.php?Year=2019"
"From the list of links, i am planning to
a. get the URLs within these links
https://aviation-safety.net/database/record.php?id=19190802-0
https://aviation-safety.net/database/record.php?id=19190811-0
https://aviation-safety.net/database/record.php?id=19200223-0"
"b. get data from tables within each URL
(e.g., Incident date, incident time, type, operator, registration, msn, first flight, classification)"
#Get the list of weblinks
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
import requests
headers = {'insert user agent'}
#start of code
mainurl = "https://aviation-safety.net/database/"
def getAndParseURL(mainurl):
result = requests.get(mainurl)
soup = BeautifulSoup(result.content, 'html.parser')
datatable = soup.find_all('a', href = True)
return datatable
datatable = getAndParseURL(mainurl)
#go through the content and grab the URLs
links = []
for link in datatable:
if 'Year' in link['href']:
url = link['href']
links.append(mainurl + url)
#check if links are in dataframe
df = pd.DataFrame(links, columns=['url'])
df.head(10)
#save the links to a csv
df.to_csv('aviationsafetyyearlinks.csv')
#from the csv read each web-link and get URLs within each link
import csv
from urllib.request import urlopen
contents = []
df = pd.read_csv('aviationsafetyyearlinks.csv')
urls = df['url']
for url in urls:
contents.append(url)
for url in contents:
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
addtable = soup.find_all('a', href = True)
"I am only able to get the list of web-links and am unable to get the URLs nor the data within these web-links. The code continually shows arrays
not really sure where my code is wrong, appreciate any help and many thanks in advance."
While requesting the page.Add User Agent.
headers = {'User-Agent':
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}
mainurl = "https://aviation-safety.net/database/dblist.php?Year=1919"
def getAndParseURL(mainurl):
result = requests.get(mainurl,headers=headers)
soup = BeautifulSoup(result.content, 'html.parser')
datatable = soup.select('a[href*="database/record"]')
return datatable
print(getAndParseURL(mainurl))
I tried to page scrape wikipedia a week ago. But i could not figure out why Beautiful Soup will only show some string from the table column and show "none" for other table column.
NOTE: the table column all contains data.
My program will extract all table columns with the tag "description". I am trying to extract all the description from the table.
The website I am scraping is: http://en.wikipedia.org/wiki/Supernatural_(season_6)
This is my code:
from BeautifulSoup import BeautifulSoup
import urllib
import sys
from urllib import FancyURLopener
class MyOpener(FancyURLopener):
version = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.24 (KHTML, like Gecko) Chrome/11.0.696.65 Safari/534.24'
def printList(rowList):
for row in rowList:
print row
print '\n'
return
url = "http://en.wikipedia.org/wiki/Supernatural_(season_6)"
#f = urllib.urlopen(url)
#content = f.read()
#f.close
myopener = MyOpener()
page = myopener.open(url)
content = page.read()
page.close()
soup = BeautifulSoup(''.join(content))
soup.prettify()
movieList = []
rowListTitle = soup.findAll('tr', 'vevent')
print len(rowListTitle)
#printList(rowListTitle)
for row in rowListTitle:
col = row.next # explain this?
if col != 'None':
col = col.findNext("b")
movieTitle = col.string
movieTuple = (movieTitle,'')
movieList.append(movieTuple)
#printList(movieList)
for row in movieList:
print row[0]
rowListDescription = soup.findAll('td' , 'description')
print len(rowListDescription)
index = 1;
while ( index < len(rowListDescription) ):
description = rowListDescription[index]
print description
print description.string
str = description
print '####################################'
movieList[index - 1] = (movieList[index - 1][0],description)
index = index + 1
I did not paste the output as it is really long. But the output is really weird as it did managed to capture the information in the <td> but when i do a .string, it gives me an empty content.
Do all the description strings come up empty? From the documentation:
For your convenience, if a tag has only one child node, and that child node is a string, the child node is made available as tag.string, as well as tag.contents[0].
In this case, the description often have child nodes, i.e.: a <a> link to another Wikipedia article. This counts as a non-string child node, in which case string for the description node is set to None.