I can't get the id while we scraping (python) - python

Here is my first post. Hope to be clear.
I'm scraping a web-site and here is the code I'm interested to scrape:
<div id="live-table">
<div class="event mobile event--summary">
<div elementtiming="SpeedCurveFRP" class="leagues--static event--leagues summary-results">
<div class="sportName tennis">
<div id="g_2_ldRHDOEp" title="Clicca per i dettagli dell'incontro!" class="event__matchevent__match--static event__match--twoLine">
...
What I would like to obtain is the last id (g_2_ldRHDOEp) and here is the code I produced using the beautifulsoup library
import urllib.request, urllib.error, urllib.parse
from bs4 import BeautifulSoup
url = '...'
response = urllib.request.urlopen(url)
webContent = response.read()
soup = BeautifulSoup(webContent, 'html.parser')
list = []
list = soup.find_all("div")
total_id = " "
for i in list :
id = i.get('id')
total_id = total_id + "\n" + str(id)
print(total_id)
But what I get is only
live-table
None
None
None
None
I'm quite new both to python and beautifulsoup and I'm not a seriuos programmer, I do this just for fun.
Can anyone answer me why can't I get what I want and may be how I could do this in a better and succesful way?
Thank you in advance

First of all, id and list are built-in functions, so don't use them as variable names.
The website is loaded dynamically so requests doesn't support it. We can use Selenium as an alternative to scrape the page.
Install it with: pip install selenium.
Download the correct ChromeDriver from here.
from selenium import webdriver
from bs4 import BeautifulSoup
from time import sleep
URL = "https://www.flashscore.it/giocatore/djokovic-novak/AZg49Et9/"
driver = webdriver.Chrome(r"C:\path\to\chromedriver.exe")
driver.get(URL)
sleep(5)
soup = BeautifulSoup(driver.page_source, "html.parser")
for tag in soup.find_all("div", id="g_2_ldRHDOEp"):
print(tag.get_text(separator=" "))
driver.quit()
Output:
30.10. 12:05 Djokovic N. (Srb) Sonego L. (Ita) 0 2 2 6 1 6 P

Related

Python Web Scraping | How to scrape data from multiple urls by choosing page number as a range with Beautiful Soup and selenium?

from selenium import webdriver
import time
from bs4 import BeautifulSoup as Soup
driver = webdriver.Firefox(executable_path='C://Downloads//webdrivers//geckodriver.exe')
a = 'https://www.amazon.com/s?k=Mobile&i=amazon-devices&page='
for c in range(8):
#a = f'https://www.amazon.com/s?k=Mobile&i=amazon-devices&page={c}'
cd = driver.get(a+str(c))
page_source = driver.page_source
bs = Soup(page_source, 'html.parser')
fetch_data = bs.find_all('div', {'class': 's-expand-height.s-include-content-margin.s-latency-cf-section.s-border-bottom'})
for f_data in fetch_data:
product_name = f_data.find('span', {'class': 'a-size-medium.a-color-base.a-text-normal'})
print(product_name + '\n')
Now The problem here is that, Webdriver successfully visits 7 pages, But doesn't provide any output or an error.
Now I don't know where M in going wrong.
Any suggestions, reference to a article that provides solution about this problem will be always welcomed.
You are not selecting the right div tag to fetch the products using BeautifulSoup, leading to no output.
Try the following snippet:-
#range of pages
for i in range(1,20):
driver.get(f'https://www.amazon.com/s?k=Mobile&i=amazon-devices&page={i}')
page_source = driver.page_source
bs = Soup(page_source, 'html.parser')
#get search results
products=bs.find_all('div',{'data-component-type':"s-search-result"})
#for each product in search result print product name
for i in range(0,len(products)):
for product_name in products[i].find('span',class_="a-size-medium a-color-base a-text-normal"):
print(product_name)
You can print bs or fetch_data to debug.
Anyway
In my opinion, you can use requests or urllib to get page_source instead of selenium

Extract a value from html using BeautifulSoup

I'm trying to retrieve a value from this HTML using bs4. I'm really new to data scraping and I have tried to figure out some ways to get this value but to no avail. The closest solution I saw is this one.
Extracting a value from html table using BeautifulSoup
Here is the HTML of which I am looking at:
<div class="dataItem_hld clearfix">
<div class="smalltxt">ROE</div>
<div name="tixStockRoe" class="value">121.362</div>
</div>
I've tried this so far:
from bs4 import BeautifulSoup as BS
import requests
url = "https://www.bursamarketplace.com/mkt/themarket/stock/SUPM"
html_content = requests.get(url).text
soup = BS(html_content, 'lxml')
val = soup.find_all('div', {'name': "tixStockRoe", 'class':"value"})
Before I want to try to use strip() to get the value, my val variable is empty.
In [96]: val
Out[96]: []
I've been searching the posts for few hours, but I did not manage to type the correct code to get the value yet.
Also, please let me know if there are any good sources to learn about extracting data. Thanks
Update
I have edited the code thanks to the response to the post. Now I encounter a problem. It seems like the number 121.362 did not appear in the variable. Any idea here?
val = soup.find_all(attrs={'name': "tixStockRoe"})
and the output is this:
Out[14]: [<div class="value" name="tixStockRoe"><div class="loader loaderSmall"><div class="loader_hld"><img alt="" src="/img/loading.gif"/></div></div></div>]
The data in that page is loaded by JavaScript and that is the reason you aren't finding the data you are looking for - 121.362 using beautifulsoup.
beautifulsoup only works on static websites.
You need to use selenium to load the page and get data. You can read more about web-scraping using selenium here
Here is how you scrape using Selenium.
import time
from bs4 import BeautifulSoup, Tag, NavigableString
import requests
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
options = Options()
options.add_argument("--headless")
driver = webdriver.Chrome("chromedriver.exe", options=options)
url = 'https://www.bursamarketplace.com/mkt/themarket/stock/SUPM'
driver.get(url)
time.sleep(5)
soup = BeautifulSoup(driver.page_source, 'lxml')
d = soup.find('div', attrs= {'name': 'tixStockRoe'})
print(d.text.strip())
121.362
Check the Docs
You can't use a keyword argument to search for HTML’s name element.
name_soup.find_all(attrs={"name": "tixStockRoe"})
You can try this :
# I do not know lxml, in my case this parser is working
soup = BS(page, 'html.parser')
val = soup.find_all('div', attrs={'name': "tixStockRoe", 'class':"value"})

How to extract value from span tag

I am writing a simple web scraper to extract the game times for the ncaa basketball games. The code doesn't need to be pretty, just work. I have extracted the value from other span tags on the same page but for some reason I cannot get this one working.
from bs4 import BeautifulSoup as soup
import requests
url = 'http://www.espn.com/mens-college-basketball/game/_/id/401123420'
response = requests.get(url)
soupy = soup(response.content, 'html.parser')
containers = soupy.findAll("div",{"class" : "team-container"})
for container in containers:
spans = container.findAll("span")
divs = container.find("div",{"class": "record"})
ranks = spans[0].text
team_name = spans[1].text
team_mascot = spans[2].text
team_abbr = spans[3].text
team_record = divs.text
time_container = soupy.find("span", {"class":"time game-time"})
game_times = time_container.text
refs_container = soupy.find("div", {"class" : "game-info-note__container"})
refs = refs_container.text
print(ranks)
print(team_name)
print(team_mascot)
print(team_abbr)
print(team_record)
print(game_times)
print(refs)
The specific code I am concerned about is this,
time_container = soupy.find("span", {"class":"time game-time"})
game_times = time_container.text
I just provided the rest of the code to show that the .text on other span tags work. The time is the only data I truly want. I just get an empty string with how my code is currently.
This is the output of the code I get when I call time_container
<span class="time game-time" data-dateformat="time1" data-showtimezone="true"></span>
or just '' when I do game_times.
Here is the line of the HTML from the website:
<span class="time game-time" data-dateformat="time1" data-showtimezone="true">6:10 PM CT</span>
I don't understand why the 6:10 pm is gone when I run the script.
The site is dynamic, thus, you need to use selenium:
from selenium import webdriver
d = webdriver.Chrome('/path/to/chromedriver')
d.get('http://www.espn.com/mens-college-basketball/game/_/id/401123420')
game_time = soup(d.page_source, 'html.parser').find('span', {'class':'time game-time'}).text
Output:
'7:10 PM ET'
See full selenium documentation here.
An alternative would be to use some of ESPN's endpoints. These endpoints will return JSON responses. https://site.api.espn.com/apis/site/v2/sports/basketball/mens-college-basketball/scoreboard
You can see other endpoints at this GitHub link https://gist.github.com/akeaswaran/b48b02f1c94f873c6655e7129910fc3b
This will make your application pretty light weight compared to running Selenium.
I recommend opening up inspect and going to the network tab. You can see all sorts of cool stuff happening. You can see all the requests that are happening in the site.
You can easily grab from an attribute on the page with requests
import requests
from bs4 import BeautifulSoup as bs
from dateutil.parser import parse
r = requests.get('http://www.espn.com/mens-college-basketball/game/_/id/401123420')
soup = bs(r.content, 'lxml')
timing = soup.select_one('[data-date]')['data-date']
print(timing)
match_time = parse(timing).time()
print(match_time)

Python Beautiful Soup find_all()

I am trying to use find_all() on the below html;
http://www.simon.com/mall
Based on advice on other threads, I ran the link through the below site and it found errors, but I am not sure how the errors shown may be hurting what I am trying to do in Beautiful Soup.
https://validator.w3.org/
Here is my code;
from requests import get
url = 'http://www.simon.com/mall'
response = get(url)
from bs4 import BeautifulSoup
html = BeautifulSoup(response.text, 'html5lib')
mall_list = html.find_all('div', class_ = 'col-xl-4 col-md-6 ')
print(type(mall_list))
print(len(mall_list))
The result is;
"C:\Program Files\Anaconda3\python.exe" C:/Users/Chris/PycharmProjects/IT485/src/GetMalls.py
<class 'bs4.element.ResultSet'>
0
Process finished with exit code 0
I know there are hundreds of these divs in the HTML. Why am I not getting any matches?
Your code looks fine, however, when I visit the simon.com/mall link and check Chrome Dev Tools there doesn't seem to be any instances of the class 'col-xl-4 col-md-6 '.
Try testing your code with 'col-xl-2' and you should see some results.
Assuming that you are trying to parse the title and location of different products from that page (mentioned in your script). The thing is the content of that page are generated dynamically so you can't catch it with requests; rather, you need to use any browser simulator like selenium that is What i did in my below code. Give this a try:
from selenium import webdriver
from bs4 import BeautifulSoup
import time
driver = webdriver.Chrome()
driver.get('http://www.simon.com/mall')
time.sleep(3)
soup = BeautifulSoup(driver.page_source, 'lxml')
driver.quit()
for item in soup.find_all(class_="mall-list-item-text"):
name = item.find_all(class_='mall-list-item-name')[0].text
location = item.find_all(class_='mall-list-item-location')[0].text
print(name,location)
Results:
ABQ Uptown Albuquerque, NM
Albertville Premium Outlets® Albertville, MN
Allen Premium Outlets® Allen, TX
Anchorage 5th Avenue Mall Anchorage, AK
Apple Blossom Mall Winchester, VA
I sometime use BeautifulSoup too. The problem lies in the way you get the attributes. The full working code can be seen bellow:
import requests
from bs4 import BeautifulSoup
url = 'http://www.simon.com/mall'
response = requests.get(url)
html = BeautifulSoup(response.text)
mall_list = html.find_all('div', attrs={'class': 'col-lg-4 col-md-6'})[1].find_all('option')
malls = []
for mall in mall_list:
if mall.get('value') == '':
continue
malls.append(mall.text)
print(malls)
print(type(malls))
print(len(malls))

Extract follower numbers from Vkontakte using Python & BeautifulSoup

I am trying to extract the follower count from a page on Vkontakte, a Russian social network. As I'm a complete beginner with Python, I have tried using a code I discovered on StackOverflow initially made to extract follower count on Twitter. Here's the original code :
from bs4 import BeautifulSoup
import requests
username='realDonaldTrump'
url = 'https://www.twitter.com/'+username
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
f = soup.find('li', class_="ProfileNav-item--followers")
print(f)
I'm using this webpage as an example : https://vk.com/msk_my. Here is my code :
from bs4 import BeautifulSoup
import requests
url = 'https://vk.com/msk_my'
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
f = soup.find('span', class_="header_count fl_l")
print(f)
This, and many other variations I've tried (for example, trying to find "div" instead of "span", only prints "None". It seems BeautifulSoup can't find the follower count, and I'm sttruggling to understand why. The only way I've managed to print the follower count is with this :
text = soup.div.get_text()
print(text)
But this prints much more stuff than I want, and I don't know how to get only the follower count.
Try this. It will fetch you only the followers count. All you have to do is use selenium to be able to grab the exact page source that you can see by inspecting element.
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('https://vk.com/msk_my')
soup = BeautifulSoup(driver.page_source,"lxml")
driver.quit()
item = soup.select(".header_count")[0].text
print("Followers: {}".format(item))
Result:
Followers: 59,343

Categories

Resources