Web Scraping Python (BeautifulSoup,Requests) - python

I am learning web scraping using python but I can't get the desired result. Below is my code and the output
code
import bs4,requests
url = "https://twitter.com/24x7chess"
r = requests.get(url)
soup = bs4.BeautifulSoup(r.text,"html.parser")
soup.find_all("span",{"class":"account-group-inner"})
[]
Here is what I was trying to scrape
https://i.stack.imgur.com/tHo5S.png
I keep on getting an empty array. Please Help.

Sites like Twitter load the content dynamically, which sometimes depends upon the browser you are using etc. And due to dynamic loading there could be some elements in the webpage which are lazily loaded, which means that the DOM is inflated dynamically, depending upon the user actions, The tag you are inspecting in your browser Inspect element, is inspected the fully dynamically inflated HTML, But the response you are getting using requests, is inflated HTML, or a simple DOM waiting to load the elements dynamically on the user actions which in your case while fetching from requests module is None.
I would suggest you to use selenium webdriver for scraping dynamic javascript web pages.

Try this. It will give you the items you probably look for. Selenium with BeautifulSoup is easy to handle. I've written it that way. Here it is.
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://twitter.com/24x7chess")
soup = BeautifulSoup(driver.page_source,"lxml")
driver.quit()
for title in soup.select("#page-container"):
name = title.select(".ProfileHeaderCard-nameLink")[0].text.strip()
location = title.select(".ProfileHeaderCard-locationText")[0].text.strip()
tweets = title.select(".ProfileNav-value")[0].text.strip()
following = title.select(".ProfileNav-value")[1].text.strip()
followers = title.select(".ProfileNav-value")[2].text.strip()
likes = title.select(".ProfileNav-value")[3].text.strip()
print(name,location,tweets,following,followers,likes)
Output:
akul chhillar New Delhi, India 214 44 17 5

You could have done the whole thing with requests rather than selenium
import requests
from bs4 import BeautifulSoup as bs
import re
r = requests.get('https://twitter.com/24x7chess')
soup = bs(r.content, 'lxml')
bio = re.sub(r'\n+',' ', soup.select_one('[name=description]')['content'])
stats_headers = ['Tweets', 'Following', 'Followers', 'Likes']
stats = [item['data-count'] for item in soup.select('[data-count]')]
data = dict(zip(stats_headers, stats))
print(bio, data)

Related

Extracting content from webpage using BeautifulSoup

I am working on scrapping numbers from the Powerball website with the code below.
However, numbers keeps coming back empty. Why is this?
import requests
from bs4 import BeautifulSoup
url = 'https://www.powerball.com/games/home'
page = requests.get(url).text
bsPage = BeautifulSoup(page)
numbers = bsPage.find_all("div", class_="field_numbers")
numbers
Can confirm #Teprr is absolutely correct. You'll need to download chrome and add chromedriver.exe to your system path for this to work but the following code gets what you are looking for. You can use other browsers too you just need their respective driver.
from bs4 import BeautifulSoup
from selenium import webdriver
import time
url = 'https://www.powerball.com/games/home'
options = webdriver.ChromeOptions()
options.add_argument('headless')
browser = webdriver.Chrome(options=options)
browser.get(url)
time.sleep(3) # wait three seconds for all the js to happen
html = browser.page_source
soup = BeautifulSoup(html, 'html.parser')
draws = soup.findAll("div", {"class":"number-card"})
print(draws)
for d in draws:
info = d.find("div",{"class":"field_draw_date"}).getText()
balls = d.find("div",{"class":"field_numbers"}).findAll("div",{"class":"numbers-ball"})
numbers = [ball.getText() for ball in balls]
print(info)
print(numbers)
If you download that file and inspect it locally, you can see that there is no <div> with that class. That means that it is likely generated dynamically using javascript by your browser, so you would need to use something like selenium to get the full, generated HTML content.
Anyway, in this specific case, this piece of HTML seems to be the container for the data you are looking for:
<div data-url="/api/v1/numbers/powerball/recent?_format=json" class="recent-winning-numbers"
data-numbers-powerball="Power Play" data-numbers="All Star Bonus">
Now, if you check that custom data-url, you can find the information you want in JSON format.

Don't get data from soup

I created bs4 web-scraping app with python. My program return empty list for review. For soup program runs normally.
from bs4 import BeautifulSoup
import requests
import pandas as pd
data = []
usernames = []
titles = []
comments = []
result = requests.get('https://www.kupujemprodajem.com/review.php?action=list')
soup = BeautifulSoup(result.text, 'html.parser')
review = soup.findAll('div', class_="single-review")
print(review)
for i in review:
header = i.find('div', class_="single-review__header")
footer = i.find('div', class_="comment-holder")
username = header.find('a', class_="single-review__username").text
title = header.find('div', class_="single-review__related-to").text
comment = footer.find('div', class_="single-review__comment").text
usernames.append(username)
titles.append(title)
comments.append(comment)
data.append(usernames)
data.append(titles)
data.append(comments)
print(data)
It isn't problem with class.
It looks like the reason this doesn't work is because the website needs a login in order to access that page. If in a private tab in a browser you where to visit https://www.kupujemprodajem.com/review.php?action=list, it would just take you to a login page.
There's 2 paths I can think of that you could take here:
Reverse engineer how the login process works and use the requests library to make a request to login and get (most likely) the session cookie from that in order to be able to request pages that require sign in.
(much simpler) use selenium instead. Selenium is a library that allows you to control a full browser instance, so you would be able to easily input credentials using this method. Beautiful soup on the other hand simply just parses html, so doing things like authenticating often take much more work in Beautiful Soup then they do in Selenium. I'd definitely suggest looking into it if you haven't already.

BeautifulSoup doesn't return the expected tag as Chrome

I'm trying to parse a page to learn beautifulSoup, here is the code
import requests as req
from bs4 import BeautifulSoup
page = 'https://www.pathofexile.com/trade/search/Delirium/w0brcb'
resp = req.get(page)
soup = BeautifulSoup(resp.content, 'html.parser')
res = soup.find_all('results')
print(len(res))
Result: 0
The goal is to get the first price.
I tried to look for the tag in Chrome and it's there, but probably the browser does another requests to get the results.
Can someone explain what am I missing here?
website's source code
Problems with the code
Your code is looking for a "results"-element. What you really have to look for (based on your screenshot) is a div-element with the class "results".
So try this:
soup.find_all("div", attrs={"class":"results"})
But if you want the price you have to dig deeper for the element which contains the price:
price = soup.find("span", attrs={"data-field":"price"}).text
Problems with the site
It seems the site is loading data by Ajax. With Requests you get the page before/without Ajax data call.
In this case you should change from Requests to Selenium module. This will navigate through a "real Browser" and you can wait until data is finally loaded before you start scraping.
Documentation: Selenium

Getting output 0 even if there are 25 same class

Image of the HTML
Link to the page
I am trying to see how many of class are there on this page but the output is 0. And I have been using BeautifulSoup for a while but never saw such error.
from bs4 import BeautifulSoup
import requests
result = requests.get("https://www.holonis.com/motivationquotes")
c = result.content
soup = BeautifulSoup(c)
samples = soup.findAll("div", {"class": "ng-scope"})
print(len(samples))
Output
0
and I want the correct output at least more than 25
This is a "dynamic" Angular-based page which needs a Javascript engine or a browser to be fully loaded. To put it differently - the HTML source code you see in the browser developer tools is not the same as you would see in the result.content - the latter is a non-rendered initial HTML of the page which does not contain the desired data.
You can use things like selenium to have the page rendered and loaded and then HTML-parse it, but, why don't make a direct request to the site API:
import requests
result = requests.get("https://www.holonis.com/api/v2/activities/motivationquotes/all?limit=15&page=0")
data = result.json()
for post in data["items"]:
print(post["body"]["description"])
Post descriptions are retrieved and printed for example-purposes only - the post dictionaries contain all the other relevant post data that is displayed on the web-page.
Basically, result.content does not contain any divs with ng-scope class. As stated in one of the comments the html you are trying to get is added there due to the javascript running on the browser.
I recommend you this package requests-html created by the author of very popular requests.
You can try to use the code below to build on that.
from requests_html import HTMLSession
session = HTMLSession()
r = session.get('https://www.holonis.com/motivationquotes')
r.html.render()
To see how many ng-scope classes are there just do this:
>>> len(r.html.find('.ng-scope'))
302
I assume you want to scrape all the hrefs from the a tags that are children of the divs you gave the image to. You can obtain them this way:
divs = r.html.find('[ng-if="!isVideo"]')
link_sets = (div.absolute_links for div in divs)
>>> list(set(chain.from_iterable(link_sets)))
['https://www.holonis.com/motivationquotes/o/byiqe-ydm',
'https://www.holonis.com/motivationquotes/o/rkhv0uq9f',
'https://www.holonis.com/motivationquotes/o/ry7ra2ycg',
...
'https://www.holonis.com/motivationquotes/o/sydzfwgcz',
'https://www.holonis.com/motivationquotes/o/s1eidcdqf']
There's nothing wrong with BeautifulSoup, in fact, the result of your GET request, does not contain any ng-scope text.
You can see the output here:
>>> from bs4 import BeautifulSoup
>>> import requests
>>>
>>> result = requests.get("https://www.holonis.com/motivationquotes")
>>> c = result.content
>>>
>>> print(c)
**Verify the output yourself**
You only have ng-cloak class as you can see from the regex result:
import re
regex = re.compile('ng.*')
samples = soup.findAll("div", {"class": regex})
samples
#[<div class="ng-cloak slideRoute" data-ui-view="main" fixed="400" main-content="" ng-class="'{{$state.$current.animateClass}}'"></div>]
To get the content of that webpage, it is either wise to use their api or choose any browser simulator like selenium. That webpage loads it's content using lazyload. When you scroll down you will see more content. The webpage expands its content through pagination like https://www.holonis.com/excellenceaddiction/1. However, you can give this a go. I've created this script to parse content displayed within 4 pages. You can always change that page number to satisfy your requirement.
from selenium import webdriver
URL = "https://www.holonis.com/excellenceaddiction/{}"
driver = webdriver.Chrome() #If necessary, define the path
for link in [URL.format(page) for page in range(1,4)]:
driver.get(link)
for items in driver.find_elements_by_css_selector('.tile-content .tile-content-text'):
print(items.text)
driver.quit()
Btw, the above script parses the description of each post.

Trouble Scraping site with BS4

usually I'm able to write a script that works for scraping, but I've been having some difficulty scraping this site for the table enlisted for this research project I'm working on. I'm planning to verify the script working on one State before entering the URL of my targeted states.
import requests
import bs4 as bs
url = ("http://programs.dsireusa.org/system/program/detail/284")
dsire_get = requests.get(url)
soup = bs.BeautifulSoup(dsire_get.text,'lxml')
table = soup.findAll('div', {'data-ng-controller': 'DetailsPageCtrl'})
print(table)
#I'm printing "Table" just to ensure that the table information I'm looking for is within this sections
I'm not sure if the site is attempting to block people from scraping, but all the info that I'm looking to grab is within "&quot"if you look what Table outputs.
The text is rendered with JavaScript.
First render the page with dryscrape
(If you don't want to use dryscrape see Web-scraping JavaScript page with Python )
Then the text can be extracted, after it has been rendered, from a different position on the page i.e the place it has been rendered to.
As an example this code will extract HTML from the summary.
import bs4 as bs
import dryscrape
url = ("http://programs.dsireusa.org/system/program/detail/284")
session = dryscrape.Session()
session.visit(url)
dsire_get = session.body()
soup = bs.BeautifulSoup(dsire_get,'html.parser')
table = soup.findAll('div', {'class': 'programSummary ng-binding'})
print(table[0])
Outputs:
<div class="programSummary ng-binding" data-ng-bind-html="program.summary"><p>
<strong>Eligibility and Availability</strong></p>
<p>
Net metering is available to all "qualifying facilities" (QFs), as defined by the federal <i>Public Utility Regulatory Policies Act of 1978</i> (PURPA), which pertains to renewable energy systems and combined heat and power systems up to 80 megawatts (MW) in capacity. There is no statewide cap on the aggregate capacity of net-metered systems.</p>
<p>
All utilities subject to Public ...
So I finally managed to solve the issue, and successfuly grab the data from the Javascript page the code as follows worked for me if anyone encounters a same issue when trying to use python to scrape a javascript webpage using windows (dryscrape incompatible).
import bs4 as bs
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.keys import Keys
browser = webdriver.Chrome()
url = ("http://programs.dsireusa.org/system/program/detail/284")
browser.get(url)
html_source = browser.page_source
browser.quit()
soup = bs.BeautifulSoup(html_source, "html.parser")
table = soup.find('div', {'class': 'programOverview'})
data = []
for n in table.findAll("div", {"class": "ng-binding"}):
trip = str(n.text)
data.append(trip)

Categories

Resources