Text missing after scraping a website using BeautifulSoup - python

I'm writing a python script to get the number of pull requests generated by a particular user during the ongoing hactoberfest event.
Here's a link to the official website of hacktoberfest.
Here's my code:
url= 'https://hacktoberfest.digitalocean.com/stats/user'
import urllib.request
from bs4 import BeautifulSoup
req = urllib.request.Request(url, headers={'User-Agent': 'Mozilla/5.0'})
html = urllib.request.urlopen(req).read()
soup = BeautifulSoup(html, 'html.parser')
name_box = soup.find('div', attrs={'class': 'userstats--progress'})
print(name_box)
Where 'user' in the first line of the code should be replaced by the user's github handle ( eg. BAJUKA ).
Below is the HTML tag I'm aiming to scrape:
<div class="userstats--progress">
<p>
Progress (<span data-js="userPRCount">5</span>/5)
</p>
<div class="ProgressBar ProgressBar--three u-mb--regular ProgressBar--full" data-js="progressBar"></div>
</div>
This is what I get after I run my code:
<div class="userstats--progress">
<p>
Progress (<span data-js="userPRCount"></span>/5)
</p>
<div class="ProgressBar ProgressBar--three u-mb--regular" data-js="progressBar"></div>
</div>
The difference being on the third line where the number of pull request in missing (i.e. in the span tag a 5 is missing)
These are the questions that I want to ask:
1.Why is the no. of pull requests ( i.e. 5 in this case ) missing from the scraped lines?
2.How can I solve this issue? That is get the no. of pull requests successfully.

The data you're looking for is not in the original data that the hacktober server sends, and Beautiful Soup downloads and parses; it's inserted into the HTML by the Javascript code that runs on that page in your browser after that original data is loaded.
If you use this shell command to download the data that's actually served as the page, you'll see that the span tag you're looking at starts off empty:
curl -s 'https://hacktoberfest.digitalocean.com/stats/BAJUKA' | grep -3 Progress
What's the javascript that fills that tag? Well, it's minified, so it's very hard to unpick what's going on. You can find it included in the very bottom of the original data, here:
curl -s 'https://hacktoberfest.digitalocean.com/stats/BAJUKA' | grep -3 "script src=" | tail -n5
which when I run it, outputs this:
<script src="https://go.digitalocean.com/js/forms2/js/forms2.min.js"></script>
<script src="/assets/application-134859a20456d7d32be9ea1bc32779e87cad0963355b5372df99a0cff784b7f0.js"></script>
That crazy looking source URL is a minified piece of Javascript, which means that it's been automatically shrunk, which also means that it's almost unreadable. But if you go to that page., and page right down to the bottom, you can see some garbled Javascript which you can try and decode.
I noticed this bit:
var d="2018-09-30T10%3A00%3A00%2B00%3A00",f="2018-11-01T12%3A00%3A00%2B00%3A00";$.getJSON("https://api.github.com/search/issues?q=-label:invalid+created:"+d+".."+f+"+type:pr+is:public+author:"+t+"&per_page=300"
Which I think is where it gets the data to fill that DIV. If you load up and parse that URL, I think you'll find the data you need. You'll need to fill in the dates for that search, and the author. Good luck!

Related

How to extract all children & sub children HTML code from a parent container? Python webscraping

I want to scrape a chunk of HTML code and load it locally in a newly created HTML file.
First I have to find the right container in the HTML code. I'm currently using the BeautifulSoup module in python to find the parent container (div):
url = 'https://darksky.net/details/52.3673,4.8998/2021-8-8/ca24/en'
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
raw_weather_forecast = soup.find('div',class_= "timeline_container")
print(raw_weather_forecast)
This however only returns the HTML code of the container plus the children containers, but not the HTML contents of these children containers (which I want to scrape aswell):
<div class="timeline_container" id="timeline">
<div class="timeline">
<div class="stripes"></div>
<div class="hour_ticks"></div>
<div class="hours"></div>
<div class="temps"></div>
</div>
</div>
Example of a part of the HTML code I would like to collect (only one to give a better idea of the problem) (Its a picture because it is quite a lot of code): HTML code
How would I tackle this problem? Is there an efficient way to do this in python?
Thanks in advance!
Luc
PS.
To give you a better idea of why I want to achieve this. When I wake up I want my TV to display the weather forecast in my area. I'm using a raspberry pi with HDMI cec to activate my TV when it's time to get up. I then want it to load and show certain things (like my agenda and the weather forecast of that day) that will aid me at the start of the day.
I you want to again find the childrens inside a parent from a given output, you can just parse the output with the BeautifulSoup functions.
Example:
url = 'https://darksky.net/details/52.3673,4.8998/2021-8-8/ca24/en'
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
raw_weather_forecast = soup.find('div',class_= "timeline_container")
# If you want to iterate through all the divs, then you can do like this:
for node in raw_weather_forecast.find_all("div"):
print(node.text)
print(raw_weather_forecast.find("div" , class_="hour_ticks").text)
It will give an empty output because there was no text.

Beatiful soup parse page table probelm

I want to get the data (numbers) from this page. With those numbers I want to do some math.
My current code:
import requests
from bs4 import BeautifulSoup
result = requests.get("http://www.tsetmc.com/Loader.aspx?ParTree=151311&i=45050389997905274")
c = result.content
soup = BeautifulSoup(c , features='lxml')
cld=soup.select("#d03")
print(cld)
================
output : []
From the page-request I get this result:
<td id="d04" class="">2,105</td>
<td id="d03" class=""><span style="font-size:15px;font-weight:bold">2,147</span> <span style="font-size:11px;color:green">305 (16.56%)</span></td>
<td id="d05" class="">1,842</td>
From this result I only want the <td> ID's outputted.
The problem with that page is that it's content is generated dynamically. By the time you fetch the html of the page, the actual elements aren't generated (I suppose they are filled in by the javascript on the page). There are two ways you can approach this.
Try using selenium which simulates a browser. You can in fact wait for the response to be generated and then fetch the html element you want.
The other way would be just to see any network requests being done by the page to fetch the data. If it was not loaded in the html, surely there must be another API call made to their servers to fetch the data.
On an initial look, I can see that the data you need is being fetched with this URL . (http://www.tsetmc.com/tsev2/data/instinfodata.aspx?i=45050389997905274&c=57+). The response looks like this.
12:29:48,A ,2150,2147,2105,1842,2210,2105,2700,53654226,115204065144,1,20190814,122948;98/5/23 16:30:51,F,261391.50,<div class='pn'>4294.29</div>,9596315531133973,3376955600,11101143554708,345522,F,2046434489,11459858578563,282945,F,12927,3823488480,235,;8#240000#2148#2159#500#1,1#600#2145#2160#198067#2,10#1000000#2141#2161#2000#1,;61157,377398,660897;;;;0;
You can figure out the parsing logic in detail by going through their code I suppose. But it looks like you only need the second element 2147.
Perhaps this might work:
result=requests.get("http://www.tsetmc.com/Loader.aspxParTree=151311&i=45050389997905274")
c = result.content
soup = BeautifulSoup(c , features='lxml')
for tag in soup.find_all('td')[0:2]:
print(tag.get('id'))

How to take table data from a website using bs4

I'm trying to scrape a website that has a table in it using bs4, but the element of the content I'm getting is not as complete compared to the one I get from inspect. I cannot find the tag <tr> and <td> in it. How can I get the full content of that site especially the tags for the table?
Here's my code:
from bs4 import BeautifulSoup
import requests
link = requests.get("https://pemilu2019.kpu.go.id/#/ppwp/hitung-suara/", verify = False)
src = link.content
soup = BeautifulSoup(src, "html.parser")
print(soup)
I expect the content to have the tag <tr> and <td> in it because they do exist when I inspect it,but I found none from the output.
Here's the image of the page where there is the tag <tr> and <td>
You should dump the contents of the text you're trying to parse to a file and look at it. This will tell you for sure what is and isn't there. Like this:
from bs4 import BeautifulSoup
import requests
link = requests.get("https://pemilu2019.kpu.go.id/#/ppwp/hitung-suara/", verify = False)
src = link.content
with open("/tmp/content.html", "w") as f:
f.write(src)
soup = BeautifulSoup(src, "html.parser")
print(soup)
Run this code, and then look at the file "/tmp/content.html" (use a different path, obviously, if you're on Windows), and look at what is actually in the file. You could probably do this with your browser, but this this is the way to be the most sure you know what you are getting. You could, of course, also just add print(src), but if it were me, I'd dump it to a file
If the HTML you're looking for is not in the initial HTML that you're getting back, then that HTML is coming from somewhere else. The table could be being built dynamically by JavaScript, or coming from another URL reference, possibly one that calls an HTTP API to grab the table's HTML via parameters passed to the API endpoint.
You will have to reverse engineer the site's design to find where that HTML comes from. If it comes from JavaScript, you may be stuck short of scripting the execution of a browser so you can gain access programmatically to the DOM in the browser's memory.
I would recommend running a debugging proxy that will show you each HTTP request being made by your browser. You'll be able to see the contents of each request and response. If you can do this, you can find the URL that actually returns the content you're looking for, if such a URL exists. You'll have to deal with SSL certificates and such because this is a https endpoint. Debugging proxies usually make that pretty easy. We use Charles. The standard browser toolboxes might do this too...allow you to see each request and response that is generated by a particular page load.
If you can discover the URL that actually returns the table HTML, then you can use that URL to grab it and parse it with BS.

A website that I am trying to scrape is changing tags/IDs based on if it detects a crawler. Is there a way to avoid this?

I am trying to write a basic web scraper that looks through a forum, goes into each post, then checks to see if the post has any github links, storing those links. I am doing this as a part of my research to see how people use and implement Smart Device routines.
I'm fairly new to web scraping, and have been using BeautifulSoup, but I've run into a strange issue. First, my program:
from bs4 import BeautifulSoup
import requests
from user_agent import generate_user_agent
url = 'https://community.smartthings.com/c/projects-stories'
headers = {'User-Agent': generate_user_agent(device_type="desktop", os=('linux'))}
page_response = requests.get(url, timeout=5, headers=headers)
page = requests.get(url, timeout = 5)
#print(page.content)
if page.status_code == 200:
print('URL: ', url, '\nRequest Successful!')
content = BeautifulSoup(page.content, 'html.parser')
print(content.prettify())
project_url = []
for i in content:
project_url += content.find_all("/div", class_="a href")
print(project_url)
What I'm trying to do right now is simply collect all the url links to each individual post on the website. When I try to do this, it returns an empty list. After some experimentation in trying to pick out a specific url based on it's ID, I found that while the ID of each post does not seem to change every time the page is reloaded, it DOES change if the website detects that a scraper is being used. I believe this considering that when the contents of the webpage is printed to the console, at the end of the HTML data, there is a section that reads:
<!-- include_crawler_content? -->
</div>
<footer class="container">
<nav class="crawler-nav" itemscope="" itemtype="http://schema.org/SiteNavigationElement">
<a href="/">
Home
</a>
<a href="/categories">
Categories
</a>
<a href="/guidelines">
FAQ/Guidelines
</a>
<a href="/tos">
Terms of Service
</a>
<a href="/privacy">
Privacy Policy
</a>
</nav>
The website seems to detect the crawler and change the navigation based on that. I've tried generating a new user_agent to trick it, but I've had no luck.
Any ideas?
You could potentially start by using
content.findChildren('a')
and then go from there, sorting through the results for the links you want.

Download Multiple PDF files from a webpage

So I am trying to download a few eBooks that I have purchased through humble bundle. I am using beautifulsoup and requests to try and parse the html and get the href links for the pdfs.
Python
import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.humblebundle.com/downloads?key=fkuzzq6R8MA8ydEw")
soup = BeautifulSoup(r.content, "html.parser")
links = soup.find_all("div", {"class": "js-all-downloads-holder"})
print(links)
I am going to put a imgar link to the site and html layout because I don't believe you can access the html page without prompting a login(Which might be one of the reason I am having this issue to start with.) https://imgur.com/24x2X0m
HTML
<div class="flexbtn active noicon js-start-download">
<div class="right"></div>
<span class="label">PDF</span>
<a class="a" download="" href="https://dl.humble.com/makea2drpginaweekend.pdf?gamekey=fkuzzq6R8MA8ydEw&ttl=1521117317&t=b714bb732413a1f0532ec6aa72b282f9">
PDF
</a>
</div>
So the print statement should output to contents of the div but that is not the case.
Output
python3 pdf_downloader.py
[]
Sorry for the long post, I have just been up all night working on this and at this point it would have just been easier to hit the download button 20+ times but that is not how you learn.

Categories

Resources