How to take table data from a website using bs4 - python

I'm trying to scrape a website that has a table in it using bs4, but the element of the content I'm getting is not as complete compared to the one I get from inspect. I cannot find the tag <tr> and <td> in it. How can I get the full content of that site especially the tags for the table?
Here's my code:
from bs4 import BeautifulSoup
import requests
link = requests.get("https://pemilu2019.kpu.go.id/#/ppwp/hitung-suara/", verify = False)
src = link.content
soup = BeautifulSoup(src, "html.parser")
print(soup)
I expect the content to have the tag <tr> and <td> in it because they do exist when I inspect it,but I found none from the output.
Here's the image of the page where there is the tag <tr> and <td>

You should dump the contents of the text you're trying to parse to a file and look at it. This will tell you for sure what is and isn't there. Like this:
from bs4 import BeautifulSoup
import requests
link = requests.get("https://pemilu2019.kpu.go.id/#/ppwp/hitung-suara/", verify = False)
src = link.content
with open("/tmp/content.html", "w") as f:
f.write(src)
soup = BeautifulSoup(src, "html.parser")
print(soup)
Run this code, and then look at the file "/tmp/content.html" (use a different path, obviously, if you're on Windows), and look at what is actually in the file. You could probably do this with your browser, but this this is the way to be the most sure you know what you are getting. You could, of course, also just add print(src), but if it were me, I'd dump it to a file
If the HTML you're looking for is not in the initial HTML that you're getting back, then that HTML is coming from somewhere else. The table could be being built dynamically by JavaScript, or coming from another URL reference, possibly one that calls an HTTP API to grab the table's HTML via parameters passed to the API endpoint.
You will have to reverse engineer the site's design to find where that HTML comes from. If it comes from JavaScript, you may be stuck short of scripting the execution of a browser so you can gain access programmatically to the DOM in the browser's memory.
I would recommend running a debugging proxy that will show you each HTTP request being made by your browser. You'll be able to see the contents of each request and response. If you can do this, you can find the URL that actually returns the content you're looking for, if such a URL exists. You'll have to deal with SSL certificates and such because this is a https endpoint. Debugging proxies usually make that pretty easy. We use Charles. The standard browser toolboxes might do this too...allow you to see each request and response that is generated by a particular page load.
If you can discover the URL that actually returns the table HTML, then you can use that URL to grab it and parse it with BS.

Related

How do I fix fix getting "None" as a response when web scraping?

So I am trying to create a small code that gets the views from a youtube video and prints them. However using this code when printing the text var I just get the response "None". Is there a way to get a response of the actual view count using these libraries?
import requests
from bs4 import BeautifulSoup
url = requests.get("https://www.youtube.com/watch?v=dQw4w9WgXcQ")
soup = BeautifulSoup(url.text, 'html.parser')
text = soup.find('span', {'class': "view-count style-scopeytd-video-view-count-renderer"})
print(text)
To see why, you should use wget or curl to fetch a copy of that page and look at it, or use "view source" from your browser. That's what requests sees. None of those classes appear in the HTML you get back. That's why you get None -- because there ARE none.
YouTube builds all of its pages dynamically, through Javascript. requests doesn't interpret Javascript. If you need to do this, you'll need to use something like Selenium to run a real browser with a Javascript interpreter built in.

Printing contents from class using BeautifulSoup

I want to print the text inside the class.
This is the HTML snip(It is inside of many classes, But in visual, It is next to Prestige->
<div class="sc-ikPAkQ ceimHt">
9882
</div>
THis is my code->
from bs4 import BeautifulSoup
import requests
URL = "https://auntm.ai/champions/abomination/tier/6"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
for data in soup.findAll('div', attrs={"class": "sc-ikPAkQ ceimHt"}):
print(data)
I want to print the integer 9882 from the class
I tried but I failed.
How do I do so?
Unlike a typical static webpage, the main content of the webpage is loaded dynamically with JavaScript.
That is, the response body (page.content) won't contain all the content you see finally. Instead, upon you accessing the webpage via a Web browser, the browser executing these JavaScript codes which then updates the HTML with data from other data sources (typically, via another API calling or just some hardcoded data in the script itself). In other words, the final HTML shown in the DOM inspector in a Web browser is different from what you gain with requests.get. (You can verify this by printing page.content or clicking the "View Page Source" entry in the right-click menu on the page).
General ways to handle this case are either:
Turn to selenium for help. Selenium is essentially a programmatically controlled Web browser (but without a real window) for JS codes to execute and render the webpage as normal.
Inspect JS codes and/or additional network requests on that page to extract the data source. It requires some experience and knowledge with Web dev or JS.
You can get the text by calling .get("text") i.e
for data in soup.findAll('div', attrs={"class": "sc-ikPAkQ ceimHt"}):
data.get("text")
Check getText() vs text() vs get_text() for different ways of getting the text (and Get text of children in a div with beautifulsoup for answer to your question)

Python Requests Library - Scraping separate JSON and HTML responses from POST request

I'm new to web scraping, programming, and StackOverflow, so I'll try to phrase things as clearly as I can.
I'm using the Python requests library to try to scrape some info from a local movie theatre chain. When I look at the Chrome developer tools response/preview tabs in the network section, I can see what appears to be very clean and useful JSON:
However, when I try to use requests to obtain this same info, instead I get the entire page content (pages upon pages of html). Upon further inspection of the cascade in the Chrome developer tools, I can see there are two events called GetNowPlayingByCity: One contains the JSON info while the other seems to be the HTML.
JSON Response
HTML Response
How can I separate the two and only obtain the JSON response using the Python requests library?
I have already tried modifying the headers within requests.post (the Chrome developer tools indicate this is a post method) to include "accept: application/json, text/plain, */*" but didn't see a difference in the response I was getting with requests.post. As it stands I can't parse any JSON from the response I get with requests.post and get the following error:
"json.decoder.JSONDecodeError: Expecting value: line 4 column 1 (char 3)"
I can always try to parse the full HTML, but it's so long and complex I would much rather work with friendly JSON info. Any help would be much appreciated!
This is probably because the javascript the page sends to your browser is making a request to an API to get the json info about the movies.
You could either try sending the request directly to their API (see edit 2), parse the html with a library like Beautiful Soup or you can use a dedicated scraping library in python. I've had great experiences with scrapy. It is much faster than requests
Edit:
If the page uses dynamically loaded content, which I think is the case, you'd have to use selenium with the PhantomJS browser instead of requests. here is an example:
from bs4 import BeautifulSoup
from selenium import webdriver
url = "your url"
browser = webdriver.PhantomJS()
browser.get(url)
html = browser.page_source
soup = BeautifulSoup(html, 'lxml')
# Then parse the html code here
Or you could load the dynamic content with scrapy
I recommend the latter if you want to get into scraping. It would take a bit more time to learn but it is a better solution.
Edit 2:
To make a request directly to their api you can just reproduce the request you see. Using google chrome, you can see the request if you click on it and go to 'Headers':
After that, you simply reproduce the request using the requests library:
import requests
import json
url = 'http://paste.the.url/?here='
response = requests.get(url)
content = response.content
# in my case content was byte string
# (it looks like b'data' instead of 'data' when you print it)
# if this is you case, convert it to string, like so
content_string = content.decode()
content_json = json.loads(content_string)
# do whatever you like with the data
You can modify the url as you see fit, for example if it is something like http://api.movies.com/?page=1&movietype=3 you could modify movietype=3 to movietype=2 to see a different type of movie, etc

Nothing return in prompt when Scraping Product data using BS4 and Request Python3

hope you're all doing good.
I am trying to scrape a specific product from https://footlocker.fr in order to get product's data such as sizes available. The thing is each time i try to run my script nothing returns.
Thank you in advance!
import requests
from bs4 import BeautifulSoup
url = 'https://www.footlocker.fr/fr/p/jordan-1-mid-bebe-chaussures-69677?v=316161155904'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
name_box = soup.find_all('div', attrs={'class':'fl-product-details--headline'})
size = soup.find_all('div', attrs={'class':'fl-size-316161155904-UE-21'})
for product in zip(name_box,size):
name,price=product
name_proper=name.text.strip()
size_proper=size.text.strip()
print(name_proper,'-',price_proper)```
Okay. So I found a solution, but it is far from ideal. It is for the following link https://www.footlocker.fr/fr/p/jordan-1-mid-bebe-chaussures-69677?v=316160178204. If you look at the resulting html in page.content, you will obviously notice that the size details are not there. If you read through it a bit, you will see a bunch of references to AJAX leading me to believe it is making an AJAX call and pulling the information in, then parsing it. (This is expected behaviour as stock of items can change over time).
There are two ways to get your data.
You know the URL you are trying to fetch data from. The value after v= is the SKU of the product. For example, if the SKU is 316160178204 you can directly make a request to https://www.footlocker.fr/INTERSHOP/web/FLE/Footlocker-Footlocker_FR-Site/fr_FR/-/EUR/ViewProduct-ProductVariationSelect?BaseSKU=316160178204&InventoryServerity=ProductDetail
For each URL you request, you have to locate the following DIV with class f1-load-animation, then get the data-ajaxcontent-url attribute. Now if you get the data-ajaxcontent-url attribute which is https://www.footlocker.fr/INTERSHOP/web/FLE/Footlocker-Footlocker_FR-Site/fr_FR/-/EUR/ViewProduct-ProductVariationSelect?BaseSKU=316160178204&InventoryServerity=ProductDetail
Now you make a request to this new URL you have, and somewhere in that JSON, you will see values such as
<button class=\"fl-product-size--item fl-product-size--item__not-available\" type=\"button\"\n\n>\n<span>20</span>\n</button>
<button class=\"fl-product-size--item\" type=\"button\"\n\ndata-form-field-target=\"SKU\"\ndata-form-field-base-css-name=\"fl-product-size--item\"\ndata-form-field-value=\"316160178204050\"\ndata-form-field-unselect-group\n\ndata-testid=\"fl-size-316160178204-UE-21\"\ndata-product-size-select-item=\"316160178204050\"\n\n>\n<span>21</span>\n</button>
You will have to parse this snippet of data (I think you can use BeautifulSoup for it). You can see that it has a class of f1-product-size--item__not-available if it is not available, and the size value is in the span.
name_box is empty because you search for <div> and the element that contains the class fl-product-details--headline is a <h1>
size is empty because, as #Sri pointed out, there are some AJAX requests that insert that information in the page after the first request

Beatiful soup parse page table probelm

I want to get the data (numbers) from this page. With those numbers I want to do some math.
My current code:
import requests
from bs4 import BeautifulSoup
result = requests.get("http://www.tsetmc.com/Loader.aspx?ParTree=151311&i=45050389997905274")
c = result.content
soup = BeautifulSoup(c , features='lxml')
cld=soup.select("#d03")
print(cld)
================
output : []
From the page-request I get this result:
<td id="d04" class="">2,105</td>
<td id="d03" class=""><span style="font-size:15px;font-weight:bold">2,147</span> <span style="font-size:11px;color:green">305 (16.56%)</span></td>
<td id="d05" class="">1,842</td>
From this result I only want the <td> ID's outputted.
The problem with that page is that it's content is generated dynamically. By the time you fetch the html of the page, the actual elements aren't generated (I suppose they are filled in by the javascript on the page). There are two ways you can approach this.
Try using selenium which simulates a browser. You can in fact wait for the response to be generated and then fetch the html element you want.
The other way would be just to see any network requests being done by the page to fetch the data. If it was not loaded in the html, surely there must be another API call made to their servers to fetch the data.
On an initial look, I can see that the data you need is being fetched with this URL . (http://www.tsetmc.com/tsev2/data/instinfodata.aspx?i=45050389997905274&c=57+). The response looks like this.
12:29:48,A ,2150,2147,2105,1842,2210,2105,2700,53654226,115204065144,1,20190814,122948;98/5/23 16:30:51,F,261391.50,<div class='pn'>4294.29</div>,9596315531133973,3376955600,11101143554708,345522,F,2046434489,11459858578563,282945,F,12927,3823488480,235,;8#240000#2148#2159#500#1,1#600#2145#2160#198067#2,10#1000000#2141#2161#2000#1,;61157,377398,660897;;;;0;
You can figure out the parsing logic in detail by going through their code I suppose. But it looks like you only need the second element 2147.
Perhaps this might work:
result=requests.get("http://www.tsetmc.com/Loader.aspxParTree=151311&i=45050389997905274")
c = result.content
soup = BeautifulSoup(c , features='lxml')
for tag in soup.find_all('td')[0:2]:
print(tag.get('id'))

Categories

Resources