How to scrape a hidden element

How to scrape a hidden element - python

I am trying to scrape the tbody under the following element of https://steamdb.info/app/730/graphs/. (I gained permission for scraping)
<div id="chart-month-breakdown" class="table-responsive">
However, when trying to scrape the content or access it through Selenium, I can't because it appears as such:
<div id="chart-month-breakdown" class="table-responsive" hidden="">
The 'hidden' tag only disappears when I manually browse the page, thus not able to scrape through requests.get.
Is there a way to get the content?

If you are good with stats, here are all the stats that table uses. they fetch table data from api. You can make api calls directly by using this:
import requests
headers = {'referer': 'https://steamdb.info/app/730/graphs/','x-requested-with': 'XMLHttpRequest','user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'}
a = requests.get('https://steamdb.info/api/GetGraph/?type=concurrent_week&appid=730',headers = headers).text
print(a)
If you are using selenium. I have tried changing hidden to nothing and that got me all the data. You can use selenium javascript to make that div id visible. You can try something like this:
javaScript = "document.getElementById('chart-month-breakdown').removeAttribute('hidden');"
driver.execute_script(javaScript)
time.sleep(3) # To let the data poplulate

Related

Python request.get(URL) does not lead to a redirect although there's a redirect when trying in the browser

There's this site called https://coolors.co and I want to grab the color palettes they generate programmatically. In the browser, I just click the button "Start the generator!". The link the button is attached to is https://coolors.co/generate. If I go to that url in the browser, the color palette is generated. Notice, that the url is changed to https://coolors.co/092327-0b5351-00a9a5-4e8098-90c2e7 (that's an example - the last part of the url is just the hex codes). There is obviously a redirect.
But when I do this in Python with a get request, I am not redirected but stay on this intermediate site. When I look at r.text, it tells me "This domain doesn't exist and is for sale".
How do I fix this? How do I enable the redirect?
Here's the code:
url = 'https://coolors.co/generate'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'}
r = requests.get(url, headers=headers)
Thanks!

This website does not use an HTTP redirect.
It probably uses a Javascript form of redirection like changing window.location.href, requests is not a browser so it does not execute the javascript in the page you requested hence the absence of redirection.

Python requests response 403 forbidden

So I am trying to scrape this website: https://www.auto24.ee
I was able to scrape data from it without any problems, but today it gives me "Response 403". I tried using proxies, passing more information to headers, but unfortunately nothing seems to work. I could not find any solution on the internet, I tried different methods.
The code that worked before without any problems:
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.93 Safari/537.36',
}
page = requests.get("https://www.auto24.ee/", headers=headers)
print(page)

The code here
import requests
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.93 Safari/537.36'}
page = requests.get("https://www.auto24.ee/", headers=headers)
print(page.text)
Always will get something as the following
<div class="cf-section cf-wrapper">
<div class="cf-columns two">
<div class="cf-column">
<h2 data-translate="why_captcha_headline">Why do I have to complete a CAPTCHA?</h2>
<p data-translate="why_captcha_detail">Completing the CAPTCHA proves you are a human and gives you temporary access to the web property.</p>
</div>
<div class="cf-column">
<h2 data-translate="resolve_captcha_headline">What can I do to prevent this in the future?</h2>
<p data-translate="resolve_captcha_antivirus">If you are on a personal connection, like at home, you can
run an anti-virus scan on your device to make sure it is not infected with malware.</p>
The website is protected by CloudFlare. By standard means, there is minimal chance of being able to access the WebSite through automation such as requests or selenium. You are seeing 403 since your client is detected as a robot. There may be some arbitrary methods to bypass CloudFlare that could be found elsewhere, but the WebSite is working as intended. There must be a ton of data submitted through headers and cookies that show your request is valid, and since you are simply submitting only a user agent, CloudFlare is triggered. Simply spoofing another user-agent is not even close to enough to not trigger a captcha, CloudFlare checks for MANY things.
I suggest you look at selenium here since it simulates a real browser, or research guides to (possibly?) bypass Cloudflare with requests.
Update
Found 2 python libraries cloudscraper and cfscrape. Both are not usable for this site since it uses cloudflare v2 unless you pay for a premium version.

Can't scrape the value of a certain field from a webpage using requests

I'm trying to scrape the value of Balance from a webpage using requests module. I've looked for the name Balance in dev tools and in page source but found nowhere. I hope there should be any way to grab the value of Balance from that webpage without using any browser simulator.
website address
Output I'm after:
I've tried with:
import requests
from bs4 import BeautifulSoup
link = 'https://tronscan.org/?fbclid=IwAR2WiSKZoTDPWX1ufaAIEg9vaA5oLj9Yd_RUfpjE6MWEQKRGBaK-L_JdtwQ#/contract/TCSPn1Lbdv62QfSCczbLdwupNoCFYAfUVL'
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36"}
res = requests.get(link,headers=headers)
soup = BeautifulSoup(res.text,'lxml')
balance = soup.select_one("li:has(> p:contains('Balance'))").get_text(strip=True)
print(balance)

The reason the page's HTML doesn't have the balance is because the page is making AJAX requests which are sending back the information you want after the page is loaded. You can look at these requests by loading up your developer window by pressing F12 in Chrome (it might be different in other browsers), go to the Network tab and you'll see this:
Here you can see the request that you want is account?address= followed by the code that is in the URL string for the page, and mousing over that shows the complete URL for the AJAX request, highlighted in coral, and the part of the response which holds the data you want is on the right highlighted in turquoise.
You can look at response by going here and find tokenBalances.
In order to get the balance in Python you can run the following:
import requests, json
url = 'https://apilist.tronscan.org/api/account?address=TCSPn1Lbdv62QfSCczbLdwupNoCFYAfUVL'
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36"}
response = requests.get(url, headers=headers)
response = json.loads(response.text)
balance = response['tokenBalances'][0]['balance']
print(balance)

Unable to scrape amazon orders page

I currently making a project which will get all the orders i've ordered on amazon and categorize them and then write them to an excel file. The problem is, when i try to scrape the page using bs4, i get the result as None.
I've made a similar project before, which will search amazon for the product you want to search for then save all the data about that product like name, price, review in a json file.
That worked perfectly.
But this doesnt seem to work
here is the code -
'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36'
}
link = 'https://www.amazon.in/gp/your-account/order-history?opt=ab&digitalOrders=1&unifiedOrders=1&returnTo=&orderFilter=year-2020'
data = requests.get(link, headers = headers)
soup = BeautifulSoup(data.text, 'lxml')
product = soup.find('div', class_="a-box-group a-spacing-base order")
print(product)
I'm a beginner, but I think its because I need to log in to get my details, but my password is already saved in my browser.
Any help is appreciated.
Thanks

See this GitHub project for reference
Amazon, like most prominent companies, doesn't allow simple scraping and needs some form of auth.

BeautifulSoup isn't working while web scraping Amazon

I'm new to web scraping and i am trying to use basic skills on Amazon. I want to make a code for finding top 10 'Today's Greatest Deals' with prices and rating and other information.
Every time I try to find a specific tag using find() and specifying class it keeps saying 'None'. However the actual HTML has that tag.
On manual scanning i found out half the code of isn't being displayed in the output terminal. The code displayed is half but then the body and html tag do close. Just a huge chunk of code in body tag is missing.
The last line of code displayed is:
<!--[endif]---->
then body tag closes.
Here is the code that i'm trying:
from bs4 import BeautifulSoup as bs
import requests
source = requests.get('https://www.amazon.in/gp/goldbox?ref_=nav_topnav_deals')
soup = bs(source.text, 'html.parser')
print(soup.prettify())
#On printing this it misses some portion of html
article = soup.find('div', class_ = 'a-row dealContainer dealTile')
print(article)
#On printing this it shows 'None'
Ideally, this should give me the code within the div tag, so that i can continue further to get the name of the product. However the output just shows 'None'. And on printing the whole code without tags it is missing a huge chunk of html inside.
And of course the information needed is in the missing html code.
Is Amazon blocking my request? Please help.

The User-Agent request header contains a characteristic string that allows the network protocol peers to identify the application type, operating system, software vendor or software version of the requesting software user agent. Validating User-Agent header on server side is a common operation so be sure to use valid browser’s User-Agent string to avoid getting blocked.
(Source: http://go-colly.org/articles/scraping_related_http_headers/)
The only thing you need to do is to set a legitimate user-agent. Therefore add headers to emulate a browser. :
# This is a standard user-agent of Chrome browser running on Windows 10
headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36' }
Example:
from bs4 import BeautifulSoup
import requests
headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}
resp = requests.get('https://www.amazon.com', headers=headers).text
soup = BeautifulSoup(resp, 'html.parser')
...
<your code here>
Additionally, you can add another set of headers to pretend like a legitimate browser. Add some more headers like this:
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36',
'Accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language' : 'en-US,en;q=0.5',
'Accept-Encoding' : 'gzip',
'DNT' : '1', # Do Not Track Request Header
'Connection' : 'close'
}

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to scrape a hidden element - python

Related

Python request.get(URL) does not lead to a redirect although there's a redirect when trying in the browser

Python requests response 403 forbidden

Can't scrape the value of a certain field from a webpage using requests

Unable to scrape amazon orders page

BeautifulSoup isn't working while web scraping Amazon

Categories

Resources