Getting error that USER_AGENT is not defined (Python 3) - python

I'm trying to scrape the information inside an 'iframe' tag. When I execute this code, it says that 'USER_AGENT' is not defined. How can I fix this?
import requests
from bs4 import BeautifulSoup
page = requests.get("https://etherscan.io/token/0x168296bb09e24a88805cb9c33356536b980d3fc5#balances" + "/token/generic-tokenholders2?a=0x168296bb09e24a88805cb9c33356536b980d3fc5&s=100000000000000000", headers=USER_AGENT, timeout=5)
soup = BeautifulSoup(page.content, "html.parser")
test = soup.find_all('iframe')

The error is telling you clearly what is wrong. You are passing in as headers USER_AGENT, which you have not defined earlier in your code. Take a look at this post on how to use headers with the method.
The documentation states you must pass in a dictionary of HTTP headers for the request, whereas you have passed in an undefined variable USER_AGENT.
From the Requests Library API:
headers = None
Case-insensitive Dictionary of Response Headers.
For example, headers['content-encoding'] will return the value of a 'Content-Encoding' response header.
EDIT:
For a better explanation of Content-Type headers, see this SO post. See also this WebMasters post which explains the difference between Accept and Content-Type HTTP headers.
Since you only seem to be interested in scraping the iframe tags, you may simply omit the headers argument entirely and you should see the results if you print out the test object in your code.
import requests
from bs4 import BeautifulSoup
page = requests.get("https://etherscan.io/token/0x168296bb09e24a88805cb9c33356536b980d3fc5#balances" + "/token/generic-tokenholders2?a=0x168296bb09e24a88805cb9c33356536b980d3fc5&s=100000000000000000", timeout=10)
soup = BeautifulSoup(page.content, "lxml")
test = soup.find_all('iframe')
for tag in test:
print(tag)

We have to provide a user-agent, HERE's a link to the fake user-agents.
import requests
from bs4 import BeautifulSoup
USER_AGENT = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/53'}
url = "https://etherscan.io/token/0x168296bb09e24a88805cb9c33356536b980d3fc5#balances"
token = "/token/generic-tokenholders2?a=0x168296bb09e24a88805cb9c33356536b980d3fc5&s=100000000000000000"
page = requests.get(url + token, headers=USER_AGENT, timeout=5)
soup = BeautifulSoup(page.content, "html.parser")
test = soup.find_all('iframe')
You can simply NOT use a User Agent, Code:
import requests
from bs4 import BeautifulSoup
url = "https://etherscan.io/token/0x168296bb09e24a88805cb9c33356536b980d3fc5#balances"
token = "/token/generic-tokenholders2?a=0x168296bb09e24a88805cb9c33356536b980d3fc5&s=100000000000000000"
page = requests.get(url + token, timeout=5)
soup = BeautifulSoup(page.content, "html.parser")
test = soup.find_all('iframe')
I've separated your URL for readability purposes into the URL and token. That's why there's two variables URL and token

Related

How to Read JS generated Page in Python

Please Note: This problem can be solved easily by using selenium library but I don't want to use selenium since the Host doesn't have a browser installed and not willing to.
Important: I know that render() will download chromium at first time and I'm ok with that.
Q: How can I get the page source when it's generated by JS code? For example this HP printer:
220.116.57.59
Someone posted online and suggested using:
from requests_html import HTMLSession
r = session.get('https://220.116.57.59', timeout=3, verify=False)
session = HTMLSession()
base_url = r.url
r.html.render()
But printing r.text doesn't print full page source and indicates that JS is disabled:
<div id="pgm-no-js-text">
<p>JavaScript is required to access this website.</p>
<p>Please enable JavaScript or use a browser that supports JavaScript.</p>
</div>
Original Answer: https://stackoverflow.com/a/50612469/19278887 (last part)
Grab the config endpoints and then parse the XML for the data you want.
For example:
import requests
from bs4 import BeautifulSoup
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:101.0) Gecko/20100101 Firefox/101.0"
}
with requests.Session() as s:
soup = (
BeautifulSoup(
s.get(
"http://220.116.57.59/IoMgmt/Adapters",
headers=headers,
).text,
features="xml",
).find_all("io:HardwareConfig")
)
print("\n".join(c.find("MacAddress").getText() for c in soup if c.find("MacAddress") is not None))
Output:
E4E749735068
E4E74973506B
E6E74973D06B

Requests returns a status code of 429 for URL https://www.instagram.com/google

I'm trying to code an Instagram-webscraper in Python to return values like a person's followers, the number of posts etc.
Let's just take Google's Instagram-account for this example.
Here is my code:
import requests
from bs4 import BeautifulSoup
link = requests.get("https://www.instagram.com/google")
soup = BeautifulSoup(link.text, "html.parser")
print(soup)
print(link.status_code)
Pretty straightforward.
However, if I run the code, it prints link.status_code = 429. It should be 200, for any other website it prints 200.
Also, when it prints soup, it doesnt show what I actually want. Not the HTML for the account is shown, but the HTML for the Instagram-Error-page.
Why does requests open the instagram error page, not the account from the link provided?
To get correct response from the server, set User-Agent HTTP header:
import requests
from bs4 import BeautifulSoup
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:91.0) Gecko/20100101 Firefox/91.0"
}
link = requests.get("https://www.instagram.com/google", headers=headers)
soup = BeautifulSoup(link.text, "lxml")
print(link.status_code)
print(soup.select_one('meta[name="description"]')["content"])
Prints:
200
12.5m Followers, 33 Following, 1,642 Posts - See Instagram photos and videos from Google (#google)

python requests & beautifulsoup bot detection

I'm trying to scrape all the HTML elements of a page using requests & beautifulsoup. I'm using ASIN (Amazon Standard Identification Number) to get the product details of a page. My code is as follows:
from urllib.request import urlopen
import requests
from bs4 import BeautifulSoup
url = "http://www.amazon.com/dp/" + 'B004CNH98C'
response = urlopen(url)
soup = BeautifulSoup(response, "html.parser")
print(soup)
But the output doesn't show the entire HTML of the page, so I can't do my further work with product details.
Any help on this?
EDIT 1:
From the given answer, It shows the markup of the bot detection page. I researched a bit & found two ways to breach it :
I might need to add a header in the requests, but I couldn't understand what should be the value of header.
Use Selenium.
Now my question is, do both of the ways provide equal support?
It is better to use fake_useragent here for making things easy. A random user agent sends request via real world browser usage statistic. If you don't need dynamic content, you're almost always better off just requesting the page content over HTTP and parsing it programmatically.
import requests
from fake_useragent import UserAgent
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
ua=UserAgent()
hdr = {'User-Agent': ua.random,
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'}
url = "http://www.amazon.com/dp/" + 'B004CNH98C'
response = requests.get(url, headers=hdr)
print response.content
Selenium is used for browser automation and high level web scraping for dynamic contents.
As some of the comments already suggested, if you need to somehow interact with Javascript on a page, it is better to use selenium. However, regarding your first approach using a header:
import requests
from bs4 import BeautifulSoup
url = "http://www.amazon.com/dp/" + 'B004CNH98C'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text,"html.parser")
These headers are a bit old, but should still work. By using them you are pretending that your request is coming from a normal webbrowser. If you use requests without such a header your code is basically telling the server that the request is coming from python, which most of the servers are rejecting right away.
Another alternative for you could also be fake-useragent maybe you can also have a try with this.
try this:
import requests
from bs4 import BeautifulSoup
url = "http://www.amazon.com/dp/" + 'B004CNH98C'
r = requests.get(url)
r = r.text
##options #1
# print r.text
soup = BeautifulSoup( r.encode("utf-8") , "html.parser")
### options 2
print(soup)

Unable to log in to the website with Requests

I'm trying to log in to this website: https://archiwum.polityka.pl/sso/loginform to scrape some articles.
Here is my code:
import requests
from bs4 import BeautifulSoup
login_url = 'https://archiwum.polityka.pl/sso/loginform'
base_url = 'http://archiwum.polityka.pl'
payload = {"username" : XXXXX, "password" : XXXXX}
headers = {"User-Agent" : "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:61.0) Gecko/20100101 Firefox/61.0"}
with requests.Session() as session:
# Login...
request = session.get(login_url, headers=headers)
post = session.post(login_url, data=payload)
# Now I want to go to the page with a specific article
article_url = 'https://archiwum.polityka.pl/art/na-kanapie-siedzi-len,393566.html'
request_article = session.get(article_url, headers=headers)
# Scrape its content
soup = BeautifulSoup(request_article.content, 'html.parser')
content = soup.find('p', {'class' : 'box_text'}).find_next_sibling().text.strip()
# And print it.
print(content)
But my output is lik this:
... [pełna treść dostępna dla abonentów Polityki Cyfrowej]
Which means in my native language
... [full content available for subscribers of the Polityka Cyfrowa]
My credentials are correct because I have full access to the content from the browser but not with Requests.
I will be grateful for any suggestions as to how I can do this with Requests. Or do I have to use Selenium for this?
I can help you with the login prodedure. The rest, I suppose, you can manage yourself. Your payload doesn't contain all the necessary information to fetch a valid response. Fill in the two fields username, password from the script below and run the it. I suppose, you will see your name what you see when you are already logged in that webpage.
import requests
from bs4 import BeautifulSoup
payload = {
'username': 'username here',
'password': 'your password here',
'login_success': 'http://archiwum.polityka.pl',
'login_error': 'http://archiwum.polityka.pl/sso/loginform?return=http%3A%2F%2Farchiwum.polityka.pl'
}
with requests.Session() as session:
session.headers={"User-Agent":"Mozilla/5.0"}
page = session.post('https://www.polityka.pl/sso/login', data=payload)
soup = BeautifulSoup(page.text,"lxml")
profilename = soup.select_one("#container p span.border").text
print(profilename)

How to use urllib and re to retrieve live price data with Python

I am attempting to request the price data from dukascopy.com but I am running into a similar problem to this user, where the price data itself is not a part of the html. Therefore, when I run my basic urllib code to extract the data:
import urllib.request
url = 'https://www.dukascopy.com'
headers = {'User-Agent':'Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'}
req = urllib.request.Request(url, headers = headers)
resp = urllib.request.urlopen(req)
respData = resp.read()
print(str(respData))
the price data cannot be found. Referring back to this post, the user Mark found another url that the data was called from. Can this be applied to collect the data here as well?
Try with dryscape. You can scrape JavaScript rendered pages with it. Don't parse web pages with regex module. It's not a good idea. Read this why you should not parse HTML pages with regex: HTML with regex. Use Beautiful for parsing.
import dryscrape
from bs4 import BeautifulSoup
url = 'https://www.dukascopy.com'
session = dryscrape.Session()
session.visit(url)
response = session.body()
soup=BeautifulSoup(response)
print soup

Categories

Resources