Hi I am need to scrape web page end extract data-id use Regular expression
Here is my code :
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("https://clarity-project.info/tenders/?entity=38163425&offset=100")
bsObj = BeautifulSoup(html,"html.parser")
DataId = bsObg.findAll("data-id", {"skr":re.compile("data-id=[0-9,a-f]")})
for DataId in DataId:
print(DataId["skr"])
when I run my program in Jupyter :
HTTPError: HTTP Error 403: Forbidden
It looks like the web server is asking you to authenticate before serving content to Python's urllib. However, they serve everything neatly to wget and curl and https://clarity-project.info/robots.txt doesn't seem to exist, so I reckon scraping as such is fine with them. Still, it might be a good idea to ask them first.
As for the code, simply changing the User Agent string to something they like better seems to work:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from urllib.request import urlopen, Request
request = Request(
'https://clarity-project.info/tenders/?entity=38163425&offset=100',
headers={
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:55.0) Gecko/20100101 Firefox/55.0'})
html = urlopen(request).read().decode()
(unrelated, you have another mistake in your code: bsObj ≠ bsObg)
EDIT added code below to answer additional question from the comments:
What you seem to need is to find the value of the data-id attribute, no matter to which tag it belongs. The code below does just that:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from urllib.request import urlopen, Request
from bs4 import BeautifulSoup
url = 'https://clarity-project.info/tenders/?entity=38163425&offset=100'
agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36\
(KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
request = Request(url, headers={'User-Agent': agent})
html = urlopen(request).read().decode()
soup = BeautifulSoup(html, 'html.parser')
tags = soup.findAll(lambda tag: tag.get('data-id', None) is not None)
for tag in tags:
print(tag['data-id'])
The key is to simply use a lambda expression as the parameter to the findAll function of BeautifulSoup.
The server is likely blocking your requests because of the default user agent. You can change this so that you will appear to the server to be a web browser. For example, a Chrome User-Agent is:
Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36
To add a User-Agent you can create a request object with the url as a parameter and the User-Agent passed in a dictionary as the keyword argument 'headers'.
See:
import urllib.request
r = urllib.request.Request(url, headers= {'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'})
html = urllib.request.urlopen(r)
You could try with this:
#!/usr/bin/env python
from bs4 import BeautifulSoup
import requests
url = 'your url here'
soup = BeautifulSoup(requests.get(url).text,"html.parser")
for i in soup.find_all('tr', attrs={'class':'table-row'}):
print '[Data id] => {}'.format(i.get('data-id'))
This should work!
Related
I the following python program that when I execute it, I get an HTTP 403: Forbidden error.
Here's the code:
import os
import re
from bs4 import BeautifulSoup
#import urllib.request
#from urllib.request import request, urlopen
from urllib import request
import pandas as pd
import numpy as np
import datetime
import time
import openpyxl
for a in range(0,len(symbols),1):
"""
Attempt to get past the Forbidden error message.
"""
#ua = UserAgent()
url = "https://fintel.io/ss/us/" + symbols[a]
"""
test urls:
https://fintel.io/ss/us/A
https://fintel.io/ss/us/BBY'
https://fintel.io/ss/us/WMT'
"""
print("Extracting Values for " + symbols[a] + ".")
header = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'
}
try:
page_request = request.Request(url, headers=header)
page = request.urlopen(page_request)
#old page = urllib.request.urlopen(url)
soup = BeautifulSoup(page, "html.parser", from_encoding="iso-8859-1")
The result I'm getting:
Extracting Values for A.
HTTP Error 403: Forbidden
Extracting Values for BBY.
HTTP Error 403: Forbidden
Extracting Values for WMT.
HTTP Error 403: Forbidden
Any suggestions on how to handle this?
That's not a BeautifulSoup specific error, the website you're trying to scrape is probably protected by Cloudflare's anti bot page. You can try using cfscrape to bypass this.
Your complaint is with Request, rather than with beautifulsoup.
The website refuses to serve your requests.
When I use that UA it works fine, I receive a 200 document:
$ wget -U 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36' -S https://fintel.io/ss/us/BBY
But then again, even using UA of 'Wget/1.20.3 (darwin19.0.0)' works just fine:
$ wget -S https://fintel.io/ss/us/BBY
Looks like you've been doing excessive crawling,
and your IP is now being blocked by the webserver.
I'm trying to automate a login using python's requests module, but whenever I use the POST or GET request the server sends 403 status code; the weird part is that I can access that same URL with any browser but it just won't work with curl and requests.
here is the code:
import requests
import lxml
from bs4 import BeautifulSoup
import os
url = "https://ais.usvisa-info.com/en-am/niv/users/sign_in"
req = requests.get(url).text
soup = BeautifulSoup(req, 'lxml')
ready = soup.prettify()
FILE = open("usvisa.html", "w")
FILE.write(ready)
FILE.close()
I'd appreciate any help or idea!
Its probably the /robots.txt, thats blocking you.
try overriding the user-agent with a custom one.
import requests
import lxml
from bs4 import BeautifulSoup
import os
headers={"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36"}
url = "https://ais.usvisa-info.com/en-am/niv/users/sign_in"
req = requests.get(url, headers=headers).text
soup = BeautifulSoup(req, 'lxml')
ready = soup.prettify()
FILE = open("usvisa.html", "w", encoding="utf-8")
FILE.write(ready)
FILE.close()
you also didnt specify the file encoding when opening a file.
I am trying to get a value from a website using beautiful soup but it keeps returning none. This is what I have for my code so far
def getmarketchange():
source = requests.get("https://www.coinbase.com/price").text
soup = bs4.BeautifulSoup(source, "lxml")
marketchange = soup.get("MarketHealth__Percent-sc-1a64a42-2.bEMszd")
print(marketchange)
getmarketchange()
and attached is a screenshot of html code I was trying to grab.
Thank you for your help in advance!
Have a look at the HTML source returned from your get() request - it's a CAPTCHA challenge. You won't be able to get to the Coinbase pricing without passing this challenge.
Excerpt:
<h2 class="cf-subheadline"><span data-translate="complete_sec_check">
Please complete the security check to access</span> www.coinbase.com
</h2>
<div class="cf-section cf-highlight cf-captcha-container">
Coinbase is recognizing that the HTTP request isn't coming from a standard browser-based user, and it is challenging the requester. BeautifulSoup doesn't have a way to pass this check on its own.
Passing in User-Agent headers (to mimic a browser request) also doesn't resolve this issue.
For example:
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
source = requests.get("https://www.coinbase.com/price", headers=headers).text
You might find better luck with Selenium, although I'm not sure about that.
To prevent a captcha page, try to specify User-Agent header:
import requests
from bs4 import BeautifulSoup
url = 'https://www.coinbase.com/price'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:80.0) Gecko/20100101 Firefox/80.0'}
soup = BeautifulSoup(requests.get(url, headers=headers).content, 'html.parser')
print(soup.select_one('[class^="MarketHealth__Percent"]').text)
Prints:
0.65%
from bs4 import BeautifulSoup
import requests
web_url = r'https://www.mlb.com/scores/2019-05-12'
get_web = requests.get(web_url).text
soup = BeautifulSoup(get_web,"html.parser")
score = soup.find_all('div',class_='container')
print(score)
I want to find this.
But result is this
Send headers to API to tell it "hey I'm a desktop browser" to get identical HTML from server side:
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
request = requests.get(url, headers={'User-Agent': user_agent})
Useful links:
How to use Python requests to fake a browser visit?
Sending "User-agent" using Requests library in Python
The below code got stuck after printing hi in output. Can you please check what is wrong with this? And if the site is secure and I need some special authentication?
from bs4 import BeautifulSoup
import requests
print('hi')
rooturl='http://www.hoovers.com/company-information/company-search.html'
r=requests.get(rooturl);
print('hi1')
soup=BeautifulSoup(r.content,"html.parser");
print('hi2')
print(soup)
Unable to read html page from beautiful soup
Why you got this problem is website consider that you are robots, they won't send anything to you. And they even hang up the connection let you wait forever.
You just imitate browser's request, then server will consider you are not an robot.
Add headers is the simplest way to deal with this problem. But something you should not pass User-Agent only(like this time). Remember copy your browser's request and remove the useless element(s) through testing. If you are lazy use browser's headers straightly, but you must not copy all of them when you want to upload files
from bs4 import BeautifulSoup
import requests
rooturl='http://www.hoovers.com/company-information/company-search.html'
with requests.Session() as se:
se.headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36",
"Accept-Encoding": "gzip, deflate",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"Accept-Language": "en"
}
resp = se.get(rooturl)
print(resp.content)
soup = BeautifulSoup(resp.content,"html.parser")
Was having the same issue as you. Just sat there.
I tried by adding user-agent, and it pulled it realtively quickly. Don't know why that is though.
from bs4 import BeautifulSoup
import requests
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
print('hi')
rooturl='http://www.hoovers.com/company-information/company-search.html'
r=requests.get(rooturl, headers=headers)
print('hi1')
soup=BeautifulSoup(r.content,"html.parser");
print('hi2')
print(soup)
EDIT: So odd. Now it's not working for me again. It first didn't work. Then it did. Now it doesn't. But there is another potential option with the use of Selenium.
from bs4 import BeautifulSoup
import requests
from selenium import webdriver
browser = webdriver.Chrome()
browser.get('http://www.hoovers.com/company-information/company-search.html')
r = browser.page_source
print('hi1')
soup=BeautifulSoup(r,"html.parser")
print('hi2')
print(soup)
browser.close()