I'm trying to log in to this website: https://archiwum.polityka.pl/sso/loginform to scrape some articles.
Here is my code:
import requests
from bs4 import BeautifulSoup
login_url = 'https://archiwum.polityka.pl/sso/loginform'
base_url = 'http://archiwum.polityka.pl'
payload = {"username" : XXXXX, "password" : XXXXX}
headers = {"User-Agent" : "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:61.0) Gecko/20100101 Firefox/61.0"}
with requests.Session() as session:
# Login...
request = session.get(login_url, headers=headers)
post = session.post(login_url, data=payload)
# Now I want to go to the page with a specific article
article_url = 'https://archiwum.polityka.pl/art/na-kanapie-siedzi-len,393566.html'
request_article = session.get(article_url, headers=headers)
# Scrape its content
soup = BeautifulSoup(request_article.content, 'html.parser')
content = soup.find('p', {'class' : 'box_text'}).find_next_sibling().text.strip()
# And print it.
print(content)
But my output is lik this:
... [pełna treść dostępna dla abonentów Polityki Cyfrowej]
Which means in my native language
... [full content available for subscribers of the Polityka Cyfrowa]
My credentials are correct because I have full access to the content from the browser but not with Requests.
I will be grateful for any suggestions as to how I can do this with Requests. Or do I have to use Selenium for this?
I can help you with the login prodedure. The rest, I suppose, you can manage yourself. Your payload doesn't contain all the necessary information to fetch a valid response. Fill in the two fields username, password from the script below and run the it. I suppose, you will see your name what you see when you are already logged in that webpage.
import requests
from bs4 import BeautifulSoup
payload = {
'username': 'username here',
'password': 'your password here',
'login_success': 'http://archiwum.polityka.pl',
'login_error': 'http://archiwum.polityka.pl/sso/loginform?return=http%3A%2F%2Farchiwum.polityka.pl'
}
with requests.Session() as session:
session.headers={"User-Agent":"Mozilla/5.0"}
page = session.post('https://www.polityka.pl/sso/login', data=payload)
soup = BeautifulSoup(page.text,"lxml")
profilename = soup.select_one("#container p span.border").text
print(profilename)
Related
I've created a script to log in to linkedin using requests. The script is doing fine.
After logging in, I used this url https://www.linkedin.com/groups/137920/ to scrape this name Marketing Intelligence Professionals from there which you can see in this image.
The script can parse the name flawlessly. However, what I wish to do now is scrape the link connected to the See all button located at the bottom of that very page shown in this image.
Group link you gotta log in to access the content
I've created so far (it can scrape the name shown in the first image):
import json
import requests
from bs4 import BeautifulSoup
link = 'https://www.linkedin.com/login?fromSignIn=true&trk=guest_homepage-basic_nav-header-signin'
post_url = 'https://www.linkedin.com/checkpoint/lg/login-submit'
target_url = 'https://www.linkedin.com/groups/137920/'
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1; ) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36'
r = s.get(link)
soup = BeautifulSoup(r.text,"lxml")
payload = {i['name']:i.get('value','') for i in soup.select('input[name]')}
payload['session_key'] = 'your email' #put your username here
payload['session_password'] = 'your password' #put your password here
r = s.post(post_url,data=payload)
r = s.get(target_url)
soup = BeautifulSoup(r.text,"lxml")
items = soup.select_one("code:contains('viewerGroupMembership')").get_text(strip=True)
print(json.loads(items)['data']['name']['text'])
How can I scrape the link connected to See all button from there?
There is an internal Rest API which is called when you click on "See All" :
GET https://www.linkedin.com/voyager/api/search/blended
The keywords query parameter contains the title of the group you have requested initially (the group title in the initial page).
In order to get the group name, you could scrape the html of the initial page, but there is an API which returns the group information when you gives the group ID :
GET https://www.linkedin.com/voyager/api/groups/groups/urn:li:group:GROUP_ID
The group id in your case is 137920 which can be extracted from the URL directly
An example :
import requests
from bs4 import BeautifulSoup
import re
from urllib.parse import urlencode
username = 'your username'
password = 'your password'
link = 'https://www.linkedin.com/login?fromSignIn=true&trk=guest_homepage-basic_nav-header-signin'
post_url = 'https://www.linkedin.com/checkpoint/lg/login-submit'
target_url = 'https://www.linkedin.com/groups/137920/'
group_res = re.search('.*/(.*)/$', target_url)
group_id = group_res.group(1)
with requests.Session() as s:
# login
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1; ) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36'
r = s.get(link)
soup = BeautifulSoup(r.text,"lxml")
payload = {i['name']:i.get('value','') for i in soup.select('input[name]')}
payload['session_key'] = username
payload['session_password'] = password
r = s.post(post_url, data = payload)
# API
csrf_token = s.cookies.get_dict()["JSESSIONID"].replace("\"","")
r = s.get(f"https://www.linkedin.com/voyager/api/groups/groups/urn:li:group:{group_id}",
headers= {
"csrf-token": csrf_token
})
group_name = r.json()["name"]["text"]
print(f"searching data for group {group_name}")
params = {
"count": 10,
"keywords": group_name,
"origin": "SWITCH_SEARCH_VERTICAL",
"q": "all",
"start": 0
}
r = s.get(f"https://www.linkedin.com/voyager/api/search/blended?{urlencode(params)}&filters=List(resultType-%3EGROUPS)&queryContext=List(spellCorrectionEnabled-%3Etrue)",
headers= {
"csrf-token": csrf_token,
"Accept": "application/vnd.linkedin.normalized+json+2.1",
"x-restli-protocol-version": "2.0.0"
})
result = r.json()["included"]
print(result)
print("list of groupName/link")
print([
(t["groupName"], f'https://www.linkedin.com/groups/{t["objectUrn"].split(":")[3]}')
for t in result
])
A few notes :
those API call require cookie session
those API call require a specific header for a XSRF token that is the same as JSESSIONID cookie value
a special media type application/vnd.linkedin.normalized+json+2.1 is necessary for the search call
the parenthesis inside the fields queryContext and filters shouldn't be url encoded otherwise it will not take these params into account
you can try the selenium, click the See all button, and scrape the linked connected content:
from selenium import webdriver
driver = webdriver.Chrome(chrome_options=options)
driver.get('https://www.linkedin.com/xxxx')
driver.find_element_by_name('s_image').click()
selenium docs: https://selenium-python.readthedocs.io/
I am trying to web scrape a certain part of the etherscan site with python, since there is no api for this functionality. Basically going to this link and one would need to press verify, after doing so a popup comes up which you can see here. What I need to scrape is this part 0x0882477e7895bdc5cea7cb1552ed914ab157fe56 in case the message starts with the message as seen in the picture.
I've written the below python script that starts this off, but I don't know how it's possible to interact further with the site, in order to have that popup come to the foreground and scrape the information. Is this possible to do?
from bs4 import BeautifulSoup
from requests import get
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:60.0) Gecko/20100101 Firefox/60.0','X-Requested-With': 'XMLHttpRequest',}
url = "https://etherscan.io/proxyContractChecker?a=0xa0b86991c6218b36c1d19d4a2e9eb0ce3606eb48"
response = get(url,headers=headers )
soup = BeautifulSoup(response.content,'html.parser')
Thank You
import requests
from bs4 import BeautifulSoup
def Main(url):
with requests.Session() as req:
r = req.get(url, headers={'User-Agent': 'Ahmed American :)'})
soup = BeautifulSoup(r.content, 'html.parser')
vs = soup.find("input", id="__VIEWSTATE").get("value")
vsg = soup.find("input", id="__VIEWSTATEGENERATOR").get("value")
ev = soup.find("input", id="__EVENTVALIDATION").get("value")
data = {
'__VIEWSTATE': vs,
'__VIEWSTATEGENERATOR': vsg,
'__EVENTVALIDATION': ev,
'ctl00$ContentPlaceHolder1$txtContractAddress': '0xa0b86991c6218b36c1d19d4a2e9eb0ce3606eb48',
'ctl00$ContentPlaceHolder1$btnSubmit': "Verify"
}
r = req.post(
"https://etherscan.io/proxyContractChecker?a=0xa0b86991c6218b36c1d19d4a2e9eb0ce3606eb48", data=data, headers={'User-Agent': 'Ahmed American :)'})
soup = BeautifulSoup(r.content, 'html.parser')
token = soup.find(
"div", class_="alert alert-success").text.split(" ")[-1]
print(token)
Main("https://etherscan.io/proxyContractChecker")
Output:
0x0882477e7895bdc5cea7cb1552ed914ab157fe56
I disagree with #InfinityTM. Usually the workflow that is follow for this kind of problems is that you will need to make a POST request into the website.
Look, if you click on Verify a POST request is made into the website as shown in this image:
This POST request was made with this headers:
and with this params:
You need to figure out how to send this POST request with the correct URL, headers, params, and cookies. Once you have achieved to make the request, you will receive the response:
which contains the information you want to scrape under the div with class "alert alert-success:
Summary
So the steps you need to follow are:
Navigate to your website, and gather all the information (request URL, Cookies, headers, and params) that you will need for your POST request.
Make the request with the requests library.
Once you get a <200> response, scrape the data you are interested in with BS.
Please let me know if this points you in the right direction! :D
I'm trying to scrape the information inside an 'iframe' tag. When I execute this code, it says that 'USER_AGENT' is not defined. How can I fix this?
import requests
from bs4 import BeautifulSoup
page = requests.get("https://etherscan.io/token/0x168296bb09e24a88805cb9c33356536b980d3fc5#balances" + "/token/generic-tokenholders2?a=0x168296bb09e24a88805cb9c33356536b980d3fc5&s=100000000000000000", headers=USER_AGENT, timeout=5)
soup = BeautifulSoup(page.content, "html.parser")
test = soup.find_all('iframe')
The error is telling you clearly what is wrong. You are passing in as headers USER_AGENT, which you have not defined earlier in your code. Take a look at this post on how to use headers with the method.
The documentation states you must pass in a dictionary of HTTP headers for the request, whereas you have passed in an undefined variable USER_AGENT.
From the Requests Library API:
headers = None
Case-insensitive Dictionary of Response Headers.
For example, headers['content-encoding'] will return the value of a 'Content-Encoding' response header.
EDIT:
For a better explanation of Content-Type headers, see this SO post. See also this WebMasters post which explains the difference between Accept and Content-Type HTTP headers.
Since you only seem to be interested in scraping the iframe tags, you may simply omit the headers argument entirely and you should see the results if you print out the test object in your code.
import requests
from bs4 import BeautifulSoup
page = requests.get("https://etherscan.io/token/0x168296bb09e24a88805cb9c33356536b980d3fc5#balances" + "/token/generic-tokenholders2?a=0x168296bb09e24a88805cb9c33356536b980d3fc5&s=100000000000000000", timeout=10)
soup = BeautifulSoup(page.content, "lxml")
test = soup.find_all('iframe')
for tag in test:
print(tag)
We have to provide a user-agent, HERE's a link to the fake user-agents.
import requests
from bs4 import BeautifulSoup
USER_AGENT = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/53'}
url = "https://etherscan.io/token/0x168296bb09e24a88805cb9c33356536b980d3fc5#balances"
token = "/token/generic-tokenholders2?a=0x168296bb09e24a88805cb9c33356536b980d3fc5&s=100000000000000000"
page = requests.get(url + token, headers=USER_AGENT, timeout=5)
soup = BeautifulSoup(page.content, "html.parser")
test = soup.find_all('iframe')
You can simply NOT use a User Agent, Code:
import requests
from bs4 import BeautifulSoup
url = "https://etherscan.io/token/0x168296bb09e24a88805cb9c33356536b980d3fc5#balances"
token = "/token/generic-tokenholders2?a=0x168296bb09e24a88805cb9c33356536b980d3fc5&s=100000000000000000"
page = requests.get(url + token, timeout=5)
soup = BeautifulSoup(page.content, "html.parser")
test = soup.find_all('iframe')
I've separated your URL for readability purposes into the URL and token. That's why there's two variables URL and token
i need a little help on my little project on learning python web scraping.
from bs4 import BeautifulSoup
import urllib.request
import http.cookiejar
base_url = "https://login.yahoo.com/config/login?.src=flickrsignin&.pc=8190&.scrumb=0&.pd=c%3DH6T9XcS72e4mRnW3NpTAiU8ZkA--&.intl=in&.lang=en&mg=1&.done=https%3A%2F%2Flogin.yahoo.com%2Fconfig%2Fvalidate%3F.src%3Dflickrsignin%26.pc%3D8190%26.scrumb%3D0%26.pd%3Dc%253DJvVF95K62e6PzdPu7MBv2V8-%26.intl%3Din%26.done%3Dhttps%253A%252F%252Fwww.flickr.com%252Fsignin%252Fyahoo%252F%253Fredir%253Dhttps%25253A%25252F%25252Fwww.flickr.com%25252F"
login_action = "/config/login?.src=flickrsignin&.pc=8190&.scrumb=0&.pd=c%3DH6T9XcS72e4mRnW3NpTAiU8ZkA--&.intl=in&.lang=en&mg=1&.done=https%3A%2F%2Flogin.yahoo.com%2Fconfig%2Fvalidate%3F.src%3Dflickrsignin%26.pc%3D8190%26.scrumb%3D0%26.pd%3Dc%253DJvVF95K62e6PzdPu7MBv2V8-%26.intl%3Din%26.done%3Dhttps%253A%252F%252Fwww.flickr.com%252Fsignin%252Fyahoo%252F%253Fredir%253Dhttps%25253A%25252F%25252Fwww.flickr.com%25252F"
cj = http.cookiejar.CookieJar()
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj))
opener.addheaders = [('User-agent',
('Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_7) '
'AppleWebKit/535.1 (KHTML, like Gecko) '
'Chrome/13.0.782.13 Safari/535.1'))
]
login_data = urllib.parse.urlencode({
'login-username' : 'username',
'login-passwd' : 'password',
'remember_me' : True
})
login_data = login_data.encode('ascii')
login_url = base_url + login_action
response = opener.open(login_url, login_data)
print (response.read())
i have tried loggin in, but the output is returned as in the login page html, could anyone help me out to login to this site?
Try read read more on request with beautifulsoup. the User[email] is only the username input name and User[password] is that of the password. Though the code below can only login inside a site without crsf_token protection
import requests
from requests.packages.urllib3 import add_stderr_logger
import urllib
from bs4 import BeautifulSoup
from urllib.error import HTTPError
from urllib.request import urlopen
import re, random, datetime
random.seed(datetime.datetime.now())
add_stderr_logger()
session = requests.Session()
per_session = session.post(url,
data={'User[email]':'your_email', 'User[password]':'your_password'})
#you can now associate request with beautifulsoup
try:
#it assumed that by now you are logged so we can now use .get and fetch any page of your choice
bsObj = BeautifulSoup(session.get(url).content, 'lxml')
except HTTPError as e:
print(e)
you are not storing the session token received on login.instead of doing that manually you can use mechanize for handling the sign in session.
here is a nice article for how to do that.
I'm trying to log in to Wikipedia using a python script, but despite following the instructions here, I just can't get it to work.
import urllib
import urllib2
import cookielib
username = 'myname'
password = 'mypassword'
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
opener.addheaders = [("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6")]
login_data = urllib.urlencode({'wpName' : username, 'wpPassword' : password})
opener.open('http://en.wikipedia.org/w/index.php?title=Special:UserLogin', login_data)
resp = opener.open('http://en.wikipedia.org/wiki/Special:Watchlist')
All I get is the "You're not logged in" page. I tried logging in to another site with the script with the same negative result. I suspect it's either got something to do with cookies, or I'm missing something incredibly simple here. But I just cannot find it.
If you inspect the raw request sent to the login URL (with the help of a tool such as Charles Proxy), you will see that it is actually sending 4 parameters: wpName, wpPassword, wpLoginAttempt and wpLoginToken. The first 3 are static and you can fill them in anytime, the 4th one however needs to be parsed from the HTML of the login page. You will need to post this value you parsed, in addition to the other 3, to the login URL to be able to login.
Here is the working code using Requests and BeautifulSoup:
import requests
from bs4 import BeautifulSoup as bs
def get_login_token(raw_resp):
soup = bs(raw_resp.text, 'lxml')
token = [n.get('value', '') for n in soup.find_all('input')
if n.get('name', '') == 'wpLoginToken']
return token[0]
payload = {
'wpName': 'my_username',
'wpPassword': 'my_password',
'wpLoginAttempt': 'Log in',
#'wpLoginToken': '',
}
with requests.session() as s:
resp = s.get('http://en.wikipedia.org/w/index.php?title=Special:UserLogin')
payload['wpLoginToken'] = get_login_token(resp)
response_post = s.post('http://en.wikipedia.org/w/index.php?title=Special:UserLogin&action=submitlogin&type=login',
data=payload)
response = s.get('http://en.wikipedia.org/wiki/Special:Watchlist')
Adding these two lines
r = bs(response.content)
print r.get_text()
I should be able to understand if I'm logged in or not, right? I keep seeing "Please log in to view or edit items on your watchlist." but I'm using the clean code given above, with my login and password.
Where is the mistake?
Wikipedia now forces HTTPS and requires other parameters, and wpLoginAttempt became wploginattempt, here is an updated version of K Z initial answer:
import requests
from bs4 import BeautifulSoup as bs
def get_login_token(raw_resp):
soup = bs(raw_resp.text, 'lxml')
token = [n.get('value', '') for n in soup.find_all('input')
if n.get('name', '') == 'wpLoginToken']
return token[0]
payload = {
'wpName': 'my_username',
'wpPassword': 'my_password',
'wploginattempt': 'Log in',
'wpEditToken': "+\\",
'title': "Special:UserLogin",
'authAction': "login",
'force': "",
'wpForceHttps': "1",
'wpFromhttp': "1",
#'wpLoginToken': '',
}
with requests.session() as s:
resp = s.get('https://en.wikipedia.org/w/index.php?title=Special:UserLogin')
payload['wpLoginToken'] = get_login_token(resp)
response_post = s.post('https://en.wikipedia.org/w/index.php?title=Special:UserLogin&action=submitlogin&type=login',
data=payload)
response = s.get('https://en.wikipedia.org/wiki/Special:Watchlist')
You need to add header Content-Type: application/x-www-form-urlencoded to your POST request.
I also added the following lines and see myself as not logged in.
page = response.text.encode('utf8')
if page.find('Not logged in'):
print 'You are not logged in. :('
else:
print 'YOU ARE LOGGED IN! :)'