Unable to scrape a name from a webpage using requests

Unable to scrape a name from a webpage using requests - python

I've created a script in python to fetch a name which is populated upon filling in an input in a webpage. Here is how you can get that name -> after opening that webpage (sitelink has been given below), put 16803 right next to CP Number and hit the search button.
I know how to grab that using selenium but I'm not interested to go that route. I'm trying here to collect the name using requests module. I tried to mimick the steps (what I can see in the chrome dev tools) within my script as to how the requests is being sent to that site. The only thing I can't supply automatically within payload parameter is ScrollTop.
Website Link
This is my attempt:
import requests
from bs4 import BeautifulSoup
URL = "https://www.icsi.in/student/Members/MemberSearch.aspx"
with requests.Session() as s:
r = s.get(URL)
cookie_item = "; ".join([str(x)+"="+str(y) for x,y in r.cookies.items()])
soup = BeautifulSoup(r.text,"lxml")
payload = {
'StylesheetManager_TSSM':soup.select_one("#StylesheetManager_TSSM")['value'],
'ScriptManager_TSM':soup.select_one("#ScriptManager_TSM")['value'],
'__VIEWSTATE':soup.select_one("#__VIEWSTATE")['value'],
'__VIEWSTATEGENERATOR':soup.select_one("#__VIEWSTATEGENERATOR")['value'],
'__EVENTVALIDATION':soup.select_one("#__EVENTVALIDATION")['value'],
'dnn$ctlHeader$dnnSearch$Search':soup.select_one("#dnn_ctlHeader_dnnSearch_SiteRadioButton")['value'],
'dnn$ctr410$MemberSearch$ddlMemberType':0,
'dnn$ctr410$MemberSearch$txtCpNumber': 16803,
'ScrollTop': 474,
'__dnnVariable': soup.select_one("#__dnnVariable")['value'],
}
headers = {
'Content-Type':'multipart/form-data; boundary=----WebKitFormBoundaryBhsR9ScAvNQ1o5ks',
'Referer': 'https://www.icsi.in/student/Members/MemberSearch.aspx',
'Cookie':cookie_item,
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'
}
res = s.post(URL,data=payload,headers=headers)
soup_obj = BeautifulSoup(res.text,"lxml")
name = soup_obj.select_one(".name_head > span").text
print(name)
When I execute the above script I get the following error:
AttributeError: 'NoneType' object has no attribute 'text'
How can I grab a name populated upon filling in an input in a webpage using requests?

The main issue with your code is the data encoding. I've noticed that you've set the Content-Type header to "multipart/form-data" but that is not enough to create multipart-encoded data. In fact, it is a problem because the actual encoding is different since you're using the data parameter which URL-encodes the POST data. In order to create multipart-encoded data, you should use the files parameter.
You could do that either by passing an extra dummy parameter to files,
res = s.post(URL, data=payload, files={'file':''})
(that would change the encoding for all POST data, not just the 'file' field)
Or you could convert the values in your payload dictionary to tuples, which is the expected structure when posting files with requests.
payload = {k:(None, str(v)) for k,v in payload.items()}
The first value is for the file name; it is not needed in this case so I've set it to None.
Next, your POST data should contain an __EVENTTARGET value that is required in order to get a valid response. (When creating the POST data dictionary it is important to submit all the data that the server expects. We can get that data from a browser: either by inspecting the HTML form or by inspecting the network traffic.) The complete code,
import requests
from bs4 import BeautifulSoup
URL = "https://www.icsi.in/student/Members/MemberSearch.aspx"
with requests.Session() as s:
r = s.get(URL)
soup = BeautifulSoup(r.text,"lxml")
payload = {i['name']: i.get('value', '') for i in soup.select('input[name]')}
payload['dnn$ctr410$MemberSearch$txtCpNumber'] = 16803
payload["__EVENTTARGET"] = 'dnn$ctr410$MemberSearch$btnSearch'
payload = {k:(None, str(v)) for k,v in payload.items()}
r = s.post(URL, files=payload)
soup_obj = BeautifulSoup(r.text,"lxml")
name = soup_obj.select_one(".name_head > span").text
print(name)
After some more tests, I discovered that the server also accepts URL-encoded data (probably because there are no files posted). So you can get a valid response either with data or with files, provided that you don't change the default Content-Type header.
It is not necessary to add any extra headers. When using a Session object, cookies are stored and submitted by default. The Content-Type header is created automatically - "application/x-www-form-urlencoded" when using the data parameter, "multipart/form-data" using files. Changing the default User-Agent or adding a Referer is not required.

Related

Request not returning same data as browser

Trying to get some values from Duolingo using Python, but urllib is giving me something different than when I navigate to the url via my browser.
Navigating to a url (https://www.duolingo.com/2017-06-30/users/215344344?fields=xpGoalMetToday) via browser gives: {"xpGoalMetToday": false}.
However, trying via the below script:
import urllib.request
url = 'http://www.duolingo.com/2017-06-30/users/215344344?fields=xpGoalMetToday'
user_agent = '[insert my local user agent copied from browser attempt]'
# header variable
headers = { 'User-Agent' : user_agent, "Cache-Control": "no-cache, max-age=0" }
# creating request
req = urllib.request.Request(url, None, headers)
print(urllib.request.urlopen(req).read())
returns just a blank {}.
As you can tell from the above, I've tried a couple things: adding a user agent, cache control. I've even tried using the response module and adding authentication (didn't work).
Any ideas? Am I missing something?

Actually when I open the link in the browser it show me {}
Maybe you have some kind of cookie set in your browser?

How to scrape through Single page Application websites in python using bs4

I am scraping players name through the NBA website. The player's name webpage is designed using a single page application. The Players are distributed across several pages in alphabetical order. I am unable to extract the names of all the players.
Here is the link: https://in.global.nba.com/playerindex/
from selenium import webdriver
from bs4 import BeautifulSoup
class make():
def __init__(self):
self.first=""
self.last=""
driver= webdriver.PhantomJS(executable_path=r'E:\Downloads\Compressed\phantomjs-2.1.1-windows\bin\phantomjs.exe')
driver.get('https://in.global.nba.com/playerindex/')
html_doc = driver.page_source
soup = BeautifulSoup(html_doc,'lxml')
names = []
layer = soup.find_all("a",class_="player-name ng-isolate-scope")
for a in layer:
span = a.find("span",class_="ng-binding")
thing = make()
thing.first = span.text
spans = a.find("span",class_="ng-binding").find_next_sibling()
thing.last = spans.text
names.append(thing)

When dealing with SPAs, you shouldn't try to extract info from DOM, because the DOM is incomplete without running a JS-capable browser to populate it with data. Open up the page source, and you'll see the page HTML doesn't have the data you need.
But most SPAs load their data using XHR requests. You can monitor network requests in Developer Console (F12) to see the requests being made during page load.
Here https://in.global.nba.com/playerindex/ loads player list from https://in.global.nba.com/stats2/league/playerlist.json?locale=en
Simulate that request yourself, then pick whatever you need. Inspect the request headers to figure out what you need to send with the request.
import requests
if __name__ == '__main__':
page_url = 'https://in.global.nba.com/playerindex/'
s = requests.Session()
s.headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:69.0) Gecko/20100101 Firefox/69.0'}
# visit the homepage to populate session with necessary cookies
res = s.get(page_url)
res.raise_for_status()
json_url = 'https://in.global.nba.com/stats2/league/playerlist.json?locale=en'
res = s.get(json_url)
res.raise_for_status()
data = res.json()
player_names = [p['playerProfile']['displayName'] for p in data['payload']['players']]
print(player_names)
output:
['Steven Adams', 'Bam Adebayo', 'Deng Adel', 'LaMarcus Aldridge', 'Kyle Alexander', 'Nickeil Alexander-Walker', ...
Dealing with auth
One thing to watch out for is that some websites require an authentication token to be sent with requests. You can see it in the API requests if it's present.
If you're building a scraper that needs to be functional in the long(er) term, you might want to make the script more robust by extracting the token from the page and including it in requests.
This token (mostly a JWT token, starts with ey...) usually sits somewhere in the HTML, encoded as JSON. Or it is sent to the client as a cookie, and the browser attaches it to the request, or in a header. In short, it could be anywhere. Scan the requests & responses to figure out where the token is coming from and how you can retrieve it yourself.
...
<script>
const state = {"token": "ey......", ...};
</script>
import json
import re
res = requests.get('url/to/page')
# extract the token from the page. Here `state` is an object serialized as JSON,
# we take everything after `=` sign until the semicolon and deserialize it
state = json.loads(re.search(r'const state = (.*);', res.text).group(1))
token = state['token']
res = requests.get('url/to/api/with/auth', headers={'authorization': f'Bearer {token}'})

Login a Website and Web Scraping using python

I am trying to figure out ways to web scraping a real estate website https://www.brickz.my/ for my research project. I have been trying between selenium and beautiful soup and decide to choose beautiful soup was the best way for me since the structure of url for each real estate allow my code to navigate the website easily and faster
I am trying to build a database transaction for each real estate'. Without login, only 10 latest transactions will be displayed for a particular property. By login, I can access to the whole transaction for a particular type of property. here is the example
without login, i can only access 10 transaction for each property
After login, i can access to more than 10 transaction plus previously obscure property address
i try to login using request in python, yet it keep bringing me to the page without login and end up, i just manage to scrap the 10 latest transaction instead of whole transaction. here is the example of my login code in python
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.brickz.my/login/", auth=
('email', 'password'))
headers = {'User-Agent': 'Mozilla/5.0 (Linux; Android 5.1.1; SM-G928X Build/LMY47X) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.83 Mobile Safari/537.36'}
soup = BeautifulSoup(page.content, 'html.parser')
#I put one of the property url to be scrapped inside response
response = get("https://www.brickz.my/transactions/residential/kuala-
lumpur/titiwangsa/titiwangsa-sentral-condo/non-landed/?range=2012+Oct-",
headers = headers)
Here is what I used to scrape the table
table = BeautifulSoup(response.text, 'html.parser')
table_rows = table.find_all('tr')
names = []
for tr in table_rows:
td = tr.find_all('td')
row = [i.text for i in td]
names.append(row)
How am I able to successfully login and get access to the whole transaction? I heard about Mechanize library but it is not available for python 3.
I am sorry if my question is not clear, this is my first time posting, and i just learn to use python only a couple of months ago.

Try the below code. What do you see when you print it (changing email and password)? Doesn't it print Logoutas result?
import requests
from bs4 import BeautifulSoup
URL = "https://www.brickz.my/login/"
payload = {
'email': 'your_email',
'pw': 'your_password',
'submit': 'Submit'
}
with requests.Session() as s:
s.headers = {"User-Agent":"Mozilla/5.0"}
s.post(URL,data=payload)
res = s.get("https://www.brickz.my/")
soup = BeautifulSoup(res.text,"lxml")
for items in soup.select("select#menu_select .nav2"):
data = [' '.join(item.text.split()) for item in items.select("option")[-1:]]
print(data)

A simple HTTP trace will show that a POST is made to https://www.brickz.my/login/ with email and pw as form parameters.
Which translates into this requests command:
session = requests.Session()
resp = session.post('https://www.brickz.my/login/', data={'email': '<youremail>', 'pw': '<yourpassword'})
if resp.ok:
print("You should now be logged in")
# then use session to request the site, like
# resp = session.get("https://www.brickz.my/whatever")
WARNING: Untested since I don't have an account there.

python requests cannot get html

I tried to get html code from a site name dcinside in Korea, i am using requests but cannot get html code
and this is my code
import requests
url = "http://gall.dcinside.com/board/lists/?id=bitcoins&page=1"
req = requests.get(url)
print (req)
print (req.content)
but the result was
Why I cannot get html codes even using requests??

Most likely they are detecting that you are trying to crawl data dynamically, and not giving any content as a response. Try pretending to be a browser and passing some User-Agent headers.
headers = {
'User-Agent': 'My User Agent 1.0',
'From': 'youremail#domain.com'
}
response = requests.get(url, headers=headers)
# use authentic mozilla or chrome user-agent strings if this doesn't work

Take a look at this:
Python Web Crawlers and "getting" html source code
Like the guy said in the aforementioned post, you should use urllib2 which will allow you to easily obtain web resources.

how to use python requests to login to website

Im trying to login and scrape a job site and send me notification when ever certain key words are found.I think i have correctly traced the xpath for the value of feild "login[iovation]" but i cannot extract the value, here is what i have done so far to login
import requests
from lxml import html
header = {"User-Agent":"Mozilla/4.0 (compatible; MSIE 5.5;Windows NT)"}
login_url = 'https://www.upwork.com/ab/account-security/login'
session_requests = requests.session()
#get csrf
result = session_requests.get(login_url)
tree=html.fromstring(result.text)
auth_token = list(set(tree.xpath('//*[#name="login[_token]"]/#value')))
auth_iovat = list(set(tree.xpath('//*[#name="login[iovation]"]/#value')))
# create payload
payload = {
"login[username]": "myemail#gmail.com",
"login[password]": "pa$$w0rD",
"login[_token]": auth_token,
"login[iovation]": auth_iovation,
"login[redir]": "/home"
}
#perform login
scrapeurl='https://www.upwork.com/ab/find-work/'
result=session_requests.post(login_url, data = payload, headers = dict(referer = login_url))
#test the result
print result.text
This is screen shot of form data when i login successfully

This is because upworks uses something called iOvation (https://www.iovation.com/) to reduce fraud. iOvation uses digital fingerprint of your device/browser, which are sent via login[iovation] parameter.
If you look at the javascripts loaded on your site, you will find two javascript being loaded from iesnare.com domain. This domain and many others are owned by iOvaiton to drop third party javascript to identify your device/browser.
I think if you copy the string from the successful login and send it over along with all the http headers as is including the browser agent in python code, you should be okie.

Are you sure that result is fetching 2XX code
When I am this code result = session_requests.get(login_url)..its fetching me a 403 status code, which means I am not going to login_url itself

They have an official API now, no need for scraping, just register for API keys.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Unable to scrape a name from a webpage using requests - python

Related

Request not returning same data as browser

How to scrape through Single page Application websites in python using bs4

Login a Website and Web Scraping using python

python requests cannot get html

how to use python requests to login to website

Categories

Resources