Log in to Amazon Mechanical Turk using Python and Parse HITS

Log in to Amazon Mechanical Turk using Python and Parse HITS - python

I am trying to use Python (2.7) to automatically log into Amazon Mechanical Turk and grab information about some of the HITS available. If you attempt to go past page 20, it requires a log in, which is where I am having difficulty. I have attempted to us many python packages including mechanize, urllib2, and most recently I found a very related solution on stackoverflow here using requests. I added the slight modifications necessary for my context, see below, but the code is not working. The response page is again the login page with an error displayed: Your password is incorrect. Additionally, the code from the original post no longer works for its context either; the same error is displayed. So I assume Amazon has changed something, and I cannot seem to figure out what it is and how to fix it. Any help along this line would be very appreciated.
import bs4, requests
headers = {
'User-Agent': 'Chrome'
}
from bs4 import BeautifulSoup
url = "https://www.mturk.com/mturk/viewhits?searchWords=&pageNumber=21" \
"&searchSpec=HITGroupSearch%23T%232%23100%23-1%23T%23%21%23%21" \
"LastUpdatedTime%211%21%23%21&sortType=LastUpdatedTime%3A1" \
"&selectedSearchType=hitgroups"
with requests.Session() as s:
s.headers = headers
r = s.get(url)
soup = BeautifulSoup(r.content, "html.parser")
signin_data = {s["name"]: s["value"]
for s in soup.select("form[name=signIn]")[0].select("input[name]")
if s.has_attr("value")}
signin_data[u'email'] = ''
signin_data[u'password'] =''
for k,v in signin_data.iteritems():
print k + ": " + v
action = soup.find('form', id='ap_signin_form').get('action')
response = s.post(action, data=signin_data)
soup = bs4.BeautifulSoup(response.text, "html.parser")
warning = soup.find('div', {'id': 'message_error'})
if warning:
print('Failed to login: {0}'.format(warning.text))

Related

Python requests not redirecting

I'm trying to scrape word definitions, but can't get python to redirect to the correct page. For example, I'm trying to get the definition for the word 'agenesia'. When you load that page in a browser with https://www.lexico.com/definition/agenesia, the page which loads is https://www.lexico.com/definition/agenesis, however in Python the page doesn't redirect and gives a 200 status code
URL = 'https://www.lexico.com/definition/agenesia'
page = requests.head(URL, allow_redirects=True)
This is how I'm currently retrieving the page content, I've also tried using requests.get
but that also doesn't work
EDIT: Because it isn't clear, I'm aware that I could change the word to 'agenesis' in the URL to get the correct page, but I am scraping a list of words and would rather automatically follow the URL rather than searching in a browser for the redirect by hand first.
EDIT 2: I realised it might be easier to check solutions with the rest of my code, so far this works with agenesis but not agenesia:
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.find("span", {"class": "ind"}).get_text(), '\n')
print(soup.find("span", {"class": "pos"}).get_text())

Other answers mentioned before doesn't make your request redirect. The cause is you didn't use the correct request header. Try code below:
import requests
from bs4 import BeautifulSoup
headers = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
}
page = requests.get('https://www.lexico.com/definition/agenesia', headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
print(page.url)
print(soup.find("span", {"class": "ind"}).get_text(), '\n')
print(soup.find("span", {"class": "pos"}).get_text())
And print:
https://www.lexico.com/definition/agenesis?s=t
Failure of development, or incomplete development, of a part of the body.
noun

You are doing an HEAD request
The HTTP HEAD method requests the headers that would be returned if the HEAD request's URL was instead requested with the HTTP GET method.
You want to do
URL = 'https://www.lexico.com/definition/agenesia'
page = requests.get(URL, allow_redirects=True)

If you don't mind a pop-up window, Selenium.py is really good for scraping at a more user-friendly level. If you know the selector of the page element, you can scrape it with driver.find_element_by_css_selector('theselector').text
Where driver = webdriver.chromedriver('file path').
This is a pretty radical circumvention of the problem so I understand if it's not applicable to your specific situation but hopefully you find this answer useful. :)

This works as expected:
>>> import requests
>>> url = 'https://www.lexico.com/definition/agenesia'
>>> requests.get(url, allow_redirects=True).url
'https://www.lexico.com/search?filter=en_dictionary&query=agenesia'
This is the URL that other commands find. For example, with curl:
$ curl -v https://www.lexico.com/definition/agenesia 2>&1 | grep location:
< location: https://www.lexico.com/search?filter=en_dictionary&query=agenesia

How to extract the Coronavirus cases from a website?

I'm trying to extract the Coronavirus from a website (https://www.trackcorona.live) but I got an error.
This is my code:
response = requests.get('https://www.trackcorona.live')
data = BeautifulSoup(response.text,'html.parser')
li = data.find_all(class_='numbers')
confirmed = int(li[0].get_text())
print('Confirmed Cases:', confirmed)
It gives the following error (though it was working few days back) because it is returning an empty list (li)
IndexError
Traceback (most recent call last)
<ipython-input-15-7a09f39edc9d> in <module>
2 data=BeautifulSoup(response.text,'html.parser')
3 li=data.find_all(class_='numbers')
----> 4 confirmed = int(li[0].get_text())
5 countries = li[1].get_text()
6 dead = int(li[3].get_text())
IndexError: list index out of range

Well, Actually the site is generating a redirection behind CloudFlare, And then it's loaded dynamically via JavaScript once the page loads, Therefore we can use several approach such as selenium and requests_html but i will mention for you the quickest solution for that as we will render the JS on the fly :)
import cloudscraper
from bs4 import BeautifulSoup
scraper = cloudscraper.create_scraper()
html = scraper.get("https://www.trackcorona.live/").text
soup = BeautifulSoup(html, 'html.parser')
confirmed = soup.find("a", id="valueTot").text
print(confirmed)
Output:
110981
A tip for 503 response code:
Basically that code referring to service unavailable.
More technically, the GET request which you sent is couldn't be served. the reason why it's because the request got stuck between the receiver of the request which is https://www.trackcorona.live/ where's it's handling it to another source on the same HOST which is https://www.trackcorona.live/?cf_chl_jschl_tk=
Where __cf_chl_jschl_tk__= is holding a token to be authenticated.
So you should usually follow your code to server the host with required data.
Something like the following showing the end url:
import requests
from bs4 import BeautifulSoup
def Main():
with requests.Session() as req:
url = "https://www.trackcorona.live"
r = req.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
redirect = f"{url}{soup.find('form', id='challenge-form').get('action')}"
print(redirect)
Main()
Output:
https://www.trackcorona.live/?__cf_chl_jschl_tk__=575fd56c234f0804bd8c87699cb666f0e7a1a114-1583762269-0-AYhCh90kwsOry_PAJXNLA0j6lDm0RazZpssum94DJw013Z4EvguHAyhBvcbhRvNFWERtJ6uDUC5gOG6r64TOrAcqEIni_-z1fjzj2uhEL5DvkbKwBaqMeIZkB7Ax1V8kV_EgIzBAeD2t6j7jBZ9-bsgBBX9SyQRSALSHT7eXjz8r1RjQT0SCzuSBo1xpAqktNFf-qME8HZ7fEOHAnBIhv8a0eod8mDmIBDCU2-r6NSOw49BAxDTDL57YAnmCibqdwjv8y3Yf8rYzm2bPh74SxVc
Now to be able to call the end URL so you need to pass the required Form-Data:
Something like that:
def Main():
with requests.Session() as req:
url = "https://www.trackcorona.live"
r = req.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
redirect = f"{url}{soup.find('form', id='challenge-form').get('action')}"
data = {
'r': 'none',
'jschl_vc': 'none',
'pass': 'none',
'jschl_answer': 'none'
}
r = req.post(redirect, data=data)
print(r.text)
Main()
here you will end up with text without your desired values. because your values is rendered via JS.

That site is covered by Cloudflare DDoS protection, so the Html returned is a Cloudflare page stating this, not the content you want. You will need to navigate that by first, presumably by getting and setting some cookies, etc.
As an alternative, I recommend taking a look at Selenium. It drives a browser and will execute any js on the page and should get you past this much easier if you are just starting out.
Hope that helps!

The website is now protected with Cloudflare DDoS Protection, so it cannot be directly accessed with python requests.
You can just try this out with https://github.com/Anorov/cloudflare-scrape which bypasses this page. pip package is named as cfscrape

Python web scraping(request.get) can't get target web page

I am using Python requests.get to scrape a website. It worked fine yesterday, but issues happened today.
I can't get requests.text as before. Now, requests returns This process is automatic. Your browser will redirect to your requested content shortly. Please allow up to 5 seconds.
To solve this problem, I tried to set time.sleep() function inside requests.get. But it didn't work and I still get the allow up to 5 seconds response as before.
Here's my code below:
def web_scraping_corporationwiki(df):
# Scrape website
df_scrape = df.reset_index()
for i in df_scrape.index:
name = df_scrape.loc[i, 'Name']
state = df_scrape.loc[i, 'State']
origin_row = df_scrape.loc[i]
url = 'https://www.corporationwiki.com/search/withfacets?term='+ name +'&stateFacet='+ state
for page_num in range(8):
req = requests.get(url, time.sleep(50), params= dict(query="wiki", page = page_num),timeout = 100A)
soup = BeautifulSoup(req.text, "html.parser")
page = soup.find_all('div', class_ = 'list-group-item')
Can someone help me with this problem?

Problems accessing page with python requests

I'm trying to extract the sector of a stock for a ML classification project. If I go to the following page:
https://www.six-swiss-exchange.com/shares/security_info_en.html?id=CH0012221716CHF4
I get (on the screen) some information about this stock (it changes, with the id code - I just pick the first one of the list). However, none of the information is available on a regular request. (The html page contains mostly javascript functions)
What I need is on the "Shares Details" tab (ICB Supersector at the bottom of the page). Once again nothing is available with a regular requests. I looked into what happens when I click this tab and the desired request is inside the url:
http://www.six-swiss-exchange.com/shares/info_details_en.html?id=CH0210483332CHF4&portalSegment=EQ&dojo.preventCache=1520360103852 HTTP/1.1
However, if I use this url directly, I get an 403 error from requests but work from a browser. I usually don't have any problems with this sort of things but in this case, do I have to submit cookies or any other information to access that page - no login is required and it can be easily accessed from any browser.
I am thinking 1) make a first request to the url that works, 2) store the cookie they send you (I don't know how to do that really) and 3) make a second request to the desired url. Would this work?
I tried using request.session() but I'm not sure if this is the solution or if I implemented it properly.
If anyone has dealt with that sort of problem, I would love any pointers in solving this. Thanks.

from urllib.parse import urljoin
import requests
from bs4 import BeautifulSoup
BASE_URL = 'https://www.six-swiss-exchange.com'
def get_page_html(isin):
params = {
'id': isin,
'portalSegment': 'EQ'
}
r = requests.get(
'{}/shares/info_details_en.html'.format(BASE_URL),
params=params
)
r.raise_for_status()
return r.text
def get_supersector_info(soup):
supersector = soup.find('td', text='ICB Supersector').next_sibling.a
return {
'link': urljoin(BASE_URL, supersector['href']),
'text': supersector.text
}
if __name__ == '__main__':
page_html = get_page_html('CH0012221716CHF4')
soup = BeautifulSoup(page_html, 'lxml')
supersector_info = get_supersector_info(soup)
Console:
https://www.six-swiss-exchange.com/search/quotes_en.html?security=C2700T
Industrial Goods & Services

Not able to parse the 'data' for login

i'm trying to login to my college website using python, i want the source code of the welcome page i.e my dashboard but when i run this i'm getting the same source code as of the login page..is this because im not able to post my info on the login form? here is the code..
import requests
from bs4 import BeautifulSoup
from lxml import html
import collections
url = 'http://erp.college_name.edu/'
opening = requests.get(url)
r = requests.session()
stuff= collections.OrderedDict()
stuff = {
'tbUserName': 'my_username',
'tbPassword': 'my_password',
}
opens = r.post(url=url, data=stuff)
soup = BeautifulSoup(opens.text, 'lxml')
print(soup)
any help?

You're probably not logging in correctly. Ideally, the site will give you a non-200 status code which you can check with opens.status_code. A successful request should start with a 2 (such as 200). Note that some sites won't provide reasonable status codes even if your request isn't correct.

UPDATE
so,after getting the tokens
import collections
url = 'http://erp.name_of_college.edu/'
opening = requests.get(url)
tree = html.fromstring(opening.text)
token = list(set(tree.xpath("//input[#name='name_of_token']/#value")
[0]))
r = requests.session()
data = collections.OrderedDict()
datas = {
'tbUserName': 'my_username',
'tbPassword': 'my_password',
'name_of_token' : token,
}
opens = r.post(url=url, data=datas)
soup = BeautifulSoup(opens.text, 'lxml')
print(soup)
problem is solved , you need to include the tokens in the parsing data , which are generally as named hidden in the class, and if the problem presists then include more data from the form ;)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.