Web Scraping Header issue - python

I am playing about with scraping data from websites as an educational exercise. I am using python and beautiful soup.
I am basically looking at products on a page e.g.
http://www.asos.com/Women/Dresses/Cat/pgecategory.aspx?cid=8799#parentID=-1&pge=0&pgeSize=5&sort=-1
I noticed it had the parameters pge and pgeSize which I can change in the browser and give the results I would expect, but when running user python requests, it always returns the same 36 products (36 being the default)
I thought this was a header issue so I tried using curl Chrome developer tools to try and work out which headers I needed but with curl I can't get past the following response:
curl -c ~/cookie -H "Accept: application/xml" -H "Accept-Language: en-GB,en-US;q=0.8,en;q=0.6" -H "Content-Type: application/xml" -H "User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36" -X GET 'http://www.asos.com/Women/Dresses/Cat/pgecategory.aspx?cid=8799#parentID=-1&pge=0&pgeSize=5&sort=-1'
Object moved
Object moved to here.
How or what is the correct way to debug and try to work this out?

The default dresses are always returned for URL /Women/Dresses/Cat/pgecategory.aspx?cid=8799&r=2.
Notice parentID=-1&pge=7&pgeSize=5&sort=-1 are after # sign.
There is an additional query that fetches the right dresses and replaces them for you.

You need to provide an asos cookie, e.g. using this curl flag:
curl --cookie "asos=currencyid=19" 'http://www.asos.com/Women/Dresses/Cat/pgecategory.aspx?cid=8799#parentID=-1&pge=0&pgeSize=5&sort=-1'

Related

how to scrape the data with expandable tabs and subsequent POST requests (CORS problem)

There is a website which I need to scrape. It has a long list of available job positions, that are folded by default:
Which unfold when a user clicks on it:
When a user unfolds it, the page sends a POST request to a website with a position id.
I tried to imitate this request (see code below), it doesn't fail (status==200) but doesn't return anything. I suspect that is because of CORS. Is there anyway to still collect the data?
import requests
url = "https://econjobmarket.org/positions/recordClick"
payload = 'posid=7026'
headers = {
'Accept': '*/*',
'X-CSRF-TOKEN': HERE_GOES_THE_TOKEN,
'X-Requested-With': 'XMLHttpRequest',
'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
'Cookie': HERE_GOES_THE_COOKIE
}
response = requests.request("POST", url, headers=headers, data = payload)
print(response.text.encode('utf8'))
I don't see additional requests sent to get expanded data. All data (both in folded and expanded states are already in page source)
response = requests.get('https://econjobmarket.org/positions').content
print("Post-Doc, Computational Marketing" in response)
True
The recordClick URL you are seeing is simply for recording the click for web analytics. As Parolla said, what you are looking for is already in the page source. Your best bet is to do an HTTP GET on the website and parse the html code with BeautifulSoup.
You can reduce the ability of the site to track you and potentially block your scraping if you drop the token and cookies from the request headers.
A quick test in curl shows the response is still complete without them.
curl -i -s -k -X $'GET' \
-H $'Host: econjobmarket.org' -H $'Connection: close' -H $'Cache-Control: max-age=0' -H $'DNT: 1' -H $'Upgrade-Insecure-Requests: 1' -H $'User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.101 Safari/537.36' -H $'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9' -H $'Sec-GPC: 1' -H $'Sec-Fetch-Site: cross-site' -H $'Sec-Fetch-Mode: navigate' -H $'Sec-Fetch-User: ?1' -H $'Sec-Fetch-Dest: document' -H $'Accept-Encoding: gzip, deflate' -H $'Accept-Language: en-GB,en-US;q=0.9,en;q=0.8' \
$'https://econjobmarket.org/positions'
J and Parolla are correct that the POST is just recording your actions on the website.

How to consume this API in Python? I just can't

So i'm trying to consume this API, I got this URL http://www.ventamovil.com.mx:9092/service.asmx?op=Check_Balance
There you can write this {"User":"6144135400","Password":"Prueba$$"} on the input field and you get a response.
https://i.stack.imgur.com/RTEii.png
Response
But when i try to consume this api on python i just can't, i don't exactly know how to consume correctly:
My Code
As you can see i got a different response with my code, i should be getting the same response as the "Response" image.
To save yourself some time, you can use their request to build python code automatically, all you have to do is:
Just as you did at first, enter the json in the input field and invoke.
open the network tab, copy the post request they made as curl
curl 'http://www.ventamovil.com.mx:9092/service.asmx/Check_Balance' -H 'Connection: keep-alive' -H 'Cache-Control: max-age=0' -H 'Upgrade-Insecure-Requests: 1' -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.122 Safari/537.36' -H 'Origin: http://www.ventamovil.com.mx:9092' -H 'Content-Type: application/x-www-form-urlencoded' -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9' -H 'Referer: http://www.ventamovil.com.mx:9092/service.asmx?op=Check_Balance' -H 'Accept-Language: en-US,en;q=0.9,ar;q=0.8,pt;q=0.7' --data 'jrquest=%7B%22User%22%3A6144135400%2C+%22Password%22%3A+%22Prueba%24%24%22%7D' --compressed --insecure
Go to postman and import the curl, then click code and select python, and here you go you have all the right headers needed
import requests
url = "http://www.ventamovil.com.mx:9092/service.asmx/Check_Balance"
payload = 'jrquest=%7B%22User%22%3A6144135400%2C+%22Password%22%3A+%22Prueba%24%24%22%7D'
headers = {
'Upgrade-Insecure-Requests': '1',
'Content-Type': 'application/x-www-form-urlencoded',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
}
response = requests.request("POST", url, headers=headers, data = payload)
print(response.text.encode('utf8'))
As you can see, they accept their input as form encoded payload.
You need to modify this request to be parameterized with user/password you want each time you use.
Btw, the output of this python code is:
b'<?xml version="1.0" encoding="utf-8"?>\r\n<string xmlns="http://www.ventamovil.com.mx/ws/">{"Confirmation":"00","Saldo_Inicial":"10000","Compras":"9360","Ventas":"8416","Comision":"469","Balance":"10345.92"}</string>'

Python; Submitting Web Form, How to use a search bar?

I have been trying to search a website and collect urls to the articles it yields, but I have run into problems I don't understand. I have read the following on web forms, and I believe that my problem arises because I am trying to submit prebuilt link to the search results rather than generating them with the web form.
There are several similar questions on here already, but none I've found which deal with how to deconstruct the html and what steps are needed to find and submit a web form.
When researching how to submit web forms, I found out about selenium, which I can't find working python 3 examples of, nor good documentation for. I found on another SO question a codeproject link which has gotten me the most progress so far. code featured below- https://www.codeproject.com/Articles/873060/Python-Search-Youtube-for-Video
That said, I don't yet understand why it works or what variables I will need to change in order to harvest results from another website. Namely, the website I'm looking to search is: https://globenewswire.com/Search
So my question is this; Whether by web form submission or by proper url formation, how can I obtain the search results html?
Here is the code I had been using to formulate the post search url:
name=input()
name=name.replace(' ','%20')
url='https://globenewswire.com/Search/NewsSearch?keyword='+name+'#'
Here is the code featured on the code project link:
import urllib.request
import urllib.parse
import re
query_string = urllib.parse.urlencode({"search_query" : input()})
html_content = urllib.request.urlopen("http://www.youtube.com/results?" + query_string)
search_results = re.findall(r'href=\"\/watch\?v=(.{11})', html_content.read().decode())
print("http://www.youtube.com/watch?v=" + search_results[0])
Edit:
After having captured my request using chrome's dev tools, I now have the response headers and the following curl:
curl "https://globenewswire.com/Search" -H "cookie:somecookie" -H "Origin: https://globenewswire.com" -H "Accept-Encoding: gzip, deflate, br" -H "Accept-Language: en-US,en;q=0.9" -H "Upgrade-Insecure-Requests: 1" -H "User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36" -H "Content-Type: application/x-www-form-urlencoded" -H "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8" -H "Cache-Control: max-age=0" -H "Referer: https://globenewswire.com/Search" -H "Connection: keep-alive" -H "DNT: 1" --data "__RequestVerificationToken=xY^%^2BkRoeEL8DswUTDlUVEWUCSxnRzX5Ax2Z^%^2FNCTa0lBNfqOFaU2eb^%^2FTD8XqENnf8d2Ghtm1taW8Cu0BvWrC1dh^%^2BdKZVgHyC6HM0EEm7mupQe1UZ7pHrF9GhnpwwcXR0dyJ^%^2B91Ng^%^3D^%^3D^&quicksearch-textbox=Abeona+Therapeutics" --compressed
As well as the request headers:
POST /Search HTTP/1.1
Host: globenewswire.com
Connection: keep-alive
Content-Length: 217
Cache-Control: max-age=0
Origin: https://globenewswire.com
Upgrade-Insecure-Requests: 1
Content-Type: application/x-www-form-urlencoded
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8
DNT: 1
Referer: https://globenewswire.com/Search
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.9

How to convert curl request to requests module in python

How to convert this curl request to python requests module compitable(Post request)
curl 'http://sss.com' -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.76 Safari/537.36' -H --data 'aq=%40syssource%3DProFind%20AND%20NOT%20%40title%3DCoveo%20AND%20NOT%20%40title%3Derror&searchHub=ProFind&xxx=yyy&xxx=yyy&xxx=yyy=10&xxx=yyy' --compressed
I am searching for requests module here
http://docs.python-requests.org/en/master/user/quickstart/
But they only have data value as key,value there
r = requests.post('http://httpbin.org/post', data = {'key':'value'})
So how could in convert the above curl post request to python requests module to make post request
The documentation you linked says:
There are many times that you want to send data that is not form-encoded. If you pass in a string instead of a dict, that data will be posted directly.
So just use
r = requests.post('http://sss.com', data = 'aq=%40syssource%3DProFind%20AND%20NOT%20%40title%3DCoveo%20AND%20NOT%20%40title%3Derror&searchHub=ProFind&xxx=yyy&xxx=yyy&xxx=yyy=10&xxx=yyy')

Incorrect input parameters when using the Python Requests library

I have a Python script using the Requests library that is of this form:
uhash = '1234567abcdefg'
cookies = {
'uhash':uhash
}
payload = {
'action':'trade.bump',
'hash':uhash,
'tradeid':'12345678'
}
r = requests.post(
'http://www.target_url.com/api/core',
cookies=cookies,
params=payload
)
Above is my Python attempt at creating the following cURL request (written with bash):
HASH="1234567abcdefg"
TRADEID="12345678"
curl 'http://www.target_url.com/api/core' -H "Cookie: uhash=$HASH" --data "action=trade.bump&hash=$HASH&tradeid=$TRADEID"
In summary, both scripts contain:
The cookie - uhash
Three data parameters called action, hash, and tradeid
My issue currently is, the bash script works - the server response for when I use the bash script is this:
{"meta":{"code":200},"data":{"bumped":true,"count":15}}
However, if I use the Python script, with the SAME cookie and parameter values as the bash script, I get:
{"meta":{"code":301},"data":{"message":"You can't bump a trade that doesn't exist ;_;"}}
The above error tells me the trade doesn't exist, despite that tradeid existing and the exact same one as my bash script's parameters.
I tried to debug using Firefox' convenient copy-as-curl tool to copy that curl command, which was how I made the bash script. However, once I tried to translate it to the Python script, it will tell me the aforementioned error. Maybe I am using the Requests library incorrectly, and I am missing something.
Attached is the full cURL request taken from Firefox (don't worry, the parameters were sanitized, meaning, they're not the real values):
curl 'http://www.tf2outpost.com/api/core' -H 'Host: www.tf2outpost.com' -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:35.0) Gecko/20100101 Firefox/35.0' -H 'Accept: application/json, text/javascript, */*; q=0.01' -H 'Accept-Language: en-US,en;q=0.5' --compressed -H 'Content-Type: application/x-www-form-urlencoded; charset=UTF-8' -H 'X-Requested-With: XMLHttpRequest' -H 'Referer: http://www.tf2outpost.com/trades' -H 'Cookie: __qca=P0-6517545-1420724809746; __utma=5135382.11011755.14224810.14331180.14180489.7; __utmz=51353782.1420724810.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); uhash=abcdefg12345678; mb_uid2=3211475230616776; CTag61=14338638870; __utmb=513532.9.10.14180489; __utmc=513782; __utmt=1; __utmt_b=1; __utmt_c=1; OX_plg=sl|qt|pm; HIRO_COOKIE=data=&newSession=true&id=2237524293&timestamp=1433506185; HIRO_CLIENT_ID=67751187' -H 'Connection: keep-alive' -H 'Pragma: no-cache' -H 'Cache-Control: no-cache' --data 'action=trade.bump&hash=abcdefg12345678&tradeid=12345678'
Not quite sure why that is happening.
Try using data or json key instead of params, use json.dumps(payload) if data is your preferred method.

Categories

Resources