Web Scraping with python requests

Web Scraping with python requests - python

I want to scrape https://sparrow.eoffice.gov.in/IPRSTATUS/IPRFiledSearch and download the entire set of PDF files that show up in the search results as on date (say 01-01-2016). The employee fields are optional. On clicking search, the site throws up a list of all the employees. I am unable to get the post method to work using python requests. Keep getting a 405 error. My code is below
from bs4 import BeautifulSoup
import requests
url = "https://sparrow.eoffice.gov.in/IPRSTATUS/IPRFiledSearch"
data = {
'assessmentYearId':'vH4pgBbZ8y8rhOFBoM0g7w',
'empName':'',
'allotmentYear':'',
'cadreId':'',
'iprReportType':'cqZvyXc--mpmnRNfPp2k7w',
'userType':'JgPOADxEXU1jGi53Xa2vGQ',
'_csrf':'7819ec72-eedf-4290-ba70-6f2b14cc4b79'
}
headers = {
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Encoding':'gzip, deflate',
'Accept-Language':'en-US,en;q=0.8',
'Cache-Control':'max-age=0',
'Connection':'keep-alive',
'Content-Length':'184',
'Content-Type':'application/x-www-form-urlencoded',
'Upgrade-Insecure-Requests':'1',
'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
}
response = requests.post(url,data=data,headers=headers)

I'm not familiar with the website but I strongly suggest reading their policy before trying to scrape the content.
In similar scenarios when you don't get the expected results by a simple post, using requests.Session usually helps.

The problem lay in my using the same csrf code. Needs to be changed with every request.

Related

Retrieving an image gives 403 error while it works with browser

Hi i'm trying to build a manga downloader app, for this reason I'm scraping several sites, however I have a problem once I get the image URL.
I can see the image using my browser (chrome), I can also download it, however I can't do the same using any popular scripting library.
Here is what I've tried:
String imgSrc = "https://cdn.mangaeden.com/mangasimg/aa/aa75d306397d1d11d07d66746dae78a36dc78672ae9e97a08cb7abb4.jpg"
Connection.Response resultImageResponse = Jsoup.connect(imgSrc)
.userAgent(
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.21 (KHTML, like Gecko) Chrome/19.0.1042.0 Safari/535.21")
.referrer("none").execute();
// output here
OutputStreamWriter out = new OutputStreamWriter(new FileOutputStream(new java.io.File(String.valueOf(imgPath))));
out.write(resultImageResponse.body()); // resultImageResponse.body() is where the image's contents are.
out.close();
I've also tried this:
URL imgUrl = new URL(imgSrc);
Files.copy(imgUrl.openStream(), imgPath);
Lastly, since I was sure the link works I've tried to download the image using python, but also in this case I get a 403 error
import requests
base_url = "https://cdn.mangaeden.com/mangasimg/d0/d08f07d762acda8a1f004677ab2414b9766a616e20bd92de4e2e44f1.jpg"
res = requests.get(url)
googling I found this Unable to get image url in Mangaeden API Angular 6 which seems really close to my problem, however I don't understand if I'm setting wrong the referrer or it doesn't work at all...
Do you have any tips?
Thank you!

How to fix?
Add some "headers" to your request to show that you might be a "browser", this will give you a 200 as response and you can save the file.
Note This will also work for postman, just overwrite the hidden user agent and you will get the image as response
Example (python)
import requests
headers ={
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36'
}
url = "https://cdn.mangaeden.com/mangasimg/d0/d08f07d762acda8a1f004677ab2414b9766a616e20bd92de4e2e44f1.jpg"
res = requests.get(url,headers=headers)
with open("image.jpg", 'wb') as f:
f.write(res.content)

Someone wrote this answer, but later deleted it, so I will copy the answer in case it can be useful.
AFAIK, you can't download anything else apart from HTML Documents
using jsoup.
If you open up Developer Tools on your browser, you can get the exact
request the browser has made. With Chrome, it's something like this.
The minimal cURL request would in your case be:
'https://cdn.mangaeden.com/mangasimg/aa/aa75d306397d1d11d07d66746dae78a36dc78672ae9e97a08cb7abb4.jpg'
\ -H 'user-agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.21
(KHTML, like Gecko) Chrome/19.0.1042.0 Safari/535.21' \ --output
image.jpg;
You can refer to HedgeHog's answer for a sample Python
solution; here's how to achieve the same in Java using the new HTTP
Client:
import java.net.URI; import java.net.http.HttpClient; import
java.net.http.HttpRequest; import
java.net.http.HttpResponse.BodyHandlers; import java.nio.file.Path;
import java.nio.file.Paths;
public class ImageDownload {
public static void main(String[] args) throws Exception {
HttpClient client = HttpClient.newHttpClient();
HttpRequest request = HttpRequest.newBuilder()
.uri(URI.create("https://cdn.mangaeden.com/mangasimg/aa/aa75d306397d1d11d07d66746dae78a36dc78672ae9e97a08cb7abb4.jpg"))
.header("user-agent", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.21 (KHTML, like Gecko) Chrome/19.0.1042.0
Safari/535.21")
.build();
client.send(request, BodyHandlers.ofFile(Paths.get("image.jpg")));
} }
I adopted this solution in my java code.
Also, one last bit, if the image is downloaded but you can't open it, it is probably due to a 503 error code in the request, in this case you will just have to perform the request again. You can recognize broken images because the image reader will say something like
Not a JPEG file: starts with 0x3c 0x68
which is <h, an HTML error page instead of the image

How to bypass Mod_Security while scraping

I tried running this Python script using BeautifulSoup and requests modules :
from bs4 import BeautifulSoup as bs
import requests
url = 'https://udemyfreecourses.org/
headers = {'UserAgent' : 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.193 Safari/537.36'
}
soup = bs(requests.get(url, headers= headers).text, 'lxml')
But when I send this line :
print(soup.get_text())
It doesn't scrape the text data but instead, It returns this output:
Not Acceptable!Not Acceptable!An appropriate representation of the requested resource could not be found on this server. This error was generated by Mod_Security.
I even even used headers when requesting the webpage, so It can looks like a normal navigator, but I'm still getting this message that's preventing me from accessing the real webpage
Note : The webpage is working perfectly on the navigator directly, but It doesn't show much info when I try to scrape it.
Is there any other way than the one I used with headers that can get a perfect valid request from the website and bypass this security called Mod_Security?
Any help would be very very helpful, Thanks.

EDIT: The Dash in "User-Agent" is essential.
Following this Answer https://stackoverflow.com/a/61968635/8106583
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:55.0) Gecko/20100101 Firefox/55.0',
}
Your User-Agent is the problem. This User-Agent works for me.
Also: Your ip might be blocked by now :D

How to send cookies with PDFKit in Python?

I'm trying to download some pages as PDF files. However, the pages require me to log in, so I simply sent some cookies along with my request (using the requests module). This worked. However, I'm not sure how to send cookies with PDFKit to achieve the same thing.
Here is the code I tried. I also tried to incorperate headers (to prevent a 403 error), but it didn't work. I can't seem to find this in the documentation, either. Does anyone know how I can send cookies to download the pages?
import pdfkit
url = r'www.someurl.com'
cookies = {
"cookie1": "cookie"
}
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.76 Safari/537.36'}
config = pdfkit.configuration(wkhtmltopdf="C:\\Program Files\\wkhtmltopdf\\bin\\wkhtmltopdf.exe")
pdfkit.from_url(url, 'out.pdf', configuration=config, options=cookies)

According to the PDFkit project description you can set cookies using this approach:
options = {'cookie': [('cookie-name1', 'cookie-value1'),
('cookie-name2', 'cookie-value2')]}
pdfkit.from_url('http://google.com', 'out.pdf', options=options)

Web Scraper returns empty html file while Chrome browser works; already tried UserAgent

I am a rookie just learning Python, however, for our Bachelor's thesis, we need the data from the following website (its just municipal financial data from the Latvian government):
https://e2.kase.gov.lv/pub5.5_pasv/code/pub.php?module=pub
So far I have done the following:
Got frustrated that this is not a simple HTML page and that it has this 'interactive' header (sorry, my knowledge is very limited on how to interact with it).
By using Chrome dev tools and network tab I found out that I can run the following URL to 'request' the period, municipality, financial statement, etc. that I need: https://e2.kase.gov.lv/pub5.5_pasv/code/ajax.php?module=pub&job=getDoc&period_id=1626&org_id=2542&blank_id=200079&currency_id=2&editable=0&type=HTML
Created basic python code to get that URL HTML (see below).
Found out that it returns empty data. Thought that this is a bug, however, the response code is 200, which as I understand means that it was successful.
Tested this URL in different browsers, and 'lo and behold. It works in Chrome, however, in Microsoft Edge, it returns an empty blank page.
Read somewhere that I have to 'introduce' myself to the server and tried to use headers and User-Agent both manually, and also using a fake_useragent library with Chrome User Agent. Yet it still doesn't work.
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
r = requests.get("https://e2.kase.gov.lv/pub5.5_pasv/code/ajax.php?module=pub&job=getDoc&period_id=1626&org_id=2542&blank_id=200079&currency_id=2&editable=1&type=HTML", headers=headers)
print(r.text)
So I'm stuck in point 6. The URL works well in Chrome, does not work in Edge. And it seems that my Python code gets the same blank page Edge browser gets - with no data whatsoever.
I would appreciate it a lot if If anyone could at least lead me in the right direction or give some reading material because right now I have no idea how to configure my Python code to reproduce the HTML output from Chrome.. Or if this is even a legitimate (or good) way on how to approach this problem to obtain this data.
EDIT: Sorry guys, I found out that it is not possible to access this website from outside Latvia, however, I have found a solution (see below).

Solved the problem.
Previously when imitating a browser I only used the following headers:
headers = {
'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Mobile Safari/537.36'
}
Turns out I had to include all of the response headers sent to the server for the request (found through Chrome dev tools), as so:
headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.9',
'Connection': 'keep-alive',
'Cookie': 'Cookie; Cookie',
'DNT': '1',
'Host': 'e2.kase.gov.lv',
'Referer': 'https://e2.kase.gov.lv/pub5.5_pasv/code/pub.php?module=pub',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Mobile Safari/537.36'
}

Python Requests - debugging POST requests

I am trying to scrape a website, in which i have to get to the right page using a POST request.
Here below are the different screen showing how i got to find which are the headers and payload that i needed to use in my request:
1) Here the page: it is a list of economic indicators:
2) It is possible to select which country's indicator are displayed using the "filter that is on the right hand side of the screen:
3) Clicking the "apply" button will send a POST requests to the site that will refresh the page to show only the information of the ticked boxes. Here a screencapture showing the elements of the form sent in the POST request:
But if i try to do this POST request using python requests using the following code (see below) it seems that the form is not processed, and the page returned is simply the default one.
payload= {
'country[]': 5,
'limit_from': '0',
'submitFilters': '1',
'timeFilter': 'timeRemain',
'currentTab': 'today',
'timeZone': '55'}
headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36',
'X-Requested-With': 'XMLHttpRequest',
'Accept':'*/*',
'Accept-Encoding':'gzip, deflate, br',
'Accept-Language':'en-US,en;q=0.8',
'Connection':'keep-alive',
'Host':'www.investing.com',
'Origin':'https://www.investing.com',
'Referer':'https://www.investing.com/economic-calendar/',
'Content-Length':'94',
'Content-Type':'application/x-www-form-urlencoded',
'Cookie':'adBlockerNewUserDomains=1505902229; __qca=P0-734073995-1505902265195; __gads=ID=d69b337b0f60d8f0:T=1505902254:S=ALNI_MYlYKXUUbs8WtYTEO2fN9O_q9oykA; cookieConsent=was-set; travelDistance=4; editionPostpone=1507424197769; PHPSESSID=v9q2deffu2n0b9q07t3jkgk4a4; StickySession=id.71595783179.419www.investing.com; geoC=GB; gtmFired=OK; optimizelySegments=%7B%224225444387%22%3A%22gc%22%2C%224226973206%22%3A%22direct%22%2C%224232593061%22%3A%22false%22%2C%225010352657%22%3A%22none%22%7D; optimizelyEndUserId=oeu1505902244597r0.8410692836488942; optimizelyBuckets=%7B%228744291438%22%3A%228731763165%22%2C%228785438042%22%3A%228807365450%22%7D; nyxDorf=OT5hY2M1P2E%2FY24xZTE3YTNoMG9hYmZjPDdlYWFnNz0wNjNvYW5kYWU6PmFvbDM6Y2Y0MDAwYTk1MzdpYGRhPDk2YTNjYT82P2E%3D; billboardCounter_1=1; _ga=GA1.2.1460679521.1505902261; _gid=GA1.2.655434067.1508542678'
}
import lxml.html
import requests
g=requests.post("https://www.investing.com/economic-calendar/",data=payload,headers=headers)
html = lxml.html.fromstring(g.text)
tr=html.xpath("//table[#id='economicCalendarData']//tr")
for t in tr[4:]:
print(t.find(".//td[#class='left flagCur noWrap']/span").attrib["title"])
This is visible as if, for instance, i select only country "5" (the USA), post the request, and look for the countries present in the result page, I will see other countries as well.
Anyone knows what i am doing wrong with that POST request?

As it shows in your own screenshot, it appears that the site posts to the URL
https://www.investing.com/economic-calendar/Service/getCalendarFilteredData
whereas you're only posting directly to
https://www.investing.com/economic-calendar/

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Web Scraping with python requests - python

I'm not familiar with the website but I strongly suggest reading their policy before trying to scrape the content. In similar scenarios when you don't get the expected results by a simple post, using requests.Session usually helps.

The problem lay in my using the same csrf code. Needs to be changed with every request.

Related

Retrieving an image gives 403 error while it works with browser

How to bypass Mod_Security while scraping

How to send cookies with PDFKit in Python?

Web Scraper returns empty html file while Chrome browser works; already tried UserAgent

Python Requests - debugging POST requests

Categories

Resources