Python Requests unable to get the site even after passing headers - python

I am trying to scrape an e-commerce site Myntra but the request keeps on loading without ever returning the result. I have tried passing headers to the request with different user-agents but still, it doesn't work. If a timeout parameter is added, the request times out but no success.
Here is the sample code I'm trying to execute
import requests
url = 'https://www.myntra.com'
s = requests.Session()
headers = {
'authority': 'www.myntra.com',
'method': 'GET',
'path': '/',
'scheme': 'https',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9',
'cache-control': 'max-age=0',
# dnt: 1
'sec-fetch-dest': 'document',
'sec-fetch-mode': 'navigate',
'sec-fetch-site': 'none',
'sec-fetch-user': '?1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36',
}
response = s.get(url, headers=headers, timeout=10).content
print(response)
If I try to curl the same site, I get a 403 status code with the following output.
<HTML><HEAD>
<TITLE>Access Denied</TITLE>
</HEAD><BODY>
<H1>Access Denied</H1>
You don't have permission to access "http://www.myntra.com/" on this server.<P>
Reference #18.24092e17.1601830542.453a61c2
</BODY>
</HTML>

Related

403 HTML Error Code when sending request to lexica.art using Python

I need to get all image prompts from the https://lexica.art/?q=history request, but the website returns 403 error code when I am trying to send a request.
I already tried to set User-Agent property, and copied all the request properties, but it still isn't working.
Here is my code:
import requests
url="https://lexica.art/api/trpc/prompts.infinitePrompts?batch=1&input={%220%22%3A{%22json%22%3A{%22text%22%3A%22history%22%2C%22searchMode%22%3A%22images%22%2C%22source%22%3A%22search%22%2C%22cursor%22%3A250}}}"
headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.5',
'Alt-Used': 'lexica.art',
'cache-control': 'max-age=0',
'Connection': 'keep-alive',
'Host': 'lexica.art',
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'cross-site',
'TE': 'trailers',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36'
}
r=requests.get(url, headers=headers)
print(r.status_code)
I used selenium library instead, everything is working fine

Python application calling Nse India URL getting stuck in Heroku

I have a python program that calls the nseindia.com and tries to fetch the indices data using the URL: https://www1.nseindia.com/live_market/dynaContent/live_watch/stock_watch/liveIndexWatchData.json"
This code is working fine on my system, but when I deploy this to Heroku it get stuck at the URL call.
import requests
url = "https://www1.nseindia.com/live_market/dynaContent/live_watch/stock_watch/liveIndexWatchData.json"
headers = {
'authority': 'beta.nseindia.com',
'cache-control': 'max-age=0',
'dnt': '1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36',
'sec-fetch-user': '?1',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'sec-fetch-site': 'none',
'sec-fetch-mode': 'navigate',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9,hi;q=0.8',
}
response = requests.get(url=url, headers=headers)
print(response.json())
Any suggestions?

Need assistance with Post/Get Request

I have an issue with the particular website https://damas.terna.it/DMSPCAS08.
I am trying to either scrape the data or to fetch the excel file that it is included.
I tried to fetch the excel file with a post request.
import requests
from bs4 import BeautifulSoup
import json
import datetime
url = 'https://damas.terna.it/api/Ntc/GetNtc'
headers = {
'Host': 'damas.terna.it',
'Connection': 'keep-alive',
'sec-ch-ua': '" Not A;Brand";v="99", "Chromium";v="90", "Google Chrome";v="90"',
'sec-ch-ua-mobile': '?0',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Sec-Fetch-Site': 'same-origin',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-User': '?1',
'Sec-Fetch-Dest': 'document',
'Referer': 'https://damas.terna.it/DMSPCAS08',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.9',
'Cookie': '__RequestVerificationToken=5mfiSM2dKte532I8fd3MRdn6nnHbSezkQX29r3fyF2tMegXsvnOInpxy8JvFuDyRVS6pZs03y-NL3CsNsItN1yboc128Kk51yEiuUU0mel41; pers_damas_2019=378972352.20480.0000; rxVisitor=1619766836714T8SRPFLUAH62F1R9K6KG3EKK104BSDFE; dtCookie=7$EC34ED7BFB4A503D379D8D8C69242174|846cd19ce7947380|1; rxvt=1619774462088|1619771198468; dtPC=7$369396404_351h-vIDMPHTCURIGKVKBWVVEWOAMRMKDNWCUH-0e1; DamasNetCookie=F-evmYb7kIS_YTMr2mwuClMB1zazemmhl9vzSXynWeuCII_keDb_jQr4VLSYif9t3juDS6LkOuIXKFfe8pydxSzHPfZzGveNB6xryj2Czp9J1qeWFFT9dYFlRXFWAHuaEIyUQQDJmzWfDBrFCWr309mZoE6hkCKzDtoJgIoor9bed1kQgcdeymAH9lrtrKxwsheaQm2qA-vWWqKjCiruO1VkJ6II_bcrAXU2A_ZPQPznE1_8AEC_AwXmBXETubMQwFAnDXsOLDfEYeQ61TGAupF3d-wz3aZfRs5eCA3kw-en-kpEbS0trwFBQzB-098610GIbIPki9ThVitZ2LN2zza6nn1A8qchqfQC_CZEgu6Jt1glfhHceWS6tvWCuyOEqo2jJpxAajMYXPB6mzlHzX13TiV-jgeFSPehugMAgms_exqocw9w27e4lI5laYZu0rkKkznpZ1mJLOhORwny8-bKa3nRUt7erFv7ul3nLLrgd3FP907tHpTh-qXt1Bmr6OqknDZr_EBN8GY_B2YHV-8hC0AjdqQqpS0xOpp7z_CzzgByTOHSNdeKjVgQfZLQ7amnp71lhxgPeJZvOIl_mIWOr_gWRy_iK6UuzrA3udCTV7bAnUXKB8gX89d9ShQf5tZDxPFchrAQBtdmDChQOA; dtLatC=2; dtSa=true%7CC%7C-1%7CExport%20to%20.xls%7C-%7C1619772685174%7C369396404_351%7Chttps%3A%2F%2Fdamas.terna.it%2FDMSPCAS08%7CTerna%20%5Ep%20NTC%7C1619772662568%7C%7C'
}
parameters = {
'busDay': "2021-05-01",
'busDayTill': "2021-05-01",
'borderDirId': '1',
'borderDirName': "TERNA-APG"
}
response = requests.post(url, data=parameters, headers=headers)
soup = BeautifulSoup(response.text, "html.parser")
print(soup.prettify())
I am receiving this error:
The parameters dictionary contains an invalid entry for parameter 'parameters' for method 'System.Web.Mvc.ActionResult GetNtc(Damas.Core.Data.DataSource.Data.ParametersModel)' in 'Terna.Web.Controllers.CapacityManagement.NtcController'. The dictionary contains a value of type 'System.Collections.Generic.Dictionary`2[System.String,System.Object]', but the parameter requires a value of type 'Damas.Core.Data.DataSource.Data.ParametersModel'.
Parameter name: parameters
Please don't post the answer to your question in the question's body; instead, post it in the answer box:
response = requests.post(url, data=json.dumps(parameters), headers=headers) seems to solve the issue.

Python requests gives 403 error when requesting from papara.com

I'm trying to get into papara.com using Python. When I make a request it always gives 403 as a response. I got cookies from my browser. Here is my code:
import requests
headers = {
'authority': 'www.papara.com',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'sec-fetch-site': 'none',
'sec-fetch-mode': 'navigate',
'sec-fetch-user': '?1',
'sec-fetch-dest': 'document',
'accept-language': 'en-US,en;q=0.9',
'cookie': '__cfruid=64370d0d06d80a1e1a701ae8bee5a4b85c1de1af-1610296629',
}
response = requests.get('https://www.papara.com/', headers=headers)
I tried different user agents, I tried removing the cookie from the headers but didn't work.

Python Requests post on a site not working

I am trying to scrape property information from https://www.ura.gov.sg/realEstateIIWeb/resiRental/search.action using Python Requests. Using Chrome I have inspected the POST request and emulated it using requests. I use sessions to maintain cookies. When I try my code, the return from the website is "missing parameters in search query" so obviously something is wrong with my requests (though it is not obvious what).
Doing some digging there was one cookie that I did not get when doing request.get on the search side, so I added that manually. Still no go. I tried emulating the request headers exactly as well, still does not return the correct results.
The only time I have gotten it to work is when I manually copy the cookies from my browser to the Python request object.
url = 'https://www.ura.gov.sg/realEstateIIWeb/resiRental/submitSearch.action;jsessionid={}'
values = {'submissionType': 'pn',
'from_Date_Prj': 'JAN-2014',
'to_Date_Prj': 'JAN-2016',
'__multiselect_projectNameList': '',
'selectedProjects': '10 SHELFORD',
'__multiselect_selectedProjects': '',
'propertyType': 'lp',
'from_Date': 'JAN-2016',
'to_Date': 'JAN-2016',
'__multiselect_postalDistrictList': '',
'__multiselect_selectedPostalDistricts': ''}
header1 = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate, sdch',
'Accept-Language': 'en-US,en;q=0.8,nb;q=0.6,no;q=0.4',
'Cache-Control': 'max-age=0',
'Connection': 'keep-alive',
'Host': 'www.ura.gov.sg',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36'
}
headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'en-US,en;q=0.8,nb;q=0.6,no;q=0.4',
'Cache-Control': 'max-age=0',
'Connection': 'keep-alive',
'Content-Type': 'application/x-www-form-urlencoded',
'Host': 'www.ura.gov.sg',
'Origin': 'https://www.ura.gov.sg',
'Referer': 'https://www.ura.gov.sg/realEstateIIWeb/resiRental/search.action',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36'
}
with requests.Session() as r:
page1 = r.get('https://www.ura.gov.sg/realEstateIIWeb/resiRental/search.action', headers=header1)
requests.utils.add_dict_to_cookiejar(r.cookies, {'BIGipServerpl-prod_iis_web_v4': '3334383808.20480.0000'})
page2 = r.post(url.format(r.cookies.get_dict()['JSESSIONID']), data=values, headers=headers)

Categories

Resources