Changing the referer URL in python requests

Changing the referer URL in python requests - python

How do I change the referer if I'm using the requests library to make a GET request to a web page. I went through the entire manual but couldn't find it.

According to http://docs.python-requests.org/en/latest/user/advanced/#session-objects , you should be able to do:
s = requests.Session()
s.headers.update({'referer': my_referer})
s.get(url)
Or just:
requests.get(url, headers={'referer': my_referer})
Your headers dict will be merged with the default/session headers. From the docs:
Any dictionaries that you pass to a request method will be merged with
the session-level values that are set. The method-level parameters
override session parameters.

here we are rotating the user_agent with referer
user_agent_list = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36",
"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36",
"Mozilla/5.0 (iPad; CPU OS 15_6 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) CriOS/104.0.5112.99 Mobile/15E148 Safari/604.1"
]
reffer_list=[
'https://stackoverflow.com/',
'https://twitter.com/',
'https://www.google.co.in/',
'https://gem.gov.in/'
]
headers = {
'Connection': 'keep-alive',
'Cache-Control': 'max-age=0',
'Upgrade-Insecure-Requests': '1',
'User-Agent': random.choice(user_agent_list),
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'en-US,en;q=0.9,fr;q=0.8',
'referer': random.choice(reffer_list)
}

Related

How to deal with changing JSESSIONID in Python?

This is in continuation to my post here.
I have figured out that I require a JSESSIONID in my headers to get the data with Python. But the problem is every time the JSESSIONID changes. How to deal with this issue? Below is my code:
import requests
sym_1 = 'NIFTY'
exp_date = '26MAY2022'
headers_1 = {
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.67 Safari/537.36',
'cookie': 'JSESSIONID=8EB69BB64441BB6906DD7A241B1AAC82;'
}
url = "https://opstra.definedge.com/api/openinterest/optionchain/free/" + sym_1 + "&" + exp_date
text_data = requests.get(url_1, headers=headers_1)

How do I send a HTTP/2 pseudo-headers with requests?

This code is to post a form data
headers = {
'authority': 'ec.ef.com.cn',
'user-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36',
'accept': '*/*',
'accept-language': 'en-US,en;q=0.9,ru-RU;q=0.8,ru;q=0.7,uk;q=0.6,en-GB;q=0.5',
}
s = requests.Session()
response = s.post(url, headers = headers)
which seems different to what Chrome does
I understand :authority is a kind of HTTP/2 Headers. How do I send it with Python requests?

You could use hyper.contrib.HTTP20Adapter, and set the mount(),like:
from hyper.contrib import HTTP20Adapter
import requests
def getHeaders():
headers = {
":authority": "xxx",
":method": "POST",
":path": "/login/secure.ashx",
":scheme": "https",
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36",
"X-Requested-With": "XMLHttpRequest"
}
return headers
sessions=requests.session()
sessions.mount('https://xxxx.com', HTTP20Adapter())
r=sessions.post(url_search,data=playload,headers=getHeaders())
Refer to a Chinese blog

Loop through Dataframe column of URLs and parse out html tag

This shouldn't be too hard, although I can't figure it, i'm betting i'm making a dumb mistake.
Here's the code that works on an individual link and returns the zestimate (the req_headers variable prevents throwing a captcha):
req_headers = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.8',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'
}
link = 'https://www.zillow.com/homedetails/1404-Clearwing-Cir-Georgetown-TX-78626/121721750_zpid/'
test_soup = BeautifulSoup(requests.get(link, headers=req_headers).content, 'html.parser')
results = test_soup.select_one('h4:contains("Home value")').find_next('p').get_text(strip=True)
print(results)
Here's the code i'm trying to get to work and return the zestimate for each link and add to a new dataframe column, but I get AttributeError: 'NoneType' object has no attribute 'find_next' (Also, imagine i have a dataframe column of different zillow house links):
req_headers = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.8',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'
}
for link in df['links']:
test_soup = BeautifulSoup(requests.get(link, headers=req_headers).content, 'html.parser')
results = test_soup.select_one('h4:contains("Home value")').find_next('p').get_text(strip=True)
df['zestimate'] = results
Any help is appreciated.

I had a space before and after the links in my dataframe column :/. That was it. The code works fine. just an oversight on my part. Thanks all

How to send a GET request with headers via python

I got fiddler to capture a GET request, I want to re send the exact request with python.
This is the request I captured:
GET https://example.com/api/content/v1/products/search?page=20&page_size=25&q=&type=image HTTP/1.1
Host: example.com
Connection: keep-alive
Search-Version: v3
Accept: application/json
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36
Referer: https://example.com/search/?q=&type=image&page=20
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.9

You can use the requests module.
The requests module automatically supplies most of the headers for you so you most likely do not need to manually include all of them.
Since you are sending a GET request, you can use the params parameter to neatly form the query string.
Example:
import requests
BASE_URL = "https://example.com/api/content/v1/products/search"
headers = {
"Connection": "keep-alive",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36"
}
params = {
"page": 20,
"page_size": 25,
"type": "image"
}
response = requests.get(BASE_URL, headers=headers, params=params)

import requests
headers = {
'authority': 'stackoverflow.com',
'cache-control': 'max-age=0',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'referer': 'https://stackoverflow.com/questions/tagged/python?sort=newest&page=2&pagesize=15',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9,tr-TR;q=0.8,tr;q=0.7',
'cookie': 'prov=6bb44cc9-dfe4-1b95-a65d-5250b3b4c9fb; _ga=GA1.2.1363624981.1550767314; __qca=P0-1074700243-1550767314392; notice-ctt=4%3B1550784035760; _gid=GA1.2.1415061800.1552935051; acct=t=4CnQ70qSwPMzOe6jigQlAR28TSW%2fMxzx&s=32zlYt1%2b3TBwWVaCHxH%2bl5aDhLjmq4Xr',
}
response = requests.get('https://stackoverflow.com/questions/55239787/how-to-send-a-get-request-with-headers-via-python', headers=headers)
This is an example of how to send a get request to this page with headers.

You may open SSL socket (https://docs.python.org/3/library/ssl.html) to example.com:443, write your captured request into this socket as raw bytes, and then read HTTP response from the socket.
You may also try to use http.client.HTTPResponse class to read and parse HTTP response from your socket, but this class is not supposed to be instantiated directly, so some unexpected obstacles could emerge.

http response is not allowed or not handled in python using scrapy?

even though i'm passing headers like below, i'm getting 416 error: http is not handled or not allowed.
headers = {'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Encoding':'gzip, deflate, br,sdch',
'Accept-Language':'en-US,en;q=0.8',
'AlexaToolbar-ALX_NS_PH':'AlexaToolbar/alx-4.0.1',
'Cache-Control':'max-age=0',
'Connection':'keep-alive',
'Host':'www.links.com',
'Referer':'https://www.links.com/',
'Upgrade-Insecure-Requests':'1',
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36'}

Try limiting the headers to below and see if it helps
headers = {
'Accept-Language':'en-US,en;q=0.8',
'Cache-Control':'max-age=0',
'Host':'www.mdlinx.com',
'Referer':'https://www.mdlinx.com/',
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36'}
Basically your site is throwing 416 and of course scrapy won't handle that. So you need to workout which headers are causing the issue. Best thing is to use chrome dev tools, copy the request as curl and see if that works. Also this may be related to no cookies being there.
You need to figure what works and what doesn't and then work this out
Edit-1

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Changing the referer URL in python requests - python

How do I change the referer if I'm using the requests library to make a GET request to a web page. I went through the entire manual but couldn't find it.

Related

How to deal with changing JSESSIONID in Python?

How do I send a HTTP/2 pseudo-headers with requests?

Loop through Dataframe column of URLs and parse out html tag

How to send a GET request with headers via python

http response is not allowed or not handled in python using scrapy?

Categories

Resources