Provide network data from Firebug to Python - python

Is there a way to copy the network data from Firebug (for example POST headers) and put them into Python code so I don't need to write each header by myself?
There is an option Copy Request Headers, but it is not in the right format for Python.
So the thing I want is not to obtain this:
User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:37.0) Gecko/20100101 Firefox/37.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
because I have to change the format to dictionary or something else, but this:
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; rv:37.0) Gecko/20100101 Firefox/37.0"
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"
It is not necessary to get it in Python's dictionary format. The only thing I want is to automatically use this data in Python.

Post-process the headers you've copied from Firefox: split each line of the input string by : and make a dictionary, example:
In [1]: headers = """
User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:37.0) Gecko/20100101 Firefox/37.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
"""
In [2]: dict(item.split(": ", 1) for item in headers.splitlines() if item)
Out[2]:
{'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; rv:37.0) Gecko/20100101 Firefox/37.0'}

Related

Python and CloudFlare issue

I hit the wall trying to make request to https://1stkissmanga.io/ due to CloudFlare protection. I prepared header and cookie (which i read from Firefox) but still without success. What is weird, i can get this site properly with wget. This is the problem i don't understand - wget doesn't have any CloudFlare bypass mechanisms so if it works from wget then shouldn't it work also from Python requests?
Of course with wget i still need to give cookie value, otherwise wget will hit CloudFlare as well.
With wget (successful result):
wget "https://1stkissmanga.io/" -U "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:97.0) Gecko/20100101 Firefox/97.0" --header="Cookie: __cf_bm=<some long string with dots and other special characters>"
With python:
headers = {"user-agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:97.0) Gecko/20100101 Firefox/97.0",} cookies = {"__cf_bm": "<some long string with dots and other special characters>",}
url = "https://1stkissmanga.io/" res = requests.get(url, headers=headers, cookies=cookies)
I tried also to put cookie into header like
headers = {"user-agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:97.0) Gecko/20100101 Firefox/97.0", "cookie": "__cf_bm=<some long string with dots and other special characters>",}
and do res = requests.get(url, headers=headers) but the result is the same. Whatever i do, request always stop on CloudFlare protection.
Not sure what to do next, CloudFlare proxy is out of question for now.
You should use string inside "Cookie" key, not dict. It should look like this {"user-agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:97.0) Gecko/20100101 Firefox/97.0", "Cookie": "cf_clearance=<some hash here>; cf_chl_2=<some hash here>; cf_chl_prog=x11; XSRF-TOKEN=<some hash here>; laravel_session=<some hash here>; __cf_bm=<some hash here>;"}
The comlete code looks like this, but remember that check works for 10-15 minutes, after that you will need to take new cookie from browser.
import requests
h = {
"user-agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:97.0) Gecko/20100101 Firefox/97.0",
"Cookie": "cf_clearance=<some hash here>; cf_chl_2=<some hash here>; cf_chl_prog=x11; XSRF-TOKEN=<some hash here>; laravel_session=<some hash here>; __cf_bm=<some hash here>;"
}
requests.get(url, headers=h)

Can't download file over HTTP - session is hanging

A friend told me that when he entered the following website:
http://tatoochange.com/watch/OGgmlnav-joe-gould-s-secret/vip.html
He noticed that when he played the video, on Fiddler he saw the path of the file (http://85.217.223.24/vids/joe_goulds_secret_2000.mp4):
So he tried to download it from the browser but he received an error:
I checked the GET request with Burpe when playing the video:
GET /vids/joe_goulds_secret_2000.mp4 HTTP/1.1
Host: 85.217.223.24
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0
Accept: video/webm,video/ogg,video/*;q=0.9,application/ogg;q=0.7,audio/*;q=0.6,*/*;q=0.5
Accept-Language: en-US,en;q=0.5
Referer: http://entervideo.net/watch/3accec760b23ad4
Range: bytes=0-
Connection: close
I converted it to python script:
import requests
session = requests.Session()
headers = {"Accept":"video/webm,video/ogg,video/*;q=0.9,application/ogg;q=0.7,audio/*;q=0.6,*/*;q=0.5","User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0","Referer":"http://entervideo.net/watch/3accec760b23ad4","Connection":"close","Accept-Language":"en-US,en;q=0.5","Range":"bytes=0-"}
response = session.get("http://85.217.223.24/vids/joe_goulds_secret_2000.mp4", headers=headers)
print("Status code: %i" % response.status_code)
print("Response body: %s" % response.content)
When I run it, it his hanging.
I don't have any idea if download it or not.
My question is, why I can't download it from the browser just by accessing it ?
Second, even when I am using the script which is not getting any error, it hangs...
Using sessions.get is not advisable to download a large file. This would primarily be used for a web call that receives a json or xml list. To download large files you should follow the method shown in this thread:
Download large file in python with requests
I managed to do it.
Need to notice that the response was 206 which is a partial content.
The solution:
import os,requests
def download():
headers = {"Accept": "video/webm,video/ogg,video/*;q=0.9,application/ogg;q=0.7,audio/*;q=0.6,*/*;q=0.5",
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0",
"Referer": "http://entervideo.net/watch/3accec760b23ad4", "Connection": "close",
"Accept-Language": "en-US,en;q=0.5", "Range": "bytes=0-"}
get_response = requests.get("http://85.217.223.24/vids/joe_goulds_secret_2000.mp4", headers=headers,stream=True)
#file_name = url.split("/")[-1]
file_name = r'c:\tmp\joe_goulds_secret_2000.mp4'
with open(file_name, 'wb') as f:
count = 0
for chunk in get_response.iter_content(chunk_size=1024):
print('chunk: ' + str(count))
count += 1
if chunk: # filter out keep-alive new chunks
f.write(chunk)
download()

How to send a GET request with headers via python

I got fiddler to capture a GET request, I want to re send the exact request with python.
This is the request I captured:
GET https://example.com/api/content/v1/products/search?page=20&page_size=25&q=&type=image HTTP/1.1
Host: example.com
Connection: keep-alive
Search-Version: v3
Accept: application/json
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36
Referer: https://example.com/search/?q=&type=image&page=20
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.9
You can use the requests module.
The requests module automatically supplies most of the headers for you so you most likely do not need to manually include all of them.
Since you are sending a GET request, you can use the params parameter to neatly form the query string.
Example:
import requests
BASE_URL = "https://example.com/api/content/v1/products/search"
headers = {
"Connection": "keep-alive",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36"
}
params = {
"page": 20,
"page_size": 25,
"type": "image"
}
response = requests.get(BASE_URL, headers=headers, params=params)
import requests
headers = {
'authority': 'stackoverflow.com',
'cache-control': 'max-age=0',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'referer': 'https://stackoverflow.com/questions/tagged/python?sort=newest&page=2&pagesize=15',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9,tr-TR;q=0.8,tr;q=0.7',
'cookie': 'prov=6bb44cc9-dfe4-1b95-a65d-5250b3b4c9fb; _ga=GA1.2.1363624981.1550767314; __qca=P0-1074700243-1550767314392; notice-ctt=4%3B1550784035760; _gid=GA1.2.1415061800.1552935051; acct=t=4CnQ70qSwPMzOe6jigQlAR28TSW%2fMxzx&s=32zlYt1%2b3TBwWVaCHxH%2bl5aDhLjmq4Xr',
}
response = requests.get('https://stackoverflow.com/questions/55239787/how-to-send-a-get-request-with-headers-via-python', headers=headers)
This is an example of how to send a get request to this page with headers.
You may open SSL socket (https://docs.python.org/3/library/ssl.html) to example.com:443, write your captured request into this socket as raw bytes, and then read HTTP response from the socket.
You may also try to use http.client.HTTPResponse class to read and parse HTTP response from your socket, but this class is not supposed to be instantiated directly, so some unexpected obstacles could emerge.

Avoid to be detected during scrape

I am trying to scrape the website of lacentrale.fr thanks to scrapy, but even if I rotate my users agent and IP address (thanks to TOR), the web site detect my robot and send me false values.
Please can you check my code used in middlwares and setting and tell me if something went wrong.
code in middlewares :
from tutorial.settings import * #USER_AGENT_LIST
import random
from stem.control import Controller
from toripchanger import TorIpChanger
from stem import Signal
class RandomUserAgentMiddleware(object):
def process_request(self, request, spider):
ua = random.choice(USER_AGENT_LIST)
if ua:
request.headers.setdefault('User-Agent', ua)
def _set_new_ip():
with Controller.from_port(port=9051) as controller:
controller.authenticate(password='')
controller.signal(Signal.NEWNYM)
ip_changer = TorIpChanger(reuse_threshold=10)
class ProxyMiddleware(object):
_requests_count = 0
def process_request(self, request, spider):
self._requests_count += 1
if self._requests_count > 10:
self._requests_count = 0
ip_changer.get_new_ip()
print("New Tor connection processed")
request.meta['proxy'] = 'http://127.0.0.1:8118'
spider.log('Proxy : %s' % request.meta['proxy'])
Code used in settings :
BOT_NAME = 'tutorial'
SPIDER_MODULES = ['tutorial.spiders']
NEWSPIDER_MODULE = 'tutorial.spiders'
ROBOTSTXT_OBEY = True
DOWNLOAD_DELAY = 1
RANDOMIZE_DOWNLOAD_DELAY = True
DOWNLOADER_MIDDLEWARES = {
'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None,
'tutorial.middlewares.RandomUserAgentMiddleware': 400,
'scrapy.contrib.downloadermiddleware.httpcompression.HttpCompressionMiddleware':None, # to avoid the raise IOError, 'Not a gzipped file' exceptions.IOError: Not a gzipped file
'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 110,
'tutorial.middlewares.ProxyMiddleware': 100
}
USER_AGENT_LIST=[
{'User-agent': 'Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5'},
{
'User-agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1'},
{
'User-agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML like Gecko) Chrome/28.0.1469.0 Safari/537.36'},
{
'User-agent': 'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML like Gecko) Chrome/28.0.1469.0 Safari/537.36'},
{
'User-agent': 'Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1667.0 Safari/537.36'},
{
'User-agent': 'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.67 Safari/537.36'},
{
'User-agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2049.0 Safari/537.36'},
{
'User-agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'},
{'User-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:35.0) Gecko/20100101 Firefox/35.0'},
{'User-agent': 'Mozilla/5.0 (Windows NT 6.3; rv:36.0) Gecko/20100101 Firefox/36.0'}
]
EDIT II
it's seems that tor use the same ip each time and there is not rotation on the ip address. I don't know what I can change in my middlwares file to resolve this !!
please any idea ?
You may be detected on several factors, including whether your scraper downloads/runs the javascript files. If that's the case, you may need to use a tool like selenium in conjunction with Python/Scrapy to further pretend like a normal human user.
This stackoverflow post offers some help in getting started:
https://stackoverflow.com/a/17979285/9693088
I don't think I can offer much guidance in what may be going wrong with your TOR set up

http response is not allowed or not handled in python using scrapy?

even though i'm passing headers like below, i'm getting 416 error: http is not handled or not allowed.
headers = {'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Encoding':'gzip, deflate, br,sdch',
'Accept-Language':'en-US,en;q=0.8',
'AlexaToolbar-ALX_NS_PH':'AlexaToolbar/alx-4.0.1',
'Cache-Control':'max-age=0',
'Connection':'keep-alive',
'Host':'www.links.com',
'Referer':'https://www.links.com/',
'Upgrade-Insecure-Requests':'1',
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36'}
Try limiting the headers to below and see if it helps
headers = {
'Accept-Language':'en-US,en;q=0.8',
'Cache-Control':'max-age=0',
'Host':'www.mdlinx.com',
'Referer':'https://www.mdlinx.com/',
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36'}
Basically your site is throwing 416 and of course scrapy won't handle that. So you need to workout which headers are causing the issue. Best thing is to use chrome dev tools, copy the request as curl and see if that works. Also this may be related to no cookies being there.
You need to figure what works and what doesn't and then work this out
Edit-1

Categories

Resources