I am a novice programmer attempting to access google insights using python. I can access sites which dont require cookies fine, but i cant seem to properly pass the cookies along. The cookines file was exported from mozilla firefox, is in the Z: drive which is also where im running python from.
Im also pretty sure my code for saving the file could be better done than reading and writing but I dont know how to do that either. Any helpo would be appreciated.
import urllib2
import cookielib
import os
url = "http://www.google.com/insights/search/overviewReport?q=eagles%2Ccsco&geo=US&cmpt=q&content=1&export=2"
cj = cookielib.MozillaCookieJar()
cj.load('cookies6.txt')
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
file = opener.open(url)
output = open('test2.csv','wb')
output.write(file.read())
output.close()
I haven't tested your code however:
As far as I can tell there seems to be nothing wrong with your code
I've tried the url you're searching and had no problems downloading the csv without any cookies
In my previous experience with google, you might be looking at the problem the wrong way, it is not that you don't have the right cookies but that google automatically blocks requests from bots. If this is the case you must replace the user agent http header to mimic an actual browser. Beware however that this is against googles terms of service and if you make too many requests per minute google will block all requests from your ip for about 8h.
Related
I have a question with probably a well-known answer. However I couldnt articulate it well enough to find answers on google.
Lets say you are using the developer interface of Chrome browser (Press F12). If you click on the network tab and go to any website, a lot of files will be queried there for example images, stylesheets and JSON-responses.
I want to parse these JSON-responses using python now.
Thanks in advance!
You can save the network requests to a .har file (JSON format) and analyze that.
In your network tools panel, there is a download button to export as HAR format.
import json
with open('myrequests.har') as f:
network_data = json.load(f)
print(network_data)
Or, as Jack Deeth answered you can make the requests using Python instead of your browser and get the response JSON data that way.
Though, this can sometimes be difficult depending on the website and nature of the request(s) (for example, needing to login and/or figuring out how to get all the proper arguments to make the request)
I use requests to get the data, and it comes back as a Python dictionary:
import requests
r = requests.get("url/spotted/with/devtools")
r.json()["keys_observed_in_devtools"]
Perhaps you can try using Selenium.
Maybe the answers on this question can help you.
I am trying to download torrent file from this code :
url = "https://itorrents.org/torrent/0BB4C10F777A15409A351E58F6BF37E8FFF53CDB.torrent"
r = requests.get(url, allow_redirects=True)
open('test123.torrent', 'wb').write(r.content)
It downloads a torrent file , but when i load it to bittorrent error occurs.
It says Unable to Load , Torrent Is Not Valid Bencoding
Can anybody please help me to resolve this problem ? Thanks in advance
This page use cloudflare to prevent scraping the page,I am sorry to say that bypassing cloudflare is very hard if you only use requests, the measures cloudflare takes will update soon.This page will check your browser whether it support Javascript.If not, they won't give you the bytes of the file.That's why you couldn't use them.(You could use r.text to see the response content, it is a html page.Not a file.)
Under this circumstance, I think you should consider about using selenium.
Bypassing Cloudflare can be a pain, so I suggest using a library that handles it. Please don't forget that your code may break in the future because Cloudflare changes their techniques periodically. Well, if you use the library, you will just need to update the library (at least you should hope for that).
I used a similar library only in NodeJS, but I see python also has something like that - cloudscraper
Example:
import cloudscraper
scraper = cloudscraper.create_scraper() # returns a CloudScraper instance
# Or: scraper = cloudscraper.CloudScraper() # CloudScraper inherits from requests.Session
print scraper.get("http://somesite.com").text # => "<!DOCTYPE html><html><head>..."
Depending on your usage you may need to consider using proxies - CloudFlare can still block you if you send too many requests.
Also, if you are working with video torrents, you may be interested in Torrent Stream Server. It a server that downloads and streams video at the same time, so you can watch the video without fully downloading it.
We can do by adding cookies in headers .
But after some time cookie expires.
Therefore only solution is to download from opening browser
I recently wanted to extract data from a website that seems to use cookies to grant me access. I do not know very much about those procedures but appearently this inteferes with my method of getting the html content of the website via Python and its requests module.
The code I am running to extract the information contains the following lines:
import responses
#...
response = requests.get(url, proxies=proxies)
content = requests.text
Where the website i am referring to is http://ieeexplore.ieee.org/xpls/icp.jsp?arnumber=6675630&tag=1 and proxies is a valid dict of my proxy servers (I tested those settings on websites that seemed to work fine). However, instead of the content of the article on this site I receive the html-content of the page that you get when you do not accept cookies in your browser.
As I am not really aware of what website is really doing and lack real Web-Developement experience I could not find a solution so far, even if a similar question might have been asked before. Is there any solution to access the content of this website via Python?
startr = requests.get('https://viennaairport.com/login/')
secondr = requests.post('http://xxx/', cookies=startr.cookies)
I would like to download a document I have in my Google Drive authenticating to Google (I only want certain users to be able to access it and do not want to publish it on the web).
I have tried using requests but apparently I am doing something wrong.
From a browser I can download my document going to the address
https://docs.google.com/spreadsheets/d/<document key>/export?format=xls.
So in my python script I do the following:
import os
import requests
import shutil
from requests.auth import HTTPBasicAuth
remote = "https://docs.google.com/spreadsheets/d/<document key>/export?format=xls"
username = os.environ['GOOGLEUSERNAME']
password = os.environ['GOOGLEPASSWORD']
r = requests.get(remote, auth=HTTPBasicAuth(username,password))
if r.status_code == 200:
with open("document.xls","wb") as f:
shutil.copyfileobj(r.raw, f)
however the resulting document.xls is empty.
What am I doing wrong?
It might actually be possible what you are trying to do, but here are some reasons why it will be non-trivial(by no means a complete list):
Google is usually blocking user-agents that are non-browsers(like your Python script) for browser intended content (for security reasons); you would have to spoof it, which is actually easy
Multi-factor authentication - you would have to turn that off (easy, but you open yourself up for being hacked...)
Session-cookie - aka security cookie; (not so easy to get ahold of)
What you should do instead
Use the official google-drive API. Also, the Python client library has a nice tutorial and this page describes how to download files from google-drive.
If you want to write even less code, then libraries like PyDrive will make your live even easier.
Hope this helps!
I might have a simple solution for you, depending on what exactly the auth requirements are. You are saying
I only want certain users to be able to access it and do not want to
publish it on the web
From this statement alone, it may be sufficient for you to create a "secret" link for your document, and share this among your users. You can then easily retrieve this document automatically, for instance with wget, and specify the format, e.g. csv:
wget -O data.csv "https://docs.google.com/spreadsheets/d/***SHARED-SECRET***/export?format=csv"
Or, in Python (2):
import urllib2
from cookielib import CookieJar
spreadsheet_url = "https://docs.google.com/spreadsheets/d/***SHARED-SECRET***/export?format=csv"
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(CookieJar()))
response = opener.open(spreadsheet_url)
with open("data.csv", "wb") as f:
f.write(response.read())
I am actually using that in production, it works reliably, without faking the user agent.
I have a very simple problem and I am absolutely amazed that I haven't seen anything on this specifically. I am attempting to follow best practices for copying a file that is hosted on a webserver going through a proxy server (which does not require auth) using python3.
i have done similar things using python 2.5 but I am really coming up short here. I am trying to make this into a function that i can reuse for future scripts on this network. any assistance that can be provided would be greatly appreciated.
I have the feeling that my issue lies within attempting to use urllib.request or http.client without any clear doc on how to incorporate the use of a proxy (without auth).
I've been looking here and pulling out my hair...
http://docs.python.org/3.1/library/urllib.request.html#urllib.request.ProxyHandler
http://docs.python.org/3.1/library/http.client.html
http://diveintopython3.org/http-web-services.html
even this stackoverflow article:
Proxy with urllib2
but in python3 urllib2 is deprecated...
here is an function to retrieve a file through an http proxy:
import urllib.request
def retrieve( url, filename ):
proxy = urllib.request.ProxyHandler( {'http': '127.0.0.1'} )
opener = urllib.request.build_opener( proxy )
remote = opener.open( url )
local = open( filename, 'wb' )
data = remote.read(100)
while data:
local.write(data)
data = remote.read(100)
local.close()
remote.close()
(error handling is left as an exercise to the reader...)
you can eventually save the opener object for later use, in case you need to retrieve multiple files. the content is written as-is into the file, but it may need to be decoded if a fancy encoding has been used.