So i am calling out urls i.e. "domain.xyz" from a .csv file. The purpose is to use the requests module to GET/HEAD responses. Using this code as a work around to add a string.
x = "http://www."+str('domain.com')
response = requests.head(x)
The problem here is not all "domain.com" entries in my .csv start with standard http://www.. What's the best way to complete the URL before using the requests module?
p.s. I am looking for something similar to what Chromes address bar does to complete a url. For instance when we enter 'abc.com'. it completes it to "http://www.abc.xyz".
Related
I found a tutorial and I'm trying to run this script, I did not work with python before.
tutorial
I've already seen what is running through logging.debug, checking whether it is connecting to google and trying to create csv file with other scripts
from urllib.parse import urlencode, urlparse, parse_qs
from lxml.html import fromstring
from requests import get
import csv
def scrape_run():
with open('/Users/Work/Desktop/searches.txt') as searches:
for search in searches:
userQuery = search
raw = get("https://www.google.com/search?q=" + userQuery).text
page = fromstring(raw)
links = page.cssselect('.r a')
csvfile = '/Users/Work/Desktop/data.csv'
for row in links:
raw_url = row.get('href')
title = row.text_content()
if raw_url.startswith("/url?"):
url = parse_qs(urlparse(raw_url).query)['q']
csvRow = [userQuery, url[0], title]
with open(csvfile, 'a') as data:
writer = csv.writer(data)
writer.writerow(csvRow)
print(links)
scrape_run()
The TL;DR of this script is that it does three basic functions:
Locates and opens your searches.txt file.
Uses those keywords and searches the first page of Google for each
result.
Creates a new CSV file and prints the results (Keyword, URLs, and
page titles).
Solved
Google add captcha couse i use to many request
its work when i use mobile internet
Assuming the links variable is full and contains data - please verify.
if empty - test the api call itself you are making, maybe it returns something different than you expected.
Other than that - I think you just need to tweak a little bit your file handling.
https://www.guru99.com/reading-and-writing-files-in-python.html
here you can find some guidelines regarding file handling in python.
in my perspective, you need to make sure you create the file first.
start on with a script which is able to just create a file.
after that enhance the script to be able to write and append to the file.
from there on I think you are good to go and continue with you're script.
other than that I think that you would prefer opening the file only once instead of each loop, it could mean much faster execution time.
let me know if something is not clear.
I'm trying to download a file from a website but it looks like it is detecting urllib and doesn't allow it to download (I'm getting the error "urllib.error.HTTPError: HTTP Error 403: Forbidden").
How can I fix this? I found on the internet that I had to add a header but the answers weren't going the way I need (It was using Request and I didn't find anything about an argument to add in urllib.request.urlretrieve() for a header)
I'm using Python 3.6
Here's the code:
import urllib.request
filelink = 'https://randomwebsite.com/changelog.txt'
filename = filelink.rsplit('/', 1)
filename = str(filename[1])
urllib.request.urlretrieve(filelink, filename)
I want to include a header to give me the permission to download the file but I need to keep a line like the last one, using the two variables (one for the link of the file and one for the name that depends of the link).
Thanks already for your help !
Check the below link:
https://stackoverflow.com/a/7244263/5903276
The most correct way to do this would be to use the urllib.request.urlopen function to return a file-like object that represents an HTTP response and copy it to a real file using shutil.copyfileobj.
I have a file, gather.htm which is a valid HTML file with header/body and forms. If I double click the file on the Desktop, it properly opens in a web browser, auto-submits the form data (via <SCRIPT LANGUAGE="Javascript">document.forms[2].submit();</SCRIPT>) and the page refreshes with the requested data.
I want to be able to have Python make a requests.post(url) call using gather.htm. However, my research and my trail-and-error has provided no solution.
How is this accomplished?
I've tried things along these lines (based on examples found on the web). I suspect I'm missing something simple here!
myUrl = 'www.somewhere.com'
filename='/Users/John/Desktop/gather.htm'
f = open (filename)
r = requests.post(url=myUrl, data = {'title':'test_file'}, files = {'file':f})
print r.status_code
print r.text
And:
htmfile = 'file:///Users/John/Desktop/gather.htm'
files = {'file':open('gather.htm')}
webbrowser.open(url,new=2)
response = requests.post(url)
print response.text
Note that in the 2nd example above, the webbrowser.open() call works correctly but the requests.post does not.
It appears that everything I tried failed in the same way - the URL is opened and the page returns default data. It appears the website never receives the gather.htm file.
Since your request is returning 200 OK, there is nothing wrong getting your post request to the server. It's hard to give you an exact answer, but the problem lies with how the server is handling the request. Either your post request is being formatted in a way that the server doesn't recognise, or the server hasn't been set up to deal with them at all. If you're managing the website yourself, some additional details would help.
Just as a final check, try the following:
r = requests.post(url=myUrl, data={'title':'test_file', 'file':f})
I try to upload a file on a random website using Python and HTTP requests. For this, I use the handy library named Requests.
According to the documentation, and some answers on StackOverflow here and there, I just have to add a files parameter in my application, after studying the DOM of the web page.
The method is simple:
Look in the source code for the URL of the form ("action" attribute);
Look in the source code for the "name" attribute of the uploading
button ;
Look in the source code for the "name" and "value" attributes of the submit form button ;
Complete the Python code with the required parameters.
Sometimes this works fine. Indeed, I managed to upload a file on this site : http://pastebin.ca/upload.php
After looking in the source code, the URL of the form is upload.php, the buttons names are file and s, the value is Upload, so I get the following code:
url = "http://pastebin.ca/upload.php"
myFile = open("text.txt","rb")
r = requests.get(url,data={'s':'Upload'},files={'file':myFile})
print r.text.find("The uploaded file has been accepted.")
# ≠ -1
But now, let's look at that site: http://www.pictureshack.us/
The corresponding code is as follows:
url = "http://www.pictureshack.us/index2.php"
myFile = open("text.txt","rb")
r = requests.get(url,data={'Upload':'upload picture'},files={'userfile':myFile})
print r.text.find("Unsupported File Type!")
# = -1
In fact, the only difference I see between these two sites is that for the first, the URL where the work is done when submitting the form is the same as the page where the form is and where the files are uploaded.
But that does not solve my problem, because I still do not know how to submit my file in the second case.
I tried to make my request on the main page instead of the .php, but of course it does not work.
In addition, I have another question.
Suppose that some form elements do not have "name" attribute. How am I supposed to designate it at my request with Python?
For example, this site: http://imagesup.org/
The submitting form button looks like this: <input type="submit" value="Héberger !">
How can I use it in my data parameters?
The forms have another component you must honour: the method attribute. You are using GET requests, but the forms you are referring to use method="post". Use requests.post to send a POST request.
I want to download some files from this site: http://www.emuparadise.me/soundtracks/highquality/index.php
But I only want to get certain ones.
Is there a way to write a python script to do this? I have intermediate knowledge of python
I'm just looking for a bit of guidance, please point me towards a wiki or library to accomplish this
thanks,
Shrub
Here's a link to my code
I looked at the page. The links seem to redirect to another page, where the file is hosted, clicking which downloads the file.
I would use mechanize to follow the required links to the right page, and then use BeautifulSoup or lxml to parse the resultant page to get the filename.
Then it's a simple matter of opening the file using urlopen and writing its contents out into a local file like so:
f = open(localFilePath, 'w')
f.write(urlopen(remoteFilePath).read())
f.close()
Hope that helps
Make a url request for the page. Once you have the source, filter out and get urls.
The files you want to download are urls that contain a specific extension. It is with this that you can do a regular expression search for all urls that match your criteria.
After filtration, then do a url request for each matched url's data and write it to memory.
Sample code:
#!/usr/bin/python
import re
import sys
import urllib
#Your sample url
sampleUrl = "http://stackoverflow.com"
urlAddInfo = urllib.urlopen(sampleUrl)
data = urlAddInfo.read()
#Sample extensions we'll be looking for: pngs and pdfs
TARGET_EXTENSIONS = "(png|pdf)"
targetCompile = re.compile(TARGET_EXTENSIONS, re.UNICODE|re.MULTILINE)
#Let's get all the urls: match criteria{no spaces or " in a url}
urls = re.findall('(https?://[^\s"]+)', data, re.UNICODE|re.MULTILINE)
#We want these folks
extensionMatches = filter(lambda url: url and targetCompile.search(url), urls)
#The rest of the unmatched urls for which the scrapping can also be repeated.
nonExtMatches = filter(lambda url: url and not targetCompile.search(url), urls)
def fileDl(targetUrl):
#Function to handle downloading of files.
#Arg: url => a String
#Output: Boolean to signify if file has been written to memory
#Validation of the url assumed, for the sake of keeping the illustration short
urlAddInfo = urllib.urlopen(targetUrl)
data = urlAddInfo.read()
fileNameSearch = re.search("([^\/\s]+)$", targetUrl) #Text right before the last slash '/'
if not fileNameSearch:
sys.stderr.write("Could not extract a filename from url '%s'\n"%(targetUrl))
return False
fileName = fileNameSearch.groups(1)[0]
with open(fileName, "wb") as f:
f.write(data)
sys.stderr.write("Wrote %s to memory\n"%(fileName))
return True
#Let's now download the matched files
dlResults = map(lambda fUrl: fileDl(fUrl), extensionMatches)
successfulDls = filter(lambda s: s, dlResults)
sys.stderr.write("Downloaded %d files from %s\n"%(len(successfulDls), sampleUrl))
#You can organize the above code into a function to repeat the process for each of the
#other urls and in that way you can make a crawler.
The above code is written mainly for Python2.X. However, I wrote a crawler that works on any version starting from 2.X
Why yes! 5 years later and, not only is this possible, but you've now got a lot of ways to do it.
I'm going to avoid code-examples here, because mainly want to help break your problem into segments and give you some options for exploration:
Segment 1: GET!
If you must stick to the stdlib, for either python2 or python3, urllib[n]* is what you're going to want to use to pull-down something from the internet.
So again, if you don't want dependencies on other packages:
urllib or urllib2 or maybe another urllib[n] I'm forgetting about.
If you don't have to restrict your imports to the Standard Library:
you're in luck!!!!! You've got:
requests with docs here. requests is the golden standard for gettin' stuff off the web with python. I suggest you use it.
uplink with docs here. It's relatively new & for more programmatic client interfaces.
aiohttp via asyncio with docs here. asyncio got included in python >= 3.5 only, and it's also extra confusing. That said, it if you're willing to put in the time it can be ridiculously efficient for exactly this use-case.
...I'd also be remiss not to mention one of my favorite tools for crawling:
fake_useragent repo here. Docs like seriously not necessary.
Segment 2: Parse!
So again, if you must stick to the stdlib and not install anything with pip, you get to use the extra-extra fun and secure (<==extreme-sarcasm) xml builtin module. Specifically, you get to use the:
xml.etree.ElementTree() with docs here.
It's worth noting that the ElementTree object is what the pip-downloadable lxml package is based on, and made make easier to use. If you want to recreate the wheel and write a bunch of your own complicated logic, using the default xml module is your option.
If you don't have to restrict your imports to the Standard Library:
lxml with docs here. As i said before, lxml is a wrapper around xml.etree that makes it human-usable & implements all those parsing tools you'd need to make yourself. However, as you can see by visiting the docs, it's not easy to use by itself. This brings us to...
BeautifulSoup aka bs4 with docs here. BeautifulSoup makes everything easier. It's my recommendation for this.
Segment 3: GET GET GET!
This section is nearly exactly the same as "Segment 1," except you have a bunch of links not one.
The only thing that changes between this section and "Segment 1" is my recommendation for what to use: aiohttp here will download way faster when dealing with several URLs because it's allows you to download them in parallel.**
* - (where n was decided-on from python-version to ptyhon-version in a somewhat frustratingly arbitrary manner. Look up which urllib[n] has .urlopen() as a top-level function. You can read more about this naming-convention clusterf**k here, here, and here.)
** - (This isn't totally true. It's more sort-of functionally-true at human timescales.)
I would use a combination of wget for downloading - http://www.thegeekstuff.com/2009/09/the-ultimate-wget-download-guide-with-15-awesome-examples/#more-1885 and BeautifulSoup http://www.crummy.com/software/BeautifulSoup/bs4/doc/ for parsing the downloaded file