Python 3, functions.
There is the following exercise:
In this exercise, we will write code that allows us to
download an image from the web! We will use an external module and understand how to read the documentation of the functions and use them.
We want to write a function that accepts a photo Url and
downloads the photo to your computer.
You can choose any picture you want from the Internet by clicking the right button and selecting "Copy Image Url”.
the image I chose:
https://images.theconversation.com/files/377569/original/file-20210107-17-q20ja9.jpg?ixlib=rb-1.1.0&rect=278%2C340%2C4644%2C3098&q=45&auto=format&w=926&fit=clip
the wanted name:
"4.jpg"
the given instructions:
Write code that selects a random number between 1 and 1000. The number will be the file name (search how to select a random number in python).
The image name has to have the right suffix, so concat“.png" to the end of the number you selected (You can also use .jpg, .bmp, etc).
For example: if the selected number was 2T0, you should have a variable that holds the string: “270.png”.
Import module urllib.request: https://docs.python.org/3/library/urllib.request.html#module-urllib.request
Read "how to use the urlretrieve function":
http://shecodesconnect.com/shecodes_python_blog/urlretrieve_en.php?lang=%27en%27
Use the function you read about in order to
download the image and save it under the name
you prepared.
Run the code you wrote and check the folder where your code is saved if the image was downloaded!
my code that doesn't work:
import urllib.request
local_filename, headers = urllib.request.urlretrieve(https://images.theconversation.com/files/377569/original/file-20210107-17-q20ja9.jpg?ixlib=rb-1.1.0&rect=278%2C340%2C4644%2C3098&q=45&auto=format&w=926&fit=clip)
html = open(local_filename)
html.close()
headers.items()
headers["content-type"]
image_url="https://images.theconversation.com/files/377569/original/file-20210107-17-q20ja9.jpg?ixlib=rb-1.1.0&rect=278%2C340%2C4644%2C3098&q=45&auto=format&w=926&fit=clip.jpg"urllib.request.urlretrieve(url=image_url, filename="4.jpg")
import random()
a=random.randint(1,1000)
print(4)
Hope you could help, my code doesn't work, thank you in advance!
After fixing the errors, looks like you end up with the below:
import urllib.request
local_filename, headers = urllib.request.urlretrieve('https://images.theconversation.com/files/377569/original/file-20210107-17-q20ja9.jpg?ixlib=rb-1.1.0&rect=278%2C340%2C4644%2C3098&q=45&auto=format&w=926&fit=clip')
html = open(local_filename)
html.close()
headers.items()
headers["content-type"]
image_url="https://images.theconversation.com/files/377569/original/file-20210107-17-q20ja9.jpg?ixlib=rb-1.1.0&rect=278%2C340%2C4644%2C3098&q=45&auto=format&w=926&fit=clip.jpg"
urllib.request.urlretrieve(url=image_url, filename="4.jpg")
import random
a=random.randint(1,1000)
print(4)
Assuming the code in the question is what were you were able to come up with, that is.
Related
I found a tutorial and I'm trying to run this script, I did not work with python before.
tutorial
I've already seen what is running through logging.debug, checking whether it is connecting to google and trying to create csv file with other scripts
from urllib.parse import urlencode, urlparse, parse_qs
from lxml.html import fromstring
from requests import get
import csv
def scrape_run():
with open('/Users/Work/Desktop/searches.txt') as searches:
for search in searches:
userQuery = search
raw = get("https://www.google.com/search?q=" + userQuery).text
page = fromstring(raw)
links = page.cssselect('.r a')
csvfile = '/Users/Work/Desktop/data.csv'
for row in links:
raw_url = row.get('href')
title = row.text_content()
if raw_url.startswith("/url?"):
url = parse_qs(urlparse(raw_url).query)['q']
csvRow = [userQuery, url[0], title]
with open(csvfile, 'a') as data:
writer = csv.writer(data)
writer.writerow(csvRow)
print(links)
scrape_run()
The TL;DR of this script is that it does three basic functions:
Locates and opens your searches.txt file.
Uses those keywords and searches the first page of Google for each
result.
Creates a new CSV file and prints the results (Keyword, URLs, and
page titles).
Solved
Google add captcha couse i use to many request
its work when i use mobile internet
Assuming the links variable is full and contains data - please verify.
if empty - test the api call itself you are making, maybe it returns something different than you expected.
Other than that - I think you just need to tweak a little bit your file handling.
https://www.guru99.com/reading-and-writing-files-in-python.html
here you can find some guidelines regarding file handling in python.
in my perspective, you need to make sure you create the file first.
start on with a script which is able to just create a file.
after that enhance the script to be able to write and append to the file.
from there on I think you are good to go and continue with you're script.
other than that I think that you would prefer opening the file only once instead of each loop, it could mean much faster execution time.
let me know if something is not clear.
Firstly, there are plenty of related questions about extracting GIFs from a URL, but majority of them are in different languages to Python. Secondly, a google search provides many examples of how to do this using requests and a parser, like lxml or beautifulsoup. However, my problem is specific to this URL I think, and I cannot quite figure out why the image in question does not have a specific url attached to it ( http://cactus.nci.nih.gov/chemical/structure/3-Methylamino-1-%28thien-2-yl%29-propane-1-ol/image)
This is what I have tried
molecule_name = "3-Methylamino-1-(thien-2-yl)-propane-1-ol"
molecule = urllib.pathname2url(molecule_name)
response = requests.get("http://cactus.nci.nih.gov/chemical/structure/"+ molecule+"/image")
response.encoding = 'ISO-8859-1'
print type(response.content)
and I just get back a string that says GIF87au. I know it is something to do with GIF being in binary etc. But I cant quite work out how to donwload that GIF file in that particular page using the script.
Furthermore, if I do manage to download the GIF file, what are the best modules to use, to make tables (csv or excel style) with GIF files embedded in the last column for example?
As far as I can tell your code is working for me.
molecule_name = "3-Methylamino-1-(thien-2-yl)-propane-1-ol"
molecule = urllib.pathname2url(molecule_name)
response = requests.get("http://cactus.nci.nih.gov/chemical/structure/"+molecule+"/image")
response.encoding = 'ISO-8859-1'
print len(response.content)
It outputs "1080".
As for second task in hand ... putting it into document. I would use xlsxwriter like this:
import xlsxwriter
# Create an new Excel file and add a worksheet.
workbook = xlsxwriter.Workbook('molecules.xlsx')
worksheet = workbook.add_worksheet()
# Input data
worksheet.write(0, 0, "My molecule") # A1 == 0, 0
worksheet.insert_image('B1', 'molecule1234.png')
workbook.close()
See http://xlsxwriter.readthedocs.org/index.html
You will have to convert that .gif into .png, because as of now xlsxwriter does not support gifs (as jmcnamara pointed out). Here you can look how to do that using PIL - How to change gif file to png file using python pil.
You can display the gif using many various methods. I would just save it to file and used some other software. If you want to view them programmatically, you can use for instance Tkinter as used here Play Animations in GIF with Tkinter.
I am trying to learn python and also create a web utility. One task I am trying to accomplish is creating a single html file which can be run locally but link to everything it needs to look like the original web page. (if you are going to ask why i want this, its because it may act of a part of a utility i am creating, or if not, just for education) So i have two questions, a theoretical one and a practical one:
1) Is this, for visual (as opposed to functional) purposes, possible? Can a html page work offline while linking to everything it needs online? or if their something fundamental about having the html file itself execute on the web server which does not allow this to be possible? How far can I go with it?
2) I have started a python script which de-relativises (made that one up) linked elements on a html page, but I am a noob so most likely I missed some elements or attributes which would also link to outside resources. I have noticed after trying a few pages that the one in the code below does not work properly, their appears to be a .js file which is not linking correctly. (the first of many problems to come) Assuming the answer to my first question was at least a partial yes, can anyone help me fix the code for this website?
Thank you.
Update, I missed the script tag on this, but even after I added it it still does not work correctly.
import lxml
import sys
from lxml import etree
from StringIO import StringIO
from lxml.html import fromstring, tostring
import urllib2
from urlparse import urljoin
site = "www.script-tutorials.com/advance-php-login-system-tutorial/"
output_filename = "output.html"
def download(site):
response = urllib2.urlopen("http://"+site)
html_input = response.read()
return html_input
def derealitivise(site, html_input):
active_html = lxml.html.fromstring(html_input)
for element in tags_to_derealitivise:
for tag in active_html.xpath(str(element+"[#"+"src"+"]")):
tag.attrib["src"] = urljoin("http://"+site, tag.attrib.get("src"))
for tag in active_html.xpath(str(element+"[#"+"href"+"]")):
tag.attrib["href"] = urljoin("http://"+site, tag.attrib.get("href"))
return lxml.html.tostring(active_html)
active_html = ""
tags_to_derealitivise = ("//img", "//a", "//link", "//embed", "//audio", "//video", "//script")
print "downloading..."
active_html = download(site)
active_html = derealitivise(site, active_html)
print "writing file..."
output_file = open (output_filename, "w")
output_file.write(active_html)
output_file.close()
Furthermore, I could make the code more through by checking all of the elements...
It would look kind of like this, but I do not know the exact way to iterate through all of the elements. This is a seperate problem, and I will most likely figure it out by the time anyone responds...:
def derealitivise(site, html_input):
active_html = lxml.html.fromstring(html_input)
for element in active_html.xpath:
for tag in active_html.xpath(str(element+"[#"+"src"+"]")):
tag.attrib["src"] = urljoin("http://"+site, tag.attrib.get("src"))
for tag in active_html.xpath(str(element+"[#"+"href"+"]")):
tag.attrib["href"] = urljoin("http://"+site, tag.attrib.get("href"))
return lxml.html.tostring(active_html)
update
Thanks to Burhan Khalid's solution, which seemed too simple to be viable at first glance, I got it working. The code is so simple most of you will most likely not require it, but I will post it anyway incase it helps:
import lxml
import sys
from lxml import etree
from StringIO import StringIO
from lxml.html import fromstring, tostring
import urllib2
from urlparse import urljoin
site = "www.script-tutorials.com/advance-php-login-system-tutorial/"
output_filename = "output.html"
def download(site):
response = urllib2.urlopen(site)
html_input = response.read()
return html_input
def derealitivise(site, html_input):
active_html = html_input.replace('<head>', '<head> <base href='+site+'>')
return active_html
active_html = ""
print "downloading..."
active_html = download(site)
active_html = derealitivise(site, active_html)
print "writing file..."
output_file = open (output_filename, "w")
output_file.write(active_html)
output_file.close()
Despite all of this, and its great simplicity, the .js object running on the website I have listed in the script still will not load correctly. Does anyone know if this is possible to fix?
while i am trying to make only the html file offline, while using the
linked resources over the web.
This is a two step process:
Copy the HTML file and save it to your local directory.
Add a BASE tag in the HEAD section, and point the href attribute of it to the absolute URL.
Since you want to learn how to do it yourself, I will leave it at that.
#Burhan has an easy answer using <base href="..."> tag in the <head>, and it works as you have found out. I ran the script you posted, and the page downloaded fine. As you noticed, some of the JavaScript now fails. This can be for multiple reasons.
If you are opening the HTML file as a local file:/// URL, the page may not work. Many browsers heavily sandbox local HTML files, not allowing them to perform network requests or examine local files.
The page may perform XmlHTTPRequests or other network operations to the remote site, which will be denied for cross domain scripting reasons. Looking in the JS console, I see the following errors for the script you posted:
XMLHttpRequest cannot load http://www.script-tutorials.com/menus.php?give=menu. Origin http://localhost:8000 is not allowed by Access-Control-Allow-Origin.
Unfortunately, if you do not have control of www.script-tutorials.com, there is no easy way around this.
I want to download some files from this site: http://www.emuparadise.me/soundtracks/highquality/index.php
But I only want to get certain ones.
Is there a way to write a python script to do this? I have intermediate knowledge of python
I'm just looking for a bit of guidance, please point me towards a wiki or library to accomplish this
thanks,
Shrub
Here's a link to my code
I looked at the page. The links seem to redirect to another page, where the file is hosted, clicking which downloads the file.
I would use mechanize to follow the required links to the right page, and then use BeautifulSoup or lxml to parse the resultant page to get the filename.
Then it's a simple matter of opening the file using urlopen and writing its contents out into a local file like so:
f = open(localFilePath, 'w')
f.write(urlopen(remoteFilePath).read())
f.close()
Hope that helps
Make a url request for the page. Once you have the source, filter out and get urls.
The files you want to download are urls that contain a specific extension. It is with this that you can do a regular expression search for all urls that match your criteria.
After filtration, then do a url request for each matched url's data and write it to memory.
Sample code:
#!/usr/bin/python
import re
import sys
import urllib
#Your sample url
sampleUrl = "http://stackoverflow.com"
urlAddInfo = urllib.urlopen(sampleUrl)
data = urlAddInfo.read()
#Sample extensions we'll be looking for: pngs and pdfs
TARGET_EXTENSIONS = "(png|pdf)"
targetCompile = re.compile(TARGET_EXTENSIONS, re.UNICODE|re.MULTILINE)
#Let's get all the urls: match criteria{no spaces or " in a url}
urls = re.findall('(https?://[^\s"]+)', data, re.UNICODE|re.MULTILINE)
#We want these folks
extensionMatches = filter(lambda url: url and targetCompile.search(url), urls)
#The rest of the unmatched urls for which the scrapping can also be repeated.
nonExtMatches = filter(lambda url: url and not targetCompile.search(url), urls)
def fileDl(targetUrl):
#Function to handle downloading of files.
#Arg: url => a String
#Output: Boolean to signify if file has been written to memory
#Validation of the url assumed, for the sake of keeping the illustration short
urlAddInfo = urllib.urlopen(targetUrl)
data = urlAddInfo.read()
fileNameSearch = re.search("([^\/\s]+)$", targetUrl) #Text right before the last slash '/'
if not fileNameSearch:
sys.stderr.write("Could not extract a filename from url '%s'\n"%(targetUrl))
return False
fileName = fileNameSearch.groups(1)[0]
with open(fileName, "wb") as f:
f.write(data)
sys.stderr.write("Wrote %s to memory\n"%(fileName))
return True
#Let's now download the matched files
dlResults = map(lambda fUrl: fileDl(fUrl), extensionMatches)
successfulDls = filter(lambda s: s, dlResults)
sys.stderr.write("Downloaded %d files from %s\n"%(len(successfulDls), sampleUrl))
#You can organize the above code into a function to repeat the process for each of the
#other urls and in that way you can make a crawler.
The above code is written mainly for Python2.X. However, I wrote a crawler that works on any version starting from 2.X
Why yes! 5 years later and, not only is this possible, but you've now got a lot of ways to do it.
I'm going to avoid code-examples here, because mainly want to help break your problem into segments and give you some options for exploration:
Segment 1: GET!
If you must stick to the stdlib, for either python2 or python3, urllib[n]* is what you're going to want to use to pull-down something from the internet.
So again, if you don't want dependencies on other packages:
urllib or urllib2 or maybe another urllib[n] I'm forgetting about.
If you don't have to restrict your imports to the Standard Library:
you're in luck!!!!! You've got:
requests with docs here. requests is the golden standard for gettin' stuff off the web with python. I suggest you use it.
uplink with docs here. It's relatively new & for more programmatic client interfaces.
aiohttp via asyncio with docs here. asyncio got included in python >= 3.5 only, and it's also extra confusing. That said, it if you're willing to put in the time it can be ridiculously efficient for exactly this use-case.
...I'd also be remiss not to mention one of my favorite tools for crawling:
fake_useragent repo here. Docs like seriously not necessary.
Segment 2: Parse!
So again, if you must stick to the stdlib and not install anything with pip, you get to use the extra-extra fun and secure (<==extreme-sarcasm) xml builtin module. Specifically, you get to use the:
xml.etree.ElementTree() with docs here.
It's worth noting that the ElementTree object is what the pip-downloadable lxml package is based on, and made make easier to use. If you want to recreate the wheel and write a bunch of your own complicated logic, using the default xml module is your option.
If you don't have to restrict your imports to the Standard Library:
lxml with docs here. As i said before, lxml is a wrapper around xml.etree that makes it human-usable & implements all those parsing tools you'd need to make yourself. However, as you can see by visiting the docs, it's not easy to use by itself. This brings us to...
BeautifulSoup aka bs4 with docs here. BeautifulSoup makes everything easier. It's my recommendation for this.
Segment 3: GET GET GET!
This section is nearly exactly the same as "Segment 1," except you have a bunch of links not one.
The only thing that changes between this section and "Segment 1" is my recommendation for what to use: aiohttp here will download way faster when dealing with several URLs because it's allows you to download them in parallel.**
* - (where n was decided-on from python-version to ptyhon-version in a somewhat frustratingly arbitrary manner. Look up which urllib[n] has .urlopen() as a top-level function. You can read more about this naming-convention clusterf**k here, here, and here.)
** - (This isn't totally true. It's more sort-of functionally-true at human timescales.)
I would use a combination of wget for downloading - http://www.thegeekstuff.com/2009/09/the-ultimate-wget-download-guide-with-15-awesome-examples/#more-1885 and BeautifulSoup http://www.crummy.com/software/BeautifulSoup/bs4/doc/ for parsing the downloaded file
I've written code that generated the links to videos such as the one below.
Once obtained, I try to download it in this manner:
import urllib.request
import os
url = 'http://www.videodetective.net/flash/players/?customerid=300120&playerid=351&publishedid=319113&playlistid=0&videokbrate=750&sub=RTO&pversion=5.2%22%20width=%22670%22%20height=%22360%22'
response = urllib.request.urlopen(url).read()
outpath = os.path.join(os.getcwd(), 'video.mp4')
videofile = open(outpath , 'wb')
videofile.write(response)
videofile.close()
All I get is a 58kB file in that directory that can't be read. Could someone point me in the right direction?
With your code, you aren't downloading the encoded video file here, but the flash application (in CWS-format) that is used to play the video. It is executed in the browser and dynamically loads and plays the video. You'd need to apply some reverse-engineering to figure out the actual video source. The following is my attempt at it:
Decompressing the SWF file
First, save the 58K file you mentioned on your hard disk under the name test.swf (or similiar).
You can then use the small Perl script cws2fws for that:
perl cws2fws test.swf
This results in a new file named test.fws.swf in the same directory
Searching for the configuration URL in the FWS file
I used a simple
strings test.fws.swf | grep http
Which yields:
...
cookieOhttp://www.videodetective.net/flash/players/flashconfiguration.aspx?customerid=
...
Interesting. Let's try to put our known customerid, playerid and publishedid arguments to this URL:
http://www.videodetective.net/flash/players/flashconfiguration.aspx?customerid=300120&playerid=351&publishedid=319113
If we open that in a browser, we can see the player configuration XML, which in turn points us to
http://www.videodetective.net/flash/players/playlist.aspx?videokbrate=450&version=4.6&customerid=300120&fmt=3&publishedid=&sub=
Now if we open that, we can finally see the source URL:
http://cdn.videodetective.net/svideo/mp4/450/6993/293732.mp4?c=300120&r=450&s=293732&d=153&sub=&ref=&fmt=4&e=20111228220329&h=03e5d78201ff0d2f7df9a
Now we can download this h264 video file and we are finished.
Automating the whole process in a Python script
This is an entirely different task (left as an exercise to the reader).