Storing downloaded files in memory

Storing downloaded files in memory - python

I'm trying to download multiple image files from two websites, and am doing it using multiprocessing module, hoping to shorten the time needed (synchronously it would be about five minutes). This is the code being executed in a separate process:
def _get_image(self):
if not os.path.isdir(self.file_path + self.folder):
os.makedirs(self.file_path + self.folder)
rand = Random()
rand_num = rand.randint(0, sys.maxint)
self.url += str(rand_num)
opener = urllib.FancyURLopener()
opener.retrieve(self.url, self.file_path + self.folder + '/' + str(rand_num) + '.jpg')
The above code is executed in separate processes and works ok, though I'd like it not to save each file right after it's downloaded, but at the end of the process execution. After download, I'd like them to be stored in some internal list, or dict... Sadly, FancyURLopener doesn't allow to store files in memory, and insists on writing them to the disk right after download. Is there a tool like FancyURLopener, but without the disk-writes?

URLopener.open() returns a file-like. You can read() it to retreive the data as a byte string, then store it wherever you want.
Why do you need a URLopener in the first place? How about a simple urllib2.urlopen()?

Related

Find out differences between directory listings on time A and time B on FTP

I want to build a script which finds out which files on an FTP server are new and which are already processed.
For each file on the FTP we read out the information, parse it and write the information we need from it to our database. The files are xml-files, but have to be translated.
At the moment I'm using mlsd() to get a list, but this takes up to 4 minutes because there are already 15.000 files in this directory - it will be more everyday.
Instead of comparing this list with an older list which I saved in a textfile I would like to know if there are better possibilities.
Because this task has to run "live" it would end in an cronjob every 1 or 2 minutes. If this method takes to long this won't work.
The solution should be either in PHP or Python.
def handle(self, *args, **options):
ftp = FTP_TLS(host=host)
ftp.login(user,passwd)
ftp.prot_p()
list = ftp.mlsd("...")
for item in list:
print(item[0] + " => " + item[1]['modify'])
This code examples already runs 4 minutes.

I have always tried to avoid browsing a folder to find what could have changed. I prefered setting a dedicated workflow. When files can only be added (or new versions of existing files), I tried to use a workflow where files are added in one directory and then go in other directories where they are archived. Processing can occur in a directory where files are deleted after being used, or when they are copied/moved from a folder to an other one.
As a slight goody, I also use a copy/rename pattern: the files are first copied using a temporary name (for example a .t prefix or suffix) and renamed when the copy has ended. This prevents trying to process a file which is not fully copied. Ok it used to be more important when we had slow lines, but race conditions should be avoided as much as possible, and it allows to use daemon which polls a folder every 10 seconds or less.
Unsure whether it is really relevant here because it could require some refactoring, but it gives bullet proof solutions.

If FTP is your only interface to the server, there's no better way that what you are already doing.
Except maybe, if you server supports non-standard -t switch to LIST/NLST commands, which returns the list sorted by timestamps.
See How to get files in FTP folder sorted by modification time.
And if what takes long is the download of the file list (not initiation of the download). In that case you can request sorted list, but download only the leading new files, aborting the listing once you find the first already processed file.
For an example, how to abort download of a file list, see:
Download the first N rows of text file in ftp with ftplib.retrlines
Something like this:
class AbortedListing(Exception):
pass
def collectNewFiles(s):
if isProcessedFile(s): # your code to detect if the file was processed already
print("We know this file already: " + s + " - aborting")
raise AbortedListing()
print("New file: " + s)
try:
ftp.retrlines("NLST -t /path", collectNewFiles)
except AbortedListing:
# read/skip response
ftp.getmultiline()

Get zip file details - not from its content

After creating a zip file in Python2, how to get the details of the zip? It's not about it's containing files but the zip itself.
On Linux opening the zip file with the 'Archive Manager' the properties can be displayed:
"Last modified, Archive size, Content size, Compression ratio, Number of files"
How to get those properties from within a python script?

This information is not available in the ZIP archive as a single structure to access. I am not sure how Archive Manager implements it and I do not have one around to check it out, but I presume it to be a combination of stat of the archive itself to retrieve the time of its last modification and size. E.g. for archive ar.zip:
os.stat('ar.zip').st_mtime # last modification of the archive
os.stat('ar.zip').st_size # size of the archive
And iterating over archive members information for the rest. For ZIP file, this operation should actually not be prohibitively expensive as there is a directory pointing to all entries at the end of the archive, so it does not have to be read it in its entirety.
For instance:
osize = csize = cnt = 0
for item in z.infolist():
osize += item.file_size
csize += item.compress_size
cnt += 1
will give you osize with original (uncompressed) size of all files, csize compressed size in the archive and cnt number of all entries in the archive.
With that, you can get the compression ratio dividing csize by osize with one caveat. Since you mention/flag using python 2.7, do not forget to convert (at least) one of them to float to force result to be float as well: ratio = float(czise) / osize. On Pyton 3 / would produce float in any case.
You can of course wrap all of that into a convenient function you can pass an open zip archive to:
def zip_details(archive_obj):
archive_info = {'original_size': 0,
'compressed_size': 0,
'total_entries': 0}
archive_info['total_size'] = os.fstat(archive_obj.fp.fileno()).st_size
archive_info['last_change'] = os.fstat(archive_obj.fp.fileno()).st_mtime
for item in archive_obj.infolist():
archive_info['original_size'] += item.file_size
archive_info['compressed_size'] += item.compress_size
archive_info['total_entries'] += 1
archive_info['compression_ration'] = float(archive_info['compressed_size']) / archive_info['original_size']
return archive_info
and get a dictionary with the desired details in return. Or you could subclass zipfile.ZipFile and add this functionality as its method.
You've expressed limitation in the question title to exclude using the content, but I am afraid, that condition is impossible to fulfill for an existing archive except for overall size and time of last modification. Everything else can really only be learned by looking into an archive itself. File count from the directory at its ends and further details from information stored on individual files. This is not python specific and holds for any tool or language used.

As long as working with 'bash' (like in Linux) here is a simple method to zip a given file/dir list with getting the zip archive properties
import os
bashCommand = "zip -r -v" \
" " + "./my-extension.zip" \
" " + "file1 file2 fileN dir1 dir2 dirN" \
" " + "| grep 'total bytes=' > zip.log"
os.system(bashCommand)
Note: Sure this can be executed directly at the OS prompt, but the intend is to include the call in a bigger python script

Python, image compression and multiprocessing

I'm trying to wrap my head round MultiProcessing in Python, but I simply can't. Notice that I was, am and probably forever be a noob in everything-programming. Ah, anyways. Here it goes.
I'm writing a Python script that compresses images downloaded to a folder with ImageMagick, using predefined variables from the user, stored in an ini file. The script searches for folders matching a pattern in a download dir, checks if they contain JPGs, PNGs or other image files and, if yes, recompresses and renames them, storing the results in a "compressed" folder.
Now, here's the thing: I'd love it if I was able to "parallelize" the whole compression thingy, but... I can't understand how I'm supposed to do that.
I don't want to tire you with the existing code since it simply sucks. It's just a simple "for file in directory" loop. THAT's what I'd love to parallelize - could somebody give me an example on how multiprocessing could be used with files in a directory?
I mean, let's take this simple piece of code:
for f in matching_directory:
print ('I\'m going to process file:', f)
For those that DO have to peek at the code, here's the part where I guess the whole parallelization bit will stick:
for f in ImageFolders:
print (splitter)
print (f)
print (splitter)
PureName = CleanName(f)
print (PureName)
for root, dirs, files in os.walk(f):
padding = int(round( math.log( len(files), 10))) + 1
padding = max(minpadding, padding)
filecounter = 0
for filename in files:
if filename.endswith(('.jpg', '.jpeg', '.gif', '.png')):
filecounter += 1
imagefile, ext = os.path.splitext(filename)
newfilename = "%s_%s%s" % (PureName, (str(filecounter).rjust(padding,'0')), '.jpg')
startfilename = os.path.join (f, filename)
finalfilename = os.path.join(Dir_Images_To_Publish, PureName, newfilename)
print (filecounter, ':', startfilename, ' >>> ', finalfilename)
Original_Image_FileList.append(startfilename)
Processed_Image_FileList.append(finalfilename)
...and here I'd like to be able to add a piece of code where a worker takes the first file from Original_Image_FileList and compresses it to the first filename from Processed_Image_FileList, a second one takes the one after that, blah-blah, up to a specific number of workers - depending on a user setting in the ini file.
Any ideas?

You can create a pool of workers using the Pool class, to which you can distribute the image compression to. See the Using a pool of workers section of the multiprocessing documentation.
If your compression function is called compress(filename), for example, then you can use the Pool.map method to apply this function to an iterable that returns the filenames, i.e. your list matching_directory:
from multiprocessing import Pool
def compress_image(image):
"""Define how you'd like to compress `image`..."""
pass
def distribute_compression(images, pool_size = 4):
with Pool(processes=pool_size) as pool:
pool.map(compress_image, images)
There's a variety of map-like methods available, see map for starters. You may like to experiment with the pool size, to see what works best.

string substitution not working for html insert

The script receives two variables from a previous web page. From those variables, the code determines which images are desired. It sends those images to a temp folder, zips up that folder and places it in an output folder for pickup. That's where things go south. I'm trying to allow the webpage to provide a button for the user to click on and download the zip file. Because the zip file's name needs to change based on the variables the script receives, I cannot just make a generic link to the zip file.
import arcpy, sys, shutil, os
path = "C:/output/exportedData/raw/"
pathZip = "C:/output/exportedData/zip/"
#First arg is the mxd base filename which is the same as the geodatabase name
geodatabaseName = "C:/output/" + sys.argv[1] + ".gdb"
#this is where the images are determined and sent to a folder
zipFileName = sys.argv[1]
zipFile = shutil.make_archive(path + zipFileName,"zip")
movedZip = os.rename(zipFile, pathZip + zipFileName + ".zip")
shutil.rmtree(path + zipFileName)
print """<h3>Download zip file</h3>""".format(movedZip)
And the last line indicates where the problem comes in. Firebug indicates the link made is
Download zip file
The string substitution isn't working in this case and I'm at a loss as to why. Thank you, in advance for any assistance you can provide.

os.rename() doesn't return anything, which means that movedZip becomes None.
Here's what you probably want to do instead:
movedZip = pathZip + zipFileName + ".zip"
os.rename(zipFile, movedZip)

The os.rename method does not return any value. You could see the official document here. It rename the file or directory src to dst. Some exceptions might be thrown. But does not return anything.

problem when re-running program

I have a fairly simple python loop that calls a few functions, and writes the output to a file. To do this is creates a folder, and saves the file in this folder.
When I run the program the first time with a unique file name, it runs fine. However, if I try to run it again, it will not work and I do not understand why. I am quite certain that it is not a problem of overwriting the file, as I delete the folder before re-running, and this is the only place that the file is stored. Is there a concept that I am mis-understanding?
The problematic file is 'buff1.shp'. I am using Python 2.5 to run some analysis in ArcGIS
Thanks for any advice (including suggestions about how to improve my coding style). One other note is that my loops currently only use one value as I am testing this at the moment.
# Import system modules
import sys, string, os, arcgisscripting, shutil
# Create the Geoprocessor object
gp = arcgisscripting.create()
# Load required toolboxes...
gp.AddToolbox("C:/Program Files/ArcGIS/ArcToolbox/Toolboxes/Spatial Statistics Tools.tbx")
gp.AddToolbox("C:/Program Files/ArcGIS/ArcToolbox/Toolboxes/Analysis Tools.tbx")
# specify workspace
gp.Workspace = "C:/LEED/Cities_20_Oct/services"
path = "C:\\LEED\\Cities_20_Oct\\services\\"
results = 'results\\'
os.mkdir( path + results )
newpath = path + results
# Loop through each file (0 -> 20)
for j in range(0,1):
in_file = "ser" + str(j) + ".shp"
in_file_2 = "ser" + str(j) + "_c.shp"
print "Analyzing " + str(in_file) + " and " + str(in_file_2)
#Loop through a range of buffers - in this case, 1,2
for i in range(1,2):
print "Buffering....."
# Local variables...
center_services = in_file_2
buffer_shp = newpath + "buff" + str(i) + ".shp"
points = in_file_2
buffered_analysis_count_shp = newpath + "buffered_analysis_count.shp"
count_txt = newpath + "count.txt"
# Buffer size
b_size = 1000 + 1000 * i
b_size_input = str(b_size) + ' METERS'
print "Buffer:" + b_size_input + "\n"
# Process: Buffer...
gp.Buffer_analysis(center_services, buffer_shp, b_size_input, "FULL", "ROUND", "ALL", "")
print "over"
(To clarify this question I edited a few parts that did not make sense without the rest of the code. The error still remains in the program.)
Error message:
ExecuteError: ERROR 000210: Cannot create output C:\LEED\Cities_20_Oct\services\results\buff1.shp Failed to execute (Buffer).

I can't see how the file name in the error message blahblah\buff1.shp can arise from your code.
for i in range(0,1):
buffer_shp = newpath + "buff" + str(i) + ".shp"
gp.Buffer_analysis(center_services, buffer_shp, etc etc)
should produce blahblah\buff0.shp not blahblah\buff1.shp... I strongly suggest that the code you display should be the code that you actually ran. Throw in a print statement just before the gp.Buffer_analysis() call to show the value of i and repr(buffer_shp). Show all print results.
Also the comment #Loop through a range of buffers (1 ->100) indicates you want to start at 1, not 0. It helps (you) greatly if the comments match the code.
Don't repeat yourself; instead of
os.mkdir( path + results )
newpath = path + results
do this:
newpath = path + results # using os.path.join() is even better
os.mkdir(newpath)
You might like to get into the habit of constructing all paths using os.path.join().
You need to take the call to os.mkdir() outside the loops i.e. do it once per run of the script, not once each time round the inner loop.
The results of these statements are not used:
buffered_analysis_count_shp = newpath + "buffered_analysis_count.shp"
count_txt = newpath + "count.txt"
Update
Googling with the first few words in your error message (always a good idea!) brings up this: troubleshooting geoprocessing errors which provides the following information:
geoprocessing errors that occur when
reading or writing ArcSDE/DBMS data
receive a generic 'catch-all' error
message, such as error 00210 when
writing output
This goes on to suggest some ways of determining what your exact problem is. If that doesn't help you, you might like to try asking in the relevant ESRI forum or on GIS StackExchange.

I see this is a 3 year old posting, but for others will add:
As I generate python script to work with Arc, I always include right after my import:
arcpy.env.overwriteOutput=True # This allows the script to overwrite files.
Also you mentioned you delete your "folder"?. That would be part of your directory, and I do not see where you are creating a directory in the script. You would want to clear the folder, not delete it (maybe you meant you delete the file though).
JJH

I'd be tempted to look again at
path = "C:\LEED\Cities_20_Oct\services\"
Surely you want double front slashes, not double back slashes?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Storing downloaded files in memory - python

URLopener.open() returns a file-like. You can read() it to retreive the data as a byte string, then store it wherever you want. Why do you need a URLopener in the first place? How about a simple urllib2.urlopen()?

Related

Find out differences between directory listings on time A and time B on FTP

Get zip file details - not from its content

Python, image compression and multiprocessing

string substitution not working for html insert

problem when re-running program

Categories

Resources