Using the zipfile module I have created a script to extract my archived files, but the method is corrupting everything other than txt files.
def unzip(zip):
filelist = []
dumpfold = r'M:\SVN_EReportingZones\eReportingZones\data\input\26012012'
storage = r'M:\SVN_EReportingZones\eReportingZones\data\input\26012012__download_dump'
file = storage + '\\' + zip
unpack = dumpfold + '\\' + str(zip)
print file
try:
time.sleep(1)
country = str(zip[:2])
countrydir = dumpfold + '\\' + country
folderthere = 0
if exists(countrydir):
folderthere = 1
if folderthere == 0:
os.makedirs(countrydir)
zfile = zipfile.ZipFile(file, 'r')
## print zf.namelist()
time.sleep(1)
shapepresent = 0
Here I have a problem - by reading and writing the zipped data, the zipfile command seems to be rendering it unusable by the programs in question - I am trying to unzip shapefiles for use in ArcGIS...
for info in zfile.infolist():
fname = info.filename
data = zfile.read(fname)
zfilename = countrydir + '\\' + fname
fout = open(zfilename, 'w')# reads and copies the data
fout.write(data)
fout.close()
print 'New file created ----> %s' % zfilename
except:
traceback.print_exc()
time.sleep(5)
Would it be possible to call WinRar using a system command and get it to do my unpacking for me? Cheers, Alex
EDIT
Having used the wb method, it works for most of my files but some are still being corrupted. When I used winRar to manually unzip the problematic files they load properly, and they also show a larger ile size.
Please could somebody point me in the direction of loading winRar for the complete unzip process?
You are opening the file in a text mode. Try:
fout = open(zfilename, 'wb')# reads and copies the data
The b opens the file in a binary mode, where the runtime libraries don't try to do any newline conversion.
To answer the second section of your question, I suggest the envoy library. To use winRar with envoy:
import envoy
r = envoy.run('unrar e {0}'.format(zfilename))
if r.status_code > 0:
print r.std_err
print r.std_out
To do it without envoy:
import subprocess
r = subprocess.call('unrar e {0}'.format(zfilename), shell=True)
print "Return code for {0}: {1}".format(zfilename, r)
Related
This question already has answers here:
How do I check whether a file exists without exceptions?
(40 answers)
Closed 1 year ago.
I am trying to write a block of code which opens a new file every time a Python3 script is run.
I am constructing the filename using an incrementing number.
For example, the following are some examples of valid filenames which should be produced:
output_0.csv
output_1.csv
output_2.csv
output_3.csv
On the next run of the script, the next filename to be used should be output_4.csv.
In C/C++ I would do this in the following way:
Enter an infinite loop
Try to open the first filename, in "read" mode
If the file is open, increment the filename number and repeat
If the file is not open, break out of the loop and re-open the file in "write" mode
This doesn't seem to work in Python 3, as opening a non-existing file in read mode causes an exception to be raised.
One possible solution might be to move the open file code block inside a try-catch block. But this doesn't seem like a particularly elegant solution.
Here is what I tried so far in code
# open a file to store output data
filename_base = "output"
filename_ext = "csv"
filename_number = 0
while True:
filename_full = f"{filename_base}_{filename_number}.{filename_ext}"
with open(filename_full, "r") as f:
if f.closed:
print(f"Writing data to {filename_full}")
break
else:
print(f"File {filename_full} exists")
filename_number += 1
with open(filename_full, "w") as f:
pass
As explained above this code crashes when trying to open a file which does not exist in "read" mode.
Using pathlib you can check with Path.is_file() which returns True when it encounters a file or a symbolic link to a file.
from pathlib import Path
filename_base = "output"
filename_ext = "csv"
filename_number = 0
filename_full = f"{filename_base}_{filename_number}.{filename_ext}"
p = Path(filename_full)
while p.is_file() or p.is_dir():
filename_number += 1
p = Path(f"{filename_base}_{filename_number}.{filename_ext}")
This loop should exit when the file isn’t there so you can open it for writing.
you can check if a file exists prior using
os.path.exists(filename)
You could use the OS module to check if the file path is a file, and then open it:
import os
file_path = './file.csv'
if(os.path.isfile(file_path)):
with open(file_path, "r") as f:
This should work:
filename_base = "output"
filename_ext = "csv"
filename_number = 0
while True:
filename_full = f"{filename_base}_{filename_number}.{filename_ext}"
try:
with open(filename_full, "r") as f:
print(f"File {filename_full} exists")
filename_number += 1
except FileNotFoundError:
print("Creating new file")
open(filename_full, 'w');
break;
You might os.path.exists to check if file already exists for example
import os
print(os.path.exists("output_0.csv"))
or harness fact that your names
output_0.csv
output_1.csv
output_2.csv
output_3.csv
are so regular, exploit glob.glob like so
import glob
existing = glob.glob("output_*.csv")
print(existing) # list of existing files
Directory 1: I have some number of txt files and one xml file which I want to change in each iteration according to txt file content.
Directory 2: I want to copy altered xml file to this directory after each iteration.
After execution I can observe the altered xml file in directory 1 which contains last txt file content as expected. However, directory 2 contains empty files with expected names.
May be there is some issue in my cp command? Could you please help?
os.system('cp /home/username/xmlFile.xml /home/username/NewFolder/%s.xml' % myString)
Entire script:
#!/usr/bin/python
import os
import re
from shutil import copyfile
arr = os.listdir('/di/rec/to/ry')
newArr = []
for j in arr:
m = re.search('.*txt', j)
if m != None:
newArr.append(m.group(0))
for i in newArr:
myString = ""
f = open('/home/username/xmlFile.xml', 'r+')
i = i[:-4]
data = f.readlines()
myString += str(i)
data[10] = data[10][:36] + i + data[10][64:]
f.truncate(0)
f.seek(0)
f.writelines(data)
#os.system('cp /home/username/xmlFile.xml /home/username/NewFolder/%s.xml' % myString)
copyfile('/home/username/xmlFile.xml', '/home/username/NewFolder/%s.xml' % myString)
It seems likely you're encountering synchronization issues. Files aren't immediately written to disk, but buffered in memory to increase overall throughput. This means that the copyfile call isn't seeing the latest changes you have made the files. Try calling f.flush() before copyfile to ensure the change is committed to disk.
I modified the code based on the comments from experts in this thread. Now the script reads and writes all the individual files. The script reiterates, highlight and write the output. The current issue is, after highlighting the last instance of the search item, the script removes all the remaining contents after the last search instance in the output of each file.
Here is the modified code:
import os
import sys
import re
source = raw_input("Enter the source files path:")
listfiles = os.listdir(source)
for f in listfiles:
filepath = source+'\\'+f
infile = open(filepath, 'r+')
source_content = infile.read()
color = ('red')
regex = re.compile(r"(\b be \b)|(\b by \b)|(\b user \b)|(\bmay\b)|(\bmight\b)|(\bwill\b)|(\b's\b)|(\bdon't\b)|(\bdoesn't\b)|(\bwon't\b)|(\bsupport\b)|(\bcan't\b)|(\bkill\b)|(\betc\b)|(\b NA \b)|(\bfollow\b)|(\bhang\b)|(\bbelow\b)", re.I)
i = 0; output = ""
for m in regex.finditer(source_content):
output += "".join([source_content[i:m.start()],
"<strong><span style='color:%s'>" % color[0:],
source_content[m.start():m.end()],
"</span></strong>"])
i = m.end()
outfile = open(filepath, 'w+')
outfile.seek(0)
outfile.write(output)
print "\nProcess Completed!\n"
infile.close()
outfile.close()
raw_input()
The error message tells you what the error is:
No such file or directory: 'sample1.html'
Make sure the file exists. Or do a try statement to give it a default behavior.
The reason why you get that error is because the python script doesn't have any knowledge about where the files are located that you want to open.
You have to provide the file path to open it as I have done below. I have simply concatenated the source file path+'\\'+filename and saved the result in a variable named as filepath. Now simply use this variable to open a file in open().
import os
import sys
source = raw_input("Enter the source files path:")
listfiles = os.listdir(source)
for f in listfiles:
filepath = source+'\\'+f # This is the file path
infile = open(filepath, 'r')
Also there are couple of other problems with your code, if you want to open the file for both reading and writing then you have to use r+ mode. More over in case of Windows if you open a file using r+ mode then you may have to use file.seek() before file.write() to avoid an other issue. You can read the reason for using the file.seek() here.
I'm a photographer and doing many backups. Over the years I found myself with a lot of hard drives. Now I bought a NAS and copied all my pictures on one 3TB raid 1 using rsync. According to my script about 1TB of those files are duplicates. That comes from doing multiple backups before deleting files on my laptop and being very messy. I do have a backup of all those files on the old hard drives, but it would be a pain if my script messes things up. Can you please have a look at my duplicate finder script and tell me if you think I can run it or not? I tried it on a test folder and it seems ok, but I don't want to mess things up on the NAS.
The script has three steps in three files. In this First part I find all image and metadata files and put them into a shelve database (datenbank) with their size as key.
import os
import shelve
datenbank = shelve.open(os.path.join(os.path.dirname(__file__),"shelve_step1"), flag='c', protocol=None, writeback=False)
#path_to_search = os.path.join(os.path.dirname(__file__),"test")
path_to_search = "/volume1/backup_2tb_wd/"
file_exts = ["xmp", "jpg", "JPG", "XMP", "cr2", "CR2", "PNG", "png", "tiff", "TIFF"]
walker = os.walk(path_to_search)
counter = 0
for dirpath, dirnames, filenames in walker:
if filenames:
for filename in filenames:
counter += 1
print str(counter)
for file_ext in file_exts:
if file_ext in filename:
filepath = os.path.join(dirpath, filename)
filesize = str(os.path.getsize(filepath))
if not filesize in datenbank:
datenbank[filesize] = []
tmp = datenbank[filesize]
if filepath not in tmp:
tmp.append(filepath)
datenbank[filesize] = tmp
datenbank.sync()
print "done"
datenbank.close()
The second part. Now I drop all file sizes which only have one file in their list and create another shelve database with the md5 hash as key and a list of files as value.
import os
import shelve
import hashlib
datenbank = shelve.open(os.path.join(os.path.dirname(__file__),"shelve_step1"), flag='c', protocol=None, writeback=False)
datenbank_step2 = shelve.open(os.path.join(os.path.dirname(__file__),"shelve_step2"), flag='c', protocol=None, writeback=False)
counter = 0
space = 0
def md5Checksum(filePath):
with open(filePath, 'rb') as fh:
m = hashlib.md5()
while True:
data = fh.read(8192)
if not data:
break
m.update(data)
return m.hexdigest()
for filesize in datenbank:
filepaths = datenbank[filesize]
filepath_count = len(filepaths)
if filepath_count > 1:
counter += filepath_count -1
space += (filepath_count -1) * int(filesize)
for filepath in filepaths:
print counter
checksum = md5Checksum(filepath)
if checksum not in datenbank_step2:
datenbank_step2[checksum] = []
temp = datenbank_step2[checksum]
if filepath not in temp:
temp.append(filepath)
datenbank_step2[checksum] = temp
print counter
print str(space)
datenbank_step2.sync()
datenbank_step2.close()
print "done"
And finally the most dangerous part. For evrey md5 key i retrieve the file list and do an additional sha1. If it matches I delete every file in that list execept the first one and create a hard link to replace the deleted files.
import os
import shelve
import hashlib
datenbank = shelve.open(os.path.join(os.path.dirname(__file__),"shelve_step2"), flag='c', protocol=None, writeback=False)
def sha1Checksum(filePath):
with open(filePath, 'rb') as fh:
m = hashlib.sha1()
while True:
data = fh.read(8192)
if not data:
break
m.update(data)
return m.hexdigest()
for hashvalue in datenbank:
switch = True
for path in datenbank[hashvalue]:
if switch:
original = path
original_checksum = sha1Checksum(path)
switch = False
else:
if sha1Checksum(path) == original_checksum:
os.unlink(path)
os.link(original, path)
print "delete: ", path
print "done"
What do you think?
Thank you very much.
*if that's somehow important: It's a synology 713+ and has an ext3 or ext4 filesystem.
This looked good, and after sanitizing a bit (to make it work with python 3.4), I ran this on my NAS. While I had hardlinks for files that had not been modified between backups, files that had moved were being duplicated. This recovered that lost disk space for me.
A minor nitpick is that files that are already hardlinks are deleted and relinked. This does not affect the end result anyway.
I did slightly alter the third file ("3.py"):
if sha1Checksum(path) == original_checksum:
tmp_filename = path + ".deleteme"
os.rename(path, tmp_filename)
os.link(original, path)
os.unlink(tmp_filename)
print("Deleted {} ".format(path))
This makes sure that in case of a power-failure or some other similar error, no files are lost, though a trailing "deleteme" is left behind. A recovery script should be quite trivial.
Why not compare the files byte for byte instead of the second checksum? One in a billion two checksums might accidentally match, but direct comparison shouldn't fail. It shouldn't be slower, and might even be faster. Maybe it could be slower when there are more than two files and you have to read the original file for each other. If you really wanted you could get around that by comparing blocks of all the files at once.
EDIT:
I don't think it would require more code, just different. Something like this for the loop body:
data1 = fh1.read(8192)
data2 = fh2.read(8192)
if data1 != data2: return False
Note: If you're not wedded to Python, there are exsting tools to do the heavy lifting for you:
https://unix.stackexchange.com/questions/3037/is-there-an-easy-way-to-replace-duplicate-files-with-hardlinks
How do you create a hard link.
In linux you do
sudo ln sourcefile linkfile
Sometimes this can fail (for me it fails sometimes). Also your python script needs to run in sudo mode.
So I use symbolic links:
ln -s sourcefile linkfile
I can check for them with os.path.islink
You can call the commands like this in Python:
os.system("ln -s sourcefile linkfile")
or like this using subprocess:
import subprocess
subprocess.call(["ln", "-s", sourcefile, linkfile], shell = True)
Have a look at execution from command line and hard vs. soft links
When it works, could you post your whole code? I would like to use it, too.
So I am pulling jpg's from a url. I am able to save the image files as long as they are being saved to the same folder the python file is in. As soon as I attempt to change the folder(seen here as the outpath) the image files do not get created. I imagine it has something to do with my outpath, but it seems to be fine when I am printing and watching it in the console.
Ubuntu 11.10 OS by the way. I'm a newbie with both linux and python, so it could easily be either. :)
If I were to print the sequence taken from the CSV file it would look like: [['Champ1', 'Subname1', 'imgurl1'],['Champ2', 'subname2', 'imgurl2'],['Champ3','subname3','imgurl3']...]
(It was scraped from a website)
import csv
from urlparse import urlsplit
from urllib2 import urlopen, build_opener
from urllib import urlretrieve
import webbrowser
import os
import sys
reader = csv.reader(open('champdata.csv', "rb"), delimiter = ",", skipinitialspace=True)
champInfo = []
for champs in reader:
champInfo.append(champs)
size = len(champInfo)
def GetImages(x, out_folder="/home/sean/Home/workspace/CP/images"):
b=1
size = len(champInfo)
print size
while b < size:
temp_imgurls = x.pop(b)
filename = os.path.basename(temp_imgurls[2])
print filename
outpath = os.path.join(out_folder, filename)
print outpath
u = urlopen(temp_imgurls[2])
localFile = open(outpath, 'wb')
localFile.write(u.read())
localFile.close()
b+=1
GetImages(champInfo)
I understand it's quite crude, but it does work, only if I'm not attempting to change the save path.
Try providing the complete image path everywhere
E:/../home/sean/Home/workspace/CD/images
def GetImages(x):
b=1
size = len(champInfo)
print size
while b < size:
temp_imgurls = x.pop(b)
filename = temp_imgurls[2]
u = urlopen(temp_imgurls[2])
localFile = open(filename, 'wb')
localFile.write(u.read())
localFile.close()
And this code will be save files in the same directory where script is.
Updated Answer:
I think the answer to your problem is just to add a check for the output directory existence, and create it if needed. ie, add:
if not os.path.exists(out_folder):
os.makedirs(out_folder)
to your existing code.
More generally , you could try something more like this:
import csv
from urllib2 import urlopen
import os
import sys
default_outfolder = "/home/sean/Home/workspace/CD/images"
# simple arg passing wihtout error checking
out_folder = sys.argv[1] if len(sys.argv) == 2 else default_outfolder
if not os.path.exists(out_folder):
os.makedirs(out_folder) # creates out_folder, including any required parent ones
else:
if not os.path.isdir(out_folder):
raise RuntimeError('output path must be a directory')
reader = csv.reader(open('champdata.csv', "rb"), delimiter = ",", skipinitialspace=True)
for champs in reader:
img_url = champs[2]
filename = os.path.basename(img_url)
outpath = os.path.join(out_folder, filename)
print 'Downloading %s to %s' % (img_url, outpath)
with open(outpath, 'wb') as f:
u = urlopen(img_url)
f.write(u.read())
The above code works for champdata.csv of the form stuff,more_stuff,http://www.somesite.com.au/path/to/image.png
but will need to be adapted if I have not understood the actual format of your incoming data.