How to save multiple file from URLs in one zip using python? - python

I am scrapping the files from URLs using beautiful soup, and then want to store those files in a single zip using Python. Below is my code snippet for one URL.
fz = zipfile.ZipFile('C:\\Users\\ADMIN\\data\\data.zip', 'w')
response = urllib2.urlopen(url/file_name.txt)
file = open('C:\\Users\\ADMIN\\data\\filename.txt','w')
file.write(response.read())
file.close()
fz.write('C:\\Users\\ADMIN\\data\\filename.txt',compress_type=zipfile.ZIP_DEFLATED) fz.close()
This snippet is not working for me can any one please help me on this. getting below error:
WindowsError: [Error 2] The system cannot find the file specified:
'C:\Users\ADMIN\data\filename.txt'
but file is present in this location.

Use:
fz.writestr("file_name", url.read())
as many times as you need. I.e. one writestr() per file. Select the zip's mode (deflated) at the opening of the new ZIP.
So, you do not need to save file to disk, then pack it. Just get the html's name and the content and feed them to writestr(). ZIP gets '/' as a path separator. Therefore you use something like: "/some_dir/some_subdir/index.html" for subdirectories or "/index.html" to put a file into root.

Related

Archive files directly from memory in Python

I'm writing this program where I get a number of files, then zip them with encryption using pyzipper, and also I'm using io.BitesIO() to write these files to it so I keep them in-memory. So now, after some other additions, I want to get all of these in-memory files and zip them together in a single encrypted zip file using the same pyzipper.
The code looks something like this:
# Create the in-memory file object
in_memory = BytesIO()
# Create the zip file and open in write mode
with pyzipper.AESZipFile(in_memory, "w", compression=pyzipper.ZIP_LZMA, encryption=pyzipper.WZ_AES) as zip_file:
# Set password
zip_file.setpassword(b"password")
# Save "data" with file_name
zip_file.writestr(file_name, data)
# Go to the beginning
in_memory.seek(0)
# Read the zip file data
data = in_memory.read()
# Add the data to a list
files.append(data)
So, as you may guess the "files" list is an attribute from a class and the whole thing above is a function that does this a number of times and then you get the full files list. For simplicity's sake, I removed most of the irrelevant parts.
I get no errors for now, but when I try to write all files to a new zip file I get an error. Here's the code:
with pyzipper.AESZipFile(test_name, "w", compression=pyzipper.ZIP_LZMA, encryption=pyzipper.WZ_AES) as zfile:
zfile.setpassword(b"pass")
for file in files:
zfile.write(file)
I get a ValueError because of os.stat:
File "C:\Users\vulka\AppData\Local\Programs\Python\Python310\lib\site-packages\pyzipper\zipfile.py", line 820, in from_file
st = os.stat(filename)
ValueError: stat: embedded null character in path
[WHAT I TRIED]
So, I tried using mmap for this purpose but I don't think this can help me and if it can - then I have no idea how to make it work.
I also tried using fs.memoryfs.MemoryFS to temporarily create a virtual filessystem in memory to store all the files and then get them back to zip everything together and then save it to disk. Again - failed. I got tons of different errors in my tests and TBH, there's very little information out there on this fs method and even if what I'm trying to do is possible - I couldn't figure it out.
P.S: I don't know if pyzipper (almost 1:1 zipfile with the addition of encryption) supports nested zip files at all. This could be the problem I'm facing but if it doesn't I'm open to any suggestions for a new approach to doing this. Also, I don't want to rely on a 3rd party software, even if it is open source! (I'm talking about the method of using 7zip to do all the archiving and ecryption, even though it shouldn't even be possible to use it without saving the files to disk in the first place, which is the main thing I'm trying to avoid)

os.listdir() adds characters to the beginning of file name?

I had a quick google of this but couldn't find anything. I'm using os to get a list of all the file names in the current working directory using the following code:
path = os.getcwd()
files = os.listdir(path)
The list of files returns fine, but the last element has an extra '~$' that isn't in the actual file name. For example:
files
['File1.xlsx', 'File2.xlsx', '~$File3.xlsx']
This is then causing an issue when I iterate through these files to try and import them, as I get the error of:
[Errno 2] No such file or directory: 'C:\\Users\\$File3.xlsx'
If anyone knows why this happens and how I can fix/prevent it, that would be great!
Just thought I'd answer in case anyone else has this issue.
It's nothing to do with os. It happened because I had File3 open in Excel while pulling the list of file names. I've found out that opening a microsoft document creates a temporary 'lock' file, which are denoted by '~$' (this is how it can re-open unsaved data if it crashes etc).
I found the below from here:
The files you are describing are so-called owner files (sometimes
referred to as "lock" files). An owner file is created when you work
with a document ... and it should be deleted when you save your
document and exit.
There's also a SO question about this within Microsoft files, which can be found here

How to iterate through multiple excel files using python

I am trying to develop a python script that will iterate through several Excel .xlsx files, search each file for a set of values and save them to a new .xlsx template.
The issue I'm having is when I'm trying to get a proper list of files in the folder I'm looking at. I'm saving these filenames in a list variable 'fileList' to manage iteration.
When I run the code os.chdir(sourcepath),
I'm constantly getting a FileNotFoundError: [WinError 2] The system cannot find the file specified: C:\\Users\\username\\PycharmProjects\\projectName\\venv\\Site List\\siteListfolder
I think this has to do with the '\\' that is displaying in the error, but when I run a print(sourcepath) in this code, the path is properly displayed, with just one '\' between each subdirectory instead of two.
I need to be able to get the list of files in the siteListfolder, and be able to iterate through them using this kind of logic:
priCLLI = sys.argv[1]
secCLLI = sys.argv[2]
sourcepath = os.path.join(homepath, 'Site List', f'{priCLLI}_{secCLLI}')
siteListfolder = os.listdir(sourcepath)
for file in siteListfolder:
for row in file:
<script does its work>
'siteListfolder = os.listdir(sourcepath)' is generating the error
Thanks to all in advance for supporting this kind of forum.
import os
directory = ('your/path/directory')
Source_Workbook = []
for filename in os.listdir(directory):
if filename.endswith(".xlsx"):
Source_Workbook.append(filename)
print(Source_Workbook)

WinError 32 :The process cannot access the file because it is being used by another process

I have written the following code to extract zip files in a directory and a delete a particular excel file in the extracted directory :
def extractZipFiles(dest_directory):
"This function extracts zip files in the destination directory for further processing"
fileFullPath = dest_directory + '\\'
extractedDirList = list()
for file in os.listdir(dest_directory):
dn = fileFullPath+file
dn = re.sub(r'\.zip$', "", fileFullPath+file) #remove the trailing .zip.
extractedDirList.append(dn)
zf = zipfile.ZipFile(fileFullPath+file, mode='r')
zf.extractall(dn) # extract the contents of that zip to the empty directory
zf.close()
return extractedDirList
def removeSelectedReports(extractedDirList):
"This function removes the selected reports from extracted directory"
for i in range(len(extractedDirList)):
for filename in os.listdir(extractedDirList[i]):
if filename.startswith("ABC_8"):
logger.info("File to be removed::"+filename)
fullPathName= "%s/%s" % (extractedDirList[i],filename)
os.remove(fullPathName)
return
extractedDirList = extractZipFiles(attributionRptDestDir)
logger.info("ZIP FILES EXTRACTED:"+str(extractedDirList))
removeSelectedReports(extractedDirList)
I am getting the following intermittent issue even though I have closed the zip file handler.
[WinError 32] The process cannot access the file because it is being used by another process: '\\\\share\\Workingdirectory\\report.20180517.zip'
Can you please help resolve this issue
You should try to figure out what has the file open. Based on your code, it looks like you are on Microsoft Windows.
I would stop all applications on your workstation, including browsers, run with only a minimum number of apps open, and reproduce the problem. Once reproduced you can use a tool to lists all handles open to a particular file.
A handy utility would be handle.exe, but please use any tool with similar functionality.
Once you find the offending application, you can further investigate why the file is open, and take counter measures.
I would be careful not to close any application which has the file open, until you know it is safe to do so.

Determine Filename of Unzipped File

Say you unzip a file called file123.zip with zipfile.ZipFile, which yields an unzipped file saved to a known path. However, this unzipped file has a completely random name. How do you determine this completely random filename? Or is there some way to control what the name of the unzipped file is?
I am trying to implement this in python.
By "random" I assume that you mean that the files are named arbitrarily.
You can use ZipFile.read() which unzips the file and returns its contents as a string of bytes. You can then write that string to a named file of your choice.
from zipfile import ZipFile
with ZipFile('file123.zip') as zf:
for i, name in enumerate(zf.namelist()):
with open('outfile_{}'.format(i), 'wb') as f:
f.write(zf.read(name))
This will write each file from the archive to a file named output_n in the current directory. The names of the files contained in the archive are obtained with ZipFile.namelist(). I've used enumerate() as a simple method of generating the file names, however, you could substitute that with whatever naming scheme you require.
If the filename is completely random you can first check for all filenames in a particular directory using os.listdir(). Now you know the filename and can do whatever you want with it :)
See this topic for more information.

Categories

Resources