Is there a python module for regex matching in zip files - python

I have over a million text files compressed into 40 zip files. I also have a list of about 500 model names of phones. I want to find out the number of times a particular model was mentioned in the text files.
Is there any python module which can do a regex match on the files without unzipping it. Is there a simple way to solve this problem without unzipping?

There's nothing that will automatically do what you want.
However, there is a python zipfile module that will make this easy to do. Here's how to iterate over the lines in the file.
#!/usr/bin/python
import zipfile
f = zipfile.ZipFile('myfile.zip')
for subfile in f.namelist():
print subfile
data = f.read(subfile)
for line in data.split('\n'):
print line

You could loop through the zip files, reading individual files using the zipfile module and running your regex on those, eliminating to unzip all the files at once.
I'm fairly certain that you can't run a regex over the zipped data, at least not meaningfully.

To access the contents of a zip file you have to unzip it, although the zipfile package makes this fairly easy, as you can unzip each file within an archive individually.
Python zipfile module

Isn't it (at least theoretically) possible, to read in the ZIP's Huffman coding and then translate the regexp into the Huffman code? Might this be more efficient than first de-compressing the data, then running the regexp?
(Note: I know it wouldn't be quite that simple: you'd also have to deal with other aspects of the ZIP coding—file layout, block structures, back-references—but one imagines this could be fairly lightweight.)
EDIT: Also note that it's probably much more sensible to just use the zipfile solution.

Related

Archive files directly from memory in Python

I'm writing this program where I get a number of files, then zip them with encryption using pyzipper, and also I'm using io.BitesIO() to write these files to it so I keep them in-memory. So now, after some other additions, I want to get all of these in-memory files and zip them together in a single encrypted zip file using the same pyzipper.
The code looks something like this:
# Create the in-memory file object
in_memory = BytesIO()
# Create the zip file and open in write mode
with pyzipper.AESZipFile(in_memory, "w", compression=pyzipper.ZIP_LZMA, encryption=pyzipper.WZ_AES) as zip_file:
# Set password
zip_file.setpassword(b"password")
# Save "data" with file_name
zip_file.writestr(file_name, data)
# Go to the beginning
in_memory.seek(0)
# Read the zip file data
data = in_memory.read()
# Add the data to a list
files.append(data)
So, as you may guess the "files" list is an attribute from a class and the whole thing above is a function that does this a number of times and then you get the full files list. For simplicity's sake, I removed most of the irrelevant parts.
I get no errors for now, but when I try to write all files to a new zip file I get an error. Here's the code:
with pyzipper.AESZipFile(test_name, "w", compression=pyzipper.ZIP_LZMA, encryption=pyzipper.WZ_AES) as zfile:
zfile.setpassword(b"pass")
for file in files:
zfile.write(file)
I get a ValueError because of os.stat:
File "C:\Users\vulka\AppData\Local\Programs\Python\Python310\lib\site-packages\pyzipper\zipfile.py", line 820, in from_file
st = os.stat(filename)
ValueError: stat: embedded null character in path
[WHAT I TRIED]
So, I tried using mmap for this purpose but I don't think this can help me and if it can - then I have no idea how to make it work.
I also tried using fs.memoryfs.MemoryFS to temporarily create a virtual filessystem in memory to store all the files and then get them back to zip everything together and then save it to disk. Again - failed. I got tons of different errors in my tests and TBH, there's very little information out there on this fs method and even if what I'm trying to do is possible - I couldn't figure it out.
P.S: I don't know if pyzipper (almost 1:1 zipfile with the addition of encryption) supports nested zip files at all. This could be the problem I'm facing but if it doesn't I'm open to any suggestions for a new approach to doing this. Also, I don't want to rely on a 3rd party software, even if it is open source! (I'm talking about the method of using 7zip to do all the archiving and ecryption, even though it shouldn't even be possible to use it without saving the files to disk in the first place, which is the main thing I'm trying to avoid)

How to open files in a particular folder with randomly generated names?

How to open files in a particular folder with randomly generated names? I have a folder named 2018 and the files within that folder are named randomly. I want to iterate through all of the files and open them up.
I will post three names of the files as an example but note that there are over a thousand files in this folder so it has to work on a large scale without any hard coding.
0a2ec2da-628d-417d-9520-b0889886e2ac_1.xml
00a6b260-951d-46b5-ab27-b2e8729e664d_1.xml
00a6b260-951d-46b5-ab27-b2e8729e664d_2.xml
You're looking for os.walk().
In general, if you want to do something with files, it's worth glancing at the os, os.path, pathlib and other built-in modules. They're all documented.
You could also use glob expansion to expand "folder/*" into a list of all the filenames, but os.walk is probably better.
With os.listdir() or os.walk(), depending on whether you want to do it recursively or not.
You can go through the python doc
https://docs.python.org/3/library/os.html#os.walk
https://docs.python.org/3/library/os.html#os.listdir
One you have list of files you can read it simply -
for file in files:
with open(file, "r") as f:
# perform file operations

Zip list of files python

I have created some csv files in my code and I would like to zip them as one folder to be sent by e-mail. I already have the e-mail function but the problem is to zip.
I tried to use this: here I am not extracting or find the files in a directory. I am creating the program the csv files and making a list of it.
My list of files is like this:
lista_files = [12.csv,13.csv,14.csv]
It seems to be easy for developers but as a beginning it is hard. I would really appreciate if someone can help me.
I believe you're looking for the zipfile library. And given that you're looking at a list of filenames, I'd just iterate using a for loop. If you have directories listed as well, you could use os.walk.
import zipfile
lista_files = ["12.csv","13.csv","14.csv"]
with zipfile.ZipFile('out.zip', 'w') as zipMe:
for file in lista_files:
zipMe.write(file, compress_type=zipfile.ZIP_DEFLATED)

Regex to match the first file in a rar archive file set in Python

I need to uncompress all the files in a directory and for this I need to find the first file in the set. I'm currently doing this using a bunch of if statements and loops. Can i do this this using regex?
Here's a list of files that i need to match:
yes.rar
yes.part1.rar
yes.part01.rar
yes.part001.rar
yes.r01
yes.r001
These should NOT be matched:
no.part2.rar
no.part02.rar
no.part002.rar
no.part011.rar
no.r002
no.r02
I found a similar regex on this thread but it seems that Python doesn't support varible length lookarounds. A single line regex would be complicated but I'll document it well and it's not a problem. It's just one of those problems you beat your heap up, over.
Thanks in advance guys.
:)
Don't rely on the names of the files to determine which one is first. You're going to end up finding an edge case where you get the wrong file.
RAR's headers will tell you which file is the first on in the volume, assuming they were created in a somewhat-recent version of RAR.
HEAD_FLAGS Bit flags:
2 bytes
0x0100 - First volume (set only by RAR 3.0 and later)
So open up each file and examine the RAR headers, looking specifically for the flag that indicates which file is the first volume. This will never fail, as long as the archive isn't corrupt.
Update: I've just confirmed this by taking a look at some spanning archives in a hex editor. The files headers are constructed exactly as the link above indicates. It's just a matter of opening the files and reading the header for that flag. The file with that flag is the first volume.
There's no need to use look behind assertions for this. Since you start looking from the beginning of the string, you can do everything with look-aheads that you can with look-behinds. This should work:
^((?!\.part(?!0*1\.rar$)\d+\.rar$).)*\.(?:rar|r?0*1)$
To capture the first part of the filename as you requested, you could do this:
^((?:(?!\.part\d+\.rar$).)*)\.(?:(?:part0*1\.)?rar|r?0*1)$
Are you sure you want to match these cases?
yes.r01
They are not the first archives: .rar always is.
It's bla.rar, bla.r00 and then only bla.r01. You'll probably extract the files twice if you match .r01 and .rar as first archive.
yes.r001
.r001 doesn't exist. Do you mean the .001 files that WinRAR supports?
After .r99, it's .s00. If it does exist, then somebody manually renamed the files.
In theory, matching on filename should be as reliable as matching on the 0x0100 flag to find the first archive.

Count number of files with certain extension in Python

I am fairly new to Python and I am trying to figure out the most efficient way to count the number of .TIF files in a particular sub-directory.
Doing some searching, I found one example (I have not tested), which claimed to count all of the files in a directory:
file_count = sum((len(f) for _, _, f in os.walk(myPath)))
This is fine, but I need to only count TIF files. My directory will contain other files types, but I only want to count TIFs.
Currently I am using the following code:
tifCounter = 0
for root, dirs, files in os.walk(myPath):
for file in files:
if file.endswith('.tif'):
tifCounter += 1
It works fine, but the looping seems to be excessive/expensive to me. Any way to do this more efficiently?
Thanks.
Something has to iterate over all files in the directory, and look at every single file name - whether that's your code or a library routine. So no matter what the specific solution, they will all have roughly the same cost.
If you think it's too much code, and if you don't actually need to search subdirectories recursively, you can use the glob module:
import glob
tifCounter = len(glob.glob1(myPath,"*.tif"))
For this particular use case, if you don't want to recursively search in the subdirectory, you can use os.listdir:
len([f for f in os.listdir(myPath)
if f.endswith('.tif') and os.path.isfile(os.path.join(myPath, f))])
Your code is fine.
Yes, you're going to need to loop over those files to filter out the .tif files, but looping over a small in-memory array is negligible compared to the work of scanning the file directory to find these files in the first place, which you have to do anyway.
I wouldn't worry about optimizing this code.
If you do need to search recursively, or for some other reason don't want to use the glob module, you could use
file_count = sum(len(f for f in fs if f.lower().endswith('.tif')) for _, _, fs in os.walk(myPath))
This is the "Pythonic" way to adapt the example you found for your purposes. But it's not going to be significantly faster or more efficient than the loop you've been using; it's just a really compact syntax for more or less the same thing.
try using fnmatch
https://docs.python.org/2/library/fnmatch.html
import fnmatch,os
num_files = len(fnmatch.filter(os.listdir(your_dir),'*.tif'))
print(num_files)

Categories

Resources