Extracting comments from multiple zip files with python - python

i started learning python a few months ago, and I'm trying to solve a challenge that requires me to go through many [2,000~] zip files in a folder, collect all the comments in them and find a clue.
The part that I'm struggling with is the extraction of the comments.
I imported the zipfile module, but I'm not sure how to make it go through all the files in the folder that contain the zip files, and collect all the comments.
I'm using pycharm, and I would't mind if the result will be in the preview area insde pycharm or exported to a new .txt file
can anyone help me?

For looping over files, I tend to use the glob module in python. It returns a list of files that match the string you specify (see docs). Then once you have the list of files, you can loop over them and run some function/code on each one in turn.
import glob
list_of_files = glob.glob("/path/to/directory/*.zip")
for f in list_of_files:
***insert code for each zip file****

for i in os.listdir(path_to_folder):
if i.endswith('.zip'):
<< Your code here>>
Try this and let me know if any issues

The comments in individual files can be accessed using getinfo function in zipfile module, i.e., getinfo(file_name).comment() as explained in this post.

Related

how to extract a continuously updable Zip file on a website by Selenium and then unzip it for specific file names

I have used Selenium x Python to download a zip file daily but i am currently facing a few issues after downloading it on my local download folder
is it possible to use Python to read those files dynamically? let's say the date is always different. Can we simply add wildcard*? I am trying to move it from downloader folder to another folder but it always require me to name the file entirely.
how to unzip a file and look for specific files there? let's say those file will always start with files names "ABC202103xx.csv"
much appreciate for your help! any sample code will be truly appreciate!
Not knowing the excact name of a file in a local folder should usually not be a problem. You could just list all filenames in the local folder and then use a for loop to find the filename you need. For example let's assume that you have downloaded a zip file into a Downloads folder and you know it is named "file-X.zip" with X being any date.
import os
for filename in os.listdir("Downloads"):
if filename.startswith("file-") and filename.endswith(".zip"):
filename_you_are_looking_for = filename
break
To unzip files, I will refer you to this stackoverflow thread. Again, to look for specific files in there, you can use os.listdir.

How to detect and separate Corrupt/Unreadable PDFs and password protected PDFs from a directory using python?

I have a directory containing about ~ 1,00,000 multipage PDFs.
I want to separate Corrupt/Unreadable and Password protected PDFs from this directory using python.
Need a good and fast solution as I might need to do it for large number of files in future.
Thanks in advance.
You can try to use PyPDF2. Loop over all files in the directory using os.listdir() and try opening each one, and store the name of each one that gives you an error. You can also place them in two different directories depending on whether opening a file gives you an error using simple try/except.

Want to compile all the images in a folder and its subfolders to pdf file with names

I am just a beginner in Python. So help me learn and write the code for a small but complex problem. I've tried many things but I am lost where to start with and go:
Problem:
I have a folder and its subfolders with heaps of different product images(let's say in .jpeg and .png). I want to compile a single pdf of all these photos with links/location of these photos in the pdf. This list and photos could be in the form of a table or a very simple format.
I am doing it because sometimes I forget the product name so I have to look at its image by going into each folder and sub-folder. This will give me an opportunity to look at all the photos in these folders and sub-folder without opening them one-by-one.
Your issue breaks down into 3 steps.
1-Search the directories for files (which you can use the os module's walk()).
here is a great tutorial:
https://www.pythoncentral.io/how-to-traverse-a-directory-tree-in-python-guide-to-os-walk/
2-add the found files into a list of tuples having path of the image and the name of it.
3- Add these images into a single pdf file. You can use python module fpdf to do this. And this has been addressed already here:
Create PDF from a list of images

Using files I downloaded with python

So I want to download a bunch of clinical trial information from clinicaltrials.gov. They have a system that lets you download searches by using a custom URL. The url format is https://clinicaltrials.gov/ct2/results/download_fields?cond=&term=genentech&locn=pennsylvania&down_count=1000&down_fmt=xml
First of all how do I download that file using python? I'm assuming its something like
file = requests.get('https://clinicaltrials.gov/ct2/results/download_fields?cond=&term=genentech&locn=pennsylvania&down_count=1000&down_fmt=xml')
Then can I also rename the file and put it in my working directory?
In the end I would like to process about three to four hundred downloads and parse the files for certain information. I think I can handle that part but getting all the files into my working directory is what I'm having trouble with now.
Any help would be greatly appreciated.
Thanks!

Python: Scour a folder of files for a specific string, then list all the files containing it

I have a directory of ~100 plaintext files that I wish to search for a pre-defined string via a Python script. All file names of the files in said directory that contain this string will then be output into a CSV text file. How would I approach this best, and what classes would be useful for this?
Have a look at these links.
Python Input Output
Python CSV
Common String Operations
please refer to these two answer of mine, may be this would help you.
multiple search and replace in python
regex search replace in batch

Categories

Resources