Split image/pdf based on specific text with Python

Split image/pdf based on specific text with Python - python

I want to split a pdf (or image if needed) based on text in it. I want to split it to get each question with its options in the pdf/image, separately like a screenshot of just that question and its options.
Sample PDF link:https://drive.google.com/file/d/1UtMropzRdfJwQjaRf9kZa1UpAzrKlH-K/view?usp=sharing
Is it even possible? If yes what is the code needed to accomplish this. I am a newbie to python so some explanation might help. I've got almost 100 of these PDFs and just wanted to automate the process of getting individual question and its options.

Step1: You simply need to install pdftotext and put the .exe in the same working directory.
Step2: Copy the code down below and paste it in the same directory.
step3: Also keep in mind that the pdf files should also be in the same directory
step4: Run the .py file
Complete Code that worked for me :
import os
import glob
import subprocess
files=[]
#remember to put your pdftotxt.exe to the folder with your pdf files
for filename in glob.glob(os.getcwd() + '\\*.pdf'):
files.append(filename[0:-4]+".txt")
subprocess.call([os.getcwd() + '\\pdftotext', filename, filename[0:-4]+".txt"])
all_files=[]
for i in range(len(files)):
with open(files[i],'r') as f:
text=f.read()
text=text.split('carry one mark each')[1].split('WWW.UNITOPERATION.COM')[0]
text_ls=text.splitlines()
ques=[]
counter=1
for i in range(len(text_ls)):
if text_ls[i].startswith(str(counter)+'.'):
ques.append(''.join(text_ls[i:]).split('\n'[0]))
counter+=1
all_files.append(ques)
# Now you have list of all_files in which ques list is added
# You simply need take one by one element out from all_files and write it in a .txt file
# and that will be your task

Related

Duplicate in list created from filenames (python)

I'm trying to create a list of excel files that are saved to a specific directory, but I'm having an issue where when the list is generated it creates a duplicate entry for one of the file names (I am absolutely certain there is not actually a duplicate of the file).
import glob
# get data file names
path =r'D:\larvalSchooling\data'
filenames = glob.glob(path + "/*.xlsx")
output:
>>> filenames
['D:\\larvalSchooling\\data\\copy.xlsx', 'D:\\larvalSchooling\\data\\Raw data-SF_Fri_70dpf_GroupABC_n5_20200828_1140-Trial 1.xlsx', 'D:\\larvalSchooling\\data\\Raw data-SF_Sat_70dpf_GroupA_n5_20200808_1015-Trial 1.xlsx', 'D:\\larvalSchooling\\data\\Raw data-SF_Sat_84dpf_GroupABCD_n5_20200822_1440-Trial 1.xlsx', 'D:\\larvalSchooling\\data\\~$Raw data-SF_Fri_70dpf_GroupABC_n5_20200828_1140-Trial 1.xlsx']
you'll note 'D:\larvalSchooling\data\Raw data-SF_Fri_70dpf_GroupABC_n5_20200828_1140-Trial 1.xlsx' is listed twice.
Rather than going through after the fact and removing duplicates I was hoping to figure out why it's happening to begin with.
I'm using python 3.7 on windows 10 pro

If you wrote the code to remove duplicates (which can be as simple as filenames = set(filenames)) you'd see that you still have two filenames. Print them out one on top of the other to make a visual comparison easier:
'D:\\larvalSchooling\\data\\Raw data-SF_Sat_84dpf_GroupABCD_n5_20200822_1440-Trial 1.xlsx',
'D:\\larvalSchooling\\data\\~$Raw data-SF_Fri_70dpf_GroupABC_n5_20200828_1140-Trial 1.xlsx'
The second one has a leading ~ (probably an auto-backup).

Whenever you open an excel file it will create a ghost copy that works as a temporary backup copy for that specific file. In this case:
Raw data-SF_Fri_70dpf_GroupABC_n5_20200828_1140-Trial1.xlsx
~$ Raw data-SF_Fri_70dpf_GroupABC_n5_20200828_1140-Trial1.xlsx
This means that the file is open by some software and it's showing you that backup inside(usually that file is hidden from the explorer as well)
Just search for the program and close it. Other actions, such as adding validation so the "~$.*.xlsx" type of file is ignored should be also implemented if this is something you want to avoid.

You can use os.path.splittext to get the file extension and loop through the directory using os.listdir . The open excel files can be skipped using the following code:
filenames = []
for file in os.listdir('D:\larvalSchooling\data'):
filename, file_extension = os.path.splitext(file)
if file_extension == '.xlsx':
if not file.startswith('~$'):
filenames.append(file)
Note: this might not be the best solution, but it'll get the job done :)

Copying/pasting specific files in batch from one folder to another

I'am really new to this python scripting thing, but pretty sure, that there is the way to copy files from one folder (via given path in .txt file) to another.
there would be directly path to the folder, which contains photo files
I'am working with huge amounts of photos, which contains gps metadata (so i need not to lose it).
Really apreciate any help, thanks.

Here is a short and simple solution:
import shutil
import os
# replace file_list.txt with your files
file_list = open("file_list.txt", "r")
# replace my_dir with your copy dir
copy_dir = "my_dir"
for f in file_list.read().splitlines():
print(f"copying file: {f}")
shutil.copyfile(f, f"{copy_dir}/{os.path.split(f)[1]}")
file_list.close()
print("done")
It loops over all the files in the file list, and copies them. It should be fast enough.

How can I read files with similar names on python, rename them and then work with them?

I've already posted here with the same question but I sadly I couldn't come up with a solution (even though some of you guys gave me awesome answers but most of them weren't what I was looking for), so I'll try again and this time giving more information about what I'm trying to do.
So, I'm using a program called GMAT to get some outputs (.txt files with numerical values). These outputs have different names, but because I'm using them to more than one thing I'm getting something like this:
GMATd_1.txt
GMATd_2.txt
GMATf_1.txt
GMATf_2.txt
Now, what I need to do is to use these outputs as inputs in my code. I need to work with them in other functions of my script, and since I will have a lot of these .txt files I want to rename them as I don't want to use them like './path/etc'.
So what I wanted was to write a loop that could get these files and rename them inside the script so I can use these files with the new name in other functions (outside the loop).
So instead of having to this individually:
GMATds1= './path/GMATd_1.txt'
GMATds2= './path/GMATd_2.txt'
I wanted to write a loop that would do that for me.
I've already tried using a dictionary:
import os
import fnmatch
dict = {}
for filename in os.listdir('.'):
if fnmatch.fnmatch(filename, 'thing*.txt'):
examples[filename[:6]] = filename
This does work but I can't use the dictionary key outside the loop.

If I understand correctly, you try to fetch files with similar names (at least a re-occurring pattern) and rename them. This can be accomplished with the following code:
import glob
import os
all_files = glob.glob('path/to/directory/with/files/GMAT*.txt')
for file in files:
new_path = create_new_path(file) # possibly split the file name, change directory and/or filename
os.rename(file, new_path)
The glob library allows for searching files with * wildcards and makes it hence possible to search for files with a specific pattern. It lists all the files in a certain directory (or multiple directories if you include a * wildcard as a directory). When you iterate over the files, you could either directly work with the input of the files (as you apparently intend to do) or rename them as shown in this snippet. To rename them, you would need to generate a new path - so you would have to write the create_new_path function that takes the old path and creates a new one.

Since python 3.4 you should be using the built-in pathlib package instead of os or glob.
from pathlib import Path
import shutil
for file_src in Path("path/to/files").glob("GMAT*.txt"):
file_dest = str(file_src.resolve()).replace("ds", "d_")
shutil.move(file_src, file_dest)

you can use
import os
path='.....' # path where these files are located
path1='.....' ## path where you want these files to store
i=1
for file in os.listdir(path):
if file.endswith(end='.txt'):
os.rename(path + "/" + file, path1 + "/"+str(i) + ".txt")
i+=1
it will rename all the txt file in the source folder to 1,2,3,....n.txt

Python Pillow Library opening and editing images ending with specific names

Currently I am using Python Pillow Library to edit images. Since I am dealing with large data-sets and need to edit some images with only specific name endings (say only image names that end with cropped or images of specific file type like png or bmp), is there a way to write code in such a way that allows me to open and edit these images? If so please give me hints or suggestions! Thanks!
Also Pillow version is 5.0.0 and Python version is 3.6.

If your question is to only know if is there a way to write code to only allow you to edit image files with specific file type and specific end names. Then the answer is YES. You can do it with python.
A Sample Code:
import os
from PIL import Image #Pillow
directory = os.fsencode("images_folder")
for file in os.listdir(directory):
filename = os.fsdecode(file)
if filename.endswith(".png") or filename.endswith(".bmp") or "cropped" in filename:
# Do the editing using pillow
# img = Image.open(filename)
continue

Certainly you can do this in python, but the specific way of doing this obviously depends on the specifics of the problem. Are all your images stored in one directory or many? Will you be running the script from the same directory as the images or from some other directory? Etc.
To get you started, take a look at the os module here.
In this module, there is a listdir method that returns a list of all files inside a directory. You can iterate through that list and find all the filenames that ends with a specific set of characters by using the built in endswith method on strings. For example:
import os
fileslist = [f for f in os.listdir(path) if f.endswith('.jpg')]
Now that you have a filelist of all the files in a directory that ends with some certain characters, you can then use pillow to open the images from that list.

Convert all pdf in a folder to text files and store them in different folders using python

Im trying to convert all the pdf stored in one file, say 60 pdfs into text documents and store them in different folders. the folder should have unique names.
i tried this code.The folders where created, but the pdftotext conversion command doesnt work in the loop:
import os
def listfiles(path):
for root, dirs, files in os.walk(path):
for f in files:
print(f)
newpath = r'/home/user/files/'
p=f.replace("pdf","")
newpath=newpath+p
if not os.path.exists(newpath): os.makedirs(newpath)
os.system("pdftotext f f.txt")
f=listfiles("/home/user/reports")

One problem here is the os.system("pdftotext f f.txt") call. I assume you want the f's here replaced with the current file in the loop. If that is the case you need to change this to os.system("pdftotext {0} {0}.txt".format(f))
Another issue may be that the working directory is not being set up so the call to system is looking for the file in the wrong place. Try using os.chdir every time you change folders.
to place the text file in a diffrent folder try:
os.system("pdftotext {0} {1}/{0}.txt".format(f, newpath))

I don't know Python, but I think I can clearly see a mistake there. It looks like you are just replacing the ".pdf" with a ".txt". Since a PDF isn't just plain text, this won't work.
For the convertion look at the top answer of this post:
Python module for converting PDF to text

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Split image/pdf based on specific text with Python - python

Related

Duplicate in list created from filenames (python)

Copying/pasting specific files in batch from one folder to another

How can I read files with similar names on python, rename them and then work with them?

Python Pillow Library opening and editing images ending with specific names

Convert all pdf in a folder to text files and store them in different folders using python

Categories

Resources