storing text into string buffer from multiple files using python - python

I want to extract text from multiple text files and the idea is that i have a folder and all text files are there in that folder.
I have tried and succesfully get the text but the thing is that when i use that string buffer somewhere else then only first text file text are visbile to me.
I want to store these texts to a particular string buffer.
what i have done:
import glob
import io
Raw_txt = " "
files = [file for file in glob.glob(r'C:\\Users\\Hp\\Desktop\\RAW\\*.txt')]
for file_name in files:
with io.open(file_name, 'r') as image_file:
content1 = image_file.read()
Raw_txt = content1
print(Raw_txt)
This Raw_txt buffer only works in this loop but i want this buffer somewhere else.
Thanks!

I think the issue is related to where you load the content of your text files.
Raw_txt is overwritten with each file.
I would recommend you to do something like this where the text is appended:
import glob
Raw_txt = ""
files = [file for file in glob.glob(r'C:\\Users\\Hp\\Desktop\\RAW\\*.txt')]
for file_name in files:
with open(file_name,"r+") as file:
Raw_txt += file.read() + "\n" # I added a new line in case you want to separate the different text content of each file
print(Raw_txt)
Also in order to read a text file you don't need io module.

Related

Getting "Xref table not zero-indexed. ID numbers for objects will be corrected" warning

I have the following code (comments explain what is occuring):
import os
from io import StringIO
from PyPDF2 import PdfFileReader
# Path to the directory containing the PDF files
pdf_dir = '/path/to/pdf/files'
# Iterate over the files in the directory
for filename in os.listdir(pdf_dir):
# Check if the file is a PDF file
if filename.endswith('.pdf'):
# Construct the full path to the file
filepath = os.path.join(pdf_dir, filename)
# Open the PDF file and read its contents
with open(filepath, 'rb') as f:
pdf = PdfFileReader(f)
# Extract the text from the PDF file
text = ''
for page in pdf.pages:
text += page.extractText()
# Construct the name of the output text file
txt_filename = filename[:-4] + '.txt'
# Write the text to the output file
with open(txt_filename, 'w') as f:
f.write(text)
When I run the code, it produces a Xref table not zero-indexed. ID numbers for objects will be corrected warning. It is not a hard error, but it makes me wonder if there's a different way I should be doing this.
Thanks for any suggestions.

How to print the content of the last saved text file in a folder using Python

I want to print the content of the last saved text file in a folder using Python. I wrote the below code. It is printing out only the path of the file but not the content.
folder_path = r'C:\Users\Siciid\Desktop\restaurant\bill'
file_type = r'\*txt'
files = glob.glob(folder_path + file_type)
max_file = max(files, key=os.path.getctime)
filename=tempfile.mktemp('.txt')
open(filename,'w').write(max_file)
os.startfile(filename,"print")
Is it possible to do this in Python. Any suggestion. I would appreciate your help. Thank you.
You can do that using the following code. Just replace the line where you open and write a file with these two lines:
with open(max_file, "r") as f, open(filename, 'w') as f2:
f2.write(f.read())
The max_file variable contains a file name, not the contents of the file, so writing it to the temp file and printing that will simply print the file name instead of its contents. To put its contents into the temporary file, you need to open the file and then read it. That is what the above two lines of code do.

Extract zip to memory, parse contents

I want to read the contents of a zip file into memory rather than extracting them to disc, find a particular file in the archive, open the file and extract a line from it.
Can a StringIO instance be opened and parsed? Suggestions? Thanks in advance.
zfile = ZipFile('name.zip', 'r')
for name in zfile.namelist():
if fnmatch.fnmatch(name, '*_readme.xml'):
name = StringIO.StringIO()
print name # prints StringIO instances
open(name, 'r') # IO Error: No such file or directory...
I found a few similar posts, but none that seem to address this issue: Extracting a zipfile to memory?
IMO just using read is enough:
zfile = ZipFile('name.zip', 'r')
files = []
for name in zfile.namelist():
if fnmatch.fnmatch(name, '*_readme.xml'):
files.append(zfile.read(name))
This will make a list with contents of files that match the pattern.
Test:
You can then parse contents afterwards by iterating through the list:
for file in files:
print(file[0:min(35,len(file))].decode()) # "parsing"
Or better use a functor:
import zipfile as zip
import os
import fnmatch
zip_name = os.sys.argv[1]
zfile = zip.ZipFile(zip_name, 'r')
def parse(contents, member_name = ""):
if len(member_name) > 0:
print( "Parsed `{}`:".format(member_name) )
print(contents[0:min(35, len(contents))].decode()) # "parsing"
for name in zfile.namelist():
if fnmatch.fnmatch(name, '*.cpp'):
parse(zfile.read(name), name)
This way there is no data kept in memory for no reason and memory foot print is smaller. It might be important if the files are big.
Don't overthink it. It Just Works:
import zipfile
# 1) I want to read the contents of a zip file ...
with zipfile.ZipFile('A-Zip-File.zip') as zipper:
# 2) ... find a particular file in the archive, open the file ...
with zipper.open('A-Particular-File.txt') as fp:
# 3) ... and extract a line from it.
first_line = fp.readline()
print first_line
The question you link shows you that you need to read the file. Depending on your use case that may already be enough. In your code you replace the loop variable holding a filename with an empty string buffer. Try something like this:
zfile = ZipFile('name.zip', 'r')
for name in zfile.namelist():
if fnmatch.fnmatch(name, '*_readme.xml'):
ex_file = zfile.open(name) # this is a file like object
content = ex_file.read() # now file-contents are a single string
If you really want a buffer that you can manipulate, then simply instantiate it with the contents:
buf = StringIO(zfile.open(name).read())
You may also want to look at BytesIO and note that there are differences between Python 2 and 3.
Thank you to everyone that contributed solutions. This is what ended up working for me:
zfile = ZipFile('name.zip', 'r')
for name in zfile.namelist():
if fnmatch.fnmatch(name, '*_readme.xml'):
zopen = zfile.open(name)
for line in zopen:
if re.match('(.*)<foo>(.*)</foo>(.*)', line):
print line

Python - How to convert many separate PDFs to text?

Question: How can I read in many PDFs in the same path using Python package "slate"?
I have a folder with over 600 PDFs.
I know how to use the slate package to convert single PDFs to text, using this code:
migFiles = [filename for filename in os.listdir(path)
if re.search(r'(.*\.pdf$)', filename) != None]
with open(migFiles[0]) as f:
doc = slate.PDF(f)
len(doc)
However, this limits you to one PDF at a time, specified by "migFiles[0]" - 0 being the first PDF in my path file.
How can I read in many PDFs to text at once, retaining them as separate strings or txt files? Should I use another package? How could I create a "for loop" to read in all the PDFs in the path?
Try this version:
import glob
import os
import slate
for pdf_file in glob.glob("{}/{}".format(path,"*.pdf")):
with open(pdf_file) as pdf:
txt_file = "{}.txt".format(os.path.splitext(pdf_file)[0])
with open(txt_file,'w') as txt:
txt.write(slate.pdf(pdf))
This will create a text file with the same name as the pdf in the same directory as the pdf file with the converted contents.
Or, if you want to save the contents - try this version; but keep in mind if the translated content is large you may exhaust your available memory:
import glob
import os
import slate
pdf_as_text = {}
for pdf_file in glob.glob("{}/{}".format(path,"*.pdf")):
with open(pdf_file) as pdf:
file_without_extension = os.path.splitext(pdf_file)[0]
pdf_as_text[file_without_extension] = slate.pdf(pdf)
Now you can use pdf_as_text['somefile'] to get the text contents.
What you can do is use a simple loop:
docs = []
for filename in migFiles:
with open(filename) as f:
docs.append(slate.pdf(f))
# or instead of saving file to memory, just process it now
Then, docs[i] will hold the text of the (i+1)-th pdf file, and you can do whatever you want with the file whenever you want. Alternatively, you can process the file inside the for loop.
If you want to convert to text, you can do:
docs = []
separator = ' ' # The character you want to use to separate contents of
# consecutive pages; if you want the contents of each pages to be separated
# by a newline, use separator = '\n'
for filename in migFiles:
with open(filename) as f:
docs.append(separator.join(slate.pdf(f))) # turn the pages into plain-text
or
separator = ' '
for filename in migFiles:
with open(filename) as f:
txtfile = open(filename[:-4]+".txt",'w')
# if filename="abc.pdf", filename[:-4]="abc"
txtfile.write(separator.join(slate.pdf(f)))
txtfile.close()

Python blank txt file creation

I am trying to create bulk text files based on list. A text file has number of lines/titles and aim is to create text files. Following is how my titles.txt looks like along with non-working code and expected output.
titles = open("C:\\Dropbox\\Python\\titles.txt",'r')
for lines in titles.readlines():
d_path = 'C:\\titles'
output = open((d_path.lines.strip())+'.txt','a')
output.close()
titles.close()
titles.txt
Title-A
Title-B
Title-C
new blank files to be created under directory c:\\titles\\
Title-A.txt
Title-B.txt
Title-C.txt
It's a little difficult to tell what you're attempting here, but hopefully this will be helpful:
import os.path
with open('titles.txt') as f:
for line in f:
newfile = os.path.join('C:\\titles',line.strip()) + '.txt'
ff = open( newfile, 'a')
ff.close()
If you want to replace existing files with blank files, you can open your files with mode 'w' instead of 'a'.
The following should work.
import os
titles='C:/Dropbox/Python/titles.txt'
d_path='c:/titles'
with open(titles,'r') as f:
for l in f:
with open(os.path.join(d_path,l.strip()),'w') as _:
pass

Categories

Resources