Find characters in a text - python

I want to create a python script which finds an expression in a text.
If it succeeds, then print 'expression found'.
My text file is named "file_id_ascii"; it comes from a link named "clinical_file_link".
import csv
import sys
import io
file_id_ascii =
io.open(clinical_file_link, 'r',
encoding='us-ascii', errors='ignore')
acc = 'http://tcga.nci/bcr/xml/clinical/acc/'
if acc in file_id_ascii:
print('expression found')
That code doesn't work...what's wrong?

if 'http://tcga.nci/bcr/xml/clinical/acc/' \
in open('clinical_file_link.txt').read():
print "true"
... works if your file has small content

It doesn't work because you are not parsing the XML.
Check this thread for more info about parsing XML.
If it is indeed a text file you might edit it like this:
if acc in open(clinical_file_link.txt).read():
print('expression FOUND')
EDIT: This solution wont work if files are big. Try this one:
with open('clinical_file_link.txt') as f:
found = False
for line in f:
if acc in line:
print('found')
found = True
if not found:
print('The string cannot be found!')

Related

how to do docx to pdf conversion using python library without subprocess in linux? [duplicate]

I'am tasked with converting tons of .doc files to .pdf. And the only way my supervisor wants me to do this is through MSWord 2010. I know I should be able to automate this with python COM automation. Only problem is I dont know how and where to start. I tried searching for some tutorials but was not able to find any (May be I might have, but I don't know what I'm looking for).
Right now I'm reading through this. Dont know how useful this is going to be.
A simple example using comtypes, converting a single file, input and output filenames given as commandline arguments:
import sys
import os
import comtypes.client
wdFormatPDF = 17
in_file = os.path.abspath(sys.argv[1])
out_file = os.path.abspath(sys.argv[2])
word = comtypes.client.CreateObject('Word.Application')
doc = word.Documents.Open(in_file)
doc.SaveAs(out_file, FileFormat=wdFormatPDF)
doc.Close()
word.Quit()
You could also use pywin32, which would be the same except for:
import win32com.client
and then:
word = win32com.client.Dispatch('Word.Application')
You can use the docx2pdf python package to bulk convert docx to pdf. It can be used as both a CLI and a python library. It requires Microsoft Office to be installed and uses COM on Windows and AppleScript (JXA) on macOS.
from docx2pdf import convert
convert("input.docx")
convert("input.docx", "output.pdf")
convert("my_docx_folder/")
pip install docx2pdf
docx2pdf input.docx output.pdf
Disclaimer: I wrote the docx2pdf package. https://github.com/AlJohri/docx2pdf
I have tested many solutions but no one of them works efficiently on Linux distribution.
I recommend this solution :
import sys
import subprocess
import re
def convert_to(folder, source, timeout=None):
args = [libreoffice_exec(), '--headless', '--convert-to', 'pdf', '--outdir', folder, source]
process = subprocess.run(args, stdout=subprocess.PIPE, stderr=subprocess.PIPE, timeout=timeout)
filename = re.search('-> (.*?) using filter', process.stdout.decode())
return filename.group(1)
def libreoffice_exec():
# TODO: Provide support for more platforms
if sys.platform == 'darwin':
return '/Applications/LibreOffice.app/Contents/MacOS/soffice'
return 'libreoffice'
and you call your function:
result = convert_to('TEMP Directory', 'Your File', timeout=15)
All resources:
https://michalzalecki.com/converting-docx-to-pdf-using-python/
I have worked on this problem for half a day, so I think I should share some of my experience on this matter. Steven's answer is right, but it will fail on my computer. There are two key points to fix it here:
(1). The first time when I created the 'Word.Application' object, I should make it (the word app) visible before open any documents. (Actually, even I myself cannot explain why this works. If I do not do this on my computer, the program will crash when I try to open a document in the invisible model, then the 'Word.Application' object will be deleted by OS. )
(2). After doing (1), the program will work well sometimes but may fail often. The crash error "COMError: (-2147418111, 'Call was rejected by callee.', (None, None, None, 0, None))" means that the COM Server may not be able to response so quickly. So I add a delay before I tried to open a document.
After doing these two steps, the program will work perfectly with no failure anymore. The demo code is as below. If you have encountered the same problems, try to follow these two steps. Hope it helps.
import os
import comtypes.client
import time
wdFormatPDF = 17
# absolute path is needed
# be careful about the slash '\', use '\\' or '/' or raw string r"..."
in_file=r'absolute path of input docx file 1'
out_file=r'absolute path of output pdf file 1'
in_file2=r'absolute path of input docx file 2'
out_file2=r'absolute path of outputpdf file 2'
# print out filenames
print in_file
print out_file
print in_file2
print out_file2
# create COM object
word = comtypes.client.CreateObject('Word.Application')
# key point 1: make word visible before open a new document
word.Visible = True
# key point 2: wait for the COM Server to prepare well.
time.sleep(3)
# convert docx file 1 to pdf file 1
doc=word.Documents.Open(in_file) # open docx file 1
doc.SaveAs(out_file, FileFormat=wdFormatPDF) # conversion
doc.Close() # close docx file 1
word.Visible = False
# convert docx file 2 to pdf file 2
doc = word.Documents.Open(in_file2) # open docx file 2
doc.SaveAs(out_file2, FileFormat=wdFormatPDF) # conversion
doc.Close() # close docx file 2
word.Quit() # close Word Application
unoconv (writen in Python) and OpenOffice running as a headless daemon.
https://github.com/unoconv/unoconv
http://dag.wiee.rs/home-made/unoconv/
Works very nicely for doc, docx, ppt, pptx, xls, xlsx.
Very useful if you need to convert docs or save/convert to certain formats on a server.
As an alternative to the SaveAs function, you could also use ExportAsFixedFormat which gives you access to the PDF options dialog you would normally see in Word. With this you can specify bookmarks and other document properties.
doc.ExportAsFixedFormat(OutputFileName=pdf_file,
ExportFormat=17, #17 = PDF output, 18=XPS output
OpenAfterExport=False,
OptimizeFor=0, #0=Print (higher res), 1=Screen (lower res)
CreateBookmarks=1, #0=No bookmarks, 1=Heading bookmarks only, 2=bookmarks match word bookmarks
DocStructureTags=True
);
The full list of function arguments is: 'OutputFileName', 'ExportFormat', 'OpenAfterExport', 'OptimizeFor', 'Range', 'From', 'To', 'Item', 'IncludeDocProps', 'KeepIRM', 'CreateBookmarks', 'DocStructureTags', 'BitmapMissingFonts', 'UseISO19005_1', 'FixedFormatExtClassPtr'
It's worth noting that Stevens answer works, but make sure if using a for loop to export multiple files to place the ClientObject or Dispatch statements before the loop - it only needs to be created once - see my problem: Python win32com.client.Dispatch looping through Word documents and export to PDF; fails when next loop occurs
If you don't mind using PowerShell have a look at this Hey, Scripting Guy! article. The code presented could be adopted to use the wdFormatPDF enumeration value of WdSaveFormat (see here).
This blog article presents a different implementation of the same idea.
I have modified it for ppt support as well. My solution support all the below-specified extensions.
word_extensions = [".doc", ".odt", ".rtf", ".docx", ".dotm", ".docm"]
ppt_extensions = [".ppt", ".pptx"]
My Solution: Github Link
I have modified code from Docx2PDF
I tried the accepted answer but wasn't particularly keen on the bloated PDFs Word was producing which was usually an order of magnitude bigger than expected. After looking how to disable the dialogs when using a virtual PDF printer I came across Bullzip PDF Printer and I've been rather impressed with its features. It's now replaced the other virtual printers I used previously. You'll find a "free community edition" on their download page.
The COM API can be found here and a list of the usable settings can be found here. The settings are written to a "runonce" file which is used for one print job only and then removed automatically. When printing multiple PDFs we need to make sure one print job completes before starting another to ensure the settings are used correctly for each file.
import os, re, time, datetime, win32com.client
def print_to_Bullzip(file):
util = win32com.client.Dispatch("Bullzip.PDFUtil")
settings = win32com.client.Dispatch("Bullzip.PDFSettings")
settings.PrinterName = util.DefaultPrinterName # make sure we're controlling the right PDF printer
outputFile = re.sub("\.[^.]+$", ".pdf", file)
statusFile = re.sub("\.[^.]+$", ".status", file)
settings.SetValue("Output", outputFile)
settings.SetValue("ConfirmOverwrite", "no")
settings.SetValue("ShowSaveAS", "never")
settings.SetValue("ShowSettings", "never")
settings.SetValue("ShowPDF", "no")
settings.SetValue("ShowProgress", "no")
settings.SetValue("ShowProgressFinished", "no") # disable balloon tip
settings.SetValue("StatusFile", statusFile) # created after print job
settings.WriteSettings(True) # write settings to the runonce.ini
util.PrintFile(file, util.DefaultPrinterName) # send to Bullzip virtual printer
# wait until print job completes before continuing
# otherwise settings for the next job may not be used
timestamp = datetime.datetime.now()
while( (datetime.datetime.now() - timestamp).seconds < 10):
if os.path.exists(statusFile) and os.path.isfile(statusFile):
error = util.ReadIniString(statusFile, "Status", "Errors", '')
if error != "0":
raise IOError("PDF was created with errors")
os.remove(statusFile)
return
time.sleep(0.1)
raise IOError("PDF creation timed out")
I was working with this solution but I needed to search all .docx, .dotm, .docm, .odt, .doc or .rtf and then turn them all to .pdf (python 3.7.5). Hope it works...
import os
import win32com.client
wdFormatPDF = 17
for root, dirs, files in os.walk(r'your directory here'):
for f in files:
if f.endswith(".doc") or f.endswith(".odt") or f.endswith(".rtf"):
try:
print(f)
in_file=os.path.join(root,f)
word = win32com.client.Dispatch('Word.Application')
word.Visible = False
doc = word.Documents.Open(in_file)
doc.SaveAs(os.path.join(root,f[:-4]), FileFormat=wdFormatPDF)
doc.Close()
word.Quit()
word.Visible = True
print ('done')
os.remove(os.path.join(root,f))
pass
except:
print('could not open')
# os.remove(os.path.join(root,f))
elif f.endswith(".docx") or f.endswith(".dotm") or f.endswith(".docm"):
try:
print(f)
in_file=os.path.join(root,f)
word = win32com.client.Dispatch('Word.Application')
word.Visible = False
doc = word.Documents.Open(in_file)
doc.SaveAs(os.path.join(root,f[:-5]), FileFormat=wdFormatPDF)
doc.Close()
word.Quit()
word.Visible = True
print ('done')
os.remove(os.path.join(root,f))
pass
except:
print('could not open')
# os.remove(os.path.join(root,f))
else:
pass
The try and except was for those documents I couldn't read and won't exit the code until the last document.
You should start from investigating so called virtual PDF print drivers.
As soon as you will find one you should be able to write batch file that prints your DOC files into PDF files. You probably can do this in Python too (setup printer driver output and issue document/print command in MSWord, later can be done using command line AFAIR).
import docx2txt
from win32com import client
import os
files_from_folder = r"c:\\doc"
directory = os.fsencode(files_from_folder)
amount = 1
word = client.DispatchEx("Word.Application")
word.Visible = True
for file in os.listdir(directory):
filename = os.fsdecode(file)
print(filename)
if filename.endswith('docx'):
text = docx2txt.process(os.path.join(files_from_folder, filename))
print(f'{filename} transfered ({amount})')
amount += 1
new_filename = filename.split('.')[0] + '.txt'
try:
with open(os.path.join(files_from_folder + r'\txt_files', new_filename), 'w', encoding='utf-8') as t:
t.write(text)
except:
os.mkdir(files_from_folder + r'\txt_files')
with open(os.path.join(files_from_folder + r'\txt_files', new_filename), 'w', encoding='utf-8') as t:
t.write(text)
elif filename.endswith('doc'):
doc = word.Documents.Open(os.path.join(files_from_folder, filename))
text = doc.Range().Text
doc.Close()
print(f'{filename} transfered ({amount})')
amount += 1
new_filename = filename.split('.')[0] + '.txt'
try:
with open(os.path.join(files_from_folder + r'\txt_files', new_filename), 'w', encoding='utf-8') as t:
t.write(text)
except:
os.mkdir(files_from_folder + r'\txt_files')
with open(os.path.join(files_from_folder + r'\txt_files', new_filename), 'w', encoding='utf-8') as t:
t.write(text)
word.Quit()
The Source Code, see here:
https://neculaifantanaru.com/en/python-full-code-how-to-convert-doc-and-docx-files-to-pdf-from-the-folder.html
I would suggest ignoring your supervisor and use OpenOffice which has a Python api. OpenOffice has built in support for Python and someone created a library specific for this purpose (PyODConverter).
If he isn't happy with the output, tell him it could take you weeks to do it with word.

Writing a Python 3.9 script to parse a text file and act based on first char of line

I'm new to python and am struggling to figure out how to load a line from a file into a string so I can parse it and act on it. The input text file is a script for a video that will create Amazon Polly mp3 files, based on the text in the file.
The file uses the first character as either an 'A:'(action, ignore),'F:'(filename),'T:'(text for Polly),or '#' (comment, ignore) then followed by string text, For example:
#This example says that the author should move the cursor, then the corresponding Audio
#created from Polly ("Move the cursor to the specified location") is dumped into Move_Cursor.mp3
A:Move the cursor
T:Move the cursor to the specified location
F:Move_Cursor.mp3
#end of example
I've been trying to use the read or readline, but I can't figure out how to take the value returned by those to create a string so I can parse it.
script_file = open(input("Enter the script filename: "),'r')
lines = script_file.readlines()
cnt = 0
for line in lines:
eol = len(line)
print ("line length is: ",eol)
action = line.read #<<THIS IS WHERE I'M HAVING A TYPE ISSUE
print ("action is ",action)
print ("string is ",line)
script_file.close()
Thank you!!!!
Here is a possible solution:
filename = input('Enter the script filename: ')
actions = {'#': 'comment', 'A': 'action', 'T': 'text for polly', 'F': 'filename'}
with open(filename, 'r') as file:
for line in file:
line = line.strip()
lof = len(line)
action = actions.get(line[0])
text = line[1:].lstrip(':')
print(
f'Length of line: {lof}\n'
f' Action: {action}\n'
f' Text: {text}\n'
)
Also I would suggest that for choosing files you use the built-in tkinter module which has a function for displaying the file chooser (it will limit human error in typing out the correct filename):
from tkinter import Tk, filedialog
Tk().withdraw()
filename = filedialog.askopenfilename()
Or just skip the filename part:
from tkinter import Tk, filedialog
Tk().withdraw()
with filedialog.askopenfile(mode='r') as file:
For those of you interested. Being new to python, I got sloppy with my spaces and tabs. My issue with the code, even though it was pointing to a syntax error on a line that should have passed, was that I was mixing spaces and tabs in my indentation, so it lost it's mind. BUT, THANK YOU for everyone that answered, as it did help me figure out a better way of doing what I was trying to do, and of course, I did find a missing ':' also.
You can check the first character of every line with:
action = line[0]

Parsing the file name from list of url links

Ok so I am using a script that is downloading a files from urls listed in a urls.txt.
import urllib.request
with open("urls.txt", "r") as file:
linkList = file.readlines()
for link in linkList:
urllib.request.urlretrieve(link)
Unfortunately they are saved as temporary files due to lack of second argument in my urllib.request.urlretrieve function. As there are thousand of links in my text file naming them separately is not an option. The thing is that the name of the file is contained in those links, i.e. /DocumentXML2XLSDownload.vm?firsttime=true&repengback=true&d‌​ocumentId=XXXXXX&xsl‌​FileName=rher2xml.xs‌​l&outputFileName=XXX‌​X_2017_06_25_4.xls where the name of the file comes after outputFileName=
Is there an easy way to parse the file names and then use them in urllib.request.urlretrieve function as secondary argument? I was thinking of extracting those names in excel and placing them in another text file that would be read in similar fashion as urls.txt but I'm not sure how to implement it in Python. Or is there a way to make it exclusively in python without using excel?
You could parse the link on the go.
Example using a regular expression:
import re
with open("urls.txt", "r") as file:
linkList = file.readlines()
for link in linkList:
regexp = '((?<=\?outputFileName=)|(?<=\&outputFileName=))[^&]+'
match = re.search(regexp, link.rstrip())
if match is None:
# Make the user aware that something went wrong, e.g. raise exception
# and/or just print something
print("WARNING: Couldn't find file name in link [" + link + "]. Skipping...")
else:
file_name = match.group(0)
urllib.request.urlretrieve(link, file_name)
You can use urlparse and parse_qs to get the query string
from urlparse import urlparse,parse_qs
parse = urlparse('http://www.cwi.nl:80/%7Eguido/Python.html?name=Python&version=2')
print(parse_qs(parse.query)['name'][0]) # prints Python

Python: Remove everything before certain chars

I have several files on which I should work on. The files are xml-files, but before " < ?xml version="1.0"? > ", there are some debugging and status lines coming from the command line. Since I'd like to pare the file, these lines must be removed. My question is: How is this possible? Preferably inplace, i.e. the filename stays the same.
Thanks for any help.
An inefficient solution would be to read the whole contents and find where this occurs:
fileName="yourfile.xml"
with open(fileName,'r+') as f:
contents=f.read()
contents=contents[contents.find("< ?xml version="1.0"? >"):]
f.seek(0)
f.write(contents)
f.truncate()
The file will now contain the original files contents from "< ?xml version="1.0"? >" onwards.
What about trimming the file headers as you read the file?
import xml.etree.ElementTree as et
with open("input.xml", "rb") as inf:
# find starting point
offset = 0
for line in inf:
if line.startswith('<?xml version="1.0"'):
break
else:
offset += len(line)
# read the xml file starting at that point
inf.seek(offset)
data = et.parse(inf)
(This assumes that the xml header starts on its own line, but works on my test file:
<!-- This is a line of junk -->
<!-- This is another -->
<?xml version="1.0" ?>
<abc>
<def>xy</def>
<def>hi</def>
</abc>
Since you say you have several files, using fileinput might be better than open. You can then do something like:
import fileinput
import sys
prolog = '< ?xml version="1.0"? >'
reached_prolog = False
files = ['file1.xml', 'file2.xml'] # The paths of all your XML files
for line in fileinput.input(files, inplace=1):
# Decide how you want to remove the lines. Something like:
if line.startswith(prolog) and not reached_prolog:
continue
else:
reached_prolog = True
sys.stdout.write(line)
Reading the docs for fileinput should make things clearer.
P.S. This is just a quick response; I haven't ran/tested the code.
A solution with regexp:
import re
import shutil
with open('myxml.xml') as ifile, open('tempfile.tmp', 'wb') as ofile:
for line in ifile:
matches = re.findall(r'< \?xml version="1\.0"\? >.+', line)
if matches:
ofile.write(matches[0])
ofile.writelines(ifile)
break
shutil.move('tempfile.tmp', 'myxml.xml')

display an error message when file is empty - proper way?

hi im slowly trying to learn the correct way to write python code. suppose i have a text file which i want to check if empty, what i want to happen is that the program immediately terminates and the console window displays an error message if indeed empty. so far what ive done is written below. please teach me the proper method on how one ought to handle this case:
import os
def main():
f1name = 'f1.txt'
f1Cont = open(f1name,'r')
if not f1Cont:
print '%s is an empty file' %f1name
os.system ('pause')
#other code
if __name__ == '__main__':
main()
There is no need to open() the file, just use os.stat().
>>> #create an empty file
>>> f=open('testfile','w')
>>> f.close()
>>> #open the empty file in read mode to prove that it doesn't raise IOError
>>> f=open('testfile','r')
>>> f.close()
>>> #get the size of the file
>>> import os
>>> import stat
>>> os.stat('testfile')[stat.ST_SIZE]
0L
>>>
The pythonic way to do this is:
try:
f = open(f1name, 'r')
except IOError as e:
# you can print the error here, e.g.
print(str(e))
Maybe a duplicate of this.
From the original answer:
import os
if (os.stat(f1name).st_size == 0)
print 'File is empty!'
If file open succeeds the value of 'f1Cont` will be a file object and will not be False (even if the file is empty).One way you can check if the file is empty (after a successful open) is :
if f1Cont.readlines():
print 'File is not empty'
else:
print 'File is empty'
Assuming you are going to read the file if it has data in it, I'd recommend opening it in append-update mode and seeing if the file position is zero. If so, there's no data in the file. Otherwise, we can read it.
with open("filename", "a+") as f:
if f.tell():
f.seek(0)
for line in f: # read the file
print line.rstrip()
else:
print "no data in file"
one can create a custom exception and handle that using a try and except block as below
class ContentNotFoundError(Exception):
pass
with open('your_filename','r') as f:
try:
content=f.read()
if not content:
raise ContentNotFoundError()
except ContentNotFoundError:
print("the file you are trying to open has no contents in it")
else:
print("content found")
print(content)
This code will print the content of the file given if found otherwise will print the message
the file you are trying to open has no contents in it

Categories

Resources