Issue locating files due to foreign characters

Issue locating files due to foreign characters - python

It's my first time writing here, so I hope I'm doing everything all right.
I'm using python 3.5 on Win10, and I'm trying to "sync" music from Itunes to my Android device. Basically, I'm reading the Itunes Library XML file and getting all the files location ( so I can copy/paste them into my phone ) but I have problems with songs containing foreign characters.
import getpass
import re
import os
from urllib.parse import unquote
user = getpass.getuser()
ITUNES_LIB_PATH = "C:\\Users\\%s\\Music\\Itunes\\iTunes Music Library.xml" % user
ITUNES_SONGS_FILE = "ya.txt"
def write(file, what, newline=True):
with open(file, 'a', encoding="utf8") as f:
if not os.path.isfile(what):
print("Issue locating file %s\n" % what)
if newline:
what+"\n"
f.write(what)
def get_songs(file=ITUNES_LIB_PATH):
with open(file, 'r', encoding="utf8") as f:
f = f.read()
songs_location = re.findall("<key>Location</key><string>file://localhost/(.*?)</string>", f)
for song in songs_location:
song = unquote(song.replace("/", '\\'))
write(ITUNES_SONGS_FILE, song)
get_songs()
Output:
Issue locating file C:\Users\Dymy\Desktop\Media\Norin &amp; Rad - Bird Is The Word.mp3
How should I handle that "&" in the file name?

There are a couple of related issues in your code e.g., unescaped xml character references, hardcoded character encodings cause by using regular expressions to parse xml. To fix them, use xml parser such as xml.etree.ElementTree or use a more specific pyitunes library (I haven't tried it).

Related

how to do docx to pdf conversion using python library without subprocess in linux? [duplicate]

I'am tasked with converting tons of .doc files to .pdf. And the only way my supervisor wants me to do this is through MSWord 2010. I know I should be able to automate this with python COM automation. Only problem is I dont know how and where to start. I tried searching for some tutorials but was not able to find any (May be I might have, but I don't know what I'm looking for).
Right now I'm reading through this. Dont know how useful this is going to be.

A simple example using comtypes, converting a single file, input and output filenames given as commandline arguments:
import sys
import os
import comtypes.client
wdFormatPDF = 17
in_file = os.path.abspath(sys.argv[1])
out_file = os.path.abspath(sys.argv[2])
word = comtypes.client.CreateObject('Word.Application')
doc = word.Documents.Open(in_file)
doc.SaveAs(out_file, FileFormat=wdFormatPDF)
doc.Close()
word.Quit()
You could also use pywin32, which would be the same except for:
import win32com.client
and then:
word = win32com.client.Dispatch('Word.Application')

You can use the docx2pdf python package to bulk convert docx to pdf. It can be used as both a CLI and a python library. It requires Microsoft Office to be installed and uses COM on Windows and AppleScript (JXA) on macOS.
from docx2pdf import convert
convert("input.docx")
convert("input.docx", "output.pdf")
convert("my_docx_folder/")
pip install docx2pdf
docx2pdf input.docx output.pdf
Disclaimer: I wrote the docx2pdf package. https://github.com/AlJohri/docx2pdf

I have tested many solutions but no one of them works efficiently on Linux distribution.
I recommend this solution :
import sys
import subprocess
import re
def convert_to(folder, source, timeout=None):
args = [libreoffice_exec(), '--headless', '--convert-to', 'pdf', '--outdir', folder, source]
process = subprocess.run(args, stdout=subprocess.PIPE, stderr=subprocess.PIPE, timeout=timeout)
filename = re.search('-> (.*?) using filter', process.stdout.decode())
return filename.group(1)
def libreoffice_exec():
# TODO: Provide support for more platforms
if sys.platform == 'darwin':
return '/Applications/LibreOffice.app/Contents/MacOS/soffice'
return 'libreoffice'
and you call your function:
result = convert_to('TEMP Directory', 'Your File', timeout=15)
All resources:
https://michalzalecki.com/converting-docx-to-pdf-using-python/

I have worked on this problem for half a day, so I think I should share some of my experience on this matter. Steven's answer is right, but it will fail on my computer. There are two key points to fix it here:
(1). The first time when I created the 'Word.Application' object, I should make it (the word app) visible before open any documents. (Actually, even I myself cannot explain why this works. If I do not do this on my computer, the program will crash when I try to open a document in the invisible model, then the 'Word.Application' object will be deleted by OS. )
(2). After doing (1), the program will work well sometimes but may fail often. The crash error "COMError: (-2147418111, 'Call was rejected by callee.', (None, None, None, 0, None))" means that the COM Server may not be able to response so quickly. So I add a delay before I tried to open a document.
After doing these two steps, the program will work perfectly with no failure anymore. The demo code is as below. If you have encountered the same problems, try to follow these two steps. Hope it helps.
import os
import comtypes.client
import time
wdFormatPDF = 17
# absolute path is needed
# be careful about the slash '\', use '\\' or '/' or raw string r"..."
in_file=r'absolute path of input docx file 1'
out_file=r'absolute path of output pdf file 1'
in_file2=r'absolute path of input docx file 2'
out_file2=r'absolute path of outputpdf file 2'
# print out filenames
print in_file
print out_file
print in_file2
print out_file2
# create COM object
word = comtypes.client.CreateObject('Word.Application')
# key point 1: make word visible before open a new document
word.Visible = True
# key point 2: wait for the COM Server to prepare well.
time.sleep(3)
# convert docx file 1 to pdf file 1
doc=word.Documents.Open(in_file) # open docx file 1
doc.SaveAs(out_file, FileFormat=wdFormatPDF) # conversion
doc.Close() # close docx file 1
word.Visible = False
# convert docx file 2 to pdf file 2
doc = word.Documents.Open(in_file2) # open docx file 2
doc.SaveAs(out_file2, FileFormat=wdFormatPDF) # conversion
doc.Close() # close docx file 2
word.Quit() # close Word Application

unoconv (writen in Python) and OpenOffice running as a headless daemon.
https://github.com/unoconv/unoconv
http://dag.wiee.rs/home-made/unoconv/
Works very nicely for doc, docx, ppt, pptx, xls, xlsx.
Very useful if you need to convert docs or save/convert to certain formats on a server.

As an alternative to the SaveAs function, you could also use ExportAsFixedFormat which gives you access to the PDF options dialog you would normally see in Word. With this you can specify bookmarks and other document properties.
doc.ExportAsFixedFormat(OutputFileName=pdf_file,
ExportFormat=17, #17 = PDF output, 18=XPS output
OpenAfterExport=False,
OptimizeFor=0, #0=Print (higher res), 1=Screen (lower res)
CreateBookmarks=1, #0=No bookmarks, 1=Heading bookmarks only, 2=bookmarks match word bookmarks
DocStructureTags=True
);
The full list of function arguments is: 'OutputFileName', 'ExportFormat', 'OpenAfterExport', 'OptimizeFor', 'Range', 'From', 'To', 'Item', 'IncludeDocProps', 'KeepIRM', 'CreateBookmarks', 'DocStructureTags', 'BitmapMissingFonts', 'UseISO19005_1', 'FixedFormatExtClassPtr'

It's worth noting that Stevens answer works, but make sure if using a for loop to export multiple files to place the ClientObject or Dispatch statements before the loop - it only needs to be created once - see my problem: Python win32com.client.Dispatch looping through Word documents and export to PDF; fails when next loop occurs

If you don't mind using PowerShell have a look at this Hey, Scripting Guy! article. The code presented could be adopted to use the wdFormatPDF enumeration value of WdSaveFormat (see here).
This blog article presents a different implementation of the same idea.

I have modified it for ppt support as well. My solution support all the below-specified extensions.
word_extensions = [".doc", ".odt", ".rtf", ".docx", ".dotm", ".docm"]
ppt_extensions = [".ppt", ".pptx"]
My Solution: Github Link
I have modified code from Docx2PDF

I tried the accepted answer but wasn't particularly keen on the bloated PDFs Word was producing which was usually an order of magnitude bigger than expected. After looking how to disable the dialogs when using a virtual PDF printer I came across Bullzip PDF Printer and I've been rather impressed with its features. It's now replaced the other virtual printers I used previously. You'll find a "free community edition" on their download page.
The COM API can be found here and a list of the usable settings can be found here. The settings are written to a "runonce" file which is used for one print job only and then removed automatically. When printing multiple PDFs we need to make sure one print job completes before starting another to ensure the settings are used correctly for each file.
import os, re, time, datetime, win32com.client
def print_to_Bullzip(file):
util = win32com.client.Dispatch("Bullzip.PDFUtil")
settings = win32com.client.Dispatch("Bullzip.PDFSettings")
settings.PrinterName = util.DefaultPrinterName # make sure we're controlling the right PDF printer
outputFile = re.sub("\.[^.]+$", ".pdf", file)
statusFile = re.sub("\.[^.]+$", ".status", file)
settings.SetValue("Output", outputFile)
settings.SetValue("ConfirmOverwrite", "no")
settings.SetValue("ShowSaveAS", "never")
settings.SetValue("ShowSettings", "never")
settings.SetValue("ShowPDF", "no")
settings.SetValue("ShowProgress", "no")
settings.SetValue("ShowProgressFinished", "no") # disable balloon tip
settings.SetValue("StatusFile", statusFile) # created after print job
settings.WriteSettings(True) # write settings to the runonce.ini
util.PrintFile(file, util.DefaultPrinterName) # send to Bullzip virtual printer
# wait until print job completes before continuing
# otherwise settings for the next job may not be used
timestamp = datetime.datetime.now()
while( (datetime.datetime.now() - timestamp).seconds < 10):
if os.path.exists(statusFile) and os.path.isfile(statusFile):
error = util.ReadIniString(statusFile, "Status", "Errors", '')
if error != "0":
raise IOError("PDF was created with errors")
os.remove(statusFile)
return
time.sleep(0.1)
raise IOError("PDF creation timed out")

I was working with this solution but I needed to search all .docx, .dotm, .docm, .odt, .doc or .rtf and then turn them all to .pdf (python 3.7.5). Hope it works...
import os
import win32com.client
wdFormatPDF = 17
for root, dirs, files in os.walk(r'your directory here'):
for f in files:
if f.endswith(".doc") or f.endswith(".odt") or f.endswith(".rtf"):
try:
print(f)
in_file=os.path.join(root,f)
word = win32com.client.Dispatch('Word.Application')
word.Visible = False
doc = word.Documents.Open(in_file)
doc.SaveAs(os.path.join(root,f[:-4]), FileFormat=wdFormatPDF)
doc.Close()
word.Quit()
word.Visible = True
print ('done')
os.remove(os.path.join(root,f))
pass
except:
print('could not open')
# os.remove(os.path.join(root,f))
elif f.endswith(".docx") or f.endswith(".dotm") or f.endswith(".docm"):
try:
print(f)
in_file=os.path.join(root,f)
word = win32com.client.Dispatch('Word.Application')
word.Visible = False
doc = word.Documents.Open(in_file)
doc.SaveAs(os.path.join(root,f[:-5]), FileFormat=wdFormatPDF)
doc.Close()
word.Quit()
word.Visible = True
print ('done')
os.remove(os.path.join(root,f))
pass
except:
print('could not open')
# os.remove(os.path.join(root,f))
else:
pass
The try and except was for those documents I couldn't read and won't exit the code until the last document.

You should start from investigating so called virtual PDF print drivers.
As soon as you will find one you should be able to write batch file that prints your DOC files into PDF files. You probably can do this in Python too (setup printer driver output and issue document/print command in MSWord, later can be done using command line AFAIR).

import docx2txt
from win32com import client
import os
files_from_folder = r"c:\\doc"
directory = os.fsencode(files_from_folder)
amount = 1
word = client.DispatchEx("Word.Application")
word.Visible = True
for file in os.listdir(directory):
filename = os.fsdecode(file)
print(filename)
if filename.endswith('docx'):
text = docx2txt.process(os.path.join(files_from_folder, filename))
print(f'{filename} transfered ({amount})')
amount += 1
new_filename = filename.split('.')[0] + '.txt'
try:
with open(os.path.join(files_from_folder + r'\txt_files', new_filename), 'w', encoding='utf-8') as t:
t.write(text)
except:
os.mkdir(files_from_folder + r'\txt_files')
with open(os.path.join(files_from_folder + r'\txt_files', new_filename), 'w', encoding='utf-8') as t:
t.write(text)
elif filename.endswith('doc'):
doc = word.Documents.Open(os.path.join(files_from_folder, filename))
text = doc.Range().Text
doc.Close()
print(f'{filename} transfered ({amount})')
amount += 1
new_filename = filename.split('.')[0] + '.txt'
try:
with open(os.path.join(files_from_folder + r'\txt_files', new_filename), 'w', encoding='utf-8') as t:
t.write(text)
except:
os.mkdir(files_from_folder + r'\txt_files')
with open(os.path.join(files_from_folder + r'\txt_files', new_filename), 'w', encoding='utf-8') as t:
t.write(text)
word.Quit()
The Source Code, see here:
https://neculaifantanaru.com/en/python-full-code-how-to-convert-doc-and-docx-files-to-pdf-from-the-folder.html

I would suggest ignoring your supervisor and use OpenOffice which has a Python api. OpenOffice has built in support for Python and someone created a library specific for this purpose (PyODConverter).
If he isn't happy with the output, tell him it could take you weeks to do it with word.

I need an epub to text solution in Python

I need to get text from an epub
from epub_conversion.utils import open_book, convert_epub_to_lines
f = open("demofile.txt", "a")
book = open_book("razvansividra.epub")
lines = convert_epub_to_lines(book)
I use this but if I use print(lines) it does print only one line. And the library is 6 years old. Do you guys know a good way ?

What about https://github.com/aerkalov/ebooklib
EbookLib is a Python library for managing EPUB2/EPUB3 and Kindle
files. It's capable of reading and writing EPUB files programmatically
(Kindle support is under development).
The API is designed to be as simple as possible, while at the same
time making complex things possible too. It has support for covers,
table of contents, spine, guide, metadata and etc.
import ebooklib
from ebooklib import epub
book = epub.read_epub('test.epub')
for doc in book.get_items_of_type(ebooklib.ITEM_DOCUMENT):
print doc

convert_epub_to_lines returns an iterator to lines, which you need to iterate one by one to get.
Instead, you can get all lines with "convert", see in the documentation of the library:
https://pypi.org/project/epub-conversion/

Epublib has the problem of modifying your epub metadata, so if you want the original file with maybe only a few things changed you can simply unpack the epub into a directory and parse it with Beautifulsoup:
from os import path, listdir
with ZipFile(FILE_NAME, "r") as zip_ref:
zip_ref.extractall(extract_dir)
for filename in listdir(extract_dir):
if filename.endswith(".xhtml"):
print(filename)
with open(path.join(extract_dir, filename), "r", encoding="utf-8") as f:
soup = BeautifulSoup(f.read(), "lxml")
for text_object in soup.find_all(text=True):

Here is a sloppy script that extracts the text from an .epub in the right order. Improvements could be made
Quick explanation:
Takes input(epub) and output(txt) file paths as first and second arguments
Extracts epub content in temporary directory
Parses 'content.opf' file for xhtml content and order
Extracts text from each xhtml
Dependency: lxml
#!/usr/bin/python3
import shutil, os, sys, zipfile, tempfile
from lxml import etree
if len(sys.argv) != 3:
print(f"Usage: {sys.argv[0]} <input.epub> <output.txt>")
exit(1)
inputFilePath=sys.argv[1]
outputFilePath=sys.argv[2]
print(f"Input: {inputFilePath}")
print(f"Output: {outputFilePath}")
with tempfile.TemporaryDirectory() as tmpDir:
print(f"Extracting input to temp directory '{tmpDir}'.")
with zipfile.ZipFile(inputFilePath, 'r') as zip_ref:
zip_ref.extractall(tmpDir)
with open(outputFilePath, "w") as outFile:
print(f"Parsing 'container.xml' file.")
containerFilePath=f"{tmpDir}/META-INF/container.xml"
tree = etree.parse(containerFilePath)
for rootFilePath in tree.xpath( "//*[local-name()='container']"
"/*[local-name()='rootfiles']"
"/*[local-name()='rootfile']"
"/#full-path"):
print(f"Parsing '{rootFilePath}' file.")
contentFilePath = f"{tmpDir}/{rootFilePath}"
contentFileDirPath = os.path.dirname(contentFilePath)
tree = etree.parse(contentFilePath)
for idref in tree.xpath("//*[local-name()='package']"
"/*[local-name()='spine']"
"/*[local-name()='itemref']"
"/#idref"):
for href in tree.xpath( f"//*[local-name()='package']"
f"/*[local-name()='manifest']"
f"/*[local-name()='item'][#id='{idref}']"
f"/#href"):
outFile.write("\n")
xhtmlFilePath = f"{contentFileDirPath}/{href}"
subtree = etree.parse(xhtmlFilePath, etree.HTMLParser())
for ptag in subtree.xpath("//html/body/*"):
for text in ptag.itertext():
outFile.write(f"{text}")
outFile.write("\n")
print(f"Text written to '{outputFilePath}'.")

Parsing the file name from list of url links

Ok so I am using a script that is downloading a files from urls listed in a urls.txt.
import urllib.request
with open("urls.txt", "r") as file:
linkList = file.readlines()
for link in linkList:
urllib.request.urlretrieve(link)
Unfortunately they are saved as temporary files due to lack of second argument in my urllib.request.urlretrieve function. As there are thousand of links in my text file naming them separately is not an option. The thing is that the name of the file is contained in those links, i.e. /DocumentXML2XLSDownload.vm?firsttime=true&repengback=true&d‌ocumentId=XXXXXX&xsl‌FileName=rher2xml.xs‌l&outputFileName=XXX‌X_2017_06_25_4.xls where the name of the file comes after outputFileName=
Is there an easy way to parse the file names and then use them in urllib.request.urlretrieve function as secondary argument? I was thinking of extracting those names in excel and placing them in another text file that would be read in similar fashion as urls.txt but I'm not sure how to implement it in Python. Or is there a way to make it exclusively in python without using excel?

You could parse the link on the go.
Example using a regular expression:
import re
with open("urls.txt", "r") as file:
linkList = file.readlines()
for link in linkList:
regexp = '((?<=\?outputFileName=)|(?<=\&outputFileName=))[^&]+'
match = re.search(regexp, link.rstrip())
if match is None:
# Make the user aware that something went wrong, e.g. raise exception
# and/or just print something
print("WARNING: Couldn't find file name in link [" + link + "]. Skipping...")
else:
file_name = match.group(0)
urllib.request.urlretrieve(link, file_name)

You can use urlparse and parse_qs to get the query string
from urlparse import urlparse,parse_qs
parse = urlparse('http://www.cwi.nl:80/%7Eguido/Python.html?name=Python&version=2')
print(parse_qs(parse.query)['name'][0]) # prints Python

python unicode csv export using pyramid

I'm trying to export mongodb that has non ascii characters into csv format.
Right now, I'm dabbling with pyramid and using pyramid.response.
from pyramid.response import Response
from mycart.Member import Member
#view_config(context="mycart:resources.Member", name='', request_method="POST", permission = 'admin')
def member_export( context, request):
filename = 'member-'+time.strftime("%Y%m%d%H%M%S")+".csv"
download_path = os.getcwd() + '/MyCart/mycart/static/downloads/'+filename
member = Members(request)
my_list = [['First Name,Last Name']]
record = member.get_all_member( )
for r in record:
mystr = [ r['fname'], r['lname']]
my_list.append(mystr)
with open(download_path, 'wb') as f:
fileWriter = csv.writer(f, delimiter=',',quotechar='|', quoting=csv.QUOTE_MINIMAL)
for l in my_list:
print(l)
fileWriter.writerow(l)
size = os.path.getsize(download_path)
response = Response(content_type='application/force-download', content_disposition='attachment; filename=' + filename)
response.app_iter = open(download_path , 'rb')
response.content_length = size
return response
In mongoDB, first name is showing 王, when I'm using print, it too is showing 王. However, when I used excel to open it up, it shows random stuff - ç¾…
However, when I tried to view it in shell
$ more member-20130227141550.csv
It managed to display the non ascii character correctly.
How should I rectify this problem?

I'm not a Windows guy, so I am not sure whether the problem may be with your code or with excel just not handling non-ascii characters nicely. But I have noticed that you are writing your file with python csv module, which is notorious for headaches with unicode.
Other users have reported success with using unicodecsv as a replacement for the csv module. Perhaps you could try dropping in this module as a csv writer and see if your problem magically goes away.

Undefined entity error while using ElementTree

I have a set of XML files that I need to read and format into a single CSV file. In order to read from the XML files, I have used the solution mentioned here.
My code looks like this:
from os import listdir
import xml.etree.cElementTree as et
files = listdir(".../blogs/")
for i in range(len(files)):
# fname = ".../blogs/" + files[i]
f = open(".../blogs/" + files[i], 'r')
contents = f.read()
tree=et.fromstring(contents)
for el in tree.findall('post'):
post = el.text
f.close()
This gives me the error cElementTree.ParseError: undefined entity: at the line tree=et.fromstring(contents). Oddly enough, when I run each of the commands on command line Python (without the for-loop though), it runs perfectly.
In case you want to know the XML structure, it is like this:
<Blog>
<date> some date </date>
<post> some blog post </post>
</Blog>
So what is causing this error, and how come it doesn't run from the Python file, but runs from the command line?
Update: After reading this link I checked files[0] and found that '&' symbol occurs a few times. I think that might be causing the problem. I used a random file to read when I ran the same commands on command line.

As I mentioned in the update, there were some symbols that I suspected might be causing a problem.
The reason the error didn't come up when I ran the same lines on the command line is because I would randomly pick a file that didn't have any such characters.
Since I mainly required the content between the <post> and </post> tags, I created my own parser (as was suggested in the link mentioned in the update).
from os import listdir
files = listdir(".../blogs/")
for i in range(len(files)):
f = open(".../blogs/" + files[i], 'r')
contents = f.read()
seek1 = contents.find('<post>')
seek2 = contents.find('</post>', seek1+1)
while(seek1!=-1):
post = contents[seek1+5:seek2+6]
seek1 = contents.find('<post>', seek1+1)
seek2 = contents.find('</post>', seek1+1)
f.close()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Issue locating files due to foreign characters - python

There are a couple of related issues in your code e.g., unescaped xml character references, hardcoded character encodings cause by using regular expressions to parse xml. To fix them, use xml parser such as xml.etree.ElementTree or use a more specific pyitunes library (I haven't tried it).

Related

how to do docx to pdf conversion using python library without subprocess in linux? [duplicate]

I need an epub to text solution in Python

Parsing the file name from list of url links

python unicode csv export using pyramid

Undefined entity error while using ElementTree

Categories

Resources