Attempting to use a code to pull all .xml file names from a particular directory in a repo, then fliter out all the files which contain a certain key word. The part of the code which pulls the file names and puts it into the first text file works properly. The second part which takes the first text file and filters it into a second text file does not work as anticipated. I am not recieving an error or anything but the array of lines is empty which is odd because the text file is not. I am wondering if anyone sees anything obvious that I am missing I have been at this for a long time so its easy to miss simple things any help would be appreciated. I have looked into other examples and the way they did it is similar to mine logically just not sure where I went wrong. Thank you in advance
This is the code:
#!/usr/bin/python
import glob
import re
import os
import fnmatch
from pprint import pprint
import xml.etree.ElementTree as ET
import cPickle as pickle
#open a text output file
text_file = open("TestCases.txt", "w")
matches = []
#initialize the array called matches with all the xml files in the selected directories (bbap, bbsc, bbtest, and bbrtm
for bbdir in ['bbap/nr', 'bbsc/nr','bbtest/nr', 'bbrtm/nr']:
for root, dirnames,filenames in os.walk('/repo/bob/ebb/'+bbdir):
for filename in fnmatch.filter(filenames, '*.xml'):
matches.append(os.path.join(root,filename))
#for each listing in matches test it against the desired filter to achieve the wanted tests
for each_xml in matches:
if each_xml.find('dlL1NrMdbfUPe') != -1:
tree = ET.parse(each_xml)
root = tree.getroot()
for child in root:
for test_suite in child:
for test_case in test_suite:
text_file.write(pickle.dumps(test_case.attrib))
#modify the text so it is easy to read
with open("TestCases.txt") as f:
with open("Formatted_File", "w") as f1:
for line in f:
if "DLL1NRMDBFUPE" in line:
f1.write(line)
Just As I had anticipated, the error was one made from looking at code for too long. I simply forgot to close text_file before opening f.
Fix:
import glob
import re
import os
import fnmatch
from pprint import pprint
import xml.etree.ElementTree as ET
import cPickle as pickle
#open a text output file
text_file = open("TestCases.txt", "w")
matches = []
#initialize the array called matches with all the xml files in the selected directories (bbap, bbsc, bbtest, and bbrtm
for bbdir in ['bbap/nr', 'bbsc/nr','bbtest/nr', 'bbrtm/nr']:
for root, dirnames,filenames in os.walk('/repo/bob/ebb/'+bbdir):
for filename in fnmatch.filter(filenames, '*.xml'):
matches.append(os.path.join(root,filename))
#for each listing in matches test it against the desired filter to achieve the wanted tests
for each_xml in matches:
if each_xml.find('dlL1NrMdbfUPe') != -1:
tree = ET.parse(each_xml)
root = tree.getroot()
for child in root:
for test_suite in child:
for test_case in test_suite:
text_file.write(pickle.dumps(test_case.attrib))
**text_file.close()**
#modify the text so it is easy to read
with open("TestCases.txt") as f:
with open("Formatted_File", "w") as f1:
for line in f:
if "DLL1NRMDBFUPE" in line:
f1.write(line)
f.close()
f1.close()
Related
When parsing a single xml file to find a specific named node and it's text, I can get output as desired. However, I need to extend this code to a collection of xml files. When I do so, I get no output.
Here is my code on a single xml file, which works as desired:
from lxml import etree
parsed_single = etree.parse('Downloads/one_file.xml')
xp_eval =etree.XPathEvaluator(parsed_single)
d = dict((item.text, item.tag) for item in xp_eval('//node_of_interest'))
The above code outputs the following, as expected:
d=
{'interesting text': 'Node_of_interest',
'more inter text': 'Node_of_interest',
'yet more txt': 'Node_of_interest'}
However, when applying the above code file by file to a folder of very similar xml files, my dictionary is empty. Here is the code attempt at parsing a collection of xml files:
import glob
import os
path = '/Users/Downloads/XML_FOLDER/'
read_files = glob.glob(os.path.join(path, '*.xml'))
for file in read_files:
file_pars = etree.parse(file)
xpatheval = etree.XPathEvaluator(file_pars)
full_dict = dict((item.text, item.tag) for item in xpatheval('//node_of_interest'))
Unfortunately, full_dict = {}
I need to get text from an epub
from epub_conversion.utils import open_book, convert_epub_to_lines
f = open("demofile.txt", "a")
book = open_book("razvansividra.epub")
lines = convert_epub_to_lines(book)
I use this but if I use print(lines) it does print only one line. And the library is 6 years old. Do you guys know a good way ?
What about https://github.com/aerkalov/ebooklib
EbookLib is a Python library for managing EPUB2/EPUB3 and Kindle
files. It's capable of reading and writing EPUB files programmatically
(Kindle support is under development).
The API is designed to be as simple as possible, while at the same
time making complex things possible too. It has support for covers,
table of contents, spine, guide, metadata and etc.
import ebooklib
from ebooklib import epub
book = epub.read_epub('test.epub')
for doc in book.get_items_of_type(ebooklib.ITEM_DOCUMENT):
print doc
convert_epub_to_lines returns an iterator to lines, which you need to iterate one by one to get.
Instead, you can get all lines with "convert", see in the documentation of the library:
https://pypi.org/project/epub-conversion/
Epublib has the problem of modifying your epub metadata, so if you want the original file with maybe only a few things changed you can simply unpack the epub into a directory and parse it with Beautifulsoup:
from os import path, listdir
with ZipFile(FILE_NAME, "r") as zip_ref:
zip_ref.extractall(extract_dir)
for filename in listdir(extract_dir):
if filename.endswith(".xhtml"):
print(filename)
with open(path.join(extract_dir, filename), "r", encoding="utf-8") as f:
soup = BeautifulSoup(f.read(), "lxml")
for text_object in soup.find_all(text=True):
Here is a sloppy script that extracts the text from an .epub in the right order. Improvements could be made
Quick explanation:
Takes input(epub) and output(txt) file paths as first and second arguments
Extracts epub content in temporary directory
Parses 'content.opf' file for xhtml content and order
Extracts text from each xhtml
Dependency: lxml
#!/usr/bin/python3
import shutil, os, sys, zipfile, tempfile
from lxml import etree
if len(sys.argv) != 3:
print(f"Usage: {sys.argv[0]} <input.epub> <output.txt>")
exit(1)
inputFilePath=sys.argv[1]
outputFilePath=sys.argv[2]
print(f"Input: {inputFilePath}")
print(f"Output: {outputFilePath}")
with tempfile.TemporaryDirectory() as tmpDir:
print(f"Extracting input to temp directory '{tmpDir}'.")
with zipfile.ZipFile(inputFilePath, 'r') as zip_ref:
zip_ref.extractall(tmpDir)
with open(outputFilePath, "w") as outFile:
print(f"Parsing 'container.xml' file.")
containerFilePath=f"{tmpDir}/META-INF/container.xml"
tree = etree.parse(containerFilePath)
for rootFilePath in tree.xpath( "//*[local-name()='container']"
"/*[local-name()='rootfiles']"
"/*[local-name()='rootfile']"
"/#full-path"):
print(f"Parsing '{rootFilePath}' file.")
contentFilePath = f"{tmpDir}/{rootFilePath}"
contentFileDirPath = os.path.dirname(contentFilePath)
tree = etree.parse(contentFilePath)
for idref in tree.xpath("//*[local-name()='package']"
"/*[local-name()='spine']"
"/*[local-name()='itemref']"
"/#idref"):
for href in tree.xpath( f"//*[local-name()='package']"
f"/*[local-name()='manifest']"
f"/*[local-name()='item'][#id='{idref}']"
f"/#href"):
outFile.write("\n")
xhtmlFilePath = f"{contentFileDirPath}/{href}"
subtree = etree.parse(xhtmlFilePath, etree.HTMLParser())
for ptag in subtree.xpath("//html/body/*"):
for text in ptag.itertext():
outFile.write(f"{text}")
outFile.write("\n")
print(f"Text written to '{outputFilePath}'.")
I was able to run the regex on multiple files, I want to save the output of this like name_of_file_clean.txt.
Trying to find the best way.
import os, re
import glob
pattern = re.compile(r'(?<=CN=)(.*?)(?=,)')
for file in glob.glob('*.txt'):
with open(file) as fp:
for result in pattern.findall(fp.read()):
print(result)
We'll just open the output file and use the print functions file keyword argument to write to the file
import os, re
import glob
pattern = re.compile(r'(?<=CN=)(.*?)(?=,)')
for file in glob.glob('*.txt'):
with open(file) as fp:
with open(file[:-4] + '_clean.txt', 'w') as outfile:
for result in pattern.findall(fp.read()):
print(result, file=outfile)
I am new to python and I am using following code to pull output as sentiment analysis:
import json
from watson_developer_cloud import ToneAnalyzerV3Beta
import urllib.request
import codecs
import csv
import os
import re
import sys
import collections
import glob
ipath = 'C:/TEMP/' # input folder
opath = 'C:/TEMP/matrix/' # output folder
reader = codecs.getreader("utf-8")
tone_analyzer = ToneAnalyzerV3Beta(
url='https://gateway.watsonplatform.net/tone-analyzer/api',
username='ABCID',
password='ABCPASS',
version='2016-02-11')
path = 'C:/TEMP/*.txt'
file = glob.glob(path)
text = file.read()
data=tone_analyzer.tone(text='text')
for cat in data['document_tone']['tone_categories']:
print('Category:', cat['category_name'])
for tone in cat['tones']:
print('-', tone['tone_name'],tone['score'])
#create file
In the above code all I am trying to do is to read the file and do sentiment analysis all the text file stored in C:/TEMP folder but I keep getting and error :'list' object has no attribute 'read'
Not sure where I am going wrong and I would really appreciate any help with this one. Also, is there a way i can write the output to a CSV file so if I am reading the file
ABC.txt and I create a output CSV file called ABC.csv with output values.
Thank You
glob returns a list of files, you need to iterate over the list, open each file and then call .read on the file object:
files = glob.glob(path)
# iterate over the list getting each file
for fle in files:
# open the file and then call .read() to get the text
with open(fle) as f:
text = f.read()
Not sure what it is exactly you want to write but the csv lib will do it:
from csv import writer
files = glob.glob(path)
# iterate over the list getting each file
for fle in files:
# open the file and then call .read() to get the text
with open(fle) as f, open("{}.csv".format(fle.rsplit(".", 1)[1]),"w") as out:
text = f.read()
wr = writer(out)
data = tone_analyzer.tone(text='text')
wr.writerow(["some", "column" ,"names"]) # write the col names
Then call writerow passing a list of whatever you want to write for each row.
I've been trying to make my python code to fill a form in word with data that i scraped off the Internet. I wrote the data in a txt file and are now trying to fill the word file with this code:
import zipfile
import os
import tempfile
import shutil
import codecs
def getXml(docxFilename,ReplaceText):
zip = zipfile.ZipFile(open(docxFilename,"rb"))
xmlString= zip.read("word/document.xml")
for key in ReplaceText.keys():
xmlString = xmlString.replace(str(key), str(ReplaceText.get(key)))
return xmlString
def createNewDocx(originalDocx,xmlString,newFilename):
tmpDir = tempfile.mkdtemp()
zip = zipfile.ZipFile(open(originalDocx,"rb"))
zip.extractall(tmpDir)
#3tmpDir=tmpDir.decode("utf-8")
with open(os.path.join(tmpDir,"word/document.xml"),"w") as f:
f.write(xmlString)
filenames = zip.namelist()
zipCopyFilename = newFilename
with zipfile.ZipFile(zipCopyFilename,"w") as docx:
for filename in filenames:
docx.write(os.path.join(tmpDir,filename),filename)
shutil.rmtree(tmpDir)
f=open('test.txt', 'r',)
text=f.read().split("\n")
print text[1]
Pavarde = text[1]
Replace = {"PAVARDE1":Pavarde}
createNewDocx("test.docx",getXml("test.docx",Replace),"test2.docx")
The file is created but I cant open it.
I get the following error:
Illegal xlm character
My guess would be that theres something with the encoding but I cant find a solution.