Read in Python code/regular expressions from external file - python

I have a section of Python (Sigil) code:
for (id, href) in bk.text_iter():
html = bk.readfile(id)
html = re.sub(r'<title></title>', '<title>Mara’s Tale</title>', html)
html = re.sub(r'<p>Mara’s Tale</p>', '<p class="title">Mara’s Tale</p>',html)
bk.writefile(id, html)
Ideally, I'd like to read the regular expressions in from an external text-file (or just read in that block of code). Any suggestions? I've done similar in Perl with a try, but I'm a Python-novice.
Also, quick supplementary question - shouldn't bk.writefile be indented? And, if so, why is my code working? It looks as though it's outside the for block, and therefore will only write to the final file, if that (it's an epub, so there are several html files), but it's updating all relevant files.
Regarding bk, my understanding is that this object is the whole epub, and what this code is doing is reading each html file that makes up an epub via text_iter, so id is each individual file.
EDIT TO ADD
Ah! That bk.writefile should indeed be indented. I got away with it because, at the point I run this code, I only have a single html file.

As for the reading something from a file - it's easy. Assume you have the file 'my_file.txt' in the same folder where the script is saved:
f = open('my_file.txt', 'r')
content = f.read() # read all content of the file in the sting 'content'
lines = f.read().splitlines() # read lines of the file in array 'lines'
f.close()
print(lines[0]) # first line
print(lines[1]) # second line
# etc
As for shouldn't bk.writefile be indented? Yep, it seems the loop makes and changes the variable html for several times but saves only the last iteration. It looks weird. Perhaps it should be indented. But it's just a guess.

Related

Trying to compare two text files using python

I have a code in python that creates and compares the content of two text files and tells me when they are different:
data=str(information)
#this creates the first file, the one used as a control group of sorts
f=open("text1.txt", "w+")
f.write(data)
f.close()
while True:
#the other file keeps updating, so it's inside a loop
data2=str(newinfo)
f=open("text2.txt", "w+")
f.write(data2)
f.close()
#I'm guessing the error is probably here
read = str(open("text1.txt", "r"))
read2 = str(open("text2.txt", "r"))
if read2 != read:
Notifica = True
break
Both data and data2 are the html from a website I'm reading with BeautifulSoup, that part is working.
However the program keeps thinking the the two text files are different even when they are exactly the same. I think I'm doing this the wrong way, any help?
the open() function returns an object, not the content of the file.
you are comparing two references to different file objects.
you should read the contents of the file and then compare it.
you should do:
read = open("text1.txt", "r").read()
read2 = open("text2.txt", "r").read()
Since HTML file can contain a Huge number of characters your code can use lot of resources. A good way (might be the most use) is to use MD5 Hash.
The MD5 algorithm is deprecated for cryptographic purpose but should be enough for your aim.

Searching for and manipulating the content of a keyword in a huge file

I have a huge HTML file that I have converted to text file. (The file is Facebook home page's source). Assume the text file has a specific keyword in some places of it. For example: "some_keyword: [bla bla]". How would I print all the different bla blas that are followed by some_keyword?
{id:"1126830890",name:"Hillary Clinton",firstName:"Hillary"}
Imagine there are 50 different names with this format in the page. How would I print all the names followed by "name:", considering the text is very large and crashes when you read() it or try to search through its lines.
Sample File:
shortProfiles:{"100000094503825":{id:"100000094503825",name:"Bla blah",firstName:"Blah",vanity:"blah",thumbSrc:"https://scontent-lax3-1.xx.fbcdn.net/v/t1.0-1/c19.0.64.64/p64x64/10354686_10150004552801856_220367501106153455_n.jpg?oh=3b26bb13129d4f9a482d9c4115b9eeb2&oe=5883062B",uri:"https://www.facebook.com/blah",gender:2,i18nGender:16777216,type:"friend",is_friend:true,mThumbSrcSmall:null,mThumbSrcLarge:null,dir:null,searchTokens:["Bla"],alternateName:"",is_nonfriend_messenger_contact:false},"1347968857":
Based on your comment, since you are the person responsible for writting the data to the file. Write the data in JSON format and read it from file using json.loads() as:
import json
json_file = open('/path/to/your_file')
json_str = json_file.read()
json_data = json.loads(json_str)
for item in json_data:
print item['name']
Explanation:
Lets say data is the variable storing
{id:"1126830890",name:"Hillary Clinton",firstName:"Hillary"}
which will be dynamically changing within your code where you are performing write operation in the file. Instead append it to the list as:
a = []
for item in page_content:
# data = some xy logic on HTML file
a.append(data)
Now write this list to the file using: json.dump()
I just wanted to throw this out there even though I agree with all the comments about just dealing with the html directly or using Facebook's API (probably the safest way), but open file objects in Python can be used as a generator yielding lines without reading the entire file into memory and the re module can be used to extract information from text.
This can be done like so:
import re
regex = re.compile(r"(?:some_keyword:\s\[)(.*?)\]")
with open("filename.txt", "r") as fp:
for line in fp:
for match in regex.findall(line):
print(match)
Of course this only works if the file is in a "line-based" format, but the end effect is that only the line you are on is loaded into memory at any one time.
here is the Python 2 docs for the re module
here is the Python 3 docs for the re module
I cannot find documentation which details the generator capabilities of file objects in Python, it seems to be one of those well-known secrets...Please feel free to edit and remove this paragraph if you know where in the Python docs this is detailed.

write into file without reading all the file

I have a template of a file (html) with the header and footer. I try to insert text into right after <trbody>.
The way i'm doing it right now is with fileinput.input()
def write_to_html(self,path):
for line in fileinput.input(path, inplace=1):
line = re.sub(r'CURRENT_APPLICATION', obj, line)
line = re.sub(r'IN_PROGRESS', time.strftime("%Y-%m-%d %H:%M:%S"), line)
line = re.sub(r'CURRENT_VERSION', svers, line)
print line, # preserve old content
if "<tbody>" in line:
print ("<tr>")
###PRINT MY STUFFS
print ("</tr>")
I call this for each Table-line I have to add in my html table. but I have around 5k table-lines to add (each line is about 30 lines of hmtl code). It starts fast, but each line takes more and more times to be added. It's because it has to write the file all over again for each line right ?
Is there a way to speed up the process?
EDIT thanks for the responses :
I like the idee of creating my big string, and the just go through the file just once.
I'll have to change some stuff because right now because the function I showed is in a Classe. and in my main programe, I just iterate on a folder containing .json.
for json in jsonfolder :
Object_a = CLASS-A(json) #unserialization
Object_a.write_to_html() (the function i showed)
I should turn that into :
block_of_lines=''
for json in jsonfolder :
Object_a = CLASS-A(json) #unserialization
block_of_line += Object_a.to_html_sting()
Create_html(block_of_line)
Would that be faster ?
Re-reading the question a couple more times, the following thought occurs.
Could you split the writing into 3 blocks - one for the header, one for the table lines and another for the footer. It does rather seem to depend on what those three substitution lines are doing, but if I'm right, they can only update lines the first time the template is used, ie. while acting on the first json file, and then remain unchanged for the others.
file_footer = CLASS-A.write_html_header(path)
for json in jsonfolder :
Object_a = CLASS-A(json) #unserialization
Object_a.write_to_html(path) #use the part of the function
# that just handles the json file here
CLASS-A.write_html_footer(path, footer)
Then in your class, define the two new functions to write the header and footer as static methods (which means they can be used from the class rather than just on an instance)
i.e. (using a copy from your own code)
#staticmethod
def write_html_header(path):
footer = []
save_for_later = false
for line in fileinput.input(path, inplace=1):
line = re.sub(r'CURRENT_APPLICATION', obj, line)
line = re.sub(r'IN_PROGRESS', time.strftime("%Y-%m-%d %H:%M:%S"), line)
line = re.sub(r'CURRENT_VERSION', svers, line)
# this blocks prints the header, and saves the
# footer from your template.
if save_for_later:
footer.append(line)
else:
print line, # preserve old content
if "<tbody>" in line:
save_for_later = true
return footer
I do wonder why you're editing 'inplace' doesn't that mean the template get's overwritten, and thus it's less of a template and more of a single use form. Normally when I use a template, I read in from the template, and write out to a new file an edited version of the template. Thus the template can be re-used time and time again.
For the footer section, open your file in append mode, and then write the lines in the footer array created by the call to the header writing function.
I do think not editing the template in place would be of benefit to you. then you'd just need to :
open the template (read only)
open the new_file (in new, write mode)
write the header into new_file
loop over json files
append table content into new_file
append the footer into new_file
That way you're never re-reading the bits of the file you created while looping over the json files. Nor are you trying to store the whole file in memory if that is a concern.
5000 lines is nothing. Read the entire file using f.readlines() to get a list of lines:
with open(path) as f:
lines = f.readlines()
Then process each line, and eventually join them to one string and write the entire thing back to the file.

How can I extract a text from a bytes file using python

I am trying to code a script that gets the code of a website, saves all html in a file and after that extracts some information.
For the moment I´ve done the first part, I've saved all html into a text file.
Now I have to extract the relevant information and then save it in another text file.
But I'm having problems with encoding and also I don´t know very well how to extract the text in python.
Parsing a website:
import urllib.request
file name to store the data
file_name = r'D:\scripts\datos.txt'
I want to get the text that goes after this tag <p class="item-description"> and before this other one </p>
tag_starts_with = '<p class="item-description">'
tag_ends_with = '</p>'
I get the website code and I save it into a text file
with urllib.request.urlopen("http://www.website.com/") as response, open(file_name, 'wb') as out_file:
data = response.read()
out_file.write(data)
print (out_file) # First question how can I print the file? Gives me an error, I can´t print bytes
the file is now full of html text so I want to open it and process it
file_for_results = open(r'D:\scripts\datos.txt',encoding="utf8")
Extract information from the file
second question how to do a substring of the lines that contain the file and get the text between p class="item-description" and
/p so i can store in file_for_results
Here is the pseudocode that I'm not capable to code.
for line in file_to_filter:
if line contains word_starts_with
copy in file_for_results until you find </p>
I am assuming this is an assignment of some sort, where you need to parse the html given an algorithm, if not just use Beautiful Soup.
The pseudocode actually translates to python code quite easily:
file_to_filter = open("file.html", 'r')
out_file = open("text_output",'w')
for line in file_to_filter:
if word_starts_with in line:
print(line, end='', file=out_file) # Store data in another file
if word_ends_with in line:
break
And of course you need to close the files, make sure you remove the tags and so on, but this is roughly what your code should be given this algorithm.

How to update/modify an XML file in python?

I have an XML document that I would like to update after it already contains data.
I thought about opening the XML file in "a" (append) mode. The problem is that the new data will be written after the root closing tag.
How can I delete the last line of a file, then start writing data from that point, and then close the root tag?
Of course I could read the whole file and do some string manipulations, but I don't think that's the best idea..
Using ElementTree:
import xml.etree.ElementTree
# Open original file
et = xml.etree.ElementTree.parse('file.xml')
# Append new tag: <a x='1' y='abc'>body text</a>
new_tag = xml.etree.ElementTree.SubElement(et.getroot(), 'a')
new_tag.text = 'body text'
new_tag.attrib['x'] = '1' # must be str; cannot be an int
new_tag.attrib['y'] = 'abc'
# Write back to file
#et.write('file.xml')
et.write('file_new.xml')
note: output written to file_new.xml for you to experiment, writing back to file.xml will replace the old content.
IMPORTANT: the ElementTree library stores attributes in a dict, as such, the order in which these attributes are listed in the xml text will NOT be preserved. Instead, they will be output in alphabetical order.
(also, comments are removed. I'm finding this rather annoying)
ie: the xml input text <b y='xxx' x='2'>some body</b> will be output as <b x='2' y='xxx'>some body</b>(after alphabetising the order parameters are defined)
This means when committing the original, and changed files to a revision control system (such as SVN, CSV, ClearCase, etc), a diff between the 2 files may not look pretty.
Useful Python XML parsers:
Minidom - functional but limited
ElementTree - decent performance, more functionality
lxml - high-performance in most cases, high functionality including real xpath support
Any of those is better than trying to update the XML file as strings of text.
What that means to you:
Open your file with an XML parser of your choice, find the node you're interested in, replace the value, serialize the file back out.
The quick and easy way, which you definitely should not do (see below), is to read the whole file into a list of strings using readlines(). I write this in case the quick and easy solution is what you're looking for.
Just open the file using open(), then call the readlines() method. What you'll get is a list of all the strings in the file. Now, you can easily add strings before the last element (just add to the list one element before the last). Finally, you can write these back to the file using writelines().
An example might help:
my_file = open(filename, "r")
lines_of_file = my_file.readlines()
lines_of_file.insert(-1, "This line is added one before the last line")
my_file.writelines(lines_of_file)
The reason you shouldn't be doing this is because, unless you are doing something very quick n' dirty, you should be using an XML parser. This is a library that allows you to work with XML intelligently, using concepts like DOM, trees, and nodes. This is not only the proper way to work with XML, it is also the standard way, making your code both more portable, and easier for other programmers to understand.
Tim's answer mentioned checking out xml.dom.minidom for this purpose, which I think would be a great idea.
While I agree with Tim and Oben Sonne that you should use an XML library, there are ways to still manipulate it as a simple string object.
I likely would not try to use a single file pointer for what you are describing, and instead read the file into memory, edit it, then write it out.:
inFile = open('file.xml', 'r')
data = inFile.readlines()
inFile.close()
# some manipulation on `data`
outFile = open('file.xml', 'w')
outFile.writelines(data)
outFile.close()
For the modification, you could use tag.text from xml. Here is snippet:
import xml.etree.ElementTree as ET
tree = ET.parse('country_data.xml')
root = tree.getroot()
for rank in root.iter('rank'):
new_rank = int(rank.text) + 1
rank.text = str(new_rank)
tree.write('output.xml')
The rank in the code is example of tag, which depending on your XML file contents.
What you really want to do is use an XML parser and append the new elements with the API provided.
Then simply overwrite the file.
The easiest to use would probably be a DOM parser like the one below:
http://docs.python.org/library/xml.dom.minidom.html
To make this process more robust, you could consider using the SAX parser (that way you don't have to hold the whole file in memory), read & write till the end of tree and then start appending.
You should read the XML file using specific XML modules. That way you can edit the XML document in memory and rewrite your changed XML document into the file.
Here is a quick start: http://docs.python.org/library/xml.dom.minidom.html
There are a lot of other XML utilities, which one is best depends on the nature of your XML file and in which way you want to edit it.
As Edan Maor explained, the quick and dirty way to do it (for [utc-16] encoded .xml files), which you should not do for the resons Edam Maor explained, can done with the following python 2.7 code in case time constraints do not allow you to learn (propper) XML parses.
Assuming you want to:
Delete the last line in the original xml file.
Add a line
substitute a line
Close the root tag.
It worked in python 2.7 modifying an .xml file named "b.xml" located in folder "a", where "a" was located in the "working folder" of python. It outputs the new modified file as "c.xml" in folder "a", without yielding encoding errors (for me) in further use outside of python 2.7.
pattern = '<Author>'
subst = ' <Author>' + domain + '\\' + user_name + '</Author>'
line_index =0 #set line count to 0 before starting
file = io.open('a/b.xml', 'r', encoding='utf-16')
lines = file.readlines()
outFile = open('a/c.xml', 'w')
for line in lines[0:len(lines)]:
line_index =line_index +1
if line_index == len(lines):
#1. & 2. delete last line and adding another line in its place not writing it
outFile.writelines("Write extra line here" + '\n')
# 4. Close root tag:
outFile.writelines("</phonebook>") # as in:
#http://tizag.com/xmlTutorial/xmldocument.php
else:
#3. Substitue a line if it finds the following substring in a line:
pattern = '<Author>'
subst = ' <Author>' + domain + '\\' + user_name + '</Author>'
if pattern in line:
line = subst
print line
outFile.writelines(line)#just writing/copying all the lines from the original xml except for the last.

Categories

Resources