Python - html2text write to file - python

I have text files which contain html tags which I want to remove using html2text with Python:
import html2text
html = open("textFileWithHtml.txt").read()
print html2text.html2text(html)
My question is how can I write the output to a .txt file ? (I want to create the new text file without the html elements -- the file does not previously exist)

You need to open another file for writing.
import html2text
html = open("textFileWithHtml.txt")
f = html.read()
w = open("out.txt", "w")
w.write(html2text.html2text(f).encode('utf-8'))
html.close()
w.close()

You should open a file and write to it.
import html2text
# Open your file
with open("textFileWithHtml.txt", 'r') as f_html:
html = f_html.read()
# Open a file and write to it
with open('your_file.txt', 'w') as f:
f.write(html2text.html2text(html).encode('utf-8'))
It is good practice to use the with keyword when dealing with file objects.
And it is more pythonic too.
See more information for files reading / writing files : https://docs.python.org/2/tutorial/inputoutput.html#reading-and-writing-files
Edit
If you have issues with encoding, try using .encode('utf-8'). I've added it in my code snipped. Look for python unicode if you have issues regarding this (https://docs.python.org/2/howto/unicode.html)

Related

Open a file (not .txt) with notepad++ and replace a line with string

I'm trying to open a file, needs to be with notepad++, and replace a certain line and position with a string
I saw some examples using the open function with readlines,write,append,replace,... but the normal notepad mess my file, below the code that i wrote to exemplify
import subprocess
mac = "F5:FF:60:60:10:FF" #example
test = subprocess.Popen([r"C:\Program Files (x86)\Notepad++\notepad++.exe", r"C:\Users\mypc\Project\test.axp"])
test.replace[5][10:27](mac) #replace with mac in lane 5 and position 10 to 27
test.close()
Any ideas?
As far as I can tell, .axp might be an xml formatted file. This could be confirmed if this prints out anything sensible.
from bs4 import BeautifulSoup
with open('test.axp', 'r') as f:
file = f.read()
soup = BeautifulSoup(file, 'xml')
print(soup)
pip install beautifulsoup4 lxml

How to increase version number of a xml file after each change in the file using ETree

I'm trying to manipulate a xml file. I use a loop and for each iteration I want the version number of the xml file to be increased. For manipulating the xml file I using ETree. Here is what I have tried so far:
def main():
import xml.etree.ElementTree as ET
import os
version = "0"
while os.path.exists(f"/Users/tt/sumoTracefcdfile_{version}.xml"):
#use parse() function to load and parse an xml file
fileDirect="/Users/tt/sumoTracefcdfile_{version}.xml"
version=int(version)
version+=1
doc = ET.parse(fileDirect)
.....
#at the end after adding some data to xml file, I do the following to write the changes into the xml file:
save_path_file = "/Users/tt/sumoTracefcdfile_{version}.xml"
b_xml = ET.tostring(valeurs)
with open(save_path_file, "wb") as f:
f.write(b_xml)
However I get the following error for the line 'doc = ET.parse(fileDirect)':
FileNotFoundError: [Errno 2] No such file or directory:
'/Users/tt/sumoTracefcdfile_{version}.xml'
It looks like you wanted to use f-strings and forgot the "f" in 2 lines.
Changing fileDirect="/Users/tt/sumoTracefcdfile_{version}.xml" to fileDirect = f"/Users/tt/sumoTracefcdfile_{version}.xml" and save_path_file = "/Users/tt/sumoTracefcdfile_{version}.xml" to save_path_file = f"/Users/tt/sumoTracefcdfile_{version}.xml" might solve your issues.

Parsing the file name from list of url links

Ok so I am using a script that is downloading a files from urls listed in a urls.txt.
import urllib.request
with open("urls.txt", "r") as file:
linkList = file.readlines()
for link in linkList:
urllib.request.urlretrieve(link)
Unfortunately they are saved as temporary files due to lack of second argument in my urllib.request.urlretrieve function. As there are thousand of links in my text file naming them separately is not an option. The thing is that the name of the file is contained in those links, i.e. /DocumentXML2XLSDownload.vm?firsttime=true&repengback=true&d‌​ocumentId=XXXXXX&xsl‌​FileName=rher2xml.xs‌​l&outputFileName=XXX‌​X_2017_06_25_4.xls where the name of the file comes after outputFileName=
Is there an easy way to parse the file names and then use them in urllib.request.urlretrieve function as secondary argument? I was thinking of extracting those names in excel and placing them in another text file that would be read in similar fashion as urls.txt but I'm not sure how to implement it in Python. Or is there a way to make it exclusively in python without using excel?
You could parse the link on the go.
Example using a regular expression:
import re
with open("urls.txt", "r") as file:
linkList = file.readlines()
for link in linkList:
regexp = '((?<=\?outputFileName=)|(?<=\&outputFileName=))[^&]+'
match = re.search(regexp, link.rstrip())
if match is None:
# Make the user aware that something went wrong, e.g. raise exception
# and/or just print something
print("WARNING: Couldn't find file name in link [" + link + "]. Skipping...")
else:
file_name = match.group(0)
urllib.request.urlretrieve(link, file_name)
You can use urlparse and parse_qs to get the query string
from urlparse import urlparse,parse_qs
parse = urlparse('http://www.cwi.nl:80/%7Eguido/Python.html?name=Python&version=2')
print(parse_qs(parse.query)['name'][0]) # prints Python

Python not printing beautifulsoup data to .txt file (or I can't find it)

I'm trying to put all the anchor text on a page in a txt file
print(anchor.get('href'))
with open('file.txt', 'a') as fd:
fd.write(anchor.get('href') + '\n')
and the script executes with no errors but I cannot find file.txt anywhere on my computer. Am I missing out on something really obvious?
With file open mode a (append) the file will be created if it doesn't already exist, otherwise writes will be appended to the end of the file. As written in the question, the file will be created in the current directory of the running process... look there.
from bs4 import BeautifulSoup
import requests
response = requests.get('http://httpbin.org/')
soup = BeautifulSoup(response.text)
with open('file.txt', 'a') as fd:
links = (link.get('href') for link in soup.find_all('a'))
fd.write('\n'.join(links) + '\n')
Try to set a full path to filename, so you can find in this path.
Example:
print(anchor.get('href'))
with open('/home/wbk/file.txt', 'a') as fd:
fd.write(anchor.get('href') + '\n')
And yes, you can use 'a' to create a new file, although this is not a good practice. The 'a' is to append data to the end of an existing file as describes doc
The following code works for me:
with open("example.txt","a") as a:
a.write("hello")
are you sure that the file isn't in your computer? In which ways you have demonstrated it? In my case, i work with eclipse IDE, so I have to refresh the Eclipse file explorer to see it..
Try this:
with open("prova.txt","a") as a:
a.write("ciao")
try: open("prova.txt","r")
except: print "file doesn't exist"

Python- need to append characters to the beginning and end of each line in text file

I should preface that I am a complete Python Newbie.
Im trying to create a script that will loop through a directory and its subdirectories looking for text files. When it encounters a text file it will parse the file and convert it to NITF XML and upload to an FTP directory.
At this point I am still working on reading the text file into variables so that they can be inserted into the XML document in the right places. An example to the text file is as follows.
Headline
Subhead
By A person
Paragraph text.
And here is the code I have so far:
with open("path/to/textFile.txt") as f:
#content = f.readlines()
head,sub,auth = [f.readline().strip() for i in range(3)]
data=f.read()
pth = os.getcwd()
print head,sub,auth,data,pth
My question is: how do I iterate through the body of the text file(data) and wrap each line in HTML P tags? For example;
<P>line of text in file </P> <P>Next line in text file</p>.
Something like
output_format = '<p>{}</p>\n'.format
with open('input') as fin, open('output', 'w') as fout:
fout.writelines( output_format(line.strip()) for line in fin )
This assumes that you want to write the new content back to the original file:
with open('path/to/textFile.txt') as f:
content = f.readlines()
with open('path/to/textFile.txt', 'w') as f:
for line in content:
f.write('<p>' + line.strip() + '</p>\n')
with open('infile') as fin, open('outfile',w) as fout:
for line in fin:
fout.write('<P>{0}</P>\n'.format(line[:-1]) #slice off the newline. Same as `line.rstrip('\n')`.
#Only do this once you're sure the script works :)
shutil.move('outfile','infile') #Need to replace the input file with the output file
in you case, you should probably replace
data=f.read()
with:
data = '\n'.join("<p>%s</p>" % l.strip() for l in f)
use data=f.readlines() here,
and then iterate over data and try something like this:
for line in data:
line="<p>"+line.strip()+"</p>"
#write line+'\n' to a file or do something else
append the and <\p> for each line
ex:
data_new=[]
data=f.readlines()
for lines in data:
data_new.append("<p>%s</p>\n" % data.strip().strip("\n"))
You could use the fileinput module to modify one or more files in-place, with optional backup file creation if desired (see its documentation for details). Here's it being used to process one file.
import fileinput
for line in fileinput.input('testinput.txt', inplace=1):
print '<P>'+line[:-1]+'<\P>'
The 'testinput.txt' argument could also be a sequence of two or more file names instead of just a single one, which could be useful especially if you're using os.walk() to generate the list of files in the directory and its subdirectories to process (as you probably should be doing).

Categories

Resources