HTML Diff File is getting malformed

HTML Diff File is getting malformed - python

With difflib library I am trying to generate the diff file which is in html format. It works for most of the time but for few times, the generate html is malformed. Sometimes it also observed that formed html doesn't have all the content and sometimes the formed content doesn't have the lines at proper place.
Below is the code I am using for it:
import difflib
try:
print("Reading file from first file")
firstfile = open(firstFilePath, "r")
contentsFirst = firstfile.readlines()
print("Reading file from second file")
secondfile = open(secondFilePath, "r")
contentsSecond = secondfile.readlines()
print("Creating diff file:")
config_diff = difflib.HtmlDiff(wrapcolumn=70).make_file(contentsSecond, contentsFirst)
if not os.path.exists(diff_file_path):
os.makedirs(diff_file_path)
final_path = diff_file_path + "/" + diff_file_name + '.html'
diff_file = open(final_path, 'w')
diff_file.write(config_diff)
print("Diff file is genrated :")
except Exception as error:
print("Exception occurred in create_diff_file " + str(error))
raise Exception(str(error))
This piece of code is called in a threaded program. Although with retry, I get the desired result but doesn't know the reason for getting malformed and inconsistent diff file. If someone can help me in finding the actual reason behind it and can propose the solution, will be helpful for me.

Related

Script to convert multiple URLs or files to individual PDFs and save to a specific location

I have written a script where I am taking the input of URLs hardcoded and giving their filenames also hardcoded, whereas I want to take the URLs from a saved text file and save their names automatically in a chronological order to a specific folder.
My code (works) :
import requests
#input urls and filenames
urls = ['https://www.northwestknowledge.net/metdata/data/pr_1979.nc',
'https://www.northwestknowledge.net/metdata/data/pr_1980.nc',
'https://www.northwestknowledge.net/metdata/data/pr_1981.nc']
fns = [r'C:\Users\HBI8\Downloads\pr_1979.nc',
r'C:\Users\HBI8\Downloads\pr_1980.nc',
r'C:\Users\HBI8\Downloads\pr_1981.nc']
#defining the inputs
inputs= zip(urls,fns)
#define download function
def download_url(args):
url, fn = args[0], args[1]
try:
r = requests.get(url)
with open(fn, 'wb') as f:
f.write(r.content)
except Exception as e:
print('Failed:', e)
#loop through all inputs and run download function
for i in inputs :
result = download_url(i)
Trying to fetch the links from text (error in code):
import requests
# getting all URLS from textfile
file = open('C:\\Users\\HBI8\\Downloads\\testing.txt','r')
#for each_url in enumerate(f):
list_of_urls = [(line.strip()).split() for line in file]
file.close()
#input urls and filenames
urls = list_of_urls
fns = [r'C:\Users\HBI8\Downloads\pr_1979.nc',
r'C:\Users\HBI8\Downloads\pr_1980.nc',
r'C:\Users\HBI8\Downloads\pr_1981.nc']
#defining the inputs
inputs= zip(urls,fns)
#define download function
def download_url(args):
url, fn = args[0], args[1]
try:
r = requests.get(url)
with open(fn, 'wb') as f:
f.write(r.content)
except Exception as e:
print('Failed:', e)
#loop through all inputs and run download fupdftion
for i in inputs :
result = download_url(i)
testing.txt has those 3 links pasted in it on each line.
Error :
Failed: No connection adapters were found for "['https://www.northwestknowledge.net/metdata/data/pr_1979.nc']"
Failed: No connection adapters were found for "['https://www.northwestknowledge.net/metdata/data/pr_1980.nc']"
Failed: No connection adapters were found for "['https://www.northwestknowledge.net/metdata/data/pr_1981.nc']"
PS :
I am new to python and it would be helpful if someone could advice me on how to loop or go through files from a text file and save them indivually in a chronological order as opposed to hardcoding the names(as I have done).

When you do list_of_urls = [(line.strip()).split() for line in file], you produce a list of lists. (For each line of the file, you produce the list of urls in this line, and then you make a list of these lists)
What you want is a list of urls.
You could do
list_of_urls = [url for line in file for url in (line.strip()).split()]
Or :
list_of_urls = []
for line in file:
list_of_urls.extend((line.strip()).split())

By far the simplest method in this simple case is use the OS command
so go to the work directory C:\Users\HBI8\Downloads
invoke cmd (you can simply put that in the address bar)
write/paste your list using >notepad testing.txt (if you don't already have it there)
Note NC HDF files are NOT.pdf
https://www.northwestknowledge.net/metdata/data/pr_1979.nc
https://www.northwestknowledge.net/metdata/data/pr_1980.nc
https://www.northwestknowledge.net/metdata/data/pr_1981.nc
then run
for /F %i in (testing.txt) do curl -O %i
92 seconds later

I have inserted a delimiter as ',' by using split function.
In order to give automated file name I used the index number of the stored list.
Data saved in following manner in txt file.
FileName | Object ID | Base URL
url_file = open('C:\\Users\\HBI8\\Downloads\\testing.txt','r')
fns=[]
list_of_urls = []
for line in url_file:
stripped_line = line.split(',')
print(stripped_line)
list_of_urls.append(stripped_line[2]+stripped_line[1])
fns.append(stripped_line[0])
url_file.close()

Don't understand this PdfReadError: EOF marker not found

I am downloading multiple PDFs. I have a list of urls and the code is written to download them and also create one big pdf with them all in. The code works for the first 144 pdfs then it throws this error:
PdfReadError: EOF marker not found
I've tried making all the pdfs end in %%EOF but that doesn't work - it still reaches the same point then I get the error again.
Here's my code:
my file and converting to list for python to read each separately
with open('minutelinks.txt', 'r') as file:
data = file.read()
links = data.split()
download pdfs
from PyPDF2 import PdfFileMerger
import requests
urls = links
merger = PdfFileMerger()
for url in urls:
response = requests.get(url)
title = url.split("/")[-1]
with open(title, 'wb') as f:
f.write(response.content)
merger.append(title)
merger.write("allminues.pdf")
merger.close()
I want to be able to download all of them and create one big pdf - which it appears to do until it throws this error. I have about 750 pdfs and it only gets to 144.

This is how I changed my code so it now downloads all of the pdfs and skips the one (or more) that may be correupted. I also had to add the self argument to the function.
from PyPDF2 import PdfFileMerger
import requests
import sys
urls = links
def download_pdfs(self):
merger = PdfFileMerger()
for url in urls:
try:
response = requests.get(url)
title = url.split("/")[-1]
with open(title, 'wb') as f:
f.write(response.content)
except PdfReadError:
print(title)
sys.exit()
merger.append(title)
merger.write("allminues.pdf")
merger.close()

The end of file marker '%%EOF' is meant to be the very last line. It is a kind of marker where the pdf parser knows, that the PDF document ends here.
My solution is to force this marker to stay at the end:
def reset_eof(self, pdf_file):
with open(pdf_file, 'rb') as p:
txt = (p.readlines())
for i, x in enumerate(txt[::-1]):
if b'%%EOF' in x:
actual_line = len(txt)-i-1
break
txtx = txt[:actual_line] + [b'%%EOF']
with open(pdf_file, 'wb') as f:
f.writelines(txtx)
return PyPDF4.PdfFileReader(pdf_file)

I read that EOF is a kind of tag included in PDF files. link in portuguese
However, I guess some kinds of PDF files do not have the 'EOF marker' and PyPDF2 do not recognizes those ones.
So, what I did to fix "PdfReadError: EOF marker not found" was opening my PDF with Google Chromer and print it as .pdf once more, so that the file is converted to .pdf by Chromer and hopefully with the EOF marker.
I ran my script with the new .pdf file converted by Chromer and it worked fine.

Unable to return the list that is read from a file(.csv)

I have been learning and practicing python and during which
I found one error in my program, but I'm unable to resolve. I want to return list of that is retrieved from a csv file. I tried the below code and it returns me an error.
import csv
def returnTheRowsInTheFile(fileName):
READ = 'r'
listOfRows = []
try:
with open(fileName, READ) as myFile:
listOfRows = csv.reader(myFile)
return listOfRows
except FileNotFoundError:
print('The file ' + fileName + ' is not found')
except:
print('Something went wrong')
finally:
#myFile.close()
print()
def main():
fullString = returnTheRowsInTheFile('ABBREVATIONS.CSV')
for eachRow in fullString:
print(eachRow)
return
main()
And the error is
Traceback (most recent call last): File
"C:\Users\santo\workspace\PyProject\hello\FinalChallenge.py", line 36,
in
main() File "C:\Users\santo\workspace\PyProject\hello\FinalChallenge.py", line 32,
in main
for eachRow in fullString: ValueError: I/O operation on closed file.

The easy way to solve this problem is to return a list from your function. I know you assigned listOfRows = [] but this was overwritten when you did listOfRows = csv.reader(myFile).
So, the easy solution is:
def returnTheRowsInTheFile(fileName):
READ = 'r'
try:
with open(fileName, READ) as myFile:
listOfRows = csv.reader(myFile)
return list(listOfRows) # convert to a list
except FileNotFoundError:
print('The file ' + fileName + ' is not found')
except:
print('Something went wrong')
You should also read pep8 which is the style guide for Python; in order to understand how to name your variables and functions.

When you use with open it closes the file when the context ends. Now listOfRows is of the return type of csv.Reader, and so is then fullString (not a list). You are trying to iterate on it, which seems to iterate over a file object, which is already closed.

As JulienD already pointed the file is alread closed when you try to read the rows from it. You can get rid of this exception using this for example:
with open(fileName, READ) as myFile:
listOfRows = csv.reader(myFile)
for row in listOfRows:
yield row
UPDATE
Btw the way you handle exceptions makes it pretty hard to debug. I'd suggest something like this.
except Exception as e:
print('Something went wrong: "%s"' e)
This way you can at least see the error message.

Can't download entire file using urllib2

Hi I'm trying to download an excel file from this url (http://www.sicom.gov.co/precios/controller?accion=ExportToExcel) and then I need to parse it using xlrd.
The problem is that when I put that Url on a browser I get an excel file of more or less 2MB, but when I download the file using urllib2, http2lib or even curl from command line I only get a 4k file and obviously parsing that incomplete file fails miserably.
Strangely enough xlrd seems to be able to read the correct sheet name from the downloaded file so I'm guessing the file is the right one but it is clearly incomplete.
Here is some sample code of what I'm trying to achieve
import urllib2
from xlrd import open_workbook
excel_url = 'http://www.sicom.gov.co/precios/controller?accion=ExportToExcel'
result = urllib2.urlopen(excel_url)
wb = open_workbook(file_contents=result.read())
response = ""
for s in wb.sheets():
response += 'Sheet:' + s.name + '<br>'
for row in range(s.nrows):
values = []
for col in range(s.ncols):
value = s.cell(row, col).value
if (value):
values.append(str(value) + " not empty")
else:
values.append("Value at " + col + ", " + row + " was empty")
response += str(values) + "<br>"

You have to call your first url first. It seems to set a cookie or something like that. Then call the second one to download the excel file. For that kind of jobs, you should prefer http://docs.python-requests.org/en/latest/#, because it's much easier to use than the standard lib tools and it handles special cases (like cookies) much better by default.
import requests
s = requests.Session()
s.get('http://www.sicom.gov.co/precios/controller?accion=Home&option=SEARCH_PRECE')
response = s.get('http://www.sicom.gov.co/precios/controller?accion=ExportToExcel')
with file('out.xls','wb') as f:
f.write(response.content)

How to delete a line from a file after it has been used

I'm trying to create a script which makes requests to random urls from a txt file e.g.:
import urllib2
with open('urls.txt') as urls:
for url in urls:
try:
r = urllib2.urlopen(url)
except urllib2.URLError as e:
r = e
if r.code in (200, 401):
print '[{}]: '.format(url), "Up!"
But I want that when some url indicates 404 not found, the line containing the URL is erased from the file. There is one unique URL per line, so basically the goal is to erase every URL (and its corresponding line) that returns 404 not found.
How can I accomplish this?

You could simply save all the URLs that worked, and then rewrite them to the file:
good_urls = []
with open('urls.txt') as urls:
for url in urls:
try:
r = urllib2.urlopen(url)
except urllib2.URLError as e:
r = e
if r.code in (200, 401):
print '[{}]: '.format(url), "Up!"
good_urls.append(url)
with open('urls.txt', 'w') as urls:
urls.write("".join(good_urls))

The easiest way is to read all the lines, loop over the saved lines and try to open them, and then when you are done, if any URLs failed you rewrite the file.
The way to rewrite the file is to write a new file, and then when the new file is successfully written and closed, then you use os.rename() to change the name of the new file to the name of the old file, overwriting the old file. This is the safe way to do it; you never overwrite the good file until you know you have the new file correctly written.
I think the simplest way to do this is just to create a list where you collect the good URLs, plus have a count of failed URLs. If the count is not zero, you need to rewrite the text file. Or, you can collect the bad URLs in another list. I did that in this example code. (I haven't tested this code but I think it should work.)
import os
import urllib2
input_file = "urls.txt"
debug = True
good_urls = []
bad_urls = []
bad, good = range(2)
def track(url, good_flag, code):
if good_flag == good:
good_str = "good"
elif good_flag == bad:
good_str = "bad"
else:
good_str = "ERROR! (" + repr(good) + ")"
if debug:
print("DEBUG: %s: '%s' code %s" % (good_str, url, repr(code)))
if good_flag == good:
good_urls.append(url)
else:
bad_urls.append(url)
with open(input_file) as f:
for line in f:
url = line.strip()
try:
r = urllib2.urlopen(url)
if r.code in (200, 401):
print '[{0}]: '.format(url), "Up!"
if r.code == 404:
# URL is bad if it is missing (code 404)
track(url, bad, r.code)
else:
# any code other than 404, assume URL is good
track(url, good, r.code)
except urllib2.URLError as e:
track(url, bad, "exception!")
# if any URLs were bad, rewrite the input file to remove them.
if bad_urls:
# simple way to get a filename for temp file: append ".tmp" to filename
temp_file = input_file + ".tmp"
with open(temp_file, "w") as f:
for url in good_urls:
f.write(url + '\n')
# if we reach this point, temp file is good. Remove old input file
os.remove(input_file) # only needed for Windows
os.rename(temp_file, input_file) # replace original input file with temp file
EDIT: In comments, #abarnert suggests that there might be a problem with using os.rename() on Windows (at least I think that is what he/she means). If os.rename() doesn't work, you should be able to use shutil.move() instead.
EDIT: Rewrite code to handle errors.
EDIT: Rewrite to add verbose messages as URLs are tracked. This should help with debugging. Also, I actually tested this version and it works for me.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

HTML Diff File is getting malformed - python

Related

Script to convert multiple URLs or files to individual PDFs and save to a specific location

Don't understand this PdfReadError: EOF marker not found

Unable to return the list that is read from a file(.csv)

Can't download entire file using urllib2

How to delete a line from a file after it has been used

Categories

Resources