MD5 Encoding HTML Giving 2 Different Results - python

Can someone help explain why this is happening? If I scrape HTML from a site using the requests module and use hashlib to get the md5 checksum I get one answer. Then if I save the html as an html file, open it, and do the same md5 checksum it gives me a different checksum.
import requests
import hashlib
resp = requests.post("http://casesearch.courts.state.md.us/", timeout=120)
html = resp.text
print("CheckSum 1: " + hashlib.md5(html.encode('utf-8')).hexdigest())
f = open("test.html", "w+")
f.write(html)
f.close()
with open('test.html', "r", encoding='utf-8') as f:
html2 = f.read()
print("CheckSum 2: " + hashlib.md5(html2.encode('utf-8')).hexdigest())
The results look like:
CheckSum 1: e0b253903327c7f68a752c6922d8b47a
CheckSum 2: 3aaf94e0df9f1298d61830d99549ddb0

When reading from a file in text mode, Python may convert newline characters depending on the value of the newlines argument provided to open.
When reading input from the stream, if newline is None, universal newlines mode is enabled. Lines in the input can end in '\n', '\r', or '\r\n', and these are translated into '\n' before being returned to the caller. If it is '', universal newlines mode is enabled, but line endings are returned to the caller untranslated.
This difference will affect the generated hash value.

Related

How to determine how a (downloaded) byte string is encoded in python?

I am trying to download a file and write it to disk, but somehow I am lost in encoding decoding land.
from urllib.request import urlopen
url = "http://export.arxiv.org/e-print/supr-con/9608001"
with urllib.request.urlopen(url) as response:
data = response.read()
filename = 'test.txt'
file_ = open(filename, 'wb')
file_.write(data)
file_.close()
Here data is a byte string. If I check the file I find a bunch of strange characters. I tried
import chardet
the_encoding = chardet.detect(data)['encoding']
but this results in None. So I don't really know how the data I downloaded is encoded?
If I just type "http://export.arxiv.org/e-print/supr-con/9608001" into the browser, it downloads a file that I can view with a text editor and it's a perfectly fine .tex file.
Apply the python-magic library.
python-magic is a Python interface to the libmagic file type
identification library. libmagic identifies file types by checking
their headers according to a predefined list of file types. This
functionality is exposed to the command line by the Unix command
file.
Commented script (works on Windows 10, Python 3.8.6):
# stage #1: read raw data from a url
from urllib.request import urlopen
import gzip
url = "http://export.arxiv.org/e-print/supr-con/9608001"
with urlopen(url) as response:
rawdata = response.read()
# stage #2: detect raw data type by its signature
print("file signature", rawdata[0:2])
import magic
print( magic.from_buffer(rawdata[0:1024]))
# stage #3: decompress raw data and write to a file
data = gzip.decompress(rawdata)
filename = 'test.tex'
file_ = open(filename, 'wb')
file_.write(data)
file_.close()
# stage #4: detect encoding of the data ( == encoding of the written file)
import chardet
print( chardet.detect(data))
Result: .\SO\68307124.py
file signature b'\x1f\x8b'
gzip compressed data, was "9608001.tex", last modified: Thu Aug 8 04:57:44 1996, max compression, from Unix
{'encoding': 'ascii', 'confidence': 1.0, 'language': ''}

Fixing damaged text after the fact

Last month I made a scraper for this Latin dictionary. It finally finished executing (that website gave me response times of 6 to 8 seconds per page). To bad I find out that a good chunk of my data is severely compromised...
eg. commandūcor ----> command\xc5\xabcor || commandūcāris ----> command\xc5\xabc\xc4\x81ris
I made the stupid mistake of using the str() function on the raw data I got from requests. Just like this:
import requests
r = requests.get("https://www.dizionario-latino.com/dizionario-latino-
flessione.php?lemma=COMMANDUCOR100", verify = False)
out = str(r.content)
with open("test.html", 'w') as file:
file.write(out)
I'd really appreciate it if anyone could help me restore the broken text.
Thank you in advance!
Just .decode them using utf-8 (the default). You can read more about character encodings in Python's Unicode howto.
b'command\xc5\xabcor'.decode() # 'commandūcor'
b'command\xc5\xabc\xc4\x81ris'.decode() # 'commandūcāris'
r.content returns bytes. (In contrast, r.text returns a str. The requests module attempts to guess the correct decoding based on HTTP headers and decodes the bytes using that encoding for you. In the future maybe that is what you would want to use instead).
If r.content contained bytes such as b'command\xc5\xabcor', then
str(r.content) returns a str which begins with the characters b' and ends with a literal '.
In [45]: str(b'command\xc5\xabcor')
Out[45]: "b'command\\xc5\\xabcor'"
You can use ast.literal_eval to recover the bytes:
In [46]: ast.literal_eval(str(b'command\xc5\xabcor'))
Out[46]: b'command\xc5\xabcor'
You could then decode these bytes to a str. The URL you posted declares the content is UTF-8 encoded:
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
Assuming all the data you downloaded uses the same encoding, you could recover the content as a str by calling the bytes.decode('utf-8') method:
In [47]: ast.literal_eval(str(b'command\xc5\xabcor')).decode('utf-8')
Out[47]: 'commandūcor'
import ast
import requests
r = requests.get("https://www.dizionario-latino.com/dizionario-latino-flessione.php?lemma=COMMANDUCOR100", verify = False)
out = str(r.content)
with open("test.html", 'w') as file:
file.write(out)
with open("test.html", 'r') as f_in, open("test-fixed.html", 'w') as f_out:
broken_text = f_in.read()
content = ast.literal_eval(broken_text)
assert content == r.content
text = content.decode('utf-8')
f_out.write(text)

Python how to "ignore" ascii text?

I'm trying to scrape some stuff off a page using selenium. But this some of the text has ascii text in it.. so I get this.
f.write(database_text.text)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xed' in position 1462: ordinal not in range(128)
I was wondering, is there anyway to just simpley ascii?
Thanks!
print("â")
I'm not looking to write it in my text file, but ignore it.
note: It's not just "â" it has other chars like that also.
window_before = driver.window_handles[0]
nmber_one = 1
f = open(str(unique_filename) + ".txt", 'w')
for i in range(5, 37):
time.sleep(3)
driver.find_element_by_xpath("""/html/body/center/table[2]/tbody/tr[2]/td/table/tbody/tr""" + "[" + str(i) + "]" + """/td[2]/a""").click()
time.sleep(3)
driver.switch_to.window(driver.window_handles[nmber_one])
nmber_one = nmber_one + 1
database_text = driver.find_element_by_xpath("/html/body/pre")
f = open(str(unique_filename) + ".txt", 'w',)
f.write(database_text.text)
driver.switch_to.window(window_before)
import uuid
import io
unique_filename = uuid.uuid4()
which generates a new filename, well it should anyway, it worked before.
The problem is that some of the text is not ascii. database_text.text is likely unicode text (you can do print type(database_text.text) to verify) and contains non-english text. If you are on windows it may be "codepage" text which depends on how your user account is configured.
Often, one wants to store text like this as utf-8 so open your output file accordingly
import io
text = u"â"
with io.open('somefile.txt', 'w', encoding='utf-8') as f:
f.write(text)
If you really do want to just drop the non-ascii characters from the file completely you can setup a error policy
text = u"ignore funky â character"
with io.open('somefile.txt', 'w', encoding='ascii', errors='ignore') as f:
f.write(text)
In the end, you need to choose what representation you want to use for non-ascii (roughly speaking, non-English) text.
A Try Except block would work:
try:
f.write(database_text.text)
except UnicodeEncodeError:
pass

Cannot execute auto-generated Python script encoded in UTF8-sig

I am using a Python script to take some text from the internet and put it as comments to another Python script which the first one generates.
Originally I was simply using open() to open create the new Python script and write() to print to it.
outputFile = open(fileName, 'w')
outputFile.write('#!/usr/bin/python\n')
outputFile.write('\n')
outputFile.write('# ' + lineFromTheInternet + '\n')
outputFile.write('print \'Hello, World!\'\n')
This works most of the time, the new script is generated and I can run it. However, sometimes the text that I am taking from the internet has Unicode characters and gives me problems (UnicodeEncodeError: 'ascii' codec can't encode character u'\xd7' in position 55: ordinal not in range(128)). I replaced the code then to:
outputFile = codecs.open(fileName, 'w', 'utf-8-sig)
outputFile.write('#!/usr/bin/python\n')
outputFile.write('\n')
outputFile.write('# ' + lineFromTheInternet + '\n')
outputFile.write('print \'Hello, World!\'\n')
And this would generate the file correctly, but when I try to execute it I get ./autogenerated.py: line 1: #!/usr/bin/python: No such file or directory
This has to be the encoding, since it's the only thing changing, but I do not know how to solve it.
Linux or Windows? This works on Windows. Make sure to write Unicode strings to the file opened with codecs.open:
#!/usr/bin/python2
import codecs
with codecs.open('y.py', 'w', 'utf-8-sig') as outputFile:
outputFile.write(u'#!/usr/bin/python2\n')
outputFile.write(u'\n')
outputFile.write(u'# ' + u'Syst\xe9m' + u'\n')
outputFile.write(u'print \'Hello, World!\'\n')
AFAIK, Linux may not like the UTF-8 BOM. Try removing it and declaring the encoding instead, e.g. #coding:utf8 at the top of the file:
#!/usr/bin/python2
import codecs
with codecs.open('y.py', 'w', 'utf8') as outputFile:
outputFile.write(u'#!/usr/bin/python2\n')
outputFile.write(u'#coding:utf8\n')
outputFile.write(u'\n')
outputFile.write(u'# ' + u'Syst\xe9m' + u'\n')
outputFile.write(u'print \'Hello, World!\'\n')

Unable to decode HTML page with urllib.request

I've wrote the following piece of code which searches URL and saves the HTML to a text file. However, I have two issues
Most importantly, it does not save € and £ in the HTML as this. This is likely a decoding issue which I've tried to fix, but so far without success
The following code does not replace the "\n" in the HTML with "". This isn't as important to me, but I am curious as to why it is not working
Any ideas?
import urllib.request
while True: # this is an infinite loop
with urllib.request.urlopen('WEBSITE_URL') as f:
fDecoded = f.read().decode('utf-8')
data = str(fDecoded .read()).replace('\n', '') # does not seem to work?
myfile = open("TestFile.txt", "r+")
myfile.write(data)
print ('----------------')
When you do this -
fDecoded = f.read().decode('utf-8')
fDecoded is already of type str , you are reading the byte string from the request and decoding it into str using utf-8 encoding.
Then after this you cannot call -
str(fDecoded .read()).replace('\n', '')
str has no method read() and you do not actually need to convert it to str again. Just do -
data = fDecoded.replace('\n', '')

Categories

Resources