Fixing damaged text after the fact - python

Last month I made a scraper for this Latin dictionary. It finally finished executing (that website gave me response times of 6 to 8 seconds per page). To bad I find out that a good chunk of my data is severely compromised...
eg. commandūcor ----> command\xc5\xabcor || commandūcāris ----> command\xc5\xabc\xc4\x81ris
I made the stupid mistake of using the str() function on the raw data I got from requests. Just like this:
import requests
r = requests.get("https://www.dizionario-latino.com/dizionario-latino-
flessione.php?lemma=COMMANDUCOR100", verify = False)
out = str(r.content)
with open("test.html", 'w') as file:
file.write(out)
I'd really appreciate it if anyone could help me restore the broken text.
Thank you in advance!

Just .decode them using utf-8 (the default). You can read more about character encodings in Python's Unicode howto.
b'command\xc5\xabcor'.decode() # 'commandūcor'
b'command\xc5\xabc\xc4\x81ris'.decode() # 'commandūcāris'

r.content returns bytes. (In contrast, r.text returns a str. The requests module attempts to guess the correct decoding based on HTTP headers and decodes the bytes using that encoding for you. In the future maybe that is what you would want to use instead).
If r.content contained bytes such as b'command\xc5\xabcor', then
str(r.content) returns a str which begins with the characters b' and ends with a literal '.
In [45]: str(b'command\xc5\xabcor')
Out[45]: "b'command\\xc5\\xabcor'"
You can use ast.literal_eval to recover the bytes:
In [46]: ast.literal_eval(str(b'command\xc5\xabcor'))
Out[46]: b'command\xc5\xabcor'
You could then decode these bytes to a str. The URL you posted declares the content is UTF-8 encoded:
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
Assuming all the data you downloaded uses the same encoding, you could recover the content as a str by calling the bytes.decode('utf-8') method:
In [47]: ast.literal_eval(str(b'command\xc5\xabcor')).decode('utf-8')
Out[47]: 'commandūcor'
import ast
import requests
r = requests.get("https://www.dizionario-latino.com/dizionario-latino-flessione.php?lemma=COMMANDUCOR100", verify = False)
out = str(r.content)
with open("test.html", 'w') as file:
file.write(out)
with open("test.html", 'r') as f_in, open("test-fixed.html", 'w') as f_out:
broken_text = f_in.read()
content = ast.literal_eval(broken_text)
assert content == r.content
text = content.decode('utf-8')
f_out.write(text)

Related

MD5 Encoding HTML Giving 2 Different Results

Can someone help explain why this is happening? If I scrape HTML from a site using the requests module and use hashlib to get the md5 checksum I get one answer. Then if I save the html as an html file, open it, and do the same md5 checksum it gives me a different checksum.
import requests
import hashlib
resp = requests.post("http://casesearch.courts.state.md.us/", timeout=120)
html = resp.text
print("CheckSum 1: " + hashlib.md5(html.encode('utf-8')).hexdigest())
f = open("test.html", "w+")
f.write(html)
f.close()
with open('test.html', "r", encoding='utf-8') as f:
html2 = f.read()
print("CheckSum 2: " + hashlib.md5(html2.encode('utf-8')).hexdigest())
The results look like:
CheckSum 1: e0b253903327c7f68a752c6922d8b47a
CheckSum 2: 3aaf94e0df9f1298d61830d99549ddb0
When reading from a file in text mode, Python may convert newline characters depending on the value of the newlines argument provided to open.
When reading input from the stream, if newline is None, universal newlines mode is enabled. Lines in the input can end in '\n', '\r', or '\r\n', and these are translated into '\n' before being returned to the caller. If it is '', universal newlines mode is enabled, but line endings are returned to the caller untranslated.
This difference will affect the generated hash value.

writing persian text into a text file in the way which could be read in python

I have developed a simple program which sends a request to a persian web server and gets the source code of the main page. Then I convert it to string , use file.open (new_file , 'w') and paste the string in it.
When i use print the string in python idle I can see the right words in persian but the text file which i made in directory is written with strings like \xd9\x8a\xd8\xb9\n.
Here is the code:
import urllib.request as ul
import sys
url = 'http://www.uut.ac.ir/'
resp = ul.urlopen(url).read()
string = str(resp)
create_file(filename , string) # this function creates a text file in desktop
I also used:
file.open(new_file , 'w' , encoding = 'utf-8')
string = resp.encode('utf-8')
But nothing changed. Any help would be appreciated.
So look at your code:
>>> resp = ul.urlopen(url).read()
>>> type(resp)
<class 'bytes'>
resp has the type bytes. In the next you have used:
string = str(resp)
But you have forgot to set the encoding. The right command is:
string = str(resp, encoding="utf-8")
Now you get the right string and can write it directly to your file.
Your solution 2 is false. You must use decode instead of encode.
string = resp.decode('utf-8')
decode the web site content before write into file
import urllib.request as ul
import sys
url = 'http://www.uut.ac.ir/'
resp = ul.urlopen(url).read()
string = str(resp.decode())
f=open("a.txt",'w')
f.write(string)

Writing on text file, accents and special characters not displaying correctly

Here's what I'm doing, I'm web crawling for my personal use on a website to copy the text and put the chapters of a book on text format and then transform it with another program to pdf automatically to put it in my cloud. Everything is fine until this happens: special characters are not copying correctly, for example the accent is showed as: \xe2\x80\x99 on the text file and the - is showed as \xe2\x80\x93. I used this (Python 3):
for text in soup.find_all('p'):
texta = text.text
f.write(str(str(texta).encode("utf-8")))
f.write('\n')
Because since I had a bug when reading those characters and it just stopped my program, I encoded everything to utf-8 and retransform everything to string with python's method str()
I will post the whole code if anyone has a better solution to my problem, here's the part that crawl the website from page 1 to max_pages, you can modify it on line 21 to get more or less chapters of the book:
import requests
from bs4 import BeautifulSoup
def crawl_ATG(max_pages):
page = 1
while page <= max_pages:
x= page
url = 'http://www.wuxiaworld.com/atg-index/atg-chapter-' + str(x) + "/"
source = requests.get(url)
chapter = source.content
soup = BeautifulSoup(chapter.decode('utf-8', 'ignore'), 'html.parser')
f = open('atg_chapter' + str(x) + '.txt', 'w+')
for text in soup.find_all('p'):
texta = text.text
f.write(str(str(texta).encode("utf-8")))
f.write('\n')
f.close
page +=1
crawl_ATG(10)
I will do the clean up of the first useless lines that are copied later when I get a solution to this problem. Thank you
The easiest way to fix this problem that I found is adding encoding= "utf-8" in the open function:
with open('file.txt','w',encoding='utf-8') as file :
file.write('ñoño')
For some reason, you (wrongly) have utf8 encoded data in a Python3 string. The real cause of that is probably that requests.content is already a unicode string, so you should not decode it, but use it directly:
url = 'http://www.wuxiaworld.com/atg-index/atg-chapter-' + str(x) + "/"
source = requests.get(url)
chapter = source.content
soup = BeautifulSoup(chapter, 'html.parser')
If it is not enough, that means if you still have ’ and – (Unicode u'\u2019' and u'\u2013') display as \xe2\x80\x99 and \xe2\x80\x93', that could be caused by the html page not correctly declaring its encoding. In that case you should first encode to a byte string with latin1 encoding, and then decode it as utf8:
chapter = source.content.encode('latin1', 'ignore').decode('utf8', 'ignore')
soup = BeautifulSoup(chapter, 'html.parser')
Demonstration:
t = u'\xe2\x80\x99 \xe2\x80\x93'
t = t.encode('latin1').decode('utf8')
Displays : u'\u2019 \u2013'
print(t)
Displays : ’ –
The only error I can spot is,
str(texta).encode("utf-8")
In it, you are forcing a conversion to str and encoding it. It should be replaced with,
texta.encode("utf-8")
EDIT:
The error stems in the server not giving the correct encoding for the page. So requests assumes a 'ISO-8859-1'. As noted in this bug, it is a deliberate decision.
Luckily, chardet library correctly detects the 'utf-8' encoding, so you can do:
source.encoding = source.apparent_encoding
chapter = source.text
And there won't be any need to manually decode the text in chapter, since requests uses it to decode the content for you.

Unable to decode HTML page with urllib.request

I've wrote the following piece of code which searches URL and saves the HTML to a text file. However, I have two issues
Most importantly, it does not save € and £ in the HTML as this. This is likely a decoding issue which I've tried to fix, but so far without success
The following code does not replace the "\n" in the HTML with "". This isn't as important to me, but I am curious as to why it is not working
Any ideas?
import urllib.request
while True: # this is an infinite loop
with urllib.request.urlopen('WEBSITE_URL') as f:
fDecoded = f.read().decode('utf-8')
data = str(fDecoded .read()).replace('\n', '') # does not seem to work?
myfile = open("TestFile.txt", "r+")
myfile.write(data)
print ('----------------')
When you do this -
fDecoded = f.read().decode('utf-8')
fDecoded is already of type str , you are reading the byte string from the request and decoding it into str using utf-8 encoding.
Then after this you cannot call -
str(fDecoded .read()).replace('\n', '')
str has no method read() and you do not actually need to convert it to str again. Just do -
data = fDecoded.replace('\n', '')

How do you base-64 encode a PNG image for use in a data-uri in a CSS file?

I want to base-64 encode a PNG file, to include it in a data:url in my stylesheet. How can I do that?
I’m on a Mac, so something on the Unix command line would work great. A Python-based solution would also be grand.
This should do it in Python:
import base64
binary_fc = open(filepath, 'rb').read() # fc aka file_content
base64_utf8_str = base64.b64encode(binary_fc).decode('utf-8')
ext = filepath.split('.')[-1]
dataurl = f'data:image/{ext};base64,{base64_utf8_str}'
Thanks to #cnst comment, we need the prefix data:image/{ext};base64,
Thanks to #ramazanpolat answer, we need the decode('utf-8')
In python3, base64.b64encode returns a bytes instance, so it's necessary to call decode to get a str, if you are working with unicode text.
# Image data from [Wikipedia][1]
>>>image_data = b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x00\x05\x00\x00\x00\x05\x08\x06\x00\x00\x00\x8do&\xe5\x00\x00\x00\x1cIDAT\x08\xd7c\xf8\xff\xff?\xc3\x7f\x06 \x05\xc3 \x12\x84\xd01\xf1\x82X\xcd\x04\x00\x0e\xf55\xcb\xd1\x8e\x0e\x1f\x00\x00\x00\x00IEND\xaeB`\x82'
# String representation of bytes object includes leading "b" and quotes,
# making the uri invalid.
>>> encoded = base64.b64encode(image_data) # Creates a bytes object
>>> 'data:image/png;base64,{}'.format(encoded)
"data:image/png;base64,b'iVBORw0KGgoAAAANSUhEUgAAAAUAAAAFCAYAAACNbyblAAAAHElEQVQI12P4//8/w38GIAXDIBKE0DHxgljNBAAO9TXL0Y4OHwAAAABJRU5ErkJggg=='"
# Calling .decode() gets us the right representation
>>> encoded = base64.b64encode(image_data).decode('ascii')
>>> 'data:image/png;base64,{}'.format(encoded)
'data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAUAAAAFCAYAAACNbyblAAAAHElEQVQI12P4//8/w38GIAXDIBKE0DHxgljNBAAO9TXL0Y4OHwAAAABJRU5ErkJggg=='
If you are working with bytes directly, you can use the output of base64.b64encode without further decoding.
>>> encoded = base64.b64encode(image_data)
>>> b'data:image/png;base64,' + encoded
b'data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAUAAAAFCAYAAACNbyblAAAAHElEQVQI12P4//8/w38GIAXDIBKE0DHxgljNBAAO9TXL0Y4OHwAAAABJRU5ErkJggg=='
import base64
def image_to_data_url(filename):
ext = filename.split('.')[-1]
prefix = f'data:image/{ext};base64,'
with open(filename, 'rb') as f:
img = f.read()
return prefix + base64.b64encode(img).decode('utf-8')
This should do it in Unix:
b64encode filename.png X | sed '1d;$d' | tr -d '\n' > b64encoded.png
The encoded image produced by b64encode includes a header and footer and no line longer than 76 characters. This format is typical in SMTP communications.
To make the encoded image embeddable in HTML/CSS, the sed and tr commands remove the header/footer (first & last lines) and all newlines, respectively.
Then just simply use the long encoded string in HTML
<img src="data:image/png;base64,ENCODED_PNG">
or in CSS
url(data:image/png;base64,ENCODED_PNG)
b64encode is not installed by default in some distros (#Clint Pachl's answer), but python is.
So, just use:
python -mbase64 image.jpeg | tr -d '\n' > b64encoded.txt
In order to get base64 encoded image from the command line.
The remaining steps were already answered by #Clint Pachl (https://stackoverflow.com/a/20467682/1522342)
This should work in Python3:
from io import BytesIO
import requests, base64
def encode_image(image_url):
buffered = BytesIO(requests.get(image_url).content)
image_base64 = base64.b64encode(buffered.getvalue())
return b'data:image/png;base64,'+image_base64
Call decode to get str as in python3 base64.b64encode returns a bytes instance.
And just for the record, if you want to do it in Node.js instead:
const fs = require('fs');
const base64encodedString = fs.readFileSync('image_file.jpg', {encoding:'base64'});

Categories

Resources