I'm writing a website/search engine with a Python back end, and every time a paragraph symbol shows up in my search results, the page gets a 500 server error. Does anyone know how I might be able to reformat the string containing the results so that it will get rid of the paragraph symbol?
Thanks!
This sounds like a problem with text encoding. Make sure you're using Unicode strings as much as possible, and if it's not possible, always specify the encoding.
Related
I recently used Google Vision API to extract text from a pdf. Now I searching for a keyword in the response text (from API). When I compare the given string and found string, they do not match even they have same characters.
The only reason I can see is font types of given and found string which looks different which lead to different ascii/utf-8 code of the characters in the string. (I never came across such a problem)
How to solve this? How can I bring these two string to same characters? I am using Jupyter notebook but I even pasted the comparison on terminal but still its evaluates it to False.
Here are the strings I am trying to match:
'КА Р5259' == 'KA P5259'
But they look the same on Stack Overflow so here's a screenshot:
Thanks everyone for the your comments.
I found the solution. I am posting it here, it might be helpful for someone. Actually it's correct that python does not support font faces. So if one copies a font faced character and paste it to python console or jupyter notebook (which renders the font faces due to the fact that it uses html to display information) it is considered a different unicode character.
So the idea is to first bring the text response in a plain text format which I achieved by storing the response in a .txt file (or .pkl file more precisely) which I had to do anyway to preserve the response objects for later data analysis purposes. Once the response in stored in plain text file you can read it without any font face problem unlike I faced above.
I have some scraped data that I have output into a html file held locally as a 'raw' version before I do some data manipulation.
The issue is when I process the website I have a troublesome time dealing with a "'" character.
After much research I am getting to the end of my tether. I have seen much on that apostrophe causing issues, I have tried many versions of encoding and decoding, chardet etc and still cannot get it to work.
A word in a few tables is : CA’BELLAVISTA
When I process a script the IDE screen prints it correctly after I get the right encoding/decoding pattern however when I view the outputted HTML file I get the following CA\x92BELLAVISTA every time.
The script is simply a urllib.response.read() then encoding.
Is it the web browser doing it or is the script genially not getting the correct character?
The next step involves me loading in the HTML file for further manipulation and output to JSON/csv so I thought nailing the encoding on html file output would be the best option.
I think it's a ISO-9959-1/Latin1 charset although that seems to change on the odd webpage.
I hope I'm doing the correct thing in trying to put it into UTF-8.
I am writing a python script for mass-replacement of links(actually image and script sources) in HTML files; I am using lxml. There is one problem, the html files are quizzes and they have data packaged like this(there is also some Cyrillic here):
<input class="question_data" value="{"text":"<p>[1] је наука која се бави чувањем, обрадом и преносом информација помоћу рачунара.</p>","fields":[{"id":"1","type":"fill","element":{"sirina":"103","maxDuzina":"12","odgovor":["Информатика"]}}]}" name="question:1:data" id="id3a1"/>
When I try to print out this data in python using:
print "OLD_DATA:", data
It just prints out the error "UnicodeEncodeError: character maps to undefined". There are more of these elements. My goal is to change the links of images in the value part of input, but I can't change the links if I don't know how to print this data(or how it should be written to the file). How does Python handle(interpret) this? Please help. Thanks!!! :)
You're running into the same problem I've hit many times in the past. That error almost always means that the console environment you're using can't display the characters it's trying to print. It might be worth trying to log to a file instead, then opening the log in an editor that can display the characters.
If you really want to be able to see it on your console, it might be worth writing a function to screen the strings you're printing for unprintable characters
I also found a couple other StackOverflow posts that might be helpful in your efforts:
How do I get Cyrillic in the output, Python?
What is right way to use cyrillic in python lxml library
I would also recommend this article and python manual entry:
https://docs.python.org/2/howto/unicode.html
http://www.joelonsoftware.com/articles/Unicode.html
I am indexing data on elasticsearch using the official python library for this: elasticsearch-py. The data is directly taken from oracle using the cx_oracle python library, cast into a document format and send for indexing to elasticsearch. For the most part this works great, but sometimes I encounter problems with characters like ö. Sometimes this character is indexed as \xc3\xb8 and sometimes as ö. This happens even in the same database entry. One variable can have the ö indexed correct while for another variable this is not the case.
Does Anyone an idea what might cause this?
thanks in advance
If your "ö" is sometimes right - and sometimes not, the data must be corrupted in your database. This is not a problem of Elasticsearch. (I had the exact same problem one month ago!)
Strings with various encodings are likely put in your database without being all converted to a single format before.
text = "ö"
asUtf=text.encode('UTF-8')
print(asUtf)
print(asUtf.decode())
Result:
b'\xc3\xb6'
ö
This problem could be solved before the insertion into Elasticsearch. Find the text sequences matching '\xXX\xXX', treat them as UTF-8 and decode them to unicode. Try to sanitize you database and fix the way you put information inside.
PS: a better practice to move information from a database to Elasticsearch is to use rivers or to make a script that would directly send the data to Elasticsearch, without saving them into a file first.
2016 edit: the rivers are deprecated now, so you should find an alternative like logstash.
After crawling many websites, in some of them i receive broken-encoding data. I can't do anything with them, i just need to detect them. For example text like:
·ç¼wÃdª«¦Ê³f
or
ãà³n³¾å¢
How can I recognize text like that ? I any language, so searching for non-english text is not an option. The only option I can think of is guess-language module.
There's NLTK which has a guess_encoding function that takes a byte string and tries all of the available encodings, would this serve your purpose?
Take a look at https://github.com/LuminosoInsight/python-ftfy
If I understand correctly, it will attempt to 'repair' incorrectly encoded/decoded text.