Python detect broken encoding - python

After crawling many websites, in some of them i receive broken-encoding data. I can't do anything with them, i just need to detect them. For example text like:
·ç¼wÃdª«¦Ê³f
or
ãà³n³¾å¢
How can I recognize text like that ? I any language, so searching for non-english text is not an option. The only option I can think of is guess-language module.

There's NLTK which has a guess_encoding function that takes a byte string and tries all of the available encodings, would this serve your purpose?

Take a look at https://github.com/LuminosoInsight/python-ftfy
If I understand correctly, it will attempt to 'repair' incorrectly encoded/decoded text.

Related

Filter Emojis like \\xe2\\x80\\x9e from HTML in python 3

So I'm working on a project where I need to manually filter a HTML of social media comment threads with split and replace and re.sub and that stuff, I wouldn't get the information required otherwise (BeautifulSoup filters out important information too). In the end, I'm left with something like this:
Best of luck to you now that there's some real competition \xf0\x9f\x98\x8f
Thanks \xf0\x9f\x98\x82
I searched for any way to get rid of these or replace them with actual emojis, but I found nothing. I did find commands that filter out emojis when they look like this U+1F600 or like this :cowboy hat face: or like this \U0001F606, and I did find someone who filtered things like this \xe2\x80\x99, but he only did it for semicolons and quotation marks, not emojis. I also couldn't find a way to use encode and decode for this.
Short: I want "Thanks \xf0\x9f\x98\x82" to become "Thanks".
So I'm new to working with websites and maybe the answer is quite simple, but as I said, I found nothing on this on the internet. Any help is very appreciated!
if you only want ascii characters in your text , you can enode and decode the text with ascii
text = """Best of luck to you now that there's some real competition \xf0\x9f\x98\x8f
Thanks \xf0\x9f\x98\x82"""
text = text.encode('ascii', 'ignore').decode()
>>> text
Best of luck to you now that there's some real competition
Thanks

HTML Decoding in Python

I am writing a python script for mass-replacement of links(actually image and script sources) in HTML files; I am using lxml. There is one problem, the html files are quizzes and they have data packaged like this(there is also some Cyrillic here):
<input class="question_data" value="{"text":"<p>[1] је наука која се бави чувањем, обрадом и преносом информација помоћу рачунара.</p>","fields":[{"id":"1","type":"fill","element":{"sirina":"103","maxDuzina":"12","odgovor":["Информатика"]}}]}" name="question:1:data" id="id3a1"/>
When I try to print out this data in python using:
print "OLD_DATA:", data
It just prints out the error "UnicodeEncodeError: character maps to undefined". There are more of these elements. My goal is to change the links of images in the value part of input, but I can't change the links if I don't know how to print this data(or how it should be written to the file). How does Python handle(interpret) this? Please help. Thanks!!! :)
You're running into the same problem I've hit many times in the past. That error almost always means that the console environment you're using can't display the characters it's trying to print. It might be worth trying to log to a file instead, then opening the log in an editor that can display the characters.
If you really want to be able to see it on your console, it might be worth writing a function to screen the strings you're printing for unprintable characters
I also found a couple other StackOverflow posts that might be helpful in your efforts:
How do I get Cyrillic in the output, Python?
What is right way to use cyrillic in python lxml library
I would also recommend this article and python manual entry:
https://docs.python.org/2/howto/unicode.html
http://www.joelonsoftware.com/articles/Unicode.html

Search text for valid Python code

I have chunks of text that may or may not contain Python code. I need a way to search the text for code and if it is there, do something. I could easily search for specific strings that match a regex, but I need something general.
One thought I had would be to run the text through ast, but that would require parsing out all possible substrings and submitting each of them to ast.
To be clear, the text comes from a Q&A forum for Python. Users frequently post code in their questions, but the code is all smushed into one long, incoherent line when it should be formatted to be displayed properly. I need to check if there is code included in the text and if there is and it isn't formatted properly, complain to the user. Checking formatting is something I can handle, I just need to check the text for any Python code.
Any help would be appreciated.

Paragraph Symbol Causing Web Page To Crash

I'm writing a website/search engine with a Python back end, and every time a paragraph symbol shows up in my search results, the page gets a 500 server error. Does anyone know how I might be able to reformat the string containing the results so that it will get rid of the paragraph symbol?
Thanks!
This sounds like a problem with text encoding. Make sure you're using Unicode strings as much as possible, and if it's not possible, always specify the encoding.

BeautifulSoup for Mandarin

I'm trying to scrape a site in Mandarin using BeautifulSoup. Unfortunately, when I do, BeautifulSoup finds the html, head, and body tags, but everything in between the opening and closing body tags is gibberish. I've tried using multiple parsers, and as far as I can tell only html5lib is able to find all of the page because it returns by far the longest result. So I think I'm using the right parser, but the encoding is wrong. The website lists 'gb2312' as its encoding, but using that encoding, it is still gibberish. I also tried chardet to determine the encoding, which returned 'windows-1252', but it also didn't seem correct. Indeed I have gone through many of the standard Chinese character encodings (found here), but none of them return anything coherent, although some have one or two Chinese characters. I also created a output file for every possible python encoding, but it looks like none of them are correct.
Other than going through the different encodings, I'm not sure what else to try. Any help would be greatly appreciated, thanks!
Never mind! I guess it was an encoding issue, but mainly that the requests library is far better than urllib! Sorry about that.

Categories

Resources