searching for non english word in a file python

searching for non english word in a file python - python

I am trying to solve "simple" problem in python (2.7).
suppose that i have two files:
key.txt - which have a key to search for.
content.txt - which has a web content (html file)
both files saved in utf-8.
content.txt is mixed file, which means it contains non english characters (web html file)
i am trying to check if the key in key.txt file found in the content or not.
tried comparing the files as binary (bytes) didn't work, also tried decoding didn't work.
i would also appreciate any help on how can i search for regex which is mixed (my pattern built from english and non-english characters)

You should let the python interpreter know that you are using utf-8 encoding by
adding this statement at the beginning :
# encoding: utf-8
Then you can use u'yourString' to indicate that string is a unicode string.
Sample code :
text = u'someString'
keyString = u'someKey'
f = re.findall(keyString, text)
You may need to use encode('utf-8') method on the string while performing some other operation on those strings

Related

Cannot read imported csv file with excel in UTF-8 format after python scraping

I have a csv file encoded in utf-8 (filled with information from website through scraping with a python code, with str(data_scraped.encode('utf-8') in the end for the content)).
When I import it to excel (even if I pick 65001: Unicode UTF8 in the options), it doesn't display the special characters.
For example, it would show \xc3\xa4 instead of ä
Any ideas of what is going on?

I solved the problem.
The reason is that in the original code, I removed items such as \t \n that were "polluting" the output with the replace function. I guess I removed too much and it was not readable for excel after.
In the final version, I didn't use
str(data_scrapped.encode('utf-8') but
data_scrapped.encode('utf-8','ignore').decode('utf-8')
then I used split and join to rempove the "polluting terms":
string_split=data_scrapped.split()
data_scrapped=" ".join(string_split)

Python lxml: how to deal with encoding errors parsing xml strings?

I need help with parsing xml data. Here's the scenario:
I have xml files loaded as strings to a postgresql database.
I downloaded them to a text file for further analysis. Each line corresponds to an xml file.
The strings have different encodings. Some explicitly specify utf-8, other windows-1252. There might be others as well; some don't specify the encoding in the string.
I need to parse these strings for data. The best approach I've found is the following:
encoded_string = bytes(bytearray(xml_data, encoding='utf-8'))
root = etree.fromstring(encoded_string)
When it doesn't work, I get two types of error messages:
"Extra content at the end of the document, line 1, column x (<string>, line 1)"
# x varies with string; I think it corresponds to the last character in the line
Looking at the lines raising exceptions it looks like the Extra content error is raised by files with a windows-1252 encoding.
I need to be able to parse every string, ideally without having to alter them in any way after download. I've tried the following:
Apply 'windows-1252' as the encoding instead.
Reading the string as binary and then applying the encoding
Reading the string as binary and converting it directly with etree.fromstring
The last attempt produced this error: ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.
What can I do? I need to be able to read these strings but can't figure out how to parse them. The xml strings with the windows encoding all start with <?xml version="1.0" encoding="windows-1252" ?>

given that the table column is text, all the XML content is being presented to python in UTF-8, as a result attempting to parse a conflicting XML encoding attribute will cause problems.
maybe try stripping that attribute from the string.

I solved the problem by removing encoding information, newline literals and carriage return literals. Every string was parsed successfully if I opened the files returning errors in vim and ran the following three commands:
:%s/\\r//g
:%s/\\n//g
:%s/<?.*?>//g
Then lxml parsed the strings without issue.
Update:
I have a better solution. The problem was \n and \r literals in UTF-8 encoded strings I was copying to text files. I just needed to remove these characters from the strings with regexp_replace like so:
select regexp_replace(xmlcolumn, '\\n|\\r', '', 'g') from table;
now I can run the following and read the data with lxml without further processing:
psql -d database -c "copy (select regexp_replace(xml_column, '\\n|\\r', '', 'g') from resource ) to stdout" > output.txt

UT8 issue - Is there a way to convert strange looking characters Ã¤ to its proper German character ä in Python?

I have a .txt file, which should contain German Umlauts like ä,ö,ß,ü. But, these characters don't apear as such, instead what appears is Ã¤ instead of ä, Ã instead of Ü and so on. It happens because the .txt file is stored in ANSI encoding. Now, when I import this file, with respective columns as Strings, in either SAS (DataStep) or Python (with .read_csv), then these strange characters appear in the .sas7bat and the Python DataFrame as such, instead of proper characters like ä,ö,ü,ß.
One work around to solve this issue is -
Open the file in standard Notepad.
Press 'Save As' and then a window appears.
Then in the drop down, change encoding to UTF-8.
Now, when you import the files, in SAS or Python, then everything is imported correctly.
But, sometimes the .txt files that I have are very big (in GBs), so I cannot open them and do this hack to solve this issue.
I could use .replace() function, to replace these strange characters with the real ones, but there could be some combinations of strange characters that I am not aware of, that's why I wish to avoid that.
Is there any Python library which can automatically translate these strange characters into their proper characters - like Ã¤ gets translated to ä and so on?

did you try to use codecs library?
import codecs
your_file= codecs.open('your_file.extension','w','encoding_type')

If the file contains the correct code points, you just have to specify the correct encoding. Python 3 will default to UTF-8 on most sane platforms, but if you need your code to also run on Windows, you probably want to spell out the encoding.
with open(filename, 'r', encoding='utf-8') as f:
# do things with f
If the file actually contains mojibake there is no simple way in the general case to revert every possible way to screw up text, but a common mistake is assuming text was in Latin-1 and convert it to UTF-8 when in fact the input was already UTF-8. What you can do then is say you want Latin-1, and probably make sure you save it in the correct format as soon as you have read it.
with open(filename, 'r', encoding='latin-1') as inp, \
open('newfile', 'w', encoding='utf-8') as outp:
for line in inp:
outp.write(line)
The ftfy library claims to be able to identify and correct a number of common mojibake problems.

How to convert utf-8 codes in binary file to html codes in python 3

The HTML files I am dealing with are generally utf-8 but have some broken encodings and therefore can't be transformed to Unicode. My idea is to parse them as binary and replace in a first step all the proper utf-8 encodings with html codes.
e.g. "\xc2\xa3" to £
In a second step I would replace the broken encodings with proper ones.
I got stuck at the first step. Replacing a single character works with replace:
string.replace(b'\xc3\x84', b'Ä')
Taking the code mappings from a table doesn't work. When reading the table the utf-8 codes get escaped (b'\xc3\x84' and I can't find a way to get rid of the double slashes.
I can think of some dirty ways of solving this problem but there should be a clean one, should it?

Best way is either pre filter them
iconv -t utf8 -c SRCFILE > NEWFILE
Or in python
with open("somefile_with_bad_utf8.txt","rt",encoding="utf-8",errors="ignore") as myfile:
for line in myfile:
process()
I was going to say always use python3 for utf-8 but I see you are already.
Hope that helps....

Converting binary stored Unicode Chinese Characters back to Unicode using Python 3

I'm working from an OpenOffice produced .csv with mixed roman and Chinese characters. This is an example of one row:
b'\xe5\xbc\x80\xe5\xbf\x83'b'K\xc4\x81i x\xc4\xabn'b'Open heart 'b'Happy '
This section contains two Chinese characters stored in binary which I would like displayed as Chinese characters on the command line from a very basic Python 3 program (see bottom), how do I do this?
b'\xe5\xbc\x80\xe5\xbf\x83'b'K\xc4\x81i x\xc4\xabn'
When I open the .csv in OpenOffice I need to select "Chinese Simplified UEC-CN" as the Character set if that helps. I have searched extensively but I do not understand Unicode and the pages do not make sense.
import csv
f = open('Chinese.csv', encoding="utf-8")
file = csv.reader(f)
for line in file:
for word in line:
print(word.encode('utf-8'), end='')
print("\n")
Thank you in advance for any suggestions.

Thanks to a suggestion by #eryksun I solved my issue by re-encoding the source file to UTF-8 from ASCII. The question is different but the solution is here :
http://www.stackoverflow.com/a/542899/792015
Alternatively if you are using Eclipse you can paste a non roman character (such as a Chinese character like 大) into your source code and save the file. If the source is not already UTF-8 Eclipse will offer to change it for you.
Thank you for all your suggestions and my apologies for answering my own question.
Footnote : If anyone knows why changing the source file type effects the compiled program I would love to know. According to https://docs.python.org/3/tutorial/interpreter.html the interpreter treats source files as UTF-8 by default.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.