I've got this function, which I modified from material in chapter 1 of the online NLTK book. It's been very useful to me but, despite reading the chapter on Unicode, I feel just as lost as before.
def openbookreturnvocab(book):
fileopen = open(book)
rawness = fileopen.read()
tokens = nltk.wordpunct_tokenize(rawness)
nltktext = nltk.Text(tokens)
nltkwords = [w.lower() for w in nltktext]
nltkvocab = sorted(set(nltkwords))
return nltkvocab
When I tried it the other day on Also Sprach Zarathustra, it clobbered words with an umlat over the o's and u's. I'm sure some of you will know why that happened. I'm also sure that it's quite easy to fix. I know that it just has to do with calling a function that re-encodes the tokens into unicode strings. If so, that it seems to me it might not happen inside that function definition at all, but here, where I prepare to write to file:
def jotindex(jotted, filename, readmethod):
filemydata = open(filename, readmethod)
jottedf = '\n'.join(jotted)
filemydata.write(jottedf)
filemydata.close()
return 0
I heard that what I had to do was encode the string into unicode after reading it from the file. I tried amending the function like so:
def openbookreturnvocab(book):
fileopen = open(book)
rawness = fileopen.read()
unirawness = rawness.decode('utf-8')
tokens = nltk.wordpunct_tokenize(unirawness)
nltktext = nltk.Text(tokens)
nltkwords = [w.lower() for w in nltktext]
nltkvocab = sorted(set(nltkwords))
return nltkvocab
But that brought this error, when I used it on Hungarian. When I used it on German, I had no errors.
>>> import bookroutines
>>> elles1 = bookroutines.openbookreturnvocab("lk1-les1")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "bookroutines.py", line 9, in openbookreturnvocab
nltktext = nltk.Text(tokens)
File "/usr/lib/pymodules/python2.6/nltk/text.py", line 285, in __init__
self.name = " ".join(map(str, tokens[:8])) + "..."
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 4: ordinal not in range(128)
I fixed the function that files the data like so:
def jotindex(jotted, filename, readmethod):
filemydata = open(filename, readmethod)
jottedf = u'\n'.join(jotted)
filemydata.write(jottedf)
filemydata.close()
return 0
However, that brought this error, when I tried to file the German:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "bookroutines.py", line 23, in jotindex
filemydata.write(jottedf)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in position 414: ordinal not in range(128)
>>>
...which is what you get when you try to write the u'\n'.join'ed data.
>>> jottedf = u'/n'.join(elles1)
>>> filemydata.write(jottedf)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in position 504: ordinal not in range(128)
For each string that you read from your file, you can convert them to unicode by calling rawness.decode('utf-8'), if you have the text in UTF-8. You will end up with unicode objects. Also, I don't know what "jotted" is, but you may want to make sure it's a unicode object and use u'\n'.join(jotted) instead.
Update:
It appears that the NLTK library doesn't like unicode objects. Fine, then you have to make sure that you are using str instances with UTF-8 encoded text. Try using this:
tokens = nltk.wordpunct_tokenize(unirawness)
nltktext = nltk.Text([token.encode('utf-8') for token in tokens])
and this:
jottedf = u'\n'.join(jotted)
filemydata.write(jottedf.encode('utf-8'))
but if jotted is really a list of UTF-8-encoded str, then you don't need this and this should be enough:
jottedf = '\n'.join(jotted)
filemydata.write(jottedf)
By the way, it looks as though NLTK isn't very cautious with respect to unicode and encoding (at least, the demos). Better be careful and check that it has processed your tokens correctly. Also, and this may have caused the fact that you get errors with Hungarian text and not German text, check your encodings.
Related
I am running Python 3.7.x and am trying to figure out how to encode a string, {CTF-FLAG1}, using zero width steganography.
I am using zwsp-steg-py to do so, but I do not know how to use this to encode text into other text, see below:
I want to encode {CTF-FLAG1} inside of the text Now you see me, now you don't. using zero width steganography.
I installed zwsp-steg-py and tried:
#coding=utf-8
import zwsp_steg
encoded = zwsp_steg.encode("{CTF-Flag1}", zwsp_steg.MODE_ZWSP)
decoded = zwsp_steg.decode(encode)
print(decoded)
Yet, the result is:
C:\Users\jerry\Desktop>python decode.py
Traceback (most recent call last):
File "decode.py", line 5, in <module>
decoded = zwsp_steg.decode(encoded)
File "C:\Python367-64\lib\site-packages\zwsp_steg\steganography.py", line 72, in decode
raise TypeError('Unknown encoding detected!')
TypeError: Unknown encoding detected!
I don't think I'm doing it right.
#coding=utf-8
import zwsp_steg
encoded = zwsp_steg.encode("{CTF-Flag1}", zwsp_steg.MODE_ZWSP)
decoded = zwsp_steg.decode(encode, zwsp_steg.MODE_ZWSP)
print(decoded)
# example with string padding
encoded += "This is a test string"
print(encoded)
decoded_the_string = zwsp_steg.decode(encode, zwsp_steg.MODE_ZWSP)
print(decoded_the_string)
I have started off by including Tweepy and managed to get a program that is able to output a correct output with a search parameter, however, while trying to create a program that can save and store a person timeline for data analysis I came across a TypeError: must be str, not ResultSet.
import tweepy
#API keys access
auth = tweepy.OAuthHandler("", "")
auth.set_access_token("", "")
client = tweepy.API(auth)
#Opening a file with the name of the user wanted
screen = input("Enter the screen name: ")
filename = (screen+".txt")
file = open(filename, "w")
#Getting the Users time line
user = client.get_user(screen_name=screen)
timeline = user.timeline()
#Writing new found data to the file.
file.write(timeline)
file.close()
This code keeps on spitting out the error:
Traceback (most recent call last):
File "GetTimeLine.py", line 20, in <module>
file.write(timeline)
TypeError: must be str, not ResultSet
However for the set line:
file.write(timeline)
I add where I want it to become str through the user of
file.write(str(timeline))
Throwing out an entirely different error.
Traceback (most recent call last):
File "GetTimeLine.py", line 20, in <module>
file.write(str(timeline))
File "C:\Program Files (x86)\Python\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 4122-4123: character maps to <undefined>
To try and fix this I tried to add .(encode"utf-8") however with no luck.
Any help greatly appreciated.
Try this, I searched for how to convert ResultSet to string and found this
unicode.join(u'\n',map(unicode,result))
Here "result" is your "timeline" variable, just put it and you will get it in string format. Doing simply str(timeline) will not encode it , that's why you are getting another error.
Just try it:) Cheers
Please go through the archival data USA GOV Sample Data
Now I want to read this file in R then getting below mentioned error
result = fromJSON(textFileName)
Error in fromJSON(textFileName) : unexpected character 'u'
When I want to read it in Python then getting below mentioned error
import json
records = [json.loads(line) for line in open(path)]
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
codecs.charmap_decode(input,self.errors,decoding_table)[0]
24
25 class StreamWriter(Codec,codecs.StreamWriter):
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 4088: character maps to <undefined>
can someone please help me that how can I read this kind of files.
I couldn't get the codes OP provided on the question on my system too(windows/Rstudio/Jupyter). I dig around and find this for R, adapting it to this case:
library(jsonlite)
out <- lapply(readLines("usagov_bitly_data2013-05-17-1368817803"), fromJSON)
df<-data.frame(Reduce(rbind, out))
Although the error I got in R is curiously different from yours.
result = fromJSON("usagov_bitly_data2013-05-17-1368817803")
#Error in parse_con(txt, bigint_as_char) : parse error: trailing garbage
# [ 34.730400, -86.586098 ] } { "a": "Mozilla\/5.0 (Windows N
# (right here) ------^
For Python, as mentioned by juanpa, it seems to be a matter of encoding. The following code works for me.
import json
import os
path=os.path.abspath("usagov_bitly_data2013-05-17-1368817803")
print(path)
file = open(path, encoding="utf8")
records = [json.loads(line) for line in file]
Solution in R:
library(jsonlite)
# if you have a local file
conn <- gzcon(file("usagov_bitly_data2013-05-17-1368817803.gz", "rb"))
# if you read it from URL
conn <- gzcon(url("http://1usagov.measuredvoice.com/bitly_archive/usagov_bitly_data2013-05-17-1368817803.gz"))
data <- stream_in(conn)
Python 2.6, upgrading not an option
Script is designed to take fields from a arcgis database and create Insert oracle statements to a text file that can be used at a later date. There are 7500 records after 3000 records it errors out and says the problem lies at.
fieldValue = unicode(str(row.getValue(field.name)),'utf-8',errors='ignore')
I have tried seemly every variation of unicode and encode. I am new to python and really just need someone with experience to look at my code and see where the problem is.
import arcpy
#Where the GDB Table is located
fc = "D:\GIS Data\eMaps\ALDOT\ALDOT_eMaps_SignInventory.gdb/SignLocation"
fields = arcpy.ListFields(fc)
cursor = arcpy.SearchCursor(fc)
#text file output
outFile = open(r"C:\Users\kkieliszek\Desktop\transfer.text", "w")
#insert values into table billboard.sign_inventory
for row in cursor:
outFile.write("Insert into billboard.sign_inventory() Values (")
for field in fields:
fieldValue = unicode(str(row.getValue(field.name)),'utf-8',errors='ignore')
if row.isNull(field.name) or fieldValue.strip() == "" : #if field is Null or a Empty String print NULL
value = "NULL"
outFile.write('"' + value + '",')
else: #print the value in the field
value = str(row.getValue(field.name))
outFile.write('"' + value + '",')
outFile.write("); \n\n ")
outFile.close() # This closes the text file
Error Code:
Traceback (most recent call last):
File "tablemig.py", line 25, in <module>
fieldValue = unicode(str(row.getValue(field.name)),'utf-8',errors='ignore')
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 76: ordinal not in range(128)
Never call str() on a unicode object:
>>> str(u'\u2019')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 0: ordinal not in range(128)
To write rows that contain Unicode strings in csv format, use UnicodeWriter instead of formatting the fields manually. It should fix several issues at once.
File TextWrappers make manual encoding/decoding unnecessary.
Assuming the result from the row is a Unicode, simply use the io.open() with the encoding attribute set to the required encoding.
For example:
import io
with io.open(r"C:\Users\kkieliszek\Desktop\transfer.text", "w", encoding='utf-8') as my_file:
my_file(my_unicode)
The problem is that you need to decode/encode unicode/byte string instead of just calling str on it. So, if you have a byte string object, then you need to call encode on it to convert it into unicode object ignoring utf content. On the other hand, if you have unicode object, you need to call decode on it to convert it into byte string ignoring utf again. So, just use this function instead
import re
def remove_unicode(string_data):
""" (str|unicode) -> (str|unicode)
recovers ascii content from string_data
"""
if string_data is None:
return string_data
if isinstance(string_data, str):
string_data = str(string_data.decode('ascii', 'ignore'))
else:
string_data = string_data.encode('ascii', 'ignore')
remove_ctrl_chars_regex = re.compile(r'[^\x20-\x7e]')
return remove_ctrl_chars_regex.sub('', string_data)
fieldValue = remove_unicode(row.getValue(field.name))
It should fix the problem.
I try to read an email from a file, like this:
import email
with open("xxx.eml") as f:
msg = email.message_from_file(f)
and I get this error:
Traceback (most recent call last):
File "I:\fakt\real\maildecode.py", line 53, in <module>
main()
File "I:\fakt\real\maildecode.py", line 50, in main
decode_file(infile, outfile)
File "I:\fakt\real\maildecode.py", line 30, in decode_file
msg = email.message_from_file(f) #, policy=mypol
File "C:\Python33\lib\email\__init__.py", line 56, in message_from_file
return Parser(*args, **kws).parse(fp)
File "C:\Python33\lib\email\parser.py", line 55, in parse
data = fp.read(8192)
File "C:\Python33\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 1920: character maps to <undefined>
The file contains a multipart email, where the part is encoded in UTF-8. The file's content or encoding might be broken, but I have to handle it anyway.
How can I read the file, even if it has Unicode errors? I cannot find the policy object compat32 and there seems to be no way to handle an exception and let Python continue right where the exception occured.
What can I do?
To parse an email message in Python 3 without unicode errors, read the file in binary mode and use the email.message_from_binary_file(f) (or email.message_from_bytes(f.read())) method to parse the content (see the documentation of the email.parser module).
Here is code that parses a message in a way that is compatible with Python 2 and 3:
import email
with open("xxx.eml", "rb") as f:
try:
msg = email.message_from_binary_file(f) # Python 3
except AttributeError:
msg = email.message_from_file(f) # Python 2
(tested with Python 2.7.13 and Python 3.6.0)
I can't test on your message, so I don't know if this will actually work, but you can do the string decoding yourself:
with open("xxx.eml", encoding='utf-8', errors='replace') as f:
text = f.read()
msg = email.message_from_string(f)
That's going to get you a lot of replacement characters if the message isn't actually in UTF-8. But if it's got \x81 in it, UTF-8 is my guess.
with open('email.txt','rb') as f:
ascii_txt = f.read().encode('ascii','backslashreplace')
with open('email.txt','w') as f:
f.write(ascii_text)
#now do your processing stuff
I doubt it is the best way to handle this ... but its at least a way ...
A method which works on python 3, which finds finds the encoding and reloads with the correct one.
msg=email.message_from_file(open('file.eml', errors='replace'))
codes=[x for x in msg.get_charsets() if x!=None]
if len(codes)>=1 :
msg=email.message_from_file(open('file.eml', encoding=codes[0]))
I have tried with msg.get_charset(), but it sometimes answers None while another encoding is available, hence the slightly involved encoding detection