Goal: I just want to take the comma away as that is the only character that will screw up my (course required) file parsing for bayesian analysis (i.e word,2,4) instead of say (i.e. word,,2,4)
So I'm currently trying to read in an email in the form of a text file from the Enron public corpus online and building a bayesian spam filter.
I've noticed that reading in some of the files are raising errors when trying to manipulate the strings that are present. I am fully aware that some of theses files contain viruses so the encoding of some of the characters might not be valid. However, I'm trying to simply replace a comma within a string and I'm getting the following error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc1 in position 1169: ordinal not in range(128)
I have tried everything that this forum has to offer and i've searched everywhere for a solution such as:
with open(file+file_path_stings[i],'r') as filehandle:
words = str(filehandle.read())
words = words.replace(',','')
words = words.split()
I've also tried many regex attempts... this is one of the versions:
with open(file+file_path_stings[i],'r') as filehandle:
words = str(filehandle.read())
words = re.sub(',','',words)
words = words.split()
Now, I can simply just regex a version that only lets A-Za-z through but I'm noticing that spam accuracy is heavily being affected by the fact that a lot of the spam files have such special characters.
Any suggestion would be most appreciated. Thanks.
-Robert
If you just want to remove the extra comma and as you said nothing is working out you can use the simple split and join (assuming comma is the only delimiter here)
','.join([s for s in 'word,,2,4'.split(',') if s])
So I ended up using another implementation that I found useful as well. It turns out, for some reason, python retains any prior information it had for any previous strings that were originally present. So i've learned its always a good idea to just re-assign it to a different(new) variable as follows:
with open(file+file_path_stings[i],'r') as filehandle:
words = str(filehandle.read()).split()
new_array = []
for word in words:
new_array.append(word.replace(',','').lower())
return new_array
Its a little more expensive as far as storing and assigning data to a whole other variable. However, I've noticed its a lot safer in terms of your string not getting casted to a unicode string. The original problem was this output
print words
[u'hello,',u'what?',u'is',u'going',u'on?']
The comma in 'hello' would not get replaced. With the code above you're guaranteed that the comma will be stripped from each word and not casted into a unicode string
print new_array
['hello','what?',u'is',u'going',u'on?']
As far as performance of the code goes, I'm still training massive files at a decent speed. So it should effect you that much.
Hope this helps!
-Robert
Related
I have a list of 77 items. I have placed all 77 items in a text file (one per line).
I am trying to read this into my python script (where I will then compare each item in a list, to another list pulled via API).
Problem: for some reason, 2/77 of the items on the list have encoding, giving me characters of "u00c2" and "u00a2" which means they are not comparing correctly and being missed. I have no idea why these 2/77 have this encoding, but the other 75 are fine, and I don't know how to get rid of the encoding, in python.
Question:
In Python, How can I get rid of the encoding to ensure none of them have any special/weird characters and are just plain text?
Is there a method I can use to do this upon reading the file in?
Here is how I am reading the text file into python:
with open("name_list2.txt", "r") as myfile:
policy_match_list = myfile.readlines()
policy_match_list = [x.strip() for x in policy_match_list]
Note - "policy_match_list" is the list of 77 policies read in from the text file.
Here is how I am comparing my two lists:
for policy_name in policy_match_list:
for us_policy in us_policies:
if policy_name == us_policy["name"]:
print(f"Match #{match} | {policy_name}")
match += 1
Note - "us_policies" is another list of thousands of policies, pulled via API that I am comparing to
Which is resulting in 75/77 expected matches, due to the other 2 policies comparing e.g. "text12 - text" to "text12u00c2-u00a2text" rather than "text12 - text" to "text12 - text"
I hope this makes sense, let me know if I can add any further info
Cheers!
Did you try to open the file while decoding from utf8? because I can't see the file I can't tell this is the problem, but the file might have characters that the default decoding option (which I think is Latin) can't process.
Try doing:
with open("name_list2.txt", "r", encoding="utf-8") as myfile:
Also, you can watch this question about how to treat control characters: Python - how to delete hidden signs from string?
Sorry about not posting it as a comment (as I really don't know if this is the solution), I don't have enough reputation for that.
Certain Unicode characters aren't properly decoded in some cases. In your case, the characters \u00c2 and \u00a2 caused the issue. As of now, I see two fixes:
Try to resolve the encoding by replacing the characters (refer to https://stackoverflow.com/a/56967370)
Copy the text to a new plain text file (if possible) and save it. These extra characters tend to get ignored in that case and consequently removed.
I currently have several text coming in which sometimes contains the character 'invalid character' e.g. \uf0b7 or \uf077. I don't have a way of knowing which of the invalid character codes a specific text might contain and I wondered if there was a way to make sure that a string is cleaned of all types of 'invalid character', since a process later on (which is dependent on a third party package) can not receive a string which contains it.
I've tried searching for a solution, but all I get it is answers regarding regular characters which people want removed (e.g. '^%$&*') which they have classified as invalid characters, however I want to remove/replace the actual character 'invalid character' in all its forms
The Python library codecs might be of help. Take a look at the documentation here: https://docs.python.org/2/library/codecs.htm
In my use case, I was doing some analysis of documents that had non-ASCII text. For my purposes, ignoring the invalid characters was acceptable. I opened the files with the following line and was able to parse the corpus.
for filename in os.listdir(ROOT_DIR):
with codecs.open(os.path.join(ROOT_DIR, filename), encoding = 'UTF8', errors ='replace' ) as f:
I had a similar issue. It turns out private use areas characters are in the Co general category, as returned by category() in unicodedata.
I fixed my problem as follow:
import unicodedata
def is_pua(c):
return unicodedata.category(c) == 'Co'
content = "This\uf0b7 is a \uf0b7string \uf0c7with private \uf0b7use are\uf0a7as blocks\uf0d7."
"".join([char for char in content if not is_pua(char)])
This outputs:
'This is a string with private use areas blocks.'
I am a complete beginner when it comes to Python, and currently getting closer towards the end of LPTHW. Now, in exercise 41, there is a line of code in a for-loop that I do not quite get.
I have searched online to the best of my abilities, but as I am still learning, I was not completely sure how to even search for this.
To clarify:
WORD_URL is just a series of words.
WORDS is an empty list.
This is the loop:
for word in urlopen(WORD_URL).readlines():
WORDS.append(str(word.strip(), encoding="utf-8"))
Now what I do not really understand is what this WORDS.append(str(word.strip(), encoding='utf-8') does. Why is encoding="utf-8" included, and what does it to in this context? I suspect the use of str here is connected to this some way, but not completely sure. Would it not be possible to simply just have it like this:
.append(word.strip())?
Thanks!
The code snippet
WORDS.append(str(word.strip(), encoding='utf-8'))
removes the whitespace from word, and coverts word to a utf-8 encoded string.
The str function takes care of the encoding.
The code you provided
WORDS.append(word.strip())
would place the variable word, with no whitespace at the end of the list WORDS.
The difference being that no encoding is specified, so the variable word will be placed at the end of WORDS in whatever encoding the variable word currently is. In the first example, all strings placed in WORDS will be utf-8 encoded.
I am working on a Python program which contains an Arabic-English database and allows to update this database and also to study the vocabluary. I am almost done with implementing all the functions I need but the most important part is missing: The encoding of the Arabic strings. To append new vocabulary to the data base txt file, a dictionary is created and then its content is appended to the file. To study vocabulary, the content of the txt file is converted into a dictionary again, a random word is printed to the console and the user is asked for its translation. Now the idea is that the user has the possibility to write the Englisch word as well as the Arabic word in latin letters and the program will internally convert the pseudo-arabic string to Arabic letters. For example, if the user writes 'b' when asked for the Arabic word, I want to append 'ب'.
1. There are about 80 signs I have to consider in the implementation. Is there a way of creating some mapping between the latin-letter input string and the respective Arabic signs? For me, the most intuitive idea would be to write one if statement after the other but that's probably super slow.
2. I have trouble printing the Arabic string to the console. This input
print('bla{}!'.format(chr(0xfe9e)))
print('bla{}!'.format(chr(int('0x'+'0627',16))))
will result in printing the Arabic sign whereas this won't:
print('{}'.format(chr(0xfe9e)))
What can I do in order to avoid this problem, since I want a sequence which consists of unicode symbols only?
Did you try encode/decode function? for example you can write
u = ("سلام".encode('utf-8'))
print(u.decode('utf-8'))
This is not the final answer but can give you a start.
First of all check your encoding:
import sys
sys.getdefaultencoding()
Edit:
sys.setdefaultencoding('UTF8') was removed from sys module. But still, you can comment what sys.getdefaultencoding() returns in your box.
However, for Arabic characters, you can range them all at once:
According to this website, Arabic characters are from 0x620 to 0x64B and Basic Latin characters are from 0x0061 to 0x007B (for lower cases).
So:
arabic_chr = [chr(k) for k in range(0x620, 0x064B, 1)]
latin_chr = [chr(k) for k in range(0x0061, 0x007B, 1)]
Now, all what you have to do, is finding a relation between the two lists, orr maybe extend more the ranges (I speak arabic and i know that there is many forms of one char and a character can change from a word to another).
So I'm trying to parse a bunch of citations from a text file using the re module in python 3.4 (on, if it matters, a mac running mavericks). Here's some minimal code. Note that there are two commented lines: they represent two alternative searches. (Obviously, the little one, r'Rawls', is the one that works)
def makeRefList(reffile):
print(reffile)
# namepattern = r'(^[A-Z1][A-Za-z1]*-?[A-Za-z1]*),.*( \(?\d\d\d\d[a-z]?[.)])'
# namepattern = r'Rawls'
refsTuplesList = re.findall(namepattern, reffile, re.MULTILINE)
print(refsTuplesList)
The string in question is ugly, and so I stuck it in a gist: https://gist.github.com/paultopia/6c48c398a42d4834f2ae
As noted, the search string r'Rawls' produces expected output ['Rawls', 'Rawls']. However, the other search string just produces an empty list.
I've confirmed this regex (partially) works using the regex101 tester. Confirmation here: https://regex101.com/r/kP4nO0/1 -- this match what I expect it to match. Since it works in the tester, it should work in the code, right?
(n.b. I copied the text from terminal output from the first print command, then manually replaced \n characters in the string with carriage returns for regex101.)
One possible issue is that python has appended the bytecode flag (is the little b called a "flag?") to the string. This is an artifact of my attempt to convert the text from utf-8 to ascii, and I haven't figured out how to make it go away.
Yet re clearly is able to parse strings in that form. I know this because I'm converting two text files from utf-8 to ascii, and the following code works perfectly fine on the other string, converted from the other text file, which also has a little b in front of it:
def makeCiteList(citefile):
print(citefile)
citepattern = r'[\s(][A-Z1][A-Za-z1]*-?[A-Za-z1]*[ ,]? \(?\d\d\d\d[a-z]?[\s.,)]'
rawCitelist = re.findall(citepattern, citefile)
cleanCitelist = cleanup(rawCitelist)
finalCiteList = list(set(cleanCitelist))
print(finalCiteList)
return(finalCiteList)
The other chunk of text, which the code immediately above matches correctly: https://gist.github.com/paultopia/a12eba2752638389b2ee
The only hypothesis I can come up with is that the first, broken, regex expression is puking on the combination of newline characters and the string being treated as a byte object, even though a) I know the regex is correct for newlines (because, confirmation from the linked regex101), and b) I know it's matching the strings (because, confirmation from the successful match on the other string).
If that's true, though, I don't know what to do about it.
Thus, questions:
1) Is my hypothesis right that it's the combination of newlines and b that blows up my regex? If not, what is?
2) How do I fix that?
a) replace the newlines with something in the string?
b) rewrite the regex somehow?
c) somehow get rid of that b and make it into a normal string again? (how?)
thanks!
Addition
In case this is a problem I need to fix upstream, here's the code I'm using to get the text files and convert to ascii, replacing non-ascii characters:
this function gets called on utf-8 .txt files saved by textwrangler in mavericks
def makeCorpoi(citefile, reffile):
citebox = open(citefile, 'r')
refbox = open(reffile, 'r')
citecorpus = citebox.read()
refcorpus = refbox.read()
citebox.close()
refbox.close()
corpoi = [str(citecorpus), str(refcorpus)]
return corpoi
and then this function gets called on each element of the list the above function returns.
def conv2ASCII(bigstring):
def convHandler(error):
return ('1FOREIGN', error.start + 1)
codecs.register_error('foreign', convHandler)
bigstring = bigstring.encode('ascii', 'foreign')
stringstring = str(bigstring)
return stringstring
Aah. I've tracked it down and answered my own question. Apparently one needs to call some kind of encode method on the decoded thing. The following code produces an actual string, with newlines and everything, out the other end (though now I have to fix a bunch of other bugs before I can figure out if the final output is as expected):
def conv2ASCII(bigstring):
def convHandler(error):
return ('1FOREIGN', error.start + 1)
codecs.register_error('foreign', convHandler)
bigstring = bigstring.encode('ascii', 'foreign')
newstring = bigstring.decode('ascii', 'foreign')
return newstring
apparently the str() function doesn't do the same job, for reasons that are mysterious to me. This is despite an answer here How to make new line commands work in a .txt file opened from the internet? which suggests that it does.