With python, I'm trying to take a text file, and then create one long list of words (with words in the order they appear in the document).
What I have so far goes through each line and then just basically adds the words to the long list.
It is supposed to lowercase each word, and remove any punctuation it finds.
wordstory=[a.lower().strip(string.punctuation) for b in [line.split() for line in open('alice.txt')] for a in b]
It seems that some punctuation isn't recognized by .strip(string.punctuation) for removal, and further, in some cases, the punctuation gets converted to odd codes.
I end up with situations like this with \xe2\x80\x94 not supposed to be there at all.
..
'she',
'spoke\xe2\x80\x94fancy',
'curtseying',
..
Also, when an apostrophe occurs next to a double quotation, the apostrophe isn't removed by .strip(string.punctuation). I end up with:
..
'she',
"couldn't",
'answer',
..
Can someone provide some code that will help, and or point me to a resource that will help me understand what is going on?
I think you're having unicode problems, as well as being unnecessarily obfuscated with the list comprehension.
I'd recommend doing something like this:
# -*- coding: utf-8 -*-
import string
file = open("""text_file.txt""", "r")
raw_text = file.read()
# stripping punctuation
punctuation = set(string.punctuation)
trimmed_text = ''.join(char for char in raw_text if char not in punctuation)
# splitting into list
word_list = trimmed_text.split(" ")
# removing duplicates
unique_word_list = set(word_list)
# or if you're preserving the order, maybe try:
unique_word_list = []
for word in word_list:
if word not in unique_word_list:
unique_word_list.append(word)
print(unique_word_list)
If you want to remove all punctuation use translateand string.maketrans:
In [94]: import string
In [95]: a ="she's all foo!"
In [96]: a.lower().translate(string.maketrans("",""), string.punctuation)
Out[96]: 'shes all foo'
str.strip only removes chars from the end or start of a string.
Related
I have a file, some lines in a .csv file that are jamming up a database import because of funky characters in some field in the line.
I have searched, found articles on how to replace non-ascii characters in Python 3, but nothing works.
When I open the file in vi and do :set list, there is a $ at the end of a line where there should not be, and ^I^I at the beginning of the next line. The two lines should be one joined line and no ^I there. I know that $ is end of line '\n' and have tried to replace those, but nothing works.
I don't know what the ^I represents, possibly a tab.
I have tried this function to no avail:
def remove_non_ascii(text):
new_text = re.sub(r"[\n\t\r]", "", text)
new_text = ''.join(new_text.split("\n"))
new_text = ''.join([i if ord(i) < 128 else ' ' for i in new_text])
new_text = "".join([x for x in new_text if ord(x) < 128])
new_text = re.sub(r'[^\x00-\x7F]+', ' ', new_text)
new_text = new_text.rstrip('\r\n')
new_text = new_text.strip('\n')
new_text = new_text.strip('\r')
new_text = new_text.strip('\t')
new_text = new_text.replace('\n', '')
new_text = new_text.replace('\r', '')
new_text = new_text.replace('\t', '')
new_text = filter(lambda x: x in string.printable, new_text)
new_text = "".join(list(new_text))
return new_text
Is there some tool that will show me exactly what this offending character is, and a then find a method to replace it?
I am opening the file like so (the .csv was saved as UTF-8)
f_csv_in = open(csv_in, "r", encoding="utf-8")
Below are two lines that should be one with the problem non-ascii characters visible.
These two lines should be one line. Notice the $ at the end of line 37, and line 38 begins with ^I^I.
Part of the problem, that vi is showing, is that there is a new line $ on line 37 where I don't want it to be. This should be one line.
37 Cancelled,01-19-17,,basket,00-00-00,00-00-00,,,,98533,SingleSource,,,17035 Cherry Hill Dr,"L/o 1-19-17 # 11:45am$
38 ^I^IVictorville",SAN BERNARDINO,CA,92395,,,,,0,,,,,Lock:6111 ,,,No,No,,0.00,0.00,No,01-19-17,0.00,0.00,,01-19-17,00-00-00,,provider,,,Unread,00-00-00,,$
A simple way to remove non-ascii chars could be doing:
new_text = "".join([c for c in text if c.isascii()])
NB: If you are reading this text from a file, make sure you read it with the correct encoding
In the case of non-printable characters, the built-in string module has some ways of filtering out non-printable or non-ascii characters, eg. with the isprintable() functionality.
A concise way of filtering the whole string at once is presented below
>>> import string
>>>
>>> str1 = '\nsomestring'
>>> str1.isprintable()
False
>>> str2 = 'otherstring'
>>> str2.isprintable()
True
>>>
>>> res = filter(lambda x: x in string.printable, '\x01mystring')
>>> "".join(list(res))
'mystring'
This question has had some discussion on SO in the past, but there are many ways to do things, so I understand it may be confusing, since you can use anything from Regular Expressions to str.translate()
Another thing one could do is to take a look at Unicode Categories, and filter out your data based on the set of symbols you need.
It looks as if you have a csv file that contains quoted values, that is values such as embedded commas or newlines which have to be surrounded with quotes so that csv readers handle them correctly.
If you look at the example data you can see there's an opening doublequote but no closing doublequote at the end of the first line, and a closing doublequote with no opening doublequote on the second line, indicating that the quotes contain a value with an embedded newline.
The fact that the lines are broken in two may be an artefact of the application used to view them, or the code that's processing them: if the software doesn't understand csv quoting it will assume each newline character denotes a new line.
It's not clear exactly what problem this is causing in the database, but it's quite likely that quote characters - especially unmatched quotes - could be causing a problem, particularly if the data isn't being properly escaped before insertion.
This snippet rewrites the file, removing embedded commas, newlines and tabs, and instructs the writer not to quote any values. It will fail with the error message _csv.Error: need to escape, but no escapechar set if it finds a value that needs to be escaped. Depending on your data, you may need to adjust the regex pattern.
with open('lines.csv') as f, open('fixed.csv', 'w') as out:
reader = csv.reader(f)
writer = csv.writer(out, quoting=csv.QUOTE_NONE)
for line in reader:
new_row = [re.sub(r'\t|\n|,', ' ', x) for x in line]
writer.writerow(new_row)
Another approach using re, python to filter non printable ASCII character:
import re
import string
string_with_printable = re.sub(f'[^{re.escape(string.printable)}]', '', original_string)
re.escape escapes special characters in the given pattern.
I've been trying to solve this problem for a few hours now and can't come with the right solution, this is the question:
Write a loop that creates a new word list, using a
string method to strip the words from the list created in Problem 3
of all leading and trailing punctuation. Hint: the string library,
which is imported above, contains a constant named punctuation.
Three lines of code.
Here is my code:
import string
def litCricFriend(wordList, text):
theList = text.lower().replace('-', ' ').split() #problem 3
#problem below
for word in theList:
word.strip(string.punctuation)
return theList
You've got a couple bits in your code that... well, I'm not really sure why they're there, to be honest, haha. Let's work through this together!
I'm assuming you have been given some text: text = "My hovercraft is full of eels!". Let's split this into words, make the words lowercase, and remove all punctuation. We know we need string.punctuation and str.split(), and you've also figured out that str.replace() is useful. So let's use these and get our result!
import string
def remove_punctuation(text):
# First, let's remove the punctuation.
# We do this by looping through each punctuation mark in the
# `string.punctuation` list, and then replacing that mark with
# the empty string and re-assigning that to the same variable.
for punc in string.punctuation:
text = text.replace(punc, '')
# Now our text is all de-punctuated! So let's make a list of
# the words, all lowercased, and return it in one go:
return text.lower().split()
Looks to me like the function is only three lines, which is what you said you wanted!
For the advanced reader, you could also use functools and do it in one line (I split it into two for readability, but it's still "one line"):
import string
import functools
def remove_punctuation(text):
return functools.reduce(lambda newtext, punc: newtext.replace(punc, ''),
punctuation, text).lower().split()
I have defined the following code
exclude = set(string.punctuation)
lmtzr = nltk.stem.wordnet.WordNetLemmatizer()
wordList= ['"the']
answer = [lmtzr.lemmatize(word.lower()) for word in list(set(wordList)-exclude)]
print answer
I have previously printed exclude and the quotation mark " is part of it. I expected answer to be [the]. However, when I printed answer, it shows up as ['"the']. I'm not entirely sure why it's not taking out the punctuation correctly. Would I need to check each character individually instead?
When you create a set from wordList it stores the string '"the' as the only element,
>>> set(wordList)
set(['"the'])
So using set difference will return the same set,
>>> set(wordList) - set(string.punctuation)
set(['"the'])
If you want to just remove punctuation you probably want something like,
>>> [word.translate(None, string.punctuation) for word in wordList]
['the']
Here I'm using the translate method of strings, only passing in a second argument specifying which characters to remove.
You can then perform the lemmatization on the new list.
I have a long text file (a screenplay). I want to turn this text file into a list (where every word is separated) so that I can search through it later on.
The code i have at the moment is
file = open('screenplay.txt', 'r')
words = list(file.read().split())
print words
I think this works to split up all the words into a list, however I'm having trouble removing all the extra stuff like commas and periods at the end of words. I also want to make capital letters lower case (because I want to be able to search in lower case and have both capitalized and lower case words show up). Any help would be fantastic :)
This is a job for regular expressions!
For example:
import re
file = open('screenplay.txt', 'r')
# .lower() returns a version with all upper case characters replaced with lower case characters.
text = file.read().lower()
file.close()
# replaces anything that is not a lowercase letter, a space, or an apostrophe with a space:
text = re.sub('[^a-z\ \']+', " ", text)
words = list(text.split())
print words
A screenplay should be short enough to be read into memory in one fell swoop. If so, you could then remove all punctation using the translate method. Finally, you can produce your list simply by splitting on whitespace using str.split:
import string
with open('screenplay.txt', 'rb') as f:
content = f.read()
content = content.translate(None, string.punctuation).lower()
words = content.split()
print words
Note that this will change Mr.Smith into mrsmith. If you'd like it to become ['mr', 'smith'] then you could replace all punctation with spaces, and then use str.split:
def using_translate(content):
table = string.maketrans(
string.punctuation,
' '*len(string.punctuation))
content = content.translate(table).lower()
words = content.split()
return words
One problem you might encounter using a positive regex pattern such as [a-z]+ is that it will only match ascii characters. If the file has accented characters, the words would get split apart.
Gruyère would become ['Gruy','re'].
You could fix that by using re.split to split on punctuation.
For example,
def using_re(content):
words = re.split(r"[ %s\t\n]+" % (string.punctuation,), content.lower())
return words
However, using str.translate is faster:
In [72]: %timeit using_re(content)
100000 loops, best of 3: 9.97 us per loop
In [73]: %timeit using_translate(content)
100000 loops, best of 3: 3.05 us per loop
Use the replace method.
mystring = mystring.replace(",", "")
If you want a more elegent solution that you will use many times over read up on RegEx expressions. Most languages use them and they are extremely useful for more complicated replacements and such
You could use a dictionary to specify what characters you don't want, and format the current string based on your choices.
replaceChars = {'.':'',',':'', ' ':''}
print reduce(lambda x, y: x.replace(y, replaceChars[y]), replaceChars, "ABC3.2,1,\nCda1,2,3....".lower())
Output:
abc321
cda123
You can use a simple regexp for creating a set with all words (sequences of one or more alphabetic characters)
import re
words = set(re.findall("[a-z]+", f.read().lower()))
Using a set each word will be included just once.
Just using findall will instead give you all the words in order.
You can try something like this. Probably need some work on the regexp though.
import re
text = file.read()
words = map(lambda x: re.sub("[,.!?]", "", x).lower(), text.split())
I have tried this code and It works in my case:
from string import punctuation, whitespace
s=''
with open("path of your file","r") as myfile:
content=myfile.read().split()
for word in content:
if((word in punctuation) or (word in whitespace)) :
pass
else:
s+=word.lower()
print(s)
I have strings that are multi-lingual consist of both languages that use whitespace as word separator (English, French, etc) and languages that don't (Chinese, Japanese, Korean).
Given such a string, I want to separate the English/French/etc part into words using whitespace as separator, and to separate the Chinese/Japanese/Korean part into individual characters.
And I want to put of all those separated components into a list.
Some examples would probably make this clear:
Case 1: English-only string. This case is easy:
>>> "I love Python".split()
['I', 'love', 'Python']
Case 2: Chinese-only string:
>>> list(u"我爱蟒蛇")
[u'\u6211', u'\u7231', u'\u87d2', u'\u86c7']
In this case I can turn the string into a list of Chinese characters. But within the list I'm getting unicode representations:
[u'\u6211', u'\u7231', u'\u87d2', u'\u86c7']
How do I get it to display the actual characters instead of the unicode? Something like:
['我', '爱', '蟒', '蛇']
??
Case 3: A mix of English & Chinese:
I want to turn an input string such as
"我爱Python"
and turns it into a list like this:
['我', '爱', 'Python']
Is it possible to do something like that?
I thought I'd show the regex approach, too. It doesn't feel right to me, but that's mostly because all of the language-specific i18n oddnesses I've seen makes me worried that a regular expression might not be flexible enough for all of them--but you may well not need any of that. (In other words--overdesign.)
# -*- coding: utf-8 -*-
import re
def group_words(s):
regex = []
# Match a whole word:
regex += [ur'\w+']
# Match a single CJK character:
regex += [ur'[\u4e00-\ufaff]']
# Match one of anything else, except for spaces:
regex += [ur'[^\s]']
regex = "|".join(regex)
r = re.compile(regex)
return r.findall(s)
if __name__ == "__main__":
print group_words(u"Testing English text")
print group_words(u"我爱蟒蛇")
print group_words(u"Testing English text我爱蟒蛇")
In practice, you'd probably want to only compile the regex once, not on each call. Again, filling in the particulars of character grouping is up to you.
In Python 3, it also splits the number if you needed.
def spliteKeyWord(str):
regex = r"[\u4e00-\ufaff]|[0-9]+|[a-zA-Z]+\'*[a-z]*"
matches = re.findall(regex, str, re.UNICODE)
return matches
print(spliteKeyWord("Testing English text我爱Python123"))
=> ['Testing', 'English', 'text', '我', '爱', 'Python', '123']
Formatting a list shows the repr of its components. If you want to view the strings naturally rather than escaped, you'll need to format it yourself. (repr should not be escaping these characters; repr(u'我') should return "u'我'", not "u'\\u6211'. Apparently this does happen in Python 3; only 2.x is stuck with the English-centric escaping for Unicode strings.)
A basic algorithm you can use is assigning a character class to each character, then grouping letters by class. Starter code is below.
I didn't use a doctest for this because I hit some odd encoding issues that I don't want to look into (out of scope). You'll need to implement a correct grouping function.
Note that if you're using this for word wrapping, there are other per-language considerations. For example, you don't want to break on non-breaking spaces; you do want to break on hyphens; for Japanese you don't want to split apart きゅ; and so on.
# -*- coding: utf-8 -*-
import itertools, unicodedata
def group_words(s):
# This is a closure for key(), encapsulated in an array to work around
# 2.x's lack of the nonlocal keyword.
sequence = [0x10000000]
def key(part):
val = ord(part)
if part.isspace():
return 0
# This is incorrect, but serves this example; finding a more
# accurate categorization of characters is up to the user.
asian = unicodedata.category(part) == "Lo"
if asian:
# Never group asian characters, by returning a unique value for each one.
sequence[0] += 1
return sequence[0]
return 2
result = []
for key, group in itertools.groupby(s, key):
# Discard groups of whitespace.
if key == 0:
continue
str = "".join(group)
result.append(str)
return result
if __name__ == "__main__":
print group_words(u"Testing English text")
print group_words(u"我爱蟒蛇")
print group_words(u"Testing English text我爱蟒蛇")
Modified Glenn's solution to drop symbols and work for Russian, French, etc alphabets:
def rec_group_words():
regex = []
# Match a whole word:
regex += [r'[A-za-z0-9\xc0-\xff]+']
# Match a single CJK character:
regex += [r'[\u4e00-\ufaff]']
regex = "|".join(regex)
return re.compile(regex)
The following works for python3.7:
import re
def group_words(s):
return re.findall(u'[\u4e00-\u9fff]|[a-zA-Z0-9]+', s)
if __name__ == "__main__":
print(group_words(u"Testing English text"))
print(group_words(u"我爱蟒蛇"))
print(group_words(u"Testing English text我爱蟒蛇"))
['Testing', 'English', 'text']
['我', '爱', '蟒', '蛇']
['Testing', 'English', 'text', '我', '爱', '蟒', '蛇']
For some reason, I cannot adapt Glenn Maynard's answer to python3.