I have the following code which has for and if alternatively.
lines = ['apple berry Citrus ', 34, 4.46, 'Audi Apple ']
corpus = [ ]
for line in lines:
# Check if the element is string and proceed
if isinstance(line, str):
# Split the element and check if first character is upper case
for word in line.split():
if word[0].isupper():
# Append the word to corpus result
corpus.append(word)
print(corpus)
# Output : ['Citrus', 'Audi', 'Apple']
I am trying to do this in list comprehension but failing. I have tried as below.
# corpus = [ word if word[0].isupper() for word in line.split() for line in lines if isinstance(line, str)]
How can i achieve this in List Comprehension ?
The following will work:
corpus = [
word for line in lines if isinstance(line, str)
for word in line.split() if word[0].isupper()
]
The scope in nested comprehensions can be confusing at first, but you'll notice that the order of for and if is the same as in the nested loops.
You can do:
[word for line in lines if isinstance(line, str)
for word in line.split() if word[0].isupper()]
An alternate approach is to prefilter the list lines with filter:
[word for line in filter(lambda e: isinstance(e, str), lines)
for word in line.split() if word[0].isupper()]
Or, as pointed out in comments, you can eliminate the lambda with:
[word for line in filter(str.__instancecheck__, lines)
for word in line.split() if word[0].isupper()]
Or even two filters:
[word for line in filter(str.__instancecheck__, lines)
for word in filter(lambda w: w[0].isupper(), line.split())]
As a general rule, you can take the inner part of the nested loops where you do corpus.append(xxxx) and place it (the xxx) at the beginning of the list comprehension. Then add the nested for loops and conditions without the ':'.
corpus = [ word # from corpus.append(word)
for line in lines # ':' removed ...
if isinstance(line,str)
for word in line.split()
if word[0].isupper() ]
# on a single line:
corpus = [ word for line in lines if isinstance(line,str) for word in line.split() if word[0].isupper() ]
print(corpus) # ['Citrus', 'Audi', 'Apple']
Related
Not sure how to remove the "\n" thing at the end of output
Basically, i have this txt file with sentences such as:
"What does Bessie say I have done?" I asked.
"Jane, I don't like cavillers or questioners; besides, there is something truly forbidding in a child
taking up her elders in that manner.
Be seated somewhere; and until you can speak pleasantly, remain silent."
I managed to split the sentences by semicolon with code:
import re
with open("testing.txt") as file:
read_file = file.readlines()
for i, word in enumerate(read_file):
low = word.lower()
re.split(';',low)
But not sure how to count the words of the split sentences as len() doesn't work:
The output of the sentences:
['"what does bessie say i have done?" i asked.\n']
['"jane, i don\'t like cavillers or questioners', ' besides, there is something truly forbidding in a
child taking up her elders in that manner.\n']
['be seated somewhere', ' and until you can speak pleasantly, remain silent."\n']
The third sentence for example, i am trying to count the 3 words at left and 8 words at right.
Thanks for reading!
The number of words is the number of spaces plus one:
e.g.
Two spaces, three words:
World is wonderful
Code:
import re
import string
lines = []
with open('file.txt', 'r') as f:
lines = f.readlines()
DELIMETER = ';'
word_count = []
for i, sentence in enumerate(lines):
# Remove empty sentance
if not sentence.strip():
continue
# Remove punctuation besides our delimiter ';'
sentence = sentence.translate(str.maketrans('', '', string.punctuation.replace(DELIMETER, '')))
# Split by our delimeter
splitted = re.split(DELIMETER, sentence)
# The number of words is the number of spaces plus one
word_count.append([1 + x.strip().count(' ') for x in splitted])
# [[9], [7, 9], [7], [3, 8]]
print(word_count)
Use str.rstrip('\n') to remove the \n at the end of each sentence.
To count the words in a sentence, you can use len(sentence.split(' '))
To transform a list of sentences into a list of counts, you can use the map function.
So here it is:
import re
with open("testing.txt") as file:
for i, line in enumerate(file.readlines()):
# Ignore empty lines
if line.strip(' ') != '\n':
line = line.lower()
# Split by semicolons
parts = re.split(';', line)
print("SENTENCES:", parts)
counts = list(map(lambda part: len(part.split()), parts))
print("COUNTS:", counts)
Outputs
SENTENCES: ['"what does bessie say i have done?" i asked.']
COUNTS: [9]
SENTENCES: ['"jane, i don\'t like cavillers or questioners', ' besides, there is something truly forbidding in a child ']
COUNTS: [7, 9]
SENTENCES: [' taking up her elders in that manner.']
COUNTS: [7]
SENTENCES: ['be seated somewhere', ' and until you can speak pleasantly, remain silent."']
COUNTS: [3, 8]
You'll need the library nltk
from nltk import sent_tokenize, word_tokenize
mytext = """I have a dog.
The dog is called Bob."""
for sent in sent_tokenize(mytext):
print(len(word_tokenize(sent)))
Output
5
6
Step by step explanation:
for sent in sent_tokenize(mytext):
print('Sentence >>>',sent)
print('List of words >>>',word_tokenize(sent))
print('Count words per sentence>>>', len(word_tokenize(sent)))
Output:
Sentence >>> I have a dog.
List of words >>> ['I', 'have', 'a', 'dog', '.']
Count words per sentence>>> 5
Sentence >>> The dog is called Bob.
List of words >>> ['The', 'dog', 'is', 'called', 'Bob', '.']
Count words per sentence>>> 6
`
import re
sentences = [] #empty list for storing result
with open('testtext.txt') as fileObj:
lines = [line.strip() for line in fileObj if line.strip()] #makin list of lines allready striped from '\n's
for line in lines:
sentences += re.split(';', line) #spliting lines by ';' and store result in sentences
for sentence in sentences:
print(sentence +' ' + str(len(sentence.split()))) #out
try this one:
import re
with open("testing.txt") as file:
read_file = file.readlines()
for i, word in enumerate(read_file):
low = word.lower()
low = low.strip()
low = low.replace('\n', '')
re.split(';',low)
list comprehension to check for presence of any of the items.
I have some text and would like to check on some keywords. It should return me the sentence if it contains any of the keywords.
An example:
text = [t for t in string.split('. ')
if 'drink' in t or 'eat' in t
or 'sleep' in t]
This works. However, I am thinking if there is a better way, as the list of keywords may grow.
I tried putting the keywords in a list but it would not work in this list comprehension.
OR using if any
pattern = ['drink', 'eat', 'sleep']
[t for t in string.split('. ') if any (l in pattern for l in t)]
You were almost there:
pattern = ['drink', 'eat', 'sleep']
[t for t in string.split('. ') if any(word in t for word in pattern)]
The key is to check for each word in pattern if that work is inside the sentence:
any(word in t for word in pattern)
Your use of any is backwards. This is what you want:
[t for t in string.split('. ') if any(l in t for l in pattern)]
An alternative approach is using a regex:
import re
regex = '|'.join(pattern)
[t for t in string.split('. ') if regex.search(t)]
I want to find all the "phrases" in a list in remove them from the list, so that I have only words (without spaces) left. I'm making a hangman type game and want the computer to choose a random word. I'm new to Python and coding, so I'm happy to hear other suggestions for my code as well.
import random
fhand = open('common_words.txt')
words = []
for line in fhand:
line = line.strip()
words.append(line)
for word in words:
if ' ' in word:
words.remove(word)
print(words)
Sets are more efficient than lists. When lazily constructed like here, you can gain significant performance boost.
# Load all words
words = {}
with open('common_words.txt') as file:
for line in file.readlines():
line = line.strip()
if " " not in line:
words.add(line)
# Can be converted to one-liner using magic of Python
words = set(filter(lambda x: " " in x, map(str.strip, open('common_words.txt').readlines())))
# Get random word
import random
print(random.choice(words))
Use str.split(). It separates by both spaces and newlines by default.
>>> 'some words\nsome more'.split()
['some', 'words', 'some', 'more']
>>> 'this is a sentence.'.split()
['this', 'is', 'a', 'sentence.']
>>> 'dfsonf 43 SDFd fe#2'.split()
['dfsonf', '43', 'SDFd', 'fe#2']
Read the file normally and make a list this way:
words = []
with open('filename.txt','r') as file:
words = file.read().split()
That should be good.
with open( 'common_words.txt', 'r' ) as f:
words = [ word for word in filter( lambda x: len( x ) > 0 and ' ' not in x, map( lambda x: x.strip(), f.readlines() ) ) ]
with is used because file objects are content managers. The strange list-like syntax is a list comprehension, so it builds a list from the statements inside of the brackets. map is a function with takes in an iterable, applying a provided function to each item in the iterable, placing each transformed result into a new list*. filter is function which takes in an iterable, testing each item against the provided predicate, placing each item which evaluated to True into a new list*. lambda is used to define a function (with a specific signature) in-line.
*: The actual return types are generators, which function like iterators so they can still be used with for loops.
I am not sure if I understand you correctly, but I guess the split() method is something for you, like:
with open('common_words.txt') as f:
words = [line.split() for line in f]
words = [word for words in words_nested for word in words] # flatten nested list
As mentioned, the
.split()
Method could be a solution.
Also, the NLTK module might be useful for future language processing tasks.
Hope this helps!
So I have a list of words in a text file. I want to perform lemmatization on them to remove words which have the same meaning but are in different tenses. Like try, tried etc. When I do this, I keep getting an error like TypeError: unhashable type: 'list'
results=[]
with open('/Users/xyz/Documents/something5.txt', 'r') as f:
for line in f:
results.append(line.strip().split())
lemma= WordNetLemmatizer()
lem=[]
for r in results:
lem.append(lemma.lemmatize(r))
with open("lem.txt","w") as t:
for item in lem:
print>>t, item
How do I lemmatize words which are already tokens?
The method WordNetLemmatizer.lemmatize is probably expecting a string but you are passing it a list of strings. This is giving you the TypeError exception.
The result of line.split() is a list of strings which you are appending as a list to results i.e. a list of lists.
You want to use results.extend(line.strip().split())
results = []
with open('/Users/xyz/Documents/something5.txt', 'r') as f:
for line in f:
results.extend(line.strip().split())
lemma = WordNetLemmatizer()
lem = map(lemma.lemmatize, results)
with open("lem.txt", "w") as t:
for item in lem:
print >> t, item
or refactored without the intermediate results list
def words(fname):
with open(fname, 'r') as document:
for line in document:
for word in line.strip().split():
yield word
lemma = WordNetLemmatizer()
lem = map(lemma.lemmatize, words('/Users/xyz/Documents/something5.txt'))
Open a text file and and read lists as results as shown below
fo = open(filename)
results1 = fo.readlines()
results1
['I have a list of words in a text file', ' \n I want to perform lemmatization on them to remove words which have the same meaning but are in different tenses', '']
# Tokenize lists
results2 = [line.split() for line in results1]
# Remove empty lists
results2 = [ x for x in results2 if x != []]
# Lemmatize each word from a list using WordNetLemmatizer
from nltk.stem.wordnet import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemma_list_of_words = []
for i in range(0, len(results2)):
l1 = results2[i]
l2 = ' '.join([lemmatizer.lemmatize(word) for word in l1])
lemma_list_of_words.append(l2)
lemma_list_of_words
['I have a list of word in a text file', 'I want to perform lemmatization on them to remove word which have the same meaning but are in different tense']
Please look at the lemmatized difference between lemma_list_of_words and results1.
I am learning python from an introductory Python textbook and I am stuck on the following problem:
You will implement function index() that takes as input the name of a text file and a list of words. For every word in the list, your function will find the lines in the text file where the word occurs and print the corresponding line numbers.
Ex:
>>>> index('raven.txt', ['raven', 'mortal', 'dying', 'ghost', 'ghastly', 'evil', 'demon'])
ghost 9
dying 9
demon 122
evil 99, 106
ghastly 82
mortal 30
raven 44, 53, 55, 64, 78, 97, 104, 111, 118, 120
Here is my attempt at the problem:
def index(filename, lst):
infile = open(filename, 'r')
lines = infile.readlines()
lst = []
dic = {}
for line in lines:
words = line.split()
lst. append(words)
for i in range(len(lst)):
for j in range(len(lst[i])):
if lst[i][j] in lst:
dic[lst[i][j]] = i
return dic
When I run the function, I get back an empty dictionary. I do not understand why I am getting an empty dictionary. So what is wrong with my function? Thanks.
You are overwriting the value of lst. You use it as both a parameter to a function (in which case it is a list of strings) and as the list of words in the file (in which case it's a list of list of strings). When you do:
if lst[i][j] in lst
The comparison always returns False because lst[i][j] is a str, but lst contains only lists of strings, not strings themselves. This means that the assignment to the dic is never executed and you get an empty dict as result.
To avoid this you should use a different name for the list in which you store the words, for example:
In [4]: !echo 'a b c\nd e f' > test.txt
In [5]: def index(filename, lst):
...: infile = open(filename, 'r')
...: lines = infile.readlines()
...: words = []
...: dic = {}
...: for line in lines:
...: line_words = line.split()
...: words.append(line_words)
...: for i in range(len(words)):
...: for j in range(len(words[i])):
...: if words[i][j] in lst:
...: dic[words[i][j]] = i
...: return dic
...:
In [6]: index('test.txt', ['a', 'b', 'c'])
Out[6]: {'a': 0, 'c': 0, 'b': 0}
There are also a lot of things you can change.
When you want to iterate a list you don't have to explicitly use indexes. If you need the index you can use enumerate:
for i, line_words in enumerate(words):
for word in line_words:
if word in lst: dict[word] = i
You can also iterate directly on a file (refer to Reading and Writing Files section of the python tutorial for a bit more information):
# use the with statement to make sure that the file gets closed
with open('test.txt') as infile:
for i, line in enumerate(infile):
print('Line {}: {}'.format(i, line))
In fact I don't see why would you first build that words list of list. Just itertate on the file directly while building the dictionary:
def index(filename, lst):
with open(filename, 'r') as infile:
dic = {}
for i, line in enumerate(infile):
for word in line.split():
if word in lst:
dic[word] = i
return dic
Your dic values should be lists, since more than one line can contain the same word. As it stands your dic would only store the last line where a word is found:
from collections import defaultdict
def index(filename, words):
# make faster the in check afterwards
words = frozenset(words)
with open(filename) as infile:
dic = defaultdict(list)
for i, line in enumerate(infile):
for word in line.split():
if word in words:
dic[word].append(i)
return dic
If you don't want to use the collections.defaultdict you can replace dic = defaultdict(list) with dic = {} and then change the:
dic[word].append(i)
With:
if word in dic:
dic[word] = [i]
else:
dic[word].append(i)
Or, alternatively, you can use dict.setdefault:
dic.setdefault(word, []).append(i)
although this last way is a bit slower than the original code.
Note that all these solutions have the property that if a word isn't found in the file it will not appear in the result at all. However you may want it in the result, with an emty list as value. In such a case it's simpler the dict with empty lists before starting to loop, such as in:
dic = {word : [] for word in words}
for i, line in enumerate(infile):
for word in line.split():
if word in words:
dic[word].append(i)
Refer to the documentation about List Comprehensions and Dictionaries to understand the first line.
You can also iterate over words instead of the line, like this:
dic = {word : [] for word in words}
for i, line in enumerate(infile):
for word in words:
if word in line.split():
dic[word].append(i)
Note however that this is going to be slower because:
line.split() returns a list, so word in line.split() will have to scan all the list.
You are repeating the computation of line.split().
You can try to solve these two problems doing:
dic = {word : [] for word in words}
for i, line in enumerate(infile):
line_words = frozenset(line.split())
for word in words:
if word in line_words:
dic[word].append(i)
Note that here we are iterating once over line.split() to build the set and also over words. Depending on the sizes of the two sets this may be slower or faster than the original version (iteratinv over line.split()).
However at this point it's probably faster to intersect the sets:
dic = {word : [] for word in words}
for i, line in enumerate(infile):
line_words = frozenset(line.split())
for word in words & line_words: # & stands for set intersection
dic[word].append(i)
Try this,
def index(filename, lst):
dic = {w:[] for w in lst}
for n,line in enumerate( open(filename,'r') ):
for word in lst:
if word in line.split(' '):
dic[word].append(n+1)
return dic
There are some features of the language introduced here that you should be aware of because they will make life a lot easier in the long run.
The first is a dictionary comprehension. It basically initializes a dictionary using the words in lst as keys and an empty list [] as the value for each key.
Next the enumerate command. This allows us to iterate over the items in a sequence but also gives us the index of those items. In this case, because we passed a file object to enumerate it will loop over the lines. For each iteration, n will be the 0-based index of the line and line will be the line itself. Next we iterate over the words in lst.
Notice that we don't need any indices here. Python encourages looping over objects in sequences rather than looping over indices and then accessing the objects in a sequence based on index (for example discourages doing for i in range(len(lst)): do something with lst[i]).
Finally, the in operator is a very straightforward way to test membership for many types of objects and the syntax is very intuitive. In this case, we are asking is the current word from lst in the current line.
Note that we use line.split(' ') to get a list of the words in the line. If we don't do this, 'the' in 'there was a ghost' would return True as the is a substring of one of the words.
On the other hand 'the' in ['there', 'was', 'a', 'ghost'] would return False. If the conditional returns True, we append it to the list associated to the key in our dictionary.
That might be a lot to chew on, but these concepts make problems like this more straight forward.
First, your function param with the words is named lst and also the list where you put all the words in the file is also named lst, so you are not saving the words passed to your functions, because on line 4 you're redeclaring the list.
Second, You are iterating over each line in the file (the first for), and getting the words in that line. After that lst has all the words in the entire file. So in the for i ... you are iterating over all the words readed from the file, there's no need to use the third for j where you are iterating over each character in every word.
In resume, in that if you are saying "If this single character is in the lists of words ..." wich is not, so the dict will be never filled up.
for i in range(len(lst)):
if words[i] in lst:
dic[words[i]] = dic[words[i]] + i # To count repetitions
You need to rethink the problem, even my answer will fail because the word in the dict will not exist giving an error, but you get the point. Good luck!