I have a question in regard to how to store target words in the list.
I have a text file:
apple tree apple_tree
banana juice banana_juice
dinner time dinner_time
divorce lawyer divorce_lawyer
breakfast table breakfast_table
I would like read this file and store only nouns...but I am struggling with the code in Python.
file = open("text.txt","r")
for f in file.readlines():
words.append(f.split(" "))
I dont know how to split lines by white space and eliminate compounds with "_"...
list = [apple, tree, banana, juice, dinner, time...]
Try this code. It works fine.
split the whole string & add only those values in the list that have not compound words(i-e those words have not _)
Code :
temp = """apple tree apple_tree
banana juice banana_juice
dinner time dinner_time
divorce lawyer divorce_lawyer
breakfast table breakfast_table"""
new_arr = [i for i in temp.split() if not '_' in i]
print(new_arr)
Output :
['apple', 'tree', 'banana', 'juice', 'dinner', 'time', 'divorce', 'lawyer', 'breakfast', 'table']
This code stores only words without the underscore, and all in one list instead of a nested list:
words = []
file = open("text.txt","r")
for f in file.readlines():
words += [i for i in f.split(" ") if not '_' in i]
print(words)
import re
file = ["apple tree apple_tree apple_tree_tree apple_tree_ _",
"banana juice banana_juice",
"dinner time dinner_time",
"divorce lawyer divorce_lawyer",
"breakfast table breakfast_table"]
#approach 1 - list comprehensions
words=[]
for f in file:
words += [x for x in f.split(" ") if '_' not in x]
print(words)
#approach 2 - regular expressions
words=[]
for f in file:
f = re.sub(r"\s*\w*_[\w_]*\s*", "", f)
words += f.split(" ")
print(words)
Both of the above approaches would work.
IMO first is better(regular expressions can be costly) and also more pythonic
Related
Example:
myList = []
text = ["salmonella in black pepper from brazil", "aflatoxins in fish from germany", "pseudomonas in meat from italy"]
findmatches = re.compile(r"\b" +
r"\b|\b".join(re.escape(hazard) for hazard in hazards_set) +
r"\b")
for i in text:
for possible_match in set(findmatches.findall(i)):
if possible_match in hazards_set:
myList.append(possible_match)
myList.append("")
print(myList)
This is what I get:
['salmonella', '', 'aflatoxins', '', '']
This is what I would like to get:
['salmonella','aflatoxins', '']
since "pseudomonas" is not in the set hazards_set.
How can I solve the problem?
Set an if condition for the first for-loop using .isdisjoint() to append the appropriate empty string.
myList = []
text = ["salmonella in black pepper from brazil", "aflatoxins in fish from germany", "pseudomonas in meat from italy"]
findmatches = re.compile(r"\b" +
r"\b|\b".join(re.escape(hazard) for hazard in hazards_set) +
r"\b")
# e.g
hazards_set = ['brrrrrrrr', 'aflatoxins', 'salmonella']
for i in text:
for possible_match in set(findmatches.findall(i)):
if possible_match in hazards_set:
myList.append(possible_match)
if set(findmatches.findall(i)).isdisjoint(hazards_set):
myList.append("")
print(myList)
['salmonella', 'aflatoxins', '']
You can also choose to improve the code using list-comprehension and re.finditer() with a different re pattern
myList = [match.group(0) for i in text
for match in re.finditer(r'\b(?:%s)\b' % '|'.join(hazards_set), i)]
myList +=[''] * (len(text) - len(myList))
print(myList)
Will produce the same output as the traditional for-loop and append approach.
Note: I am anticipating that hazards_set could either be a list of words like:
hazards_set = ['brrrrrrrr', 'aflatoxins', 'salmonella']
or set of words like:
hazards_set = {'brrrrrrrr', 'aflatoxins', 'salmonella'}
Not sure how to remove the "\n" thing at the end of output
Basically, i have this txt file with sentences such as:
"What does Bessie say I have done?" I asked.
"Jane, I don't like cavillers or questioners; besides, there is something truly forbidding in a child
taking up her elders in that manner.
Be seated somewhere; and until you can speak pleasantly, remain silent."
I managed to split the sentences by semicolon with code:
import re
with open("testing.txt") as file:
read_file = file.readlines()
for i, word in enumerate(read_file):
low = word.lower()
re.split(';',low)
But not sure how to count the words of the split sentences as len() doesn't work:
The output of the sentences:
['"what does bessie say i have done?" i asked.\n']
['"jane, i don\'t like cavillers or questioners', ' besides, there is something truly forbidding in a
child taking up her elders in that manner.\n']
['be seated somewhere', ' and until you can speak pleasantly, remain silent."\n']
The third sentence for example, i am trying to count the 3 words at left and 8 words at right.
Thanks for reading!
The number of words is the number of spaces plus one:
e.g.
Two spaces, three words:
World is wonderful
Code:
import re
import string
lines = []
with open('file.txt', 'r') as f:
lines = f.readlines()
DELIMETER = ';'
word_count = []
for i, sentence in enumerate(lines):
# Remove empty sentance
if not sentence.strip():
continue
# Remove punctuation besides our delimiter ';'
sentence = sentence.translate(str.maketrans('', '', string.punctuation.replace(DELIMETER, '')))
# Split by our delimeter
splitted = re.split(DELIMETER, sentence)
# The number of words is the number of spaces plus one
word_count.append([1 + x.strip().count(' ') for x in splitted])
# [[9], [7, 9], [7], [3, 8]]
print(word_count)
Use str.rstrip('\n') to remove the \n at the end of each sentence.
To count the words in a sentence, you can use len(sentence.split(' '))
To transform a list of sentences into a list of counts, you can use the map function.
So here it is:
import re
with open("testing.txt") as file:
for i, line in enumerate(file.readlines()):
# Ignore empty lines
if line.strip(' ') != '\n':
line = line.lower()
# Split by semicolons
parts = re.split(';', line)
print("SENTENCES:", parts)
counts = list(map(lambda part: len(part.split()), parts))
print("COUNTS:", counts)
Outputs
SENTENCES: ['"what does bessie say i have done?" i asked.']
COUNTS: [9]
SENTENCES: ['"jane, i don\'t like cavillers or questioners', ' besides, there is something truly forbidding in a child ']
COUNTS: [7, 9]
SENTENCES: [' taking up her elders in that manner.']
COUNTS: [7]
SENTENCES: ['be seated somewhere', ' and until you can speak pleasantly, remain silent."']
COUNTS: [3, 8]
You'll need the library nltk
from nltk import sent_tokenize, word_tokenize
mytext = """I have a dog.
The dog is called Bob."""
for sent in sent_tokenize(mytext):
print(len(word_tokenize(sent)))
Output
5
6
Step by step explanation:
for sent in sent_tokenize(mytext):
print('Sentence >>>',sent)
print('List of words >>>',word_tokenize(sent))
print('Count words per sentence>>>', len(word_tokenize(sent)))
Output:
Sentence >>> I have a dog.
List of words >>> ['I', 'have', 'a', 'dog', '.']
Count words per sentence>>> 5
Sentence >>> The dog is called Bob.
List of words >>> ['The', 'dog', 'is', 'called', 'Bob', '.']
Count words per sentence>>> 6
`
import re
sentences = [] #empty list for storing result
with open('testtext.txt') as fileObj:
lines = [line.strip() for line in fileObj if line.strip()] #makin list of lines allready striped from '\n's
for line in lines:
sentences += re.split(';', line) #spliting lines by ';' and store result in sentences
for sentence in sentences:
print(sentence +' ' + str(len(sentence.split()))) #out
try this one:
import re
with open("testing.txt") as file:
read_file = file.readlines()
for i, word in enumerate(read_file):
low = word.lower()
low = low.strip()
low = low.replace('\n', '')
re.split(';',low)
I have a list of phrases (n-grams) that need to be removed from a given sentence.
removed = ['range', 'drinks', 'food and drinks', 'summer drinks']
sentence = 'Oranges are the main ingredient for a wide range of food and drinks'
I want to get:
new_sentence = 'Oranges are the main ingredient for a wide of'
I tried Remove list of phrases from string but it doesn't work ('Oranges' turns into 'Os', 'drinks' is removed instead of a phrase 'food and drinks')
Does anyone know how to solve it? Thank you!
Since you want to match on whole words only, I think the first step is to turn everything into lists of words, and then iterate from longest to shortest phrase in order to find things to remove:
>>> removed = ['range', 'drinks', 'food and drinks', 'summer drinks']
>>> sentence = 'Oranges are the main ingredient for a wide range of food and drinks'
>>> words = sentence.split()
>>> for ngram in sorted([r.split() for r in removed], key=len, reverse=True):
... for i in range(len(words) - len(ngram)+1):
... if words[i:i+len(ngram)] == ngram:
... words = words[:i] + words[i+len(ngram):]
... break
...
>>> " ".join(words)
'Oranges are the main ingredient for a wide of'
Note that there are some flaws with this simple approach -- multiple copies of the same n-gram won't be removed, but you can't continue with that loop after modifying words either (the length will be different), so if you want to handle duplicates, you'll need to batch the updates.
Regular expression time!
In [116]: removed = ['range', 'drinks', 'food and drinks', 'summer drinks']
...: removed = sorted(removed, key=len, reverse=True)
...: sentence = 'Oranges are the main ingredient for a wide range of food and drinks'
...: new_sentence = sentence
...: import re
...: removals = [r'\b' + phrase + r'\b' for phrase in removed]
...: for removal in removals:
...: new_sentence = re.sub(removal, '', new_sentence)
...: new_sentence = ' '.join(new_sentence.split())
...: print(sentence)
...: print(new_sentence)
Oranges are the main ingredient for a wide range of food and drinks
Oranges are the main ingredient for a wide of
import re
removed = ['range', 'drinks', 'food and drinks', 'summer drinks']
sentence = 'Oranges are the main ingredient for a wide range of food and drinks'
# sort the removed tokens according to their length,
removed = sorted(removed, key=len, reverse=True)
# using word boundaries
for r in removed:
sentence = re.sub(r"\b{}\b".format(r), " ", sentence)
# replace multiple whitspaces with a single one
sentence = re.sub(' +',' ',sentence)
I hope this would help:
first, you need to sort the removed strings according to their length, in this way 'food and drinks' will be replaced before 'drinks'
Here you go
removed = ['range', 'drinks', 'food and drinks', 'summer drinks','are']
sentence = 'Oranges are the main ingredient for a wide range of food and drinks'
words = sentence.split()
resultwords = [word for word in words if word.lower() not in removed]
result = ' '.join(resultwords)
print(result)
Results:
Oranges the main ingredient for a wide of food and
I have a file that contains sentences. I want to extract those sentences to a list and remove the words with length <=3
This is what I have by now:
with open("./data/pos/train-pos.txt", "r", encoding="utf8") as f:
train_pos = [line.strip().lower() for line in f]
newDoc = [word for word in train_pos if len(word) >= 3]
print(newDoc)
train-pos = ['i like apples', 'apples are my favorite fruits']
And I want to obtain: ['like apples', 'apples favorite fruits'], but I obtain the same list. Which is the problem? I want to do this in a very optimal way because train-pos.txt contains thousands of sentences so if your solution is different than my wrong solution, there is no problem.
You can do something like this:
>>> newDoc = [' '.join(word for word in sentence.split() if len(word) >= 3) for sentence in train_pos]
>>> newDoc
['like apples', 'apples are favorite fruits']
I have an input file with sentences like this:
I like apples
My mother is called Anna.
I transfer these sentences to a list and then I want to remove words that have the length < 3.
I've tried this:
with open("fis.txt", "r", encoding="utf8") as f:
lst = [w.lower() for w in f.readlines() if len(w) >= 3]
print(lst)
but it gives me ['i like apples', 'my mother is called anna.']
and I want to obtain ['like apples', 'mother called anna.']
What seems to be the problem here?
f.readlines() gives you a list with two items which correspond to the two lines of the file.
You need to iterate over the lines (no need to read them into memory first, iterating over f will do), split each line, and then filter the words.
with open("fis.txt", "r", encoding="utf8") as f:
lst = [' '.join(w.lower() for w in line.split() if len(w) >= 3) for line in f]
Try:
with open("fis.txt", "r", encoding="utf8") as f:
print( [" ".join(j for j in w.split() if len(j) >= 3 ) for w in f.readlines() ] )
Output:
['like apples', 'mother called Anna.']
It is taking the entire sentence and not individual words, try iterating through w and then check for length.