I have a file that contains sentences. I want to extract those sentences to a list and remove the words with length <=3
This is what I have by now:
with open("./data/pos/train-pos.txt", "r", encoding="utf8") as f:
train_pos = [line.strip().lower() for line in f]
newDoc = [word for word in train_pos if len(word) >= 3]
print(newDoc)
train-pos = ['i like apples', 'apples are my favorite fruits']
And I want to obtain: ['like apples', 'apples favorite fruits'], but I obtain the same list. Which is the problem? I want to do this in a very optimal way because train-pos.txt contains thousands of sentences so if your solution is different than my wrong solution, there is no problem.
You can do something like this:
>>> newDoc = [' '.join(word for word in sentence.split() if len(word) >= 3) for sentence in train_pos]
>>> newDoc
['like apples', 'apples are favorite fruits']
Related
I have two lists. The first is with adjectives and the second is with sentences.
I need to return a sentence, if there is an adjective from our list and write the sentence in a dictionary with value = 'adj'. It'd return (['Have a good day'], 'adj').
Or at least if it could just return the sentence with a match adj.
sents_cleaned = ['have a good day', 'don't forget your yellow umbrella', 'bold seagull']
adjectives = ['good', 'red', 'green', 'yellow']
This is what I've tried so far. Didn't work as expected, sorry, I'm a noobie.
for sents in sents_cleaned:
sents = sents.strip().split(" ")
for words in sents:
for adj in adjectives:
if adj in sents:
print(sents)
Output would be ['have a good day', 'adj'],
['don't forget your yellow umbrella', 'adj']
Suppose you want to store it in a dictionary called d, with sentences as keys and adjectives as values.
The following code assumes you want only 1 adjective from each sentence. If however, you require multiple adjectives, keeping a dictionary of string to list of strings would help.
d = dict()
for sents in sents_cleaned:
sents = sents.strip().split(" ")
for word in sents:
if word in adjectives:
d[sents] = word
print(d)
Not sure how to remove the "\n" thing at the end of output
Basically, i have this txt file with sentences such as:
"What does Bessie say I have done?" I asked.
"Jane, I don't like cavillers or questioners; besides, there is something truly forbidding in a child
taking up her elders in that manner.
Be seated somewhere; and until you can speak pleasantly, remain silent."
I managed to split the sentences by semicolon with code:
import re
with open("testing.txt") as file:
read_file = file.readlines()
for i, word in enumerate(read_file):
low = word.lower()
re.split(';',low)
But not sure how to count the words of the split sentences as len() doesn't work:
The output of the sentences:
['"what does bessie say i have done?" i asked.\n']
['"jane, i don\'t like cavillers or questioners', ' besides, there is something truly forbidding in a
child taking up her elders in that manner.\n']
['be seated somewhere', ' and until you can speak pleasantly, remain silent."\n']
The third sentence for example, i am trying to count the 3 words at left and 8 words at right.
Thanks for reading!
The number of words is the number of spaces plus one:
e.g.
Two spaces, three words:
World is wonderful
Code:
import re
import string
lines = []
with open('file.txt', 'r') as f:
lines = f.readlines()
DELIMETER = ';'
word_count = []
for i, sentence in enumerate(lines):
# Remove empty sentance
if not sentence.strip():
continue
# Remove punctuation besides our delimiter ';'
sentence = sentence.translate(str.maketrans('', '', string.punctuation.replace(DELIMETER, '')))
# Split by our delimeter
splitted = re.split(DELIMETER, sentence)
# The number of words is the number of spaces plus one
word_count.append([1 + x.strip().count(' ') for x in splitted])
# [[9], [7, 9], [7], [3, 8]]
print(word_count)
Use str.rstrip('\n') to remove the \n at the end of each sentence.
To count the words in a sentence, you can use len(sentence.split(' '))
To transform a list of sentences into a list of counts, you can use the map function.
So here it is:
import re
with open("testing.txt") as file:
for i, line in enumerate(file.readlines()):
# Ignore empty lines
if line.strip(' ') != '\n':
line = line.lower()
# Split by semicolons
parts = re.split(';', line)
print("SENTENCES:", parts)
counts = list(map(lambda part: len(part.split()), parts))
print("COUNTS:", counts)
Outputs
SENTENCES: ['"what does bessie say i have done?" i asked.']
COUNTS: [9]
SENTENCES: ['"jane, i don\'t like cavillers or questioners', ' besides, there is something truly forbidding in a child ']
COUNTS: [7, 9]
SENTENCES: [' taking up her elders in that manner.']
COUNTS: [7]
SENTENCES: ['be seated somewhere', ' and until you can speak pleasantly, remain silent."']
COUNTS: [3, 8]
You'll need the library nltk
from nltk import sent_tokenize, word_tokenize
mytext = """I have a dog.
The dog is called Bob."""
for sent in sent_tokenize(mytext):
print(len(word_tokenize(sent)))
Output
5
6
Step by step explanation:
for sent in sent_tokenize(mytext):
print('Sentence >>>',sent)
print('List of words >>>',word_tokenize(sent))
print('Count words per sentence>>>', len(word_tokenize(sent)))
Output:
Sentence >>> I have a dog.
List of words >>> ['I', 'have', 'a', 'dog', '.']
Count words per sentence>>> 5
Sentence >>> The dog is called Bob.
List of words >>> ['The', 'dog', 'is', 'called', 'Bob', '.']
Count words per sentence>>> 6
`
import re
sentences = [] #empty list for storing result
with open('testtext.txt') as fileObj:
lines = [line.strip() for line in fileObj if line.strip()] #makin list of lines allready striped from '\n's
for line in lines:
sentences += re.split(';', line) #spliting lines by ';' and store result in sentences
for sentence in sentences:
print(sentence +' ' + str(len(sentence.split()))) #out
try this one:
import re
with open("testing.txt") as file:
read_file = file.readlines()
for i, word in enumerate(read_file):
low = word.lower()
low = low.strip()
low = low.replace('\n', '')
re.split(';',low)
I am working on a project and I want to write a code, that would find words containing only certain letters in a sentence and then return them (print them out).
sentence = "I am asking a question on Stack Overflow"
lst = []
# this gives me a list of all words in a sentence
change = sentence.split()
# NOTE: I know this isn't correct syntax, but that's basically what I want to do.
lst.append(only words containing "a")
print(lst)
Now the part I am struggeling with is, how do I append only words containig letter "a" for example?
you can act like this:
words = sentence.split()
lst = [word for word in words if 'a' in word]
print(lst)
# ['am', 'asking', 'a', 'Stack']
Try this! I hope it's well understood!
sentence = "I am asking a question on Stack Overflow"
lst = []
change = sentence.split()
#we are going to check in every word of the sentence, if letter 'a' is in it.
for a in change:
if 'a' in a:
print(a+" has an a! ")
lst.append(a)
print(lst)
This will output:
['am', 'asking', 'a', 'Stack']
Write a program that asks a user for a file name, then reads in the file. The program should then determine how frequently each word in the file is used. The words should be counted regardless of case, for example Spam and spam would both be counted as the same word. You should disregard punctuation. The program should then output the the words and how frequently each word is used. The output should be sorted by the most frequent word to the least frequent word.
Only problem I am having is getting the code to count "The" and "the" as the same thing. The code counts them as different words.
userinput = input("Enter a file to open:")
if len(userinput) < 1 : userinput = 'ran.txt'
f = open(userinput)
di = dict()
for lin in f:
lin = lin.rstrip()
wds = lin.split()
for w in wds:
di[w] = di.get(w,0) + 1
lst = list()
for k,v in di.items():
newtup = (v, k)
lst.append(newtup)
lst = sorted(lst, reverse=True)
print(lst)
Need to count "the" and "The" as on single word.
We start by getting the words in a list, updating the list so that all words are in lowercase. You can disregard punctuation by replacing them from the string with an empty character
punctuations = '!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~'
s = "I want to count how many Words are there.i Want to Count how Many words are There"
for punc in punctuations:
s = s.replace(punc,' ')
words = s.split(' ')
words = [word.lower() for word in words]
We then iterate through the list, and update a frequency map.
freq = {}
for word in words:
if word in freq:
freq[word] += 1
else:
freq[word] = 1
print(freq)
#{'i': 2, 'want': 2, 'to': 2, 'count': 2, 'how': 2, 'many': 2,
#'words': 2, 'are': #2, 'there': 2}
You can use counter and re like this,
from collections import Counter
import re
sentence = 'Egg ? egg Bird, Goat afterDoubleSpace\nnewline'
# some punctuations (you can add more here)
punctuationsToBeremoved = ",|\n|\?"
#to make all of them in lower case
sentence = sentence.lower()
#to clean up the punctuations
sentence = re.sub(punctuationsToBeremoved, " ", sentence)
# getting the word list
words = sentence.split()
# printing the frequency of each word
print(Counter(words))
I have an input file with sentences like this:
I like apples
My mother is called Anna.
I transfer these sentences to a list and then I want to remove words that have the length < 3.
I've tried this:
with open("fis.txt", "r", encoding="utf8") as f:
lst = [w.lower() for w in f.readlines() if len(w) >= 3]
print(lst)
but it gives me ['i like apples', 'my mother is called anna.']
and I want to obtain ['like apples', 'mother called anna.']
What seems to be the problem here?
f.readlines() gives you a list with two items which correspond to the two lines of the file.
You need to iterate over the lines (no need to read them into memory first, iterating over f will do), split each line, and then filter the words.
with open("fis.txt", "r", encoding="utf8") as f:
lst = [' '.join(w.lower() for w in line.split() if len(w) >= 3) for line in f]
Try:
with open("fis.txt", "r", encoding="utf8") as f:
print( [" ".join(j for j in w.split() if len(j) >= 3 ) for w in f.readlines() ] )
Output:
['like apples', 'mother called Anna.']
It is taking the entire sentence and not individual words, try iterating through w and then check for length.