I'm learning Python by myself, I'm starting to refactor Python code to learn new and efficient ways to code.
I tried to do a comprehension dictionary for word_dict, but I don't find a way to do it. I had two problems with it:
I tried to add word_dict[word] += 1 in my comprehension dictionary using word_dict[word]:=word_dict[word]+1
I wanted to check if the element was already in the comprehension dictionary (which I'm creating) using if word not in word_dict and it didn't work.
The comprehension dictionary is:
word_dict = {word_dict[word] := 0 if word not in word_dict else word_dict[word] := word_dict[word] + 1 for word in text_split}
Here is the code, it reads a text and count the different words in it. If you know a better way to do it, just let me know.
text = "hello Hello, water! WATER:HELLO. water , HELLO"
# clean then text
text_cleaned = re.sub(r':|!|,|\.', " ", text)
# Output 'hello Hello water WATER HELLO water HELLO'
# creates list without spaces elements
text_split = [element for element in text_cleaned.split(' ') if element != '']
# Output ['hello', 'Hello', 'water', 'WATER', 'HELLO', 'water', 'HELLO']
word_dict = {}
for word in text_split:
if word not in word_dict:
word_dict[word] = 0
word_dict[word] += 1
word_dict
# Output {'hello': 1, 'Hello': 1, 'water': 2, 'WATER': 1, 'HELLO': 2}
Right now you're using a regex to remove some undesirable characters, and then you split on whitespace to get a list of words. Why not use a regex to get the words right away? You can also take advantage of collections.Counter to create a dictionary, where the keys are words, and the associated values are counts/occurrences:
import re
from collections import Counter
text = "hello Hello, water! WATER:HELLO. water , HELLO"
pattern = r"\b\w+\b"
print(Counter(re.findall(pattern, text)))
Output:
Counter({'water': 2, 'HELLO': 2, 'hello': 1, 'Hello': 1, 'WATER': 1})
>>>
Here's what the regex pattern is composed of:
\b - represents a word boundary (will not be included in the match)
\w+ - one or more characters from the set [a-zA-Z0-9_].
\b - another word boundary (will also not be included in the match)
Welcome to Python. There is the library collections (https://docs.python.org/3/library/collections.html), which has a class called Counter. It seems very likely that this could fit in your code. Is that a take?
from collections import Counter
...
word_dict = Counter(text_split)
Related
I am creating a list with pairs of words in a large text. I am going to use those pairs for other tasks later on.
Let's say these are the words I am looking for:
word_list = ["and", "car", "melon"]
And I'm trying to find all instances of these exact words and change them into "banana".
Method 1:
for word in range(len(text.split())):
if word in word_list:
word = "banana"
Method 2:
for word in range(len(text.split())):
word = word.replace("and", "banana")
word = word.replace("car", "banana")
word = word.replace("melon", "banana")
I feel like both of these options are far from efficient. What are some better ways to deal with the problem?
Things to note:
The end result will be a list of lists: [["He","has"],["has","a"],["a","banana"]]
Only exact matches should be replaced (watermelon should not become waterbanana)
You could use a dictionary to do that,
value = 'banana'
d = {'and': value, 'car': value, 'melon': value}
result = ' '.join(d.get(i, i) for i in text.split())
You can create the mapping dictionary like this,
value = 'banana'
word_list = ["and", "car", "melon"]
d = dict(zip(word_list,[value]*len(word_list)))
So Ive got a variable list which is always being fed a new line
And variable words which is a big list of single word strings
Every time list updates I want to compare it to words and see if any strings from words are in list
If they do match, lets say the word and is in both of them, I then want to print "And : 1". Then if next sentence has that as well, to print "And : 2", etc. If another word comes in like The I want to print +1 to that
So far I have split the incoming text into an array with text.split() - unfortunately that is where im stuck. I do see some use in [x for x in words if x in list] but dont know how I would use that. Also how I would extract the specific word that is matching
You can use a collections.Counter object to keep a tally for each of the words that you are tracking. To improve performance, use a set for your word list (you said it's big). To keep things simple assume there is no punctuation in the incoming line data. Case is handled by converting all incoming words to lowercase.
from collections import Counter
words = {'and', 'the', 'in', 'of', 'had', 'is'} # words to keep counts for
word_counts = Counter()
lines = ['The rabbit and the mole live in the ground',
'Here is a sentence with the word had in it',
'Oh, it also had in in it. AND the and is too']
for line in lines:
tracked_words = [w for word in line.split() if (w:=word.lower()) in words]
word_counts.update(tracked_words)
print(*[f'{word}: {word_counts[word]}'
for word in set(tracked_words)], sep=', ')
Output
the: 3, and: 1, in: 1
the: 4, in: 2, is: 1, had: 1
the: 5, and: 3, in: 4, is: 2, had: 2
Basically this code takes a line of input, splits it into words (assuming no punctuation), converts these words to lowercase, and discards any words that are not in the main list of words. Then the counter is updated. Finally the current values of the relevant words is printed.
This does the trick:
sentence = "Hello this is a sentence"
list_of_words = ["this", "sentence"]
dict_of_counts = {} #This will hold all words that have a minimum count of 1.
for word in sentence.split(): #sentence.split() returns a list with each word of the sentence, and we loop over it.
if word in list_of_words:
if word in dict_of_counts: #Check if the current sentence_word is in list_of_words.
dict_of_counts[word] += 1 #If this key already exists in the dictionary, then add one to its value.
else:
dict_of_counts[word] = 1 #If key does not exists, create it with value of 1.
print(f"{word}: {dict_of_counts[word]}") #Print your statement.
The total count is kept in dict_of_counts and would look like this if you print it:
{'this': 1, 'sentence': 1}
You should use defaultdict here for the fastest processing.
from collections import defaultdict
input_string = "This is an input string"
list_of_words = ["input", "is"]
counts = defaultdict(int)
for word in input_string.split():
if word in list_of_words:
counts[word] +=1
Let's say, I have a .txt file with phrases divided by new line (\n)
I split them in list of phrases
["Rabbit eats banana", "Fox eats apple", "bear eats sanwich", "Tiger sleeps"]
What I need to do:
I need to make list of word objects, each word should have:
name
frequency(how many times it occured in phrases)
list of phrases it belongs to
For word eats the result will be:
{'name':'eats', 'frequency': '3', 'phrases': [0,1,2]}
What I've already done:
Right now, I am doing it simple, but not effective:
I get the list of words(by splitting .txt file by space character (" ")
words = split_my_input_file_by_spaces
#["banana", 'eats', 'apple', ....]
And loop for every word and every phrase:
for word in words:
for phrase in phrases:
if word in phrase:
#add word freq +1
What is the problem with current aproach:
I will have up to 10k phrases, so I encountered some problems with speed and performance. And I want to make it faster
I saw this interesting and promising way of counting occurences(but I don't know how can make a list of phrases each word belongs to)
from collections import Counter
list1=['apple','egg','apple','banana','egg','apple']
counts = Counter(list1)
print(counts)
# Counter({'apple': 3, 'egg': 2, 'banana': 1})
What you're referring to (the interesting and promising way of counting occurrences) is called a HashMap, or a dictionary in Python. These are key-value stores, which allow you to store & update some value (such as a count, a list of phrases, or a Word object) with constant time retrieval.
You mentioned that you ran into some runtime issues. Switching to a HashMap-based approach will speed up the runtime of your algorithm significantly (from quadratic to linear).
phrases = ["Hello there", "Hello where"];
wordCounts = {};
wordPhrases = {};
for phrase in phrases:
for word in phrase.split():
if (wordCounts.get(word)):
wordCounts[word] = wordCounts[word] + 1
wordPhrases[word].append(phrase)
else:
wordCounts[word] = 1
wordPhrases[word] = [phrase]
print(wordCounts)
print(wordPhrases)
Output:
{'there': 1, 'where': 1, 'Hello': 2}
{'there': ['Hello there'], 'where': ['Hello where'],
'Hello': ['Hello there', 'Hello where']
}
This will leave you with two dictionaries:
For each word, how often does it appear: {word: count}
For each word, in what phrases does it appear: {word: [phrases...]}
Some small effort is required from this point to achieve the output you're looking for.
Here is my code so far:
def a(nameOfFile):
f = open(nameOfFile)
text = f.read()
lines = text.split() # splits each word into string
d = {}
for x in range(len(lines)-1): # for each word in new line
lines[x] = lines[x+1]
return d
I am trying to go from a text file to a dictionary that lists each word and the possible words that can followed it. For instance, if the text file contains the sentence, "John is tall. Sunny thinks he will win," then the output should be {'John': [is], 'is': [tall] … } and so on.
I can't seem to grasp how to define the dictionary. I saw some examples that use a key, value code but we haven't learned that so I don't think I am need that. The examples in our class material use for loops so I am trying to use that.
Thanks, any help is much appreciated.
Here is a start:
>>> st="John is tall. Sunny thinks he will win, and win he will if she thinks so."
>>> wl=[e.rstrip('.,') for e in st.split()]
>>> words={}
>>> for w1, w2 in zip(wl[::2], wl[1::2]):
... words.setdefault(w1, []).append(w2)
...
>>> words
{'and': ['win'], 'will': ['win'], 'if': ['she'], 'tall': ['Sunny'],
'John': ['is'], 'thinks': ['he', 'so'], 'he': ['will']}
Your for loop doesn't use d at all.
split() creates a list of the words in your string. It's default is to split on any kind of whitespae.
d = {}
s = "Put us in the dictionary"
words = text.split() # ['Put', 'us', 'in', 'the', 'dictionary']
start = 0
for word in words:
index = words.index(word, start) + 1
start = index
try:
d[word] = words[index]
except IndexError:
pass
Output:
>>> print d
{'Put': 'us', 'the': 'dictionary', 'us': 'in', 'in': 'the'}
Note that dicts don't have an obvious ordering
i'm having a little problem with an exercise i have to do :
Basically the assignment is to open an url, convert it into a given format, and count the number of occurrences of given strings in the text.
import urllib2 as ul
def word_counting(url, code, words):
page = ul.urlopen(url)
text = page.read()
decoded = ext.decode(code)
result = {}
for word in words:
count = decoded.count(word)
counted = str(word) + ":" + " " + str(count)
result.append(counted)
return finale
The result i should get is like " word1: x, word2: y, word3: z " with x,y,z being the number of occurrences. But it seems that i only get ONE number, when i try to run the test program i get as result only like 9 for the first occurrences, 14 for the second list, 5 for the third, missing the other occurrences and the whole counted value.
What am i doing wrong? Thanks in advance
You're not appending to the dictionary correctly.
The correct way is result[key] = value.
So for your loop it would be
for word in words:
count = decoded.count(word)
result[word] = str(count)
An example without decode but using .count()
words = ['apple', 'apple', 'pear', 'banana']
result= {}
for word in words:
count = words.count(word)
result[word] = count
>>> result
>>> {'pear': 1, 'apple': 2, 'banana': 1}
Or you can use Collections.Counter :
>>> from collections import Counter
>>> words = ['apple', 'apple', 'pear', 'banana']
>>> Counter(words)
Counter({'apple': 2, 'pear': 1, 'banana': 1})
Don't forget about list and dictionary comprehensions. They can be quite efficient on larger sets of data (especially if you are analysing a large web-page in your example). At the end of the day, if your data set is small, one could argue that the dict comprehension syntax is cleaner/more pythonic etc.
So in this case I would use something like:
result = {word : decoded.count(word) for word in words}