Counting the number of times a word appears in n tweets - python

I have a data frame of about 118,000 tweets. Here's a made up sample:
Tweets
1 The apple is red
2 The grape is purple
3 The tree is green
I have also used the 'set' function to arrive at a list of of every unique word that is found in my data frame of tweets. For the example above it looks like this (in no particular order):
Words
1 The
2 is
3 apple
4 grape
....so on
Basically I need to find out how many tweets contain a given word. For example, "The" is found in 3 tweets, "apple" is found in 1 tweet, "is" is found in 3 tweets, and so on.
I have tried using a nested for loop that looks like:
number_words = [0]*len(words)
for i in range(len(words)):
for j in range(len(tweets)):
if words[i] in tweets[j]:
number_words[i] += 1
number_words
Which creates a new list and counts the amount of tweets that contain the given word, for each word down the list. However I have found that this incredibly inefficient, the code block takes forever to run.
What is a better way to do this?

you could use: str.count
df.Tweets.str.count(word).sum()
for example, i suppose words is the list
for word in Words:
print(f'{word} count: {df.Tweets.str.count(word).sum()}')
full sample:
import pandas as pd
data = """
Tweets
The apple is red
The grape is purple
The tree is green
"""
datb = """
Words
The
is
apple
grape
"""
dfa = pd.read_csv(pd.compat.StringIO(data), sep=';')
dfb = pd.read_csv(pd.compat.StringIO(datb), sep=';')
Words = dfb['Words'].values
dico = {}
for word in Words:
dico[word] = dfa.Tweets.str.count(word).sum()
print(dico)
output:
{'The': 3, 'is': 3, 'apple': 1, 'grape ': 1}

You can use a default dictionary for this to store all word counts like this:
from collections import defaultdict
word_counts = defaultdict(int)
for tweet in tweets:
for word in tweet:
word_counts[word] += 1
# print(word_counts['some_word']) will output occurrence of some_word

This will take your list of words and turn it into a dictionary
import collections
words = tweets.split()
counter = collections.Counter(words)
for key , value in sorted(counter.items()):
print("`{}` is repeated {} time".format(key , value))

Related

Checking whether sentences in a database contain certain words from a dictionary

I have a huge dictionary (in Python) containing millions of words and a score for each of them indicating popularity. (Note: I have it as a dictionary but I can easily word with it as a dataframe too). I also have a database/SQL table with a few hundred sentences, each sentence having an ID.
I want to see whether each sentence contains a popular word, that is, whether it contains a word with score below some number n. Is it inefficient to iterate through every sentence and each time check through every word to see if it is in the dictionary, and what score it has?
Is there any other more efficient way to do it?
Here is the approach you can go with: '6' in my example code is the value of 'n' you have added in the question.
import re
words = {
'dog': 5,
'ant': 6,
'elephant': 1
}
n = 6
sentences = ['an ant', 'a dog', 'an elephant']
# Get all the popular words
popular_words = [key for key, val in words.items() if int(val)<int(n)]
popular_words = "|".join(popular_words)
for sentence in sentences:
# Check if sentence contains any of the popular word
if re.search(rf"{popular_words}", sentence):
print (sentence)

How to define a function that counts how many times a word is repeated

I would like to make a function that outputs as a list of how many times it is in the sentence
def words(text:str,number:int):
text=text.split()
for freq in text:
print(freq,"," text.count(freq)
I know this isn't correct, because i want the output to look like
[(cat,4),(dog,10),(pig,3)]
I am unsure how to approach this problem.
I also want it to only allow it to count the amount of different repeated words with the number assigned to the function
After splitting the string I created a set with unique entries. With re.findall you can find the count of every word in the text. The function words returns a dictionary with the word as key and the count as value.
import re
def words(text:str):
textsp = text.split()
unique = set(textsp)
count = []
for word in unique:
count.append(len(re.findall(word, text)))
out = dict(zip(unique, count))
return out
te = "cat dog cat pig dog dog pig cat"
print(words(te))
Learn how to use sets - https://realpython.com/python-sets/
This is a possible approach using sets:
s = "cat cat cat cat dog dog dog".split()
def words(s):
counter = []
for i in set(s):
counter.append((i,s.count(i)))
return counter
print(words(s))

Store specific lines from a multiline file as values in a dictionary (Python)

I have a multiline transcript file that contains lines of text and corresponding timestamps. It looks like this:
00:02:01,640 00:02:04,409
word word CHERRY word word
00:02:04,409 00:02:07,229
word APPLE word word
00:02:07,229 00:02:09,380
word word word word
00:02:09,380 00:02:12,060
word BANANA word word word
Now, if the text contains specific words (types of fruit) which I have already stored in a list, these words shall be stored as keys in a dictionary. My code for this:
Dict = {}
FruitList = []
for w in transcript.split():
if w in my_list:
FruitList.append(w)
keys = FruitList
The output of printing keys is: ['CHERRY', 'APPLE', 'BANANA'].
Moving on, my problem is that I want to extract the timestamps belonging to the lines containing fruits, and store them in the dictionary as values - but only those timestamps which correspond to the line underneath in which a type of fruit is given.
For this task, I have several code snippets:
values = [] # shall contain timestamps later
timestamp_pattern = re.compile(r"\d{2}:\d{2}:\d{2},\d{3} \d{2}:\d{2}:\d{2},\d{3}")
for i in keys:
Dict[i] = values[i]
Unfortunately, I have no idea how to write the code in order to get only the relevant timestamps and store them as values with their keys (fruits) in the Dict.
The desired output (Dict) should look like this:
{'CHERRY': '00:02:01,640 -> 00:02:04,409',
'APPLE': '00:02:04,409 -> 00:02:07,229',
'BANANA': '00:02:09,380 -> 00:02:12,060'}
Can anyone help?
Thank you very much!
This looks like something you can do using zip and avoid regex, considering the pattern of lines:
d = {}
lines = transcript.split('\n')
for x, y in zip(lines, lines[1:]):
for w in my_list:
if w in y.split():
splits = x.split()
d[w] = f'{splits[0]} -> {splits[1]}'
print(d)
You may use
^(\d{2}:\d{2}:\d{2},\d{3} \d{2}:\d{2}:\d{2},\d{3})\n.*\b(CHERRY|APPLE|BANANA)\b
See the regex demo. With this pattern, you capture the time span line and the keyword into separate groups that can be retrieved with re.findall. After swapping the two captured values, you may cast the list of tuples into a dictionary.
If you read the data from a file, you need to use with open(fpath, 'r') as r: and then contents = r.read() to read the whole contents into a single string variable.
See Python demo:
import re
text = "00:02:01,640 00:02:04,409\nword word CHERRY word word\n\n00:02:04,409 00:02:07,229\nword APPLE word word\n\n00:02:07,229 00:02:09,380\nword word word word\n\n00:02:09,380 00:02:12,060\nword BANANA word word word"
t = r"\d{2}:\d{2}:\d{2},\d{3}"
keys = ['CHERRY', 'APPLE', 'BANANA']
rx = re.compile(fr"^({t} {t})\n.*\b({'|'.join(keys)})\b", re.M)
print( dict([(y,x) for x, y in rx.findall(text)]) )
Output:
{'CHERRY': '00:02:01,640 00:02:04,409', 'APPLE': '00:02:04,409 00:02:07,229', 'BANANA': '00:02:09,380 00:02:12,060'}

Pandas Dataframe: Count unique words in a column and return count in another column

I have a dataframe which has the following columns
df['Album'] (contains album names of artistX )
df['Tracks'] (contains Tracks in the albums of artistX)
df['Lyrics'] (contains Lyrics of the tracks )
I am trying to count the number of words in df['Lyrics'] and return a new column called df['wordcount'] as well count the number of unique words in df['Lyrics'] and return a new column called df['uniquewordcount'].
I have been able to get df['wordcount'] by counting every string in df['lyrics'] minus white space.
totalscore = df.Lyrics.str.count('[^\s]') #count every word in a track
df['wordcount'] = totalscore
df
I have been able to count unique words in df['Lyrics']
import collections
from collections import Counter
results = Counter()
count_unique = df.Lyrics.str.lower().str.split().apply(results.update)
unique_counts = sum((results).values())
df['uniquewordcount'] = unique_counts
And that gives me the count of all the unique words in df['Lyrics'], which is what the code is meant to do, but I want the unique words in the lyrics of each track, My python isn't great currently so the solution might be obvious to everyone but not me. I would love someone to point me in the right direction on how to get the count of the unique words for each track.
expected output:
Album Tracks Lyrics wordcount uniquewordcount
A Ball Ball is life and Ball is key 7 5
Pass Pass me the hookah Pass me the 7 4
what I got:
Album Tracks Lyrics wordcount uniquewordcount
A Ball Ball is life and Ball is key 7 9
Pass Pass me the hookah Pass me the 7 9
Here is one alternative solution:
import pandas as pd
df = pd.DataFrame({'Lyrics': ['This is some life some collection of words',
'Lyrics abound lyrics here there eveywhere',
'Come fly come fly away']})
# Split list into new series
lyrics = df['Lyrics'].str.lower().str.split()
# Get amount of unique words
df['LyricsCounter'] = lyrics.apply(set).apply(len)
# Get amount of words
df['LyricsWords'] = lyrics.apply(len)
print(df)
Returns:
Lyrics LyricsCounter LyricsWords
0 This is some life some collection of words 7 8
1 Lyrics abound lyrics here there eveywhere 5 6
2 Come fly come fly away 3 5
Using just the standard library, you can indeed use collections.Counter. However, ntlk is advisable since there are many edge cases that may interest you, e.g. dealing with punctuation, plurals, etc.
Here's a step-by-step guide for Counter. Note we go further here than required, since we are also calculating the counts of each word. This data held in Counter dictionaries is discarded when we drop df['LyricsCounter'].
from collections import Counter
df = pd.DataFrame({'Lyrics': ['This is some life some collection of words',
'Lyrics abound lyrics here there eveywhere',
'Come fly come fly away']})
# convert to lowercase, split to list
df['LyricsList'] = df['Lyrics'].str.lower().str.split()
# for each set of lyrics, create a Counter dictionary
df['LyricsCounter'] = df['LyricsList'].apply(Counter)
# calculate length of list
df['LyricsWords'] = df['LyricsList'].apply(len)
# calculate number of Counter items for each set of lyrics
df['LyricsUniqueWords'] = df['LyricsCounter'].apply(len)
res = df.drop(['LyricsList', 'LyricsCounter'], axis=1)
print(res)
Lyrics LyricsWords LyricsUniqueWords
0 This is some life some collection of words 8 7
1 Lyrics abound lyrics here there eveywhere 6 5
2 Come fly come fly away 5 3

Most efficient way to compare words in list / dict in Python

I have the following sentence and dict :
sentence = "I love Obama and David Card, two great people. I live in a boat"
dico = {
'dict1':['is','the','boat','tree'],
'dict2':['apple','blue','red'],
'dict3':['why','Obama','Card','two'],
}
I want to match the number of the elements that are in the sentence and in a given dict. The heavier method consists in doing the following procedure:
classe_sentence = []
text_splited = sentence.split(" ")
dic_keys = dico.keys()
for key_dics in dic_keys:
for values in dico[key_dics]:
if values in text_splited:
classe_sentence.append(key_dics)
from collections import Counter
Counter(classe_sentence)
Which gives the following output:
Counter({'dict1': 1, 'dict3': 2})
However it's not efficient at all since there are two loops and it is raw comparaison. I was wondering if there is a faster way to do that. Maybe using itertools object. Any idea ?
Thanks in advance !
You can use the set data data type for all you comparisons, and the set.intersection method to get the number of matches.
It will increare algorithm efficiency, but it will only count each word once, even if it shows up in several places in the sentence.
sentence = set("I love Obama and David Card, two great people. I live in a boat".split())
dico = {
'dict1':{'is','the','boat','tree'},
'dict2':{'apple','blue','red'},
'dict3':{'why','Obama','Card','two'}
}
results = {}
for key, words in dico.items():
results[key] = len(words.intersection(sentence))
Assuming you want case-sensitive matching:
from collections import defaultdict
sentence_words = defaultdict(lambda: 0)
for word in sentence.split(' '):
# strip off any trailing or leading punctuation
word = word.strip('\'";.,!?')
sentence_words[word] += 1
for name, words in dico.items():
count = 0
for x in words:
count += sentence_words.get(x, 0)
print('Dictionary [%s] has [%d] matches!' % (name, count,))

Categories

Resources