Count occurences of list items in dataframe - python

I have a dataframe and a list of words. Now I want to count how often all words in the list occur in each cell of a dataframe.
text
this is a test sentence
another sentence
list = ["this", "test", "break"]
Result:
text
occurence_count
this is a test sentence
2
another sentence
0
My code does not work:
df["occurence_count"] = [df["text"].count(x) for x in list]

Perhaps you can do this:
a = ['this', 'test', 'break'] # 'list' shouldn't be used as a variable name
df['occurence_count'] = (
df['text'].str.split().explode()
.isin(set(a)).groupby(level=0).sum()
)
>>> df
text occurence_count
0 this is a test sentence 2
1 another sentence 0

You can do :
import re
l = ['this', 'test', 'break']
s = set(l)
df['occurence_count'] =df['text'].apply(
lambda x:len(set(re.split('\s+',x)).intersection(s)))
So you split them into words, get a set and look for an intersection in your list and get len
(BTW don't use list as variable name, its a keyword in python)
output:
text occurence_count
0 this is a test sentence 2
1 another sentence 0

Related

How to filter a list via specific positioning of variable letters

What would be the most pythonic way of filtering a list using variables in exact positions of a word/list?
e.g. If my list is [HELLO, JELLO, JUUNO, ELNOO], how would I filter the list for any word containing 'EL' in a specific position of the word but exclude words which do not have the correct positioning of the letters? So if I was to filter for 'EL' in position 2 and 3 of a word I would want it to output [HELLO, JELLO] but exclude [ELNOO]
wordlist = ['HELLO', 'JELLO', 'JUUNO', 'ELNOO']
#flter wordlist for 'EL'in position 2 and 3
#filter wordlist for starting with 'J' and ending with 'O'
varA = 'E'
varB = 'l'
varC = 'J'
varD = 'O'
I thought about importing re and using re.search, is there a better way of doing this?
For a simple fixed string like this, a slice comparison is probably easiest:
words = ['HELLO', 'JELLO', 'JUUNO', 'ELNOO']
[x for x in words if x[1:3] == 'EL']
extending #match's answer if you need it as a function
words = ['HELLO', 'JELLO', 'JUUNO', 'ELNOO']
filter_word = lambda wordlist, keyword, pos1, pos2: [x for x in wordlist if x[pos1-1:pos2] == keyword]
filtered_words = filter_word(words, "EL", 2, 3)
print(filtered_words)

Matching 2 words in 2 lines and +1 to the matching pair?

So Ive got a variable list which is always being fed a new line
And variable words which is a big list of single word strings
Every time list updates I want to compare it to words and see if any strings from words are in list
If they do match, lets say the word and is in both of them, I then want to print "And : 1". Then if next sentence has that as well, to print "And : 2", etc. If another word comes in like The I want to print +1 to that
So far I have split the incoming text into an array with text.split() - unfortunately that is where im stuck. I do see some use in [x for x in words if x in list] but dont know how I would use that. Also how I would extract the specific word that is matching
You can use a collections.Counter object to keep a tally for each of the words that you are tracking. To improve performance, use a set for your word list (you said it's big). To keep things simple assume there is no punctuation in the incoming line data. Case is handled by converting all incoming words to lowercase.
from collections import Counter
words = {'and', 'the', 'in', 'of', 'had', 'is'} # words to keep counts for
word_counts = Counter()
lines = ['The rabbit and the mole live in the ground',
'Here is a sentence with the word had in it',
'Oh, it also had in in it. AND the and is too']
for line in lines:
tracked_words = [w for word in line.split() if (w:=word.lower()) in words]
word_counts.update(tracked_words)
print(*[f'{word}: {word_counts[word]}'
for word in set(tracked_words)], sep=', ')
Output
the: 3, and: 1, in: 1
the: 4, in: 2, is: 1, had: 1
the: 5, and: 3, in: 4, is: 2, had: 2
Basically this code takes a line of input, splits it into words (assuming no punctuation), converts these words to lowercase, and discards any words that are not in the main list of words. Then the counter is updated. Finally the current values of the relevant words is printed.
This does the trick:
sentence = "Hello this is a sentence"
list_of_words = ["this", "sentence"]
dict_of_counts = {} #This will hold all words that have a minimum count of 1.
for word in sentence.split(): #sentence.split() returns a list with each word of the sentence, and we loop over it.
if word in list_of_words:
if word in dict_of_counts: #Check if the current sentence_word is in list_of_words.
dict_of_counts[word] += 1 #If this key already exists in the dictionary, then add one to its value.
else:
dict_of_counts[word] = 1 #If key does not exists, create it with value of 1.
print(f"{word}: {dict_of_counts[word]}") #Print your statement.
The total count is kept in dict_of_counts and would look like this if you print it:
{'this': 1, 'sentence': 1}
You should use defaultdict here for the fastest processing.
from collections import defaultdict
input_string = "This is an input string"
list_of_words = ["input", "is"]
counts = defaultdict(int)
for word in input_string.split():
if word in list_of_words:
counts[word] +=1

Extracting all words starting with a certain character

I have a list of lists, in which I store sentences as strings. What I want to do is to get only the words starting with #. In order to do that, I split the sentences into words and now trying to pick only the words that start with # and exclude all the other words.
# to create the empty list:
lst = []
# to iterate through the columns:
for i in range(0,len(df)):
lst.append(df['col1'][i].split())
If I am mistaken you just need flat list containing all words starting with particular character. For doing that I would employ list flattening (via itertools):
import itertools
first = 'f' #look for words starting with f letter
nested_list = [['This is first sentence'],['This is following sentence']]
flat_list = list(itertools.chain.from_iterable(nested_list))
nested_words = [i.split(' ') for i in flat_list]
words = list(itertools.chain.from_iterable(nested_words))
lst = [i for i in words if i[0]==first]
print(lst) #output: ['first', 'following']

Extracting the first word from every value in a list

So I have a long list of column headers. All are strings, some are several words long. I've yet to find a way to write a function that extracts the first word from each value in the list and returns a list of just those singular words.
For example, this is what my list looks like:
['Customer ID', 'Email','Topwater -https:', 'Plastics - some uml']
And I want it to look like:
['Customer', 'Email', 'Topwater', 'Plastics']
I currently have this:
def first_word(cur_list):
my_list = []
for word in cur_list:
my_list.append(word.split(' ')[:1])
and it returns None when I run it on a list.
You can use list comprehension to return a list of the first index after splitting the strings by spaces.
my_list = [x.split()[0] for x in your_list]
To address "and it returns None when I run it on a list."
You didn't return my_list. Because it created a new list, didn't change the original list cur_list, the my_list is not returned.
To extract the first word from every value in a list
From #dfundako, you can simplify it to
my_list = [x.split()[0] for x in cur_list]
The final code would be
def first_word(cur_list):
my_list = [x.split()[0] for x in cur_list]
return my_list
Here is a demo. Please note that some punctuation may be left behind especially if it is right after the last letter of the name:
names = ["OMG FOO BAR", "A B C", "Python Strings", "Plastics: some uml"]
first_word(names) would be ['OMG', 'A', 'Python', 'Plastics:']
>>> l = ['Customer ID', 'Email','Topwater -https://karls.azureedge.net/media/catalog/product/cache/1/image/627x470/9df78eab33525d08d6e5fb8d27136e95/f/g/fgh55t502_web.jpg', 'Plastics - https://www.bass.co.za/1473-thickbox_default/berkley-powerbait-10-power-worm-black-blue-fleck.jpg']
>>> list(next(zip(*map(str.split, l))))
['Customer', 'Email', 'Topwater', 'Plastics']
[column.split(' ')[0] for column in my_list] should do the trick.
and if you want it in a function:
def first_word(my_list):
return [column.split(' ')[0] for column in my_list]
(?<=\d\d\d)\d* try using this in a loop to extract the words using regex

Extracting multiple substring from a string

I have a complicated string and would like to try to extract multiple substring from it.
The string consists of a set of items, separated by commas. Each item has an identifier (id-n) for a pair of words inside which is enclosed by brackets. I want to get only the word inside the bracket which has a number attached to its end (e.g. 'This-1'). The number actually indicates the position of how the words should be arrannged after extraction.
#Example of how the individual items would look like
id1(attr1, is-2) #The number 2 here indicates word 'is' should be in position 2
id2(attr2, This-1) #The number 1 here indicates word 'This' should be in position 1
id3(attr3, an-3) #The number 3 here indicates word 'an' should be in position 3
id4(attr4, example-4) #The number 4 here indicates word 'example' should be in position 4
id5(attr5, example-4) #This is a duplicate of the word 'example'
#Example of string - this is how the string with the items looks like
string = "id1(attr1, is-1), id2(attr2, This-2), id3(attr3, an-3), id4(attr4, example-4), id5(atttr5, example-4)"
#This is how the result should look after extraction
result = 'This is an example'
Is there an easier way to do this? Regex doesn't work for me.
A trivial/naive approach:
>>> z = [x.split(',')[1].strip().strip(')') for x in s.split('),')]
>>> d = defaultdict(list)
>>> for i in z:
... b = i.split('-')
... d[b[1]].append(b[0])
...
>>> ' '.join(' '.join(d[t]) for t in sorted(d.keys(), key=int))
'is This an example example'
You have duplicated positions for example in your sample string, which is why example is repeated in the code.
However, your sample is not matching your requirements either - but this results is as per your description. Words arranged as per their position indicators.
Now, if you want to get rid of duplicates:
>>> ' '.join(e for t in sorted(d.keys(), key=int) for e in set(d[t]))
'is This an example'
Why not regex? This works.
In [44]: s = "id1(attr1, is-2), id2(attr2, This-1), id3(attr3, an-3), id4(attr4, example-4), id5(atttr5, example-4)"
In [45]: z = [(m.group(2), m.group(1)) for m in re.finditer(r'(\w+)-(\d+)\)', s)]
In [46]: [x for y, x in sorted(set(z))]
Out[46]: ['This', 'is', 'an', 'example']
OK, how about this:
sample = "id1(attr1, is-2), id2(attr2, This-1),
id3(attr3, an-3), id4(attr4, example-4), id5(atttr5, example-4)"
def make_cryssie_happy(s):
words = {} # we will use this dict later
ll = s.split(',')[1::2]
# we only want items like This-1, an-3, etc.
for item in ll:
tt = item.replace(')','').lstrip()
(word, pos) = tt.split('-')
words[pos] = word
# there can only be one word at a particular position
# using a dict with the numbers as positions keys
# is an alternative to using sets
res = [words[i] for i in sorted(words)]
# sort the keys, dicts are unsorted!
# create a list of the values of the dict in sorted order
return ' '.join(res)
# return a nice string
print make_cryssie_happy(sample)

Categories

Resources