I want to improve the loop performance where it counts word occurrences in text, but it runs around 5 minutes for 5 records now
DataFrame
No Text
1 I love you forever...*500 other words
2 No , i know that you know xxx *100 words
My word list
wordlist =['i','love','David','Mary',......]
My code to count word
for i in wordlist :
df[i] = df['Text'].str.count(i)
Result :
No Text I love other_words
1 I love you ... 1 1 4
2 No, i know ... 1 0 5
You can do this by making a Counter from the words in each Text value, then converting that into columns (using pd.Series), summing the columns that don't exist in wordlist into other_words and then dropping those columns:
import re
import pandas as pd
from collections import Counter
wordlist = list(map(str.lower, wordlist))
counters = df['Text'].apply(lambda t:Counter(re.findall(r'\b[a-z]+\b', t.lower())))
df = pd.concat([df, counters.apply(pd.Series).fillna(0).astype(int)], axis=1)
other_words = list(set(df.columns) - set(wordlist) - { 'No', 'Text' })
df['other_words'] = df[other_words].sum(axis=1)
df = df.drop(other_words, axis=1)
Output (for the sample data in your question):
No Text i love other_words
0 1 I love you forever... other words 1 1 4
1 2 No , i know that you know xxx words 1 0 7
Note:
I've converted all the words to lower-case so you're not counting I and i separately.
I've used re.findall rather than the more obvious split() so that forever... gets counted as the word forever rather than forever...
If you only want to count the words in wordlist (and don't want an other_words count), you can simplify this to:
wordlist = list(map(str.lower, wordlist))
counters = df['Text'].apply(lambda t:Counter(w for w in re.findall(r'\b[a-z]+\b', t.lower()) if w in wordlist))
df = pd.concat([df, counters.apply(pd.Series).fillna(0).astype(int)], axis=1)
Output:
No Text i love
0 1 I love you forever... other words 1 1
1 2 No , i know that you know xxx words 1 0
Another way of also generating the other_words value is to generate 2 sets of counters, one of all the words, and one only of the words in wordlist. These can then be subtracted from each other to find the count of words in the text which are not in the wordlist:
wordlist = list(map(str.lower, wordlist))
counters = df['Text'].apply(lambda t:Counter(w for w in re.findall(r'\b[a-z]+\b', t.lower()) if w in wordlist))
df = pd.concat([df, counters.apply(pd.Series).fillna(0).astype(int)], axis=1)
c2 = df['Text'].apply(lambda t:Counter(re.findall(r'\b[a-z]+\b', t.lower())))
df['other_words'] = (c2 - counters).apply(lambda d:sum(d.values()))
Output of this is the same as for the first code sample. Note that in Python 3.10 and later, you should be able to use the new total function:
(c2 - counters).apply(Counter.total)
as an alternative you could try this:
counts = (df['Text'].str.lower().str.findall(r'\b[a-z]+\b')
.apply(lambda x: pd.Series(x).value_counts())
.filter(map(str.lower, wordlist)).fillna(0))
df[counts.columns] = counts
print(df)
'''
№ Text i love
0 1 I love you forever... other words 1.0 1.0
1 2 No , i know that you know xxx words 1.0 0.0
Related
I have searched a lot here and I couldnt find the answer for it.
I have a dataframe with column "Descriptions" which contain a long string,
I'm trying to count the number of occurence for a specific word "restaurant",
df['has_restaurants'] = 0
for index,text in enumerate(df['Description']):
text = text.split()
df['has_restaurants'][index] = (sum(map(lambda count : 1 if 'restaurant' in count else 0, text)))
Did the above and it works but it doesn't look like a good way to do it and it generates this "error" as well:
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df['has_restaurants'][index] = (sum(map(lambda count : 1 if 'restaurant' in count else 0, text)))
You might simplify that by using .str.count method, consider following simple example
import pandas as pd
df = pd.DataFrame({"description":["ABC DEF GHI","ABC ABC ABC","XYZ XYZ XYZ"]})
df['ABC_count'] = df.description.str.count("ABC")
print(df)
output
description ABC_count
0 ABC DEF GHI 1
1 ABC ABC ABC 3
2 XYZ XYZ XYZ 0
You could use Python's native .count() method:
df['has_restaurants'] = 0
for index,text in enumerate(df['Description']):
df['has_restaurants'][index] = text.count('restaurant')
I am having dataframe idf as below.
feature_name idf_weights
2488 kralendijk 11.221923
3059 night 0
1383 ebebf 0
I have another Dataframe df
message Number of Words in each message
0 night kralendijk ebebf 3
I want to add idf weights from idf for each word in the "df" dataframe in a new column.
The output will look like the below:
message Number of Words in each message Number of words with idf_score>0
0 night kralendijk ebebf 3 1
Here is what I've tried so far, but it's giving the total count of words instead of word having idf_weight>0:
words_weights = dict(idf[['feature_name', 'idf_weights']].values)
df['> zero'] = df['message'].apply(lambda x: count([words_weights.get(word, 11.221923) for word in x.split()]))
Output
message Number of Words in each message Number of words with idf_score>0
0 night kralendijk ebebf 3 3
Thank you.
Try using a list comprehension:
# set up a dictionary for easy feature->weight indexing
d = idf.set_index('feature_name')['idf_weights'].to_dict()
# {'kralendijk': 11.221923, 'night': 0.0, 'ebebf': 0.0}
df['> zero'] = [sum(d.get(w, 0)>0 for w in x.split()) for x in df['message']]
## OR, slighlty faster alternative
# df['> zero'] = [sum(1 for w in x.split() if d.get(w, 0)>0) for x in df['message']]
output:
message Number of Words in each message > zero
0 night kralendijk ebebf 3 1
You can use str.findall: the goal here is to create a list of feature names with a weight greater than 0 to find in each message.
pattern = fr"({'|'.join(idf.loc[idf['idf_weights'] > 0, 'feature_name'])})"
df['Number of words with idf_score>0'] = df['message'].str.findall(pattern).str.len()
print(df)
# Output
message Number of Words in each message Number of words with idf_score>0
0 night kralendijk ebebf 3 1
I have texts in one column and respective dictionary in another column. I have tokenized the text and want to replace those tokens which found a match for the key in respective dictionary. the text and and the dictionary are specific to each record of a pandas dataframe.
import pandas as pd
data =[['1','i love mangoes',{'love':'hate'}],['2', 'its been a long time we have not met',{'met':'meet'}],['3','i got a call from one of our friends',{'call':'phone call','one':'couple of'}]]
df = pd.DataFrame(data, columns = ['id', 'text','dictionary'])
The final dataframe the output should be
data =[['1','i hate mangoes'],['2', 'its been a long time we have not meet'],['3','i got a phone call from couple of of our friends']
df = pd.DataFrame(data, columns =['id, 'modified_text'])
I am using Python 3 in a windows machine
You can use dict.get method after zipping the 2 cols and splitting the sentence:
df['modified_text']=([' '.join([b.get(i,i) for i in a.split()])
for a,b in zip(df['text'],df['dictionary'])])
print(df)
Output:
id text \
0 1 i love mangoes
1 2 its been a long time we have not met
2 3 i got a call from one of our friends
dictionary \
0 {'love': 'hate'}
1 {'met': 'meet'}
2 {'call': 'phone call', 'one': 'couple of'}
modified_text
0 i hate mangoes
1 its been a long time we have not meet
2 i got a phone call from couple of of our friends
I added spaces to the key and values to distinguish a whole word from part of it:
def replace(text, mapping):
new_s = text
for key in mapping:
k = ' '+key+' '
val = ' '+mapping[key]+' '
new_s = new_s.replace(k, val)
return new_s
df_out = (df.assign(modified_text=lambda f:
f.apply(lambda row: replace(row.text, row.dictionary), axis=1))
[['id', 'modified_text']])
print(df_out)
id modified_text
0 1 i hate mangoes
1 2 its been a long time we have not met
2 3 i got a phone call from couple of of our friends
I have a Pandas DataFrame (df) where some of the words contain encoding replacement characters. I want to replace these words with replacement words from a dictionary (translations).
translations = {'gr�nn': 'gronn', 'm�nst': 'menst'}
df = pd.DataFrame(["gr�nn Y", "One gr�nn", "Y m�nst/line X"])
df.replace(translations, regex=True, inplace=True)
However, it doesn't seem to capture all the instances.
Current output:
0
0 gronn Y
1 One gr�nn
2 Y m�nst/line X
Do I need to specify any regex patterns to enable the replacement to also capture partial words within a string?
Expected output:
0
0 gronn Y
1 One gronn
2 Y menst/line X
Turn your translations into regex find/replace strings:
translations = {r'(.*)gr�nn(.*)': r'\1gronn\2', r'(.*)m�nst(.*)': r'\1menst\2'}
df = pd.DataFrame(["gr�nn Y", "One gr�nn", "Y m�nst/line X"])
df.replace(translations, regex=True)
Returns:
0
0 gronn Y
1 One gronn
2 Y menst/line X
can anyone make me understand this piece of code.
def remove_digit(data):
newData = ''.join([i for i in data if not i.isdigit()])
i = newData.find('(')
if i>-1: newData = newData[:i]
return newData.strip()
Why don't you use regex. [0-9()] looks for matching characters between 0-9, ( and )
newData = re.sub('[0-9()]', '', data)
Give this df:
data
0 a43
1 b((
2 cr3r3
3 d
You can remove digits and parenthesis from the column in this way:
df['data'] = df['data'].str.replace('\d|\(|\)','')
Output:
data
0 a
1 b
2 crr
3 d