I am having dataframe idf as below.
feature_name idf_weights
2488 kralendijk 11.221923
3059 night 0
1383 ebebf 0
I have another Dataframe df
message Number of Words in each message
0 night kralendijk ebebf 3
I want to add idf weights from idf for each word in the "df" dataframe in a new column.
The output will look like the below:
message Number of Words in each message Number of words with idf_score>0
0 night kralendijk ebebf 3 1
Here is what I've tried so far, but it's giving the total count of words instead of word having idf_weight>0:
words_weights = dict(idf[['feature_name', 'idf_weights']].values)
df['> zero'] = df['message'].apply(lambda x: count([words_weights.get(word, 11.221923) for word in x.split()]))
Output
message Number of Words in each message Number of words with idf_score>0
0 night kralendijk ebebf 3 3
Thank you.
Try using a list comprehension:
# set up a dictionary for easy feature->weight indexing
d = idf.set_index('feature_name')['idf_weights'].to_dict()
# {'kralendijk': 11.221923, 'night': 0.0, 'ebebf': 0.0}
df['> zero'] = [sum(d.get(w, 0)>0 for w in x.split()) for x in df['message']]
## OR, slighlty faster alternative
# df['> zero'] = [sum(1 for w in x.split() if d.get(w, 0)>0) for x in df['message']]
output:
message Number of Words in each message > zero
0 night kralendijk ebebf 3 1
You can use str.findall: the goal here is to create a list of feature names with a weight greater than 0 to find in each message.
pattern = fr"({'|'.join(idf.loc[idf['idf_weights'] > 0, 'feature_name'])})"
df['Number of words with idf_score>0'] = df['message'].str.findall(pattern).str.len()
print(df)
# Output
message Number of Words in each message Number of words with idf_score>0
0 night kralendijk ebebf 3 1
Related
I want to improve the loop performance where it counts word occurrences in text, but it runs around 5 minutes for 5 records now
DataFrame
No Text
1 I love you forever...*500 other words
2 No , i know that you know xxx *100 words
My word list
wordlist =['i','love','David','Mary',......]
My code to count word
for i in wordlist :
df[i] = df['Text'].str.count(i)
Result :
No Text I love other_words
1 I love you ... 1 1 4
2 No, i know ... 1 0 5
You can do this by making a Counter from the words in each Text value, then converting that into columns (using pd.Series), summing the columns that don't exist in wordlist into other_words and then dropping those columns:
import re
import pandas as pd
from collections import Counter
wordlist = list(map(str.lower, wordlist))
counters = df['Text'].apply(lambda t:Counter(re.findall(r'\b[a-z]+\b', t.lower())))
df = pd.concat([df, counters.apply(pd.Series).fillna(0).astype(int)], axis=1)
other_words = list(set(df.columns) - set(wordlist) - { 'No', 'Text' })
df['other_words'] = df[other_words].sum(axis=1)
df = df.drop(other_words, axis=1)
Output (for the sample data in your question):
No Text i love other_words
0 1 I love you forever... other words 1 1 4
1 2 No , i know that you know xxx words 1 0 7
Note:
I've converted all the words to lower-case so you're not counting I and i separately.
I've used re.findall rather than the more obvious split() so that forever... gets counted as the word forever rather than forever...
If you only want to count the words in wordlist (and don't want an other_words count), you can simplify this to:
wordlist = list(map(str.lower, wordlist))
counters = df['Text'].apply(lambda t:Counter(w for w in re.findall(r'\b[a-z]+\b', t.lower()) if w in wordlist))
df = pd.concat([df, counters.apply(pd.Series).fillna(0).astype(int)], axis=1)
Output:
No Text i love
0 1 I love you forever... other words 1 1
1 2 No , i know that you know xxx words 1 0
Another way of also generating the other_words value is to generate 2 sets of counters, one of all the words, and one only of the words in wordlist. These can then be subtracted from each other to find the count of words in the text which are not in the wordlist:
wordlist = list(map(str.lower, wordlist))
counters = df['Text'].apply(lambda t:Counter(w for w in re.findall(r'\b[a-z]+\b', t.lower()) if w in wordlist))
df = pd.concat([df, counters.apply(pd.Series).fillna(0).astype(int)], axis=1)
c2 = df['Text'].apply(lambda t:Counter(re.findall(r'\b[a-z]+\b', t.lower())))
df['other_words'] = (c2 - counters).apply(lambda d:sum(d.values()))
Output of this is the same as for the first code sample. Note that in Python 3.10 and later, you should be able to use the new total function:
(c2 - counters).apply(Counter.total)
as an alternative you could try this:
counts = (df['Text'].str.lower().str.findall(r'\b[a-z]+\b')
.apply(lambda x: pd.Series(x).value_counts())
.filter(map(str.lower, wordlist)).fillna(0))
df[counts.columns] = counts
print(df)
'''
№ Text i love
0 1 I love you forever... other words 1.0 1.0
1 2 No , i know that you know xxx words 1.0 0.0
I have a dataframe with data in the of similar format
song lyric tokenized_lyrics
0 Song 1 Look at her face, it's a wonderful face [look , at , her ,face, it's a wonderful, face ]
1 Song 2 Some lyrics of the song taken [Some, lyrics ,of, the, song, taken]
I want to count the no of words in the lyrics per song and an output like
song count
song 1 8
song 2 6
I tried aggregate function but it is not yielding the correct result.
Code I tried :
df.groupby(['song']).agg(
word_count = pd.NamedAgg(column='text' , aggfunc = 'count' )
)
How can I achieve the desired result
I couldnt copy tokenized_lyrics as a list, it came in as a string, so I tokenized the lyrics, with the assumption that the delimiter is a white space:
df['token_count'] = df.lyric.str.replace(',','').str.split().str.len()
df.filter(['song','token_count'])
song token_count
0 Song 1 8
1 Song 2 6
note that you can just apply string len to the tokenized lyrics to get your count, since it is a list, it will count the individual items
Use Series.str.len for count values and if duplicated song values then aggregate sum:
df1 = (df.assign(count = df['tokenized_lyrics'].str.len())
.groupby('song', as_index=False)['count'].sum())
I have texts in one column and respective dictionary in another column. I have tokenized the text and want to replace those tokens which found a match for the key in respective dictionary. the text and and the dictionary are specific to each record of a pandas dataframe.
import pandas as pd
data =[['1','i love mangoes',{'love':'hate'}],['2', 'its been a long time we have not met',{'met':'meet'}],['3','i got a call from one of our friends',{'call':'phone call','one':'couple of'}]]
df = pd.DataFrame(data, columns = ['id', 'text','dictionary'])
The final dataframe the output should be
data =[['1','i hate mangoes'],['2', 'its been a long time we have not meet'],['3','i got a phone call from couple of of our friends']
df = pd.DataFrame(data, columns =['id, 'modified_text'])
I am using Python 3 in a windows machine
You can use dict.get method after zipping the 2 cols and splitting the sentence:
df['modified_text']=([' '.join([b.get(i,i) for i in a.split()])
for a,b in zip(df['text'],df['dictionary'])])
print(df)
Output:
id text \
0 1 i love mangoes
1 2 its been a long time we have not met
2 3 i got a call from one of our friends
dictionary \
0 {'love': 'hate'}
1 {'met': 'meet'}
2 {'call': 'phone call', 'one': 'couple of'}
modified_text
0 i hate mangoes
1 its been a long time we have not meet
2 i got a phone call from couple of of our friends
I added spaces to the key and values to distinguish a whole word from part of it:
def replace(text, mapping):
new_s = text
for key in mapping:
k = ' '+key+' '
val = ' '+mapping[key]+' '
new_s = new_s.replace(k, val)
return new_s
df_out = (df.assign(modified_text=lambda f:
f.apply(lambda row: replace(row.text, row.dictionary), axis=1))
[['id', 'modified_text']])
print(df_out)
id modified_text
0 1 i hate mangoes
1 2 its been a long time we have not met
2 3 i got a phone call from couple of of our friends
I am doing some natural language processing on some twitter data. So I managed to successfully load and clean up some tweets and placed it into a data frame below.
id text
1104159474368024599 repmiketurner the only time that michael cohen told the truth is when he pled that he is guilty also when he said no collusion and i did not tell him to lie
1104155456019357703 rt msnbc president trump and first lady melania trump view memorial crosses for the 23 people killed in the alabama tornadoes t
The problem is that I am trying to construct a term frequency matrix where each row is a tweet and each column is the value that said word occurs in for a particular row. My only problem is that other post mentioning term frequency distribution text files. Here is the code I used to generate the data frame above
import nltk.classify
from nltk.tokenize import word_tokenize
from nltk.tokenize import wordpunct_tokenize
from nltk.corpus import stopwords
from nltk.probability import FreqDist
df_tweetText = df_tweet
#Makes a dataframe of just the text and ID to make it easier to tokenize
df_tweetText = pd.DataFrame(df_tweetText['text'].str.replace(r'[^\w\s]+', '').str.lower())
#Removing Stop words
#nltk.download('stopwords')
stop = stopwords.words('english')
#df_tweetText['text'] = df_tweetText.apply(lambda x: [item for item in x if item not in stop])
#Remove the https linkes
df_tweetText['text'] = df_tweetText['text'].replace("[https]+[a-zA-Z0-9]{14}",'',regex=True, inplace=False)
#Tokenize the words
df_tweetText
At first I tried to use the function word_dist = nltk.FreqDist(df_tweetText['text']) but It would end up counting the value of the entire sentence instead of each word in the row.
Another thing I had tried was to tokenize each word using df_tweetText['text'] = df_tweetText['text'].apply(word_tokenize) then call FeqDist again but that gives me an error saying unhashable type: 'list'.
1104159474368024599 [repmiketurner, the, only, time, that, michael, cohen, told, the, truth, is, when, he, pled, that, he, is, guilty, also, when, he, said, no, collusion, and, i, did, not, tell, him, to, lie]
1104155456019357703 [rt, msnbc, president, trump, and, first, lady, melania, trump, view, memorial, crosses, for, the, 23, people, killed, in, the, alabama, tornadoes, t]
Is there some alternative way for trying to construct this term frequency matrix? Ideally, I want my data to look something like this
id |collusion | president |
------------------------------------------
1104159474368024599 | 1 | 0 |
1104155456019357703 | 0 | 2 |
EDIT 1: So I decided to take a look at the textmining library and recreated one of their examples. The only problem is that It creates the Term Document Matrix with one row of every single tweet.
import textmining
#Creates Term Matrix
tweetDocumentmatrix = textmining.TermDocumentMatrix()
for column in df_tweetText:
tweetDocumentmatrix.add_doc(df_tweetText['text'].to_string(index=False))
# print(df_tweetText['text'].to_string(index=False))
for row in tweetDocumentmatrix.rows(cutoff=1):
print(row)
EDIT2: So I tried SKlearn but that sortof worked but the problem is that I'm finding chinese/japanese characters in my columns which does should not exist. Also my columns are showing up as numbers for some reason
from sklearn.feature_extraction.text import CountVectorizer
corpus = df_tweetText['text'].tolist()
vec = CountVectorizer()
X = vec.fit_transform(corpus)
df = pd.DataFrame(X.toarray(), columns=vec.get_feature_names())
print(df)
00 007cigarjoe 08 10 100 1000 10000 100000 1000000 10000000 \
0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0
Probably not optimal by iterating over each row, but works. Milage may vary based on how long tweets are and how many tweets are being processed.
import pandas as pd
from collections import Counter
# example df
df = pd.DataFrame()
df['tweets'] = [['test','xd'],['hehe','xd'],['sam','xd','xd']]
# result dataframe
df2 = pd.DataFrame()
for i, row in df.iterrows():
df2 = df2.append(pd.DataFrame.from_dict(Counter(row.tweets), orient='index').transpose())
I have a dataframe like this df with the column name title.
title
I have a pen tp001
I have rt0024 apple
I have wtw003 orange
I need to return the new title to the following (begin with letter and end with digit)
title
tp001
rt0024
wtw003
So I use df['new_title'] =df['title'].str.extract(r'^[a-z].*\d$') but it didn't work.The error is ValueError: pattern contains no capture groups
I updated the question,so each word has different length with letters and digits.
You can use:
df['title'] = df['title'].str.extract(r'(\w+\d+)',expand=False)
>>> df
title
0 tp001
1 rt0024
2 wtw003
By using extract
df.title.str.extract(r'([a-z]{2}[0-9]{3})',expand=True)
Out[250]:
0
0 tp001
1 rt002
2 wt003