Here I got a pandas.series named 'traindata'.
0 Published: 4:53AM Friday August 29, 2014 Sourc...
1 8 Have your say\n\n\nPlaying low-level club c...
2 Rohit Shetty has now turned producer. But the ...
3 A TV reporter in Serbia almost lost her job be...
4 THE HAGUE -- Tony de Brum was 9 years old in 1...
5 Australian TV cameraman Harry Burton was kille...
6 President Barack Obama sharply rebuked protest...
7 The car displaying the DIE FOR SYRIA! sticker....
8 \nIf you've ever been, you know that seeing th...
9 \nThe former executive director of JBWere has ...
10 Waterloo Road actor Joe Slater has revealed hi...
...
**Name: traindata, Length: 2284, dtype: object**
and what I want to do is to replace the series.values with the stemmed sentences.
my thought is to build a new series and put the stemmed sentence in.
my code is as below:
from nltk.stem.porter import PorterStemmer
stem_word_data = np.zeros([2284,1])
ps = PorterStemmer()
for i in range(0,len(traindata)):
tst = word_tokenize(traindata[i])
for word in tst:
word = ps.stem(word)
stem_word_data[i] = word
and then an error occurs:
ValueError: could not convert string to float: 'publish'
Anyone knows how to fix this error or anyone has a better idea on how to replace the series.values with the stemmed sentence? thanks.
You can use apply on a series and avoid writing loops.
from nltk import word_tokenize
from nltk.stem import PorterStemmer
## intialise stemmer class
pst = PorterStemmer()
## sample data frame
df = pd.DataFrame({'senten': ['I am not dancing','You are playing']})
## apply here
df['senten'] = df['senten'].apply(word_tokenize)
df['senten'] = df['senten'].apply(lambda x: ' '.join([pst.stem(y) for y in x]))
print(df)
senten
0 I am not danc
1 you are play
Related
I have a dataframe called DF which contains two types of information: datetime and sentence(string).
0 2019-02-01 point say give choice invest motor today money...
1 2019-02-01 get inside car drive drunk excuse bad driving ...
2 2019-02-01 look car snow know buy car snow
3 2019-02-01 drive home car day terrify experience stay least
4 2019-02-01 quid way ferry nice trip enjoy land list celeb...
... ... ...
35818 2021-09-30 choice life drive type car holiday type carava...
35819 2021-09-30 scarlet carson bloody marvellous big car lover...
35820 2021-09-30 podcast adriano great episode dude weird car d...
35821 2021-09-30 scarlet carson smugly cruise traffic know driv...
35822 2021-09-30 hornet know fuel shortage brexit destroy suppl...
Now I generate a word list to seek whether the sentence contains these string:
word_list=['drive','car','buy','fuel','electric','panic','tax','second hand','petrol','auto']
I only need to count once if the word in word list appears in the sentence, here comes my solution
set_list=[]
for word in word_list:
for sentence in DF['new_processed_text']:
if word in sentence:
set_list.append(sentence)
count=len(set(set_list))
However, this will work for the whole dataset, and I want to do the process by day.
I have no ideas about dataframe.groupby, should I need that?
You can use a regular expression, pd.DataFrame.loc method, pd.Series.value_counts() method and pandas string methods for your purposes:
In aggregate:
len(df.loc[df['new_processed_text'].str.contains('|'.join(word_list), regex=True)].index)
Processing per day:
df.loc[df['new_processed_text'].str.contains('|'.join(word_list), regex=True), 'date_column'].value_counts()
You can remove duplicates first and then use the string methods of pandas Series objects.
import pandas as pd
s = pd.Series(['abc def', 'def xyz ijk', 'xyz ijk', 'abc def', 'abc def', 'ijk mn', 'def xyz'])
words = ['abc', 'xyz']
s_prime = s.drop_duplicates()
contains_word = s_prime.str.contains("|".join(words))
print(contains_word.sum())
In your case, s = DF['new_processed_text'] and words = word_list.
I forgot the per day part, which #Vladimir Vilimaitis provides in his answer with his last line.
I have checked the other two answers to similar question but in vain.
from gensim.corpora.dictionary import Dictionary
from gensim.models.tfidfmodel import TfidfModel
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
article = "Aziz Ismail Ansari[1] (/ənˈsɑːri/; born February 23, 1983) is an American actor, writer, producer, director, and comedian. He is known for his role as Tom Haverford on the NBC series Parks and Recreation (2009–2015) and as creator and star of the Netflix series Master of None (2015–) for which he won several acting and writing awards, including two Emmys and a Golden Globe, which was the first award received by an Asian American actor for acting on television.[2][3][4]Ansari began performing comedy in New York City while a student at New York University in 2000. He later co-created and starred in the MTV sketch comedy show Human Giant, after which he had acting roles in a number of feature films.As a stand-up comedian, Ansari released his first comedy special, Intimate Moments for a Sensual Evening, in January 2010 on Comedy Central Records. He continues to perform stand-up on tour and on Netflix. His first book, Modern Romance: An Investigation, was released in June 2015. He was included in the Time 100 list of most influential people in 2016.[5] In July 2019, Ansari released his fifth comedy special Aziz Ansari: Right Now, which was nominated for a Grammy Award for Best Comedy Album.[6]"
tokens = regexp_tokenize(article, "\w+|\d+")
az_only_alpha = [t for t in tokens if t.isalpha()]
#below line will take some time to execute.
az_no_stop_words = [t for t in az_only_alpha if t not in stopwords.words('english')]
az_lemmetize = [wordnetLem.lemmatize(t) for t in az_no_stop_words]
dictionary = Dictionary([az_lemmetize])
x = [dictionary.doc2bow(doc) for doc in [az_lemmetize]]
cp = TfidfModel(x)
cp[x[0]]
The last line gives me an empty array whereas I expect a list of tupples with id and their weights.
Would it be possible to look at a specific n-grams from the whole list of them and look for it in a list of sentences?
For example:
I have the following sentences (from a dataframe column):
example = ['Mary had a little lamb. Jack went up the hill' ,
'Jack went to the beach' ,
'i woke up suddenly' ,
'it was a really bad dream...']
and n-grams (bigrams) got from
word_v = CountVectorizer(ngram_range=(2,2), analyzer='word')
mat = word_v r.fit_transform(df['Example'])
frequencies = sum(mat).toarray()[0]
which generates the output of the n-grams frequency.
I would like to select
the most frequent bi-grams
a bi-gram selected manually
within the list above example.
So, let's say that the most frequent bi-gram is Jack went, how could I look for it in the example above?
Also, if I want to look, not at the most frequent bi-grams but at the hill/beach in the example, how could I do it?
To select the rows that have the most frequent ngrams in it, you can do:
df.loc[mat.toarray()[:, frequencies==frequencies.max()].astype(bool)]
Example
0 Mary had a little lamb. Jack went up the hill
1 Jack went to the beach
but if two ngrams have the max frequency, you would get all the rows where both are present.
If you want the top/hill 3 and all the rows that have any of them:
top = 3
print (df.loc[mat.toarray()[:, np.argsort(frequencies)][:, -top:].any(axis=1)])
Example
0 Mary had a little lamb. Jack went up the hill
1 Jack went to the beach
2 i woke up suddenly
3 it was a really bad dream...
#here it is all the rows with the example
hill = 3
print (df.loc[mat.toarray()[:, np.argsort(frequencies)][:, :hill].any(axis=1)])
Example
1 Jack went to the beach
3 it was a really bad dream...
Finally if you want a specific ngrams:
ng = 'it was'
df.loc[mat.toarray()[:, np.array(word_v.get_feature_names())==ng].astype(bool)]
Example
3 it was a really bad dream...
I am trying to remove stopwords in df['Sentences'] as I would need to plot it.
My sample is
... 13
London 12
holiday 11
photo 7
. 7
..
walk 1
dogs 1
I have built my own dictionary and I would like to use it to remove the stop_words from that list.
What I have done is as follows:
import matplotlib.pyplot as plt
df['Sentences'] = df['Sentences'].apply(lambda x: ' '.join([item for item in x.split() if item not in my_dict]))
w_freq=df.Sentences.str.split(expand=True).stack().value_counts()
Although it does not give me any error, the stopwords and punctuation are still there. Also, I would like to not alterate the column, but just looking at the results for a short analysis (for example, creating a copy of the original column).
How could I remove them?
Let's assume you have this dataframe with this really interesting conversation.
df = pd.DataFrame({'Sentences':['Hello, how are you?',
'Hello, I am fine. Have you watched the news',
'Not really the news ...']})
print (df)
Sentences
0 Hello, how are you?
1 Hello, I am fine. Have you watched the news
2 Not really the news ...
Now you want to remove the punctuation and the stopwords from my_dict, you can do it like this
my_dict = ['a','i','the','you', 'am', 'are', 'have']
s = (df['Sentences'].str.lower() #to prevent any case problem
.str.replace(r'[^\w\s]+', '') # remove the punctuation
.str.split(' ') # create a list of words
.explode() # create a row per word of the lists
.value_counts() # get occurrences
)
s = s[~s.index.isin(my_dict)] #remove the the stopwords
print (s) #you can see you don't have punctuation nor stopwords
news 2
hello 2
watched 1
fine 1
not 1
really 1
how 1
1
Name: Sentences, dtype: int64
This might not be the faster way though
How do I remove the same words in the dataframe named df3?
My below codes doesn't seemed to work...
df3 = pd.DataFrame(np.array(c3), columns=["content"]).drop_duplicates()
def text_processing_cat3(df3):
''=== Removal of common words ==='''
freq = pd.Series(' '.join(df3['content']).split()).value_counts()[:10]
freq = list(freq.index)
df3['content'] = df3['content'].apply(lambda x: " ".join(x for x in
x.split() if x not in freq))
'''=== Removal of rare words ==='''
freq = pd.Series(' '.join(df3['content']).split()).value_counts()[-10:]
freq = list(freq.index)
df3['content'] = df3['content'].apply(lambda x: " ".join(x for x in
x.split() if x not in freq))
return df3
print(text_processing_cat3(df3)
The sample output for the above is:
cat_id content
0 3 male malay man nkda walking stick home ambulant ws void deck able walk bendemeer mall home bus stop away adli stays daughter family husband none image image image order cancellation note ct brain duplicate image
1 3 yo chinese man nkda phx hypertension hyperlipidemia benign hyperplasia open cholecystectomy gallbladder empyema distal gastrectomy pud penetrating aortic
Please help check the codes and improve the codes above. Thank you!!