I am trying to import twitter data saved as text file and use the keyword function for designating columns that would show the details.
I have used this code in ipython3 notebook:
#definition for collecting keyword.
def word_in_text(word, text):
word = word.lower()
text = text.lower()
match = re.search(word, text)
if match:
return True
return False
The next cell has the following code:
#adding column
tweets['Trade'] = tweets['text'].apply(lambda tweet: word_in_text('Trade', tweet))
The error I get is as follows:
AttributeError Traceback (most recent
call last)
<ipython-input-35-b172c4e07d29> in <module>()
1 #adding column
----> 2 tweets['Trade'] = tweets['text'].apply(lambda tweet: word_in_text('Trade', tweet))
/usr/lib/python3/dist-packages/pandas/core/series.py in apply(self,
func, convert_dtype, args, **kwds)
2292 else:
2293 values = self.asobject
-> 2294 mapped = lib.map_infer(values, f,
convert=convert_dtype)
2295
2296 if len(mapped) and isinstance(mapped[0], Series):
pandas/src/inference.pyx in pandas.lib.map_infer (pandas/lib.c:66124)()
<ipython-input-35-b172c4e07d29> in <lambda>(tweet)
1 #adding column
----> 2 tweets['Trade'] = tweets['text'].apply(lambda tweet:
word_in_text('Trade', tweet))
<ipython-input-34-daa2f94a8fec> in word_in_text(word, text)
2 def word_in_text(word, text):
3 word = word.lower()
----> 4 text = text.lower()
5 match = re.search(word, text)
6 if match:
AttributeError: 'float' object has no attribute 'lower'
Update: I was able to reproduce your error. The field text might be missing in some of your tweets.
from pandas.io.json import json_normalize
tweet_data = [{'text': "let's trade!", 'lang':'en', 'place': {'country':'uk'}, 'created_at':'now', 'coordinates':'x,y', 'user':{'location':'here'}}, {'lang':'en', 'place': {'country':'uk'}, 'created_at': 'now', 'coordinates':'z,w', 'user':{'location':'there'}}]
tweets = json_normalize(tweet_data)[["text", "lang", "place.country","created_at", "coordinates","user.location"]]
I get the error with:
tweets['Trade'] = tweets['text'].apply(lambda tweet: word_in_text('Trade', tweet))
Output:
>> AttributeError: 'float' object has no attribute 'lower'
If I feed the tweet_data with the 'text' key I don't get the error. So, that would be an option. Another option would be to ignore nan cases in your lambda.
tweets['Trade'] = tweets['text'].apply(lambda tweet: word_in_text('Trade', tweet) if type(tweet) == str else False)
This way you get the correct output:
>>> tweets
text lang place.country created_at coordinates user.location Trade
0 let's trade! en uk now x,y here True
1 NaN en uk now z,w there False
This is old content, left here for completeness.
Somehow you are passing a float instead of the text to your word_in_text method. I've tried a simple example of what you want to achieve:
import pandas as pd
import re
def word_in_text(word, text):
word = word.lower()
text = text.lower()
match = re.search(word, text)
if match:
return True
return False
tweets = pd.DataFrame(['Hello, I like to trade', 'Trade', 'blah blah', 'Nice tradeoff here!'], columns=['text'])
The output is:
>>> tweets
text Trade
0 Hello, I like to trade True
1 Trade True
2 blah blah False
Also, for this sort of task, you can always use the Pandas' str built-in contains method. This code will give you the same result as the example above:
tweets['Trade'] = tweets['text'].str.contains("Trade", case=False) == True
I guess you want to check for 'exact word' matching, meaning "Nice tradeoff here!" shouldn't be identified as containing the word. You can also solve this problem:
tweets['Trade_[strict]'] = tweets['text'].str.contains(r"Trade\b.*", case=False) == True
The output being:
>>> tweets
text Trade Trade_[strict]
0 Hello, I like to trade True True
1 Trade True True
2 blah blah False False
3 Nice tradeoff here! True False
Plus, I added your json_normalize method with 'fake' data and it also worked. Make sure in your data you don't have any float in your text column instead of str.
from pandas.io.json import json_normalize
tweet_data = [{'text': '0', 'lang':'en', 'place': {'country':'uk'}, 'created_at':'now', 'coordinates':'x,y', 'user':{'location':'here'}}, {'text': 'Trade', 'lang':'en', 'place': {'country':'uk'}, 'created_at': 'now', 'coordinates':'z,w', 'user':{'location':'there'}}]
tweets = json_normalize(tweet_data)[["text", "lang", "place.country","created_at", "coordinates","user.location"]]
And applying your method worked.
def word_in_text(word, text):
word = word.lower()
text = text.lower()
match = re.search(word, text)
if match:
return True
return False
tweets['Trade'] = tweets['text'].apply(lambda tweet: word_in_text('Trade', tweet))
ERROR:
<ipython-input-34-daa2f94a8fec> in word_in_text(word, text)
2 def word_in_text(word, text):
3 word = word.lower()
----> 4 text = text.lower()
5 match = re.search(word, text)
6 if match:
You need to check whether the text parameter is of type str . So either check it with if else as shown in the answer by #Guiem Bosch.
Else simply convert the type of the text parameter by :
text = type(text).lower()
Hope this helps.
Related
I'm trying to remove random words from text in a column with nltk.
Here is my code:
import pandas as pd
import nltk
import nltknltk.download('punkt')
from nltk.tokenize import word_tokenize
df = pd.read_excel("Output_Summarization/OUTPUT_ocr_OPENAIGOOD.xlsx", usecols=["Open_AI_Text"])
for index, row in df.iterrows():
words = word_tokenize(row["Open_AI_Text"])
word_to_remove = random.choice(words)
new_text = row["Open_AI_Text"].replace(word_to_remove, "")
df.at[index, "Open_AI_Text"] = new_text
df.to_excel("Texte_Trou.xlsx", index=False)
Next I have an error :
TypeError
Traceback (most recent call last)<ipython-input-51-3cb2fde32407> in <module>11
for index, row in df.iterrows():12
# tokeniser le texte de la ligne en mots individuels13
words = word_tokenize(row["Open_AI_Text"])1415
# choisir un mot au hasard à enlever
/usr/local/lib/python3.6/site-packages/nltk/tokenize/init.py in
word_tokenize(text, language, preserve_line)126
:type preserver_line: bool127
"""128
sentences = [text] if preserve_line else sent_tokenize(text, language)129
return [token for sent in sentences130
for token in _treebank_word_tokenizer.tokenize(sent)]
/usr/local/lib/python3.6/site-packages/nltk/tokenize/init.py in
sent_tokenize(text, language)93
"""94 tokenizer = load('tokenizers/punkt/{0}.pickle'.format(language))95
return tokenizer.tokenize(text)9697 # Standard word tokenizer.
/usr/local/lib/python3.6/site-packages/nltk/tokenize/punkt.py in
tokenize(self, text, realign_boundaries)1235
Given a text, returns a list of the sentences in that text.1236
"""1237
return list(self.sentences_from_text(text, realign_boundaries))12381239
def debug_decisions(self, text):
/usr/local/lib/python3.6/site-packages/nltk/tokenize/punkt.py in
sentences_from_text(self, text, realign_boundaries)1283
follows the period.1284
"""1285
return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]12861287
def _slices_from_text(self, text):
/usr/local/lib/python3.6/site-packages/nltk/tokenize/punkt.py in
span_tokenize(self, text, realign_boundaries)1274
if realign_boundaries:1275
slices = self._realign_boundaries(text, slices)1276
return [(sl.start, sl.stop) for sl in slices]12771278
def sentences_from_text(self, text, realign_boundaries=True):
/usr/local/lib/python3.6/site-packages/nltk/tokenize/punkt.py in
<listcomp>(.0)1274
if realign_boundaries:1275
slices = self._realign_boundaries(text, slices)1276
return [(sl.start, sl.stop) for sl in slices]12771278
def sentences_from_text(self, text, realign_boundaries=True):
/usr/local/lib/python3.6/site-packages/nltk/tokenize/punkt.py in _realign_boundaries(self, text, slices)1314
"""1315 realign = 01316
for sl1, sl2 in _pair_iter(slices):1317
sl1 = slice(sl1.start + realign, sl1.stop)1318
if not sl2:
/usr/local/lib/python3.6/site-packages/nltk/tokenize/punkt.py in
_pair_iter(it)310
"""311 it = iter(it)312
prev = next(it)313
for el in it:314
yield (prev, el)
/usr/local/lib/python3.6/site-packages/nltk/tokenize/punkt.py in
_slices_from_text(self, text)1287
def _slices_from_text(self, text):1288
last_break = 01289
for match in self._lang_vars.period_context_re().finditer(text):1290
context = match.group() + match.group('after_tok')1291
if self.text_contains_sentbreak(context):
TypeError: expected string or bytes-like object
I tried to replace my variable with a list but this didn't work, how can I solve this issue?
using cosine similarity I am trying to find the semantic word comparison. I have posted the code below for reference, in the code, I have added the stopwords, which are the words I don't want to be found during the search. I have opened the text file with which I want the reference words (also given below ) to be compared. I am also adding a limit of 3 words to the search means any word less than three characters is to be considered a stop word. while running the code it is giving me the process finished with exit code 0 and I can't get an output from the code. Would really appreciate some help. Thank you in advance.
import math
import re
stopwords = set (["is", "a", "about", "above", "above", "across", "after", "afterwards", "again", "against", "all", "almost",
"alonll", "with", "within", "without", "would", "yet", "you", "your",
"yours", "yourself", "yourselves", "the"])
with open("ref.txt", "r") as f:
lines = f.readlines()
def build_frequency_vector(content: str) -> dict[str, int]:
vector = {}
word_seq = re.split("[ ,;.!?]+", content)
for words in word_seq:
if words not in stopwords and len(words) >= 3:
words = words.lower()
if words in vector:
vector[words] = vector[words] + 1
else:
vector[words] = 1
return vector
refWords = ['spain', 'anchovy',
'france', 'internet', 'china', 'mexico', 'fish', 'industry', 'agriculture', 'fishery', 'tuna', 'transport',
'italy', 'web', 'communication', 'labour', 'fish', 'cod']
refWordsDict = {}
for refWord in refWords:
refWordsDict[refWord] = {}
for line in lines:
line = line.lower()
temp = build_frequency_vector(line)
if refWord not in temp:
continue
for word in temp:
if word not in stopwords and len(word) >= 3 and word != refWord:
refWordsDict[refWord][word] = refWordsDict[refWord].get(word, 0) + temp[word]
def product(v1: dict[str, int], v2: dict[str, int]) -> float:
sp = 0.0
for word in v1:
sp += v1[word] * v2.get(word, 0)
return sp
def cosineSimilarity(s1: str, s2: str) -> float :
d1 = build_frequency_vector(word1)
d2 = build_frequency_vector(word2)
return product(d1, d2) / (math.sqrt(product(d1, d1) * product(d2, d2)))
bests = {}
for word1 in refWords:
bestSimilarity = 0
for word2 in refWords:
if word1 != word2:
similarity: float = cosineSimilarity(refWordsDict[word1], refWordsDict[word2])
if similarity > bestSimilarity:
bestSimilarity = similarity
bests[word1] = (word2, bestSimilarity)
for item in bests:
print(item, "->", bests[item])
I am very new to python and not able to find a solution.
I have some qeustion and problem with celaning text on my NLP model. I dont know why i get this error: AttributeError: 'list' object has no attribute 'split.
On below is my df['Text'].sample(5) :
26278 [RT, #davidsirota:, subset, people, website, t...
63243 [RT, #jmartNYT:, The, presses, Team, Biden, As...
61059 [RT, #caitoz:, BREAKING:, Biden, nominate, "Li...
43160 [RT, #K_JeanPierre:, I, profoundly, honored, P...
Name: Text, dtype: object
On below is my code
def tokenizer(text):
tokenized = [w for w in text.split() if w not in stopset]
return tokenized
df['Text'] = df['Text'].apply(tokenizer)
def remove_emoji(string):
emoji_pattern = re.compile("["
u"\U0001F600-\U0001F64F" # emoticons
u"\U0001F300-\U0001F5FF" # symbols & pictographs
u"\U0001F680-\U0001F6FF" # transport & map symbols
u"\U0001F1E0-\U0001F1FF" # flags (iOS)
u"\U00002702-\U000027B0"
u"\U000024C2-\U0001F251"
"]+", flags=re.UNICODE)
return emoji_pattern.sub(r'', string)
def remove_nonwords(Text):
if re.findall('\d',Text):
return ''
else:
return Text
def clean_text(Text):
text=' '.join([i for i in Text.split() if i not in stopset])
text=' '.join([stem.stem(word) for word in Text.split()])
return Text
df['text2'] = df['Text'].apply(clean_text)
Could you someone help me ?
I have an issue with my function. The design is to aggregate word tokens into dictionaries.
This is the code:
def preprocess (texts):
case = truecase.get_true_case(texts)
doc = nlp(case)
return doc
def summarize_texts(texts):
doc = preprocess(texts)
actions = {}
entities = {}
for token in doc:
if token.pos_ == "VERB":
actions[token.lemma_] = actions.get(token.text, 0) +1
for token in doc.ents:
entities[token.label_] = [token.text]
return {
'actions': actions,
'entities': entities
})
The problem I am having is the function works as expected for a single input:
summarize_texts("Play something by Billie Holiday")
{'actions': {'play': 1}, 'entities': {'PERSON': ['Billie']}}
but the objective is to be able to pass a list or csv file through it and have it aggregate it all.
When I try:
docs = [
"Play something by Billie Holiday",
"Set a timer for five minutes",
"Play it again, Sam"
]
summarize_texts(docs)
I get the error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-18-200347d5cac5> in <module>()
4 "Play it again, Sam"
5 ]
----> 6 summarize_texts(docs)
5 frames
<ipython-input-16-08c879553d6e> in summarize_texts(texts)
1 def summarize_texts(texts):
----> 2 doc = preprocess(texts)
3 actions = {}
4 entities = {}
5 for token in doc:
<ipython-input-12-fccf767830b1> in preprocess(texts)
1 def preprocess (texts):
----> 2 case = truecase.get_true_case(texts)
3 doc = nlp(case)
4 return doc
/usr/local/lib/python3.6/dist-packages/truecase/__init__.py in get_true_case(sentence, out_of_vocabulary_token_option)
14 return get_truecaser().get_true_case(
15 sentence,
---> 16 out_of_vocabulary_token_option=out_of_vocabulary_token_option)
/usr/local/lib/python3.6/dist-packages/truecase/TrueCaser.py in get_true_case(self, sentence, out_of_vocabulary_token_option)
97 as-is: Returns OOV tokens as is
98 """
---> 99 tokens = self.tknzr.tokenize(sentence)
100
101 tokens_true_case = []
/usr/local/lib/python3.6/dist-packages/nltk/tokenize/casual.py in tokenize(self, text)
293 """
294 # Fix HTML character entities:
--> 295 text = _replace_html_entities(text)
296 # Remove username handles
297 if self.strip_handles:
/usr/local/lib/python3.6/dist-packages/nltk/tokenize/casual.py in _replace_html_entities(text, keep, remove_illegal, encoding)
257 return "" if remove_illegal else match.group(0)
258
--> 259 return ENT_RE.sub(_convert_entity, _str_to_unicode(text, encoding))
260
261
TypeError: expected string or bytes-like object
I expect to get the output:
{'actions': {'play': 2, 'set': 1}, 'entities': {'PERSON': ['Billie', 'Sam'], 'TIME': ['five minutes']}}
Not sure what's wrong with my function syntax.
Looks like your problem is that truecase.get_true_case(texts) expects to receive a string/bytes like argument, and you're passing it a list of strings.
You'll need to iterate through texts and preprocess each item in the list separately:
def preprocess (text):
case = truecase.get_true_case(text)
doc = nlp(case)
return doc
def summarize_texts(texts):
actions = {}
entities = {}
for text in texts:
doc = preprocess(text)
for token in doc:
if token.pos_ == "VERB":
actions[token.lemma_] = actions.get(token.text, 0) +1
for token in doc.ents:
entities[token.label_] = [token.text]
return {
'actions': actions,
'entities': entities
})
Try using a for loop for texts, before calling preprocess
for i in texts:
doc = preprocess(i)
I'm using python 3.2.2 on windows 7.This is part of my code.it reads from an excel file .But when I run the code it just prints from 0 to 10 and gives" TypeError: 'float' object is not iterable".
Thanks for any help!
pages = [i for i in range(0,19634)]
for page in pages:
x=df.loc[page,["id"]]
x=x.values
x=str(x)[2:-2]
text=df.loc[page,["rev"]]
def remove_punct(text):
text=''.join([ch.lower() for ch in text if ch not in exclude])
tokens = re.split('\W+', text)
tex = " ".join([wn.lemmatize(word) for word in tokens if word not in stopword])
removetable = str.maketrans('', '', '1234567890')
out_list = [s.translate(removetable) for s in tokens1]
str_list = list(filter(None,out_list))
line = [i for i in str_list if len(i) > 1]
return line
s=df.loc[page,["rev"]].apply(lambda x:remove_punct(x))
with open('FileNamex.csv', 'a', encoding="utf-8") as f:
s.to_csv(f, header=False)
print(s)
this is the Error
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-54-c71f66bdaca6> in <module>()
33 return line
34
---> 35 s=df.loc[page,["rev"]].apply(lambda x:remove_punct(x))
36
37 with open('FileNamex.csv', 'a', encoding="utf-8") as f:
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\series.py in apply(self, func, convert_dtype, args, **kwds)
3190 else:
3191 values = self.astype(object).values
-> 3192 mapped = lib.map_infer(values, f, convert=convert_dtype)
3193
3194 if len(mapped) and isinstance(mapped[0], Series):
pandas/_libs/src\inference.pyx in pandas._libs.lib.map_infer()
<ipython-input-54-c71f66bdaca6> in <lambda>(x)
33 return line
34
---> 35 s=df.loc[page,["rev"]].apply(lambda x:remove_punct(x))
36
37 with open('FileNamex.csv', 'a', encoding="utf-8") as f:
<ipython-input-54-c71f66bdaca6> in remove_punct(text)
22
23 def remove_punct(text):
---> 24 text=''.join([ch.lower() for ch in text if ch not in exclude])
25 tokens = re.split('\W+', text)
26 tex = " ".join([wn.lemmatize(word) for word in tokens if word not in stopword])
TypeError: 'float' object is not iterable
Thanks for any help!
You are trying to apply a function that iterates text (whatever it is) - and ou call it using a float value.
floats can not be iterated. You can use text = str(text) to convert any input to text first - but looking at your code I hesitate to say that would make sense.
You can check if you are handling a float like this:
def remove_punct(text):
if isinstance(text,float):
pass # do something sensible with floats here
return # something sensible
text=''.join([ch.lower() for ch in text if ch not in exclude])
tokens = re.split('\W+', text)
tex = " ".join([wn.lemmatize(word) for word in tokens if word not in stopword])
removetable = str.maketrans('', '', '1234567890')
out_list = [s.translate(removetable) for s in tokens1]
str_list = list(filter(None,out_list))
line = [i for i in str_list if len(i) > 1]
return line
You can either tackle float via isinstance or get inspiration from
In Python, how do I determine if an object is iterable? on how to detect if you provide any iterable. You need to handle non-iterables differently.