Python Spacy KeyError: "[E018] Can't retrieve string for hash

Python Spacy KeyError: "[E018] Can't retrieve string for hash - python

I am trying to make my code run on Raspberry Pi 4 and been stuck on this error for hours. This code segment throws an error on it but runs perfectly on windows with the same project
def create_lem_texts(data): # as a list
def sent_to_words(sentences):
for sentence in sentences:
yield (gensim.utils.simple_preprocess(str(sentence), deacc=True)) # deacc=True removes punctuations
data_words = list(sent_to_words(data))
bigram = gensim.models.Phrases(data_words, min_count=5, threshold=100) # higher threshold fewer phrases.
bigram_mod = gensim.models.phrases.Phraser(bigram)
def remove_stopwords(texts):
return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]
def make_bigrams(texts):
return [bigram_mod[doc] for doc in texts]
def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
"""https://spacy.io/api/annotation"""
texts_out = []
print(os.getcwd())
for sent in texts:
doc = nlp(" ".join(sent))
texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
return texts_out
data_words_nostops = remove_stopwords(data_words)
data_words_bigrams = make_bigrams(data_words_nostops)
print(os.getcwd())
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])
# Do lemmatization keeping only noun, adj, vb, adv
data_lemmatized = lemmatization(data_words_bigrams, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])
return data_lemmatized
This code is in turned called by this function:
def assign_topics_tweet(tweets):
owd = os.getcwd()
print(owd)
os.chdir('/home/pi/Documents/pycharm_project_twitter/topic_model/')
print(os.getcwd())
lda = LdaModel.load("LDA26")
print(lda)
id2word = Dictionary.load('Id2Word')
print(id2word)
os.chdir(owd)
data = create_lem_texts(tweets)
corpus = [id2word.doc2bow(text) for text in data]
topics = []
for tweet in corpus:
topics_dist = lda.get_document_topics(tweet)
topics.append(topics_dist)
return topics
And here is the error message
Traceback (most recent call last):
File "/home/pi/Documents/pycharm_project_twitter/Twitter_Import.py", line 193, in <module>
main()
File "/home/pi/Documents/pycharm_project_twitter/Twitter_Import.py", line 169, in main
topics = assign_topics_tweet(data)
File "/home/pi/Documents/pycharm_project_twitter/TopicModel.py", line 238, in assign_topics_tweet
data = create_lem_texts(tweets)
File "/home/pi/Documents/pycharm_project_twitter/TopicModel.py", line 76, in create_lem_texts
data_lemmatized = lemmatization(data_words_bigrams, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])
File "/home/pi/Documents/pycharm_project_twitter/TopicModel.py", line 67, in lemmatization
texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
File "/home/pi/Documents/pycharm_project_twitter/TopicModel.py", line 67, in <listcomp>
texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
File "token.pyx", line 871, in spacy.tokens.token.Token.lemma_.__get__
File "strings.pyx", line 136, in spacy.strings.StringStore.__getitem__
KeyError: "[E018] Can't retrieve string for hash '18446744073541552667'. This usually refers to an issue with the `Vocab` or `StringStore`."
Process finished with exit code 1
I tried reinstalling spacy and the en model, running it directly on pi, spacy versions are the same on both my windows machine and on the Pi. And there is basically no information online on this error

After three days of testing problem was solved by simply installing an older version of Spacy 2.0.1

Related

Twitter reply-to-mentions bot programmed in Python works once and then crashes with error 400: what is the problem?

I´ve been building a bot and it works exactly as intended, but only for one Tweet. Then, it waits 60 seconds, and, if it doesn´t find a new Tweet to reply to (since it´s configured to reply to the most recent Tweet), it throws an error (it´s 400 as in "400: Bad Authentication Data", but I think the issue is not that, since the bot posts on Twitter once without any issues. However, I do think it´s possibly some kind of Bad Request error).
Whenever it crashes, I can just run in my command "python (botname).py" and it works once if there is now a new Tweet, but then, it crashes again.
I want the bot to run properly by itself, so I would really appreciate some help!
This is the code in my file:
#!/usr/bin/env python
# tweepy-bots/bots/autoreply.py
import tweepy
import logging
from config import create_api
import time
import re
from googlesearch import search
import sys
import io
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger()
def check_mentions(api, since_id):
logger.info("Collecting info")
new_since_id = since_id
for tweet in tweepy.Cursor(api.mentions_timeline,
since_id=since_id).items():
new_since_id = max(tweet.id, new_since_id)
if tweet.in_reply_to_status_id is not None:
in_reply_to_status_id = tweet.id
status_id = tweet.in_reply_to_status_id
tweet_u = api.get_status(status_id,tweet_mode='extended')
logger.info(f"Answering to {tweet.user.name}")
# remove words between 1 and 3
shortword = re.compile(r'\W*\b\w{1,3}\b')
keywords_search = str(shortword.sub('', tweet_u.full_text))
print(keywords_search)
if keywords_search is not None:
mystring = search(keywords_search, num_results=500)
else:
mystring = search("error", num_results=1)
print(mystring)
output_info=[]
for word in mystring:
if "harvard" in word or "cornell" in word or "researchgate" in word or "yale" in word or "rutgers" in word or "caltech" in word or "upenn" in word or "princeton" in word or "columbia" in word or "journal" in word or "mit" in word or "stanford" in word or "gov" in word or "pubmed" in word or "theguardian" in word or "aaas" in word or "bbc" in word or "rice" in word or "ams" in word or "sciencemag" in word or "research" in word or "article" in word or "publication" in word or "nationalgeographic" in word or "ngenes" in word:
output_info.append(word)
infostringa = ' '.join(output_info)
if output_info:
output_info4 = output_info[:5]
infostring = ' '.join(output_info4)
print(infostring)
status = "Hi there! This may be what you're looking for " + infostring
len(status) <= 280
api.update_status(status, in_reply_to_status_id=tweet.id, auto_populate_reply_metadata=True)
else:
status = "Sorry, I cannot help you with that :(. You might want to try again with a distinctly sourced Tweet"
api.update_status(status, in_reply_to_status_id=tweet.id, auto_populate_reply_metadata=True)
print(status)
return new_since_id
return check_mentions
def main():
api = create_api()
since_id = 1 #the last mention you have.
while True:
since_id = check_mentions(api, since_id)
logger.info("Waiting...")
time.sleep(60)
main()
My config module:
# tweepy-bots/bots/config.py
import tweepy
import logging
import os
logger = logging.getLogger()
def create_api():
consumer_key = os.getenv("CONSUMER_KEY")
consumer_secret = os.getenv("CONSUMER_SECRET")
access_token = os.getenv("ACCESS_TOKEN")
access_token_secret = os.getenv("ACCESS_TOKEN_SECRET")
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth, wait_on_rate_limit=True,
wait_on_rate_limit_notify=True)
try:
api.verify_credentials()
except Exception as e:
logger.error("Error creating API", exc_info=True)
raise e
logger.info("API created")
return api
The error raised:
Traceback (most recent call last):
File "C:\Users\maria\OneDrive\Documentos\Lara\Python\Factualbot\botstring32.py", line 92, in <module>
main()
File "C:\Users\maria\OneDrive\Documentos\Lara\Python\Factualbot\botstring32.py", line 87, in main
since_id = check_mentions(api, since_id)
File "C:\Users\maria\OneDrive\Documentos\Lara\Python\Factualbot\botstring32.py", line 26, in check_mentions
for tweet in tweepy.Cursor(api.mentions_timeline, since_id=since_id).items():
File "C:\Users\maria\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\tweepy\cursor.py", line 51, in __next__
return self.next()
File "C:\Users\maria\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\tweepy\cursor.py", line 243, in next
self.current_page = self.page_iterator.next()
File "C:\Users\maria\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\tweepy\cursor.py", line 132, in next
data = self.method(max_id=self.max_id, parser=RawParser(), *self.args, **self.kwargs)
File "C:\Users\maria\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\tweepy\binder.py", line 253, in _call
return method.execute()
File "C:\Users\maria\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\tweepy\binder.py", line 234, in execute
raise TweepError(error_msg, resp, api_code=api_error_code)
tweepy.error.TweepError: Twitter error response: status code = 400
Thank you really much!

A 400 HTTP error status code usually means Bad Request, which is likely the case here. When there's not a new Tweet to reply to, the for loop isn't entered, and check_mentions, the function itself, is returned. You then set it as since_id when its returned and use it as an ID the next time check_mentions is called. This probably ends up passing something like "<function check_mentions at 0x0000028E6C729040>" to the API as since_id.

AttributeError: 'spacy.tokens.span.Span' object has no attribute 'string'

import re
import spacy
from nltk.corpus import stopwords
import pdfplumber
def extract_All_data(path):
text = ""
try:
with pdfplumber.open(path) as pdf:
for i in pdf.pages:
text += i.extract_text()
return text
except:
return None
resume_text = extract_All_data(r"E:\AllResumesPdfs\37202883_Mumbai_6.pdf")
#resume_text =text.lower()
# load pre-trained model
nlp = spacy.load('en_core_web_lg')
# Grad all general stop words
STOPWORDS = set(stopwords.words('english'))
# Education Degrees
EDUCATION = [
'BE','B.E.', 'B.E', 'BS', 'B.S',
'ME', 'M.E', 'M.E.', 'MS', 'M.S', 'M.C.A.',
'BTECH', 'B.TECH', 'M.TECH', 'MTECH',
'SSC', 'HSC', 'CBSE', 'ICSE', 'X', 'XII'
]
def extract_education(resume_text):
nlp_text = nlp(resume_text)
# Sentence Tokenizer
nlp_text = [sent.string.strip() for sent in nlp_text.sents]
edu = {}
# Extract education degree
for index, text in enumerate(nlp_text):
for tex in text.split():
# Replace all special symbols
tex = re.sub(r'[?|$|.|!|,]', r'', tex)
if tex.upper() in EDUCATION and tex not in STOPWORDS:
edu[tex] = text + nlp_text[index + 1]
# Extract year
education = []
for key in edu.keys():
year = re.search(re.compile(r'((20|19)(\d{2}))'), edu[key])
if year:
education.append((key, ''.join(year[0])))
else:
education.append(key)
return education
Education= extract_education(resume_text)
print(Education)
I have download large model but still it showing error for string.
please help me for solve this issue.
Thanks in Advance.
C:\Python37\python.exe E:/JobScan/Sample1.py
2021-05-22 09:33:10.781450: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'cudart64_110.dll'; dlerror: cudart64_110.dll not found
2021-05-22 09:33:10.781933: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
Traceback (most recent call last):
File "E:/JobScan/Sample1.py", line 58, in <module>
Education= extract_education(resume_text)
File "E:/JobScan/Sample1.py", line 37, in extract_education
nlp_text = [sent.string.strip() for sent in nlp_text.sents]
File "E:/JobScan/Sample1.py", line 37, in <listcomp>
nlp_text = [sent.string.strip() for sent in nlp_text.sents]
AttributeError: 'spacy.tokens.span.Span' object has no attribute 'string'
Process finished with exit code 1
This is console error.

As Tim Roberts said, you want to the text attribute.
# change "string" to "text"
nlp_text = [sent.text.strip() for sent in nlp_text.sents]

nltk corpus tweeter_sample by category

I want to train the nltk with the tweeter_sample corpus, but I get an error when I try to load the sample by category.
First I tried like that:
from nltk.corpus import twitter_samples
documents = [(list(twitter_samples.strings(fileid)), category)
for category in twitter_samples.categories()
for fileid in twitter_samples.fileids(category)]
but it gave me this error:
Traceback (most recent call last):
File "C:/Users/neptun/PycharmProjects/Thesis/First_sentimental.py", line 6, in <module>
for category in twitter_samples.categories()
File "C:\Users\neptun\AppData\Local\Programs\Python\Python36-32\lib\site-packages\nltk\corpus\util.py", line 119, in __getattr__
return getattr(self, attr)
AttributeError: 'TwitterCorpusReader' object has no attribute 'categories'
I don't know how to give them the available attributes in order to have my list with positive and negative sentiment.

If you inspect twitter_samples.fileids(), you'll see that there are separate positive and negative files:
>>> twitter_samples.fileids()
['negative_tweets.json', 'positive_tweets.json', 'tweets.20150430-223406.json']
So to get the tweets classified as positive or negative, just select the corresponding file. It's not the usual way the nltk handles categorized corpora, but there you have it.
documents = ([(t, "pos") for t in twitter_samples.strings("positive_tweets.json")] +
[(t, "neg") for t in twitter_samples.strings("negative_tweets.json")])
This will get you a dataset of 10000 tweets. The third file contains another 20000, which apparently are not categorized.

categorized_tweets = ([(t, "pos") for t in twitter_samples.strings("positive_tweets.json")] +
[(t, "neg") for t in twitter_samples.strings("negative_tweets.json")])
smilies = [':-)', ':)', ';)', ':o)', ':]', ':3', ':c)', ':>', '=]', '8)', '=)', ':}',
':^)', ':-D', ':D', '8-D', '8D', 'x-D', 'xD', 'X-D', 'XD', '=-D', '=D',
'=-3', '=3', ':-))', ":'-)", ":')", ':*', ':^*', '>:P', ':-P', ':P', 'X-P',
'x-p', 'xp', 'XP', ':-p', ':p', '=p', ':-b', ':b', '>:)', '>;)', '>:-)',
'<3', ':L', ':-/', '>:/', ':S', '>:[', ':#', ':-(', ':[', ':-||', '=L', ':<',
':-[', ':-<', '=\\', '=/', '>:(', ':(', '>.<', ":'-(", ":'(", ':\\', ':-c',
':c', ':{', '>:\\', ';(', '(', ')', 'via']
categorized_tweets_tokens = []
for tweet in categorized_tweets:
text = tweet[0]
for smiley in smilies:
text = re.sub(re.escape(smiley), '', text)
categorized_tweets_tokens.append((word_tokenize(text), tweet[1]))

Python - regex relation extraction

As a part of schoolwork we have been given this code:
>>> IN = re.compile(r'.*\bin\b(?!\b.+ing)')
>>> for doc in nltk.corpus.ieer.parsed_docs('NYT_19980315'):
... for rel in nltk.sem.extract_rels('ORG', 'LOC', doc,
... corpus='ieer', pattern = IN):
... print(nltk.sem.rtuple(rel))
We are asked to try it out with some sentences of our own to see the output, so for this i decided to define a function:
def extract(sentence):
import re
import nltk
IN = re.compile(r'.*\bin\b(?!\b.+ing)')
for rel in nltk.sem.extract_rels('ORG', 'LOC', sentence, corpus='ieer', pattern = IN):
print(nltk.sem.rtuple(rel))
When I try and run this code:
>>> from extract import extract
>>> extract("The Whitehouse in Washington")
I get the gollowing error:
Traceback (most recent call last):
File "<pyshell#1>", line 1, in <module>
extract("The Whitehouse in Washington")
File "C:/Python34/My Scripts\extract.py", line 6, in extract
for rel in nltk.sem.extract_rels('ORG', 'LOC', sentence, corpus='ieer', pattern = IN):
File "C:\Python34\lib\site-packages\nltk\sem\relextract.py", line 216, in extract_rels
pairs = tree2semi_rel(doc.text) + tree2semi_rel(doc.headline)
AttributeError: 'str' object has no attribute 'text'
Can anyone help me understand where I am going wrong in my function?
The correct output for the test sentence should be:
[ORG: 'Whitehouse'] 'in' [LOC: 'Washington']

If you see the method definition of extract_rels, it expects the parsed document as third argument.
And here you are passing the sentence. To overcome this error, you can do following :
tagged_sentences = [ nltk.pos_tag(token) for token in tokens]
class doc():
pass
IN = re.compile(r'.*\bin\b(?!\b.+ing)')
doc.headline=["test headline for sentence"]
for i,sent in enumerate(tagged_sentences):
doc.text = nltk.ne_chunk(sent)
for rel in nltk.sem.relextract.extract_rels('ORG', 'LOC', doc, corpus='ieer', pattern=IN):
print(nltk.sem.rtuple(rel) )// you can change it according
Try it out..!!!

Python NLTK: SyntaxError: Non-ASCII character '\xc3' in file (Sentiment Analysis -NLP)

I am playing around with NLTK to do an assignment on sentiment analysis. I am using Python 2.7. NLTK 3.0 and NumPy1.9.1 version.
This is the code :
__author__ = 'karan'
import nltk
import re
import sys
def main():
print("Start");
# getting the stop words
stopWords = open("english.txt","r");
stop_word = stopWords.read().split();
AllStopWrd = []
for wd in stop_word:
AllStopWrd.append(wd);
print("stop words-> ",AllStopWrd);
# sample and also cleaning it
tweet1= 'Love, my new toyí ½í¸í ½í¸#iPhone6. Its good https://twitter.com/Sandra_Ortega/status/513807261769424897/photo/1'
print("old tweet-> ",tweet1)
tweet1 = tweet1.lower()
tweet1 = ' '.join(re.sub("(#[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",tweet1).split())
print(tweet1);
tw = tweet1.split()
print(tw)
#tokenize
sentences = nltk.word_tokenize(tweet1)
print("tokenized ->", sentences)
#remove stop words
Otweet =[]
for w in tw:
if w not in AllStopWrd:
Otweet.append(w);
print("sans stop word-> ",Otweet)
# get taggers for neg/pos/inc/dec/inv words
taggers ={}
negWords = open("neg.txt","r");
neg_word = negWords.read().split();
print("ned words-> ",neg_word)
posWords = open("pos.txt","r");
pos_word = posWords.read().split();
print("pos words-> ",pos_word)
incrWords = open("incr.txt","r");
inc_word = incrWords.read().split();
print("incr words-> ",inc_word)
decrWords = open("decr.txt","r");
dec_word = decrWords.read().split();
print("dec wrds-> ",dec_word)
invWords = open("inverse.txt","r");
inv_word = invWords.read().split();
print("inverse words-> ",inv_word)
for nw in neg_word:
taggers.update({nw:'negative'});
for pw in pos_word:
taggers.update({pw:'positive'});
for iw in inc_word:
taggers.update({iw:'inc'});
for dw in dec_word:
taggers.update({dw:'dec'});
for ivw in inv_word:
taggers.update({ivw:'inv'});
print("tagger-> ",taggers)
print(taggers.get('little'))
# get parts of speech
posTagger = [nltk.pos_tag(tw)]
print("posTagger-> ",posTagger)
main();
This is the error that I am getting when running my code:
SyntaxError: Non-ASCII character '\xc3' in file C:/Users/karan/PycharmProjects/mainProject/sentiment.py on line 19, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details
How do I fix this error?
I also tried the code using Python 3.4.2 and with NLTK 3.0 and NumPy 1.9.1 but then I get the error:
Traceback (most recent call last):
File "C:/Users/karan/PycharmProjects/mainProject/sentiment.py", line 80, in <module>
main();
File "C:/Users/karan/PycharmProjects/mainProject/sentiment.py", line 72, in main
posTagger = [nltk.pos_tag(tw)]
File "C:\Python34\lib\site-packages\nltk\tag\__init__.py", line 100, in pos_tag
tagger = load(_POS_TAGGER)
File "C:\Python34\lib\site-packages\nltk\data.py", line 779, in load
resource_val = pickle.load(opened_resource)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xcb in position 0: ordinal not in range(128)

Add the following to the top of your file # coding=utf-8
If you go to the link in the error you can seen the reason why:
Defining the Encoding
Python will default to ASCII as standard encoding if no other
encoding hints are given.
To define a source code encoding, a magic comment must
be placed into the source files either as first or second
line in the file, such as:
# coding=

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Spacy KeyError: "[E018] Can't retrieve string for hash - python

After three days of testing problem was solved by simply installing an older version of Spacy 2.0.1

Related

Twitter reply-to-mentions bot programmed in Python works once and then crashes with error 400: what is the problem?

AttributeError: 'spacy.tokens.span.Span' object has no attribute 'string'

nltk corpus tweeter_sample by category

Python - regex relation extraction

Python NLTK: SyntaxError: Non-ASCII character '\xc3' in file (Sentiment Analysis -NLP)

Categories

Resources