Pass elements of a list in a function - python

I have a function that is able to create triples and relationships from text. However, when I create a list of a column that contains text and pass it through the function, it only processes the first row, or item of the list. Therefore, I am wondering how the whole list can be processed within this function. Maybe a for loop would work?
The following line contains the list
rez_dictionary = {'Decent Little Reader, Poor Tablet',
'Ok For What It Is',
'Too Heavy and Poor weld quality,',
'difficult mount',
'just got it installed'}
from transformers import pipeline
triplet_extractor = pipeline('text2text-generation', model='Babelscape/rebel-large', tokenizer='Babelscape/rebel-large')
# We need to use the tokenizer manually since we need special tokens.
extracted_text = triplet_extractor.tokenizer.batch_decode([triplet_extractor(rez_dictionary, return_tensors=True, return_text=False)[0]["generated_token_ids"]])
print(extracted_text[0])
If anyone has a suggestion, I am looking forward for it.
Would it also be possible to get the output adjusted to the following format:
# Function to parse the generated text and extract the triplets
def extract_triplets(text):
triplets = []
relation, subject, relation, object_ = '', '', '', ''
text = text.strip()
current = 'x'
for token in text.replace("<s>", "").replace("<pad>", "").replace("</s>", "").split():
if token == "<triplet>":
current = 't'
if relation != '':
triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})
relation = ''
subject = ''
elif token == "<subj>":
current = 's'
if relation != '':
triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})
object_ = ''
elif token == "<obj>":
current = 'o'
relation = ''
else:
if current == 't':
subject += ' ' + token
elif current == 's':
object_ += ' ' + token
elif current == 'o':
relation += ' ' + token
if subject != '' and relation != '' and object_ != '':
triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})
return triplets
extracted_triplets = extract_triplets(extracted_text[0])
print(extracted_triplets)

You are removing the other entries of rez_dictionary inside the batch_decode:
triplet_extractor(rez_dictionary, return_tensors=True, return_text=False)[0]["generated_token_ids"]
Use a list comprehension instead:
from transformers import pipeline
rez = ['Decent Little Reader, Poor Tablet',
'Ok For What It Is',
'Too Heavy and Poor weld quality,',
'difficult mount',
'just got it installed']
triplet_extractor = pipeline('text2text-generation', model='Babelscape/rebel-large', tokenizer='Babelscape/rebel-large')
model_output = triplet_extractor(rez, return_tensors=True, return_text=False)
extracted_text = triplet_extractor.tokenizer.batch_decode([x["generated_token_ids"] for x in model_output])
print("\n".join(extracted_text))
Output:
<s><triplet> Decent Little Reader <subj> Poor Tablet <obj> different from <triplet> Poor Tablet <subj> Decent Little Reader <obj> different from</s>
<s><triplet> Ok For What It Is <subj> film <obj> instance of</s>
<s><triplet> Too Heavy and Poor <subj> weld quality <obj> subclass of</s>
<s><triplet> difficult mount <subj> mount <obj> subclass of</s>
<s><triplet> 2008 Summer Olympics <subj> 2008 <obj> point in time</s>
Regarding the extension of the OP's question, OP wanted to know how to run the function extract_triplets. OP can simply do that via a for-loop:
for text in extracted_text:
print(extract_triplets(text))
Output:
[{'head': 'Decent Little Reader', 'type': 'different from', 'tail': 'Poor Tablet'}, {'head': 'Poor Tablet', 'type': 'different from', 'tail': 'Decent Little Reader'}]
[{'head': 'Ok For What It Is', 'type': 'instance of', 'tail': 'film'}]
[{'head': 'Too Heavy and Poor', 'type': 'subclass of', 'tail': 'weld quality'}]
[{'head': 'difficult mount', 'type': 'subclass of', 'tail': 'mount'}]
[{'head': '2008 Summer Olympics', 'type': 'point in time', 'tail': '2008'}]

Related

I need to convert a doc string sentence into a list

Input file is:
l1 = ['Passing much less urine', 'Bleeding from any body part', 'Feeling extremely lethargic/weak', 'Excessive sleepiness/restlessness', 'Altered mental status', 'Seizure/fits', 'Breathlessness', 'Blood in sputum', 'Chest pain', 'Sound/noise in breathing', 'Drooling of saliva', 'Difficulty in opening mouth']
k=[]
for n in range(0,len(l1)):
e = l1[n]
doc =nlp(e)
for token in doc:
if token.lemma_ != "-PRON-":
temp = token.lemma_.lower().strip()
else:
temp = token.lower_
k.append(temp)
cleaned_tokens = []
t = []
d = []
for token in k:
li = []
if token not in stopwords and token not in punct:
cleaned_tokens.append(token)
li= " ".join(cleaned_tokens)
t.append(li)
print(t)
This code gives output:
['pass urine']
['pass urine bleed body']
['pass urine bleed body feel extremely lethargic weak']
But I need output should be:
["pass urine", "bleed body", "feel extremely lethargic weak"]
Suggest me how can I get this result.
This produces the results you want:
import spacy
nlp = spacy.load("en_core_web_md")
l1 = ['Passing much less urine', 'Bleeding from any body part', 'Feeling extremely lethargic/weak', 'Excessive sleepiness/restlessness', 'Altered mental status', 'Seizure/fits', 'Breathlessness', 'Blood in sputum', 'Chest pain', 'Sound/noise in breathing', 'Drooling of saliva', 'Difficulty in opening mouth']
docs = nlp.pipe(l1)
t= []
for doc in docs:
clean_doc = " ".join([tok.text.lower() for tok in doc if not tok.is_stop and not tok.is_punct])
t.append(clean_doc)
print(t)
['passing urine', 'bleeding body', 'feeling extremely lethargic weak', 'excessive sleepiness restlessness', 'altered mental status', 'seizure fits', 'breathlessness', 'blood sputum', 'chest pain', 'sound noise breathing', 'drooling saliva', 'difficulty opening mouth']
In case you need lemma:
t= []
for doc in docs:
clean_doc = " ".join([tok.lemma_.lower() for tok in doc if not tok.is_stop and not tok.is_punct])
t.append(clean_doc)
print(t)
['pass urine', 'bleed body', 'feel extremely lethargic weak', 'excessive sleepiness restlessness', 'alter mental status', 'seizure fit', 'breathlessness', 'blood sputum', 'chest pain', 'sound noise breathing', 'drool saliva', 'difficulty open mouth']

How to set contents of a file that don't start with "\t" as keys, and those who start with "\t" and end with "\n" as values to the key before them?

I want make a dictionary that looks like this: { 'The Dorms': {'Public Policy' : 50, 'Physics Building' : 100, 'The Commons' : 120}, ...}
This is the list :
['The Dorms\n', '\tPublic Policy, 50\n', '\tPhysics Building, 100\n', '\tThe Commons, 120\n', 'Public Policy\n', '\tPhysics Building, 50\n', '\tThe Commons, 60\n', 'Physics Building\n', '\tThe Commons, 30\n', '\tThe Quad, 70\n', 'The Commons\n', '\tThe Quad, 15\n', '\tBiology Building, 20\n', 'The Quad\n', '\tBiology Building, 35\n', '\tMath Psych Building, 50\n', 'Biology Building\n', '\tMath Psych Building, 75\n', '\tUniversity Center, 125\n', 'Math Psych Building\n', '\tThe Stairs by Sherman, 50\n', '\tUniversity Center, 35\n', 'University Center\n', '\tEngineering Building, 75\n', '\tThe Stairs by Sherman, 25\n', 'Engineering Building\n', '\tITE, 30\n', 'The Stairs by Sherman\n', '\tITE, 50\n', 'ITE']
This is my code:
def load_map(map_file_name):
# map_list = []
map_dict = {}
map_file = open(map_file_name, "r")
map_list = map_file.readlines()
for map in map_file:
map_content = map.strip("\n").split(",")
map_list.append(map_content)
for map in map_list:
map_dict[map[0]] = map[1:]
print(map_dict)
if __name__ == "__main__":
map_file_name = input("What is the map file? ")
load_map(map_file_name)
Since your file's content is apparently literal Python data, you should use ast.literal_eval to parse it not some ad-hoc method.
Then you can just loop around your values and process them:
def load_map(mapfile):
with open(mapfile, encoding='utf-8') as f:
data = ast.literal_eval(f.read())
m = {}
current_section = None
for item in data:
if not item.startswith('\t'):
current_section = m[item.strip()] = {}
else:
k, v = item.split(',')
current_section[k.strip()] = int(v.strip())
print(m)

Is there a way to create a case-like IF statement in Python/Jupyter to monitor substrings in Tweets using Tweepy?

I am attempting to monitor several businesses Twitter accounts by exporting tweets to a .csv to look at the positivity or negativity of Tweets that include the name of the business, which will then be visualised.
To make it easier for myself I'm only assigning each Tweet one number, between 1 (negative) - 10 (positive), however the code I've written doesn't give any feedback (remains at 0), gets stuck in a For Loop, or gets a Syntax Error.
Using Jupyter notebook I've tried to create a 10 line If/Elif statement - due to Python not having a case statement, and inserted this code both in the 'get Tweets' method as well as the 'write csv' method.
Get Tweets
api = tweepy.API(auth)
query = "ASOS"
language = "en"
results = api.search(q=query, lang=language, count=100)
for tweet in results:
if (not tweet.retweeted) and ('RT #' not in tweet.text):
print(tweet.user.screen_name,"Tweeted:",tweet.text,**rating**)
print()
Write CSV
import csv
api = tweepy.API(auth)
csvFile = open('ASOS with emojis1.csv', 'a')
csvWriter = csv.writer(csvFile)
results = api.search(q=query, lang=language, count=100)
for tweet in results:
if (not tweet.retweeted) and ('RT #' not in tweet.text):
csvWriter.writerow([tweet.created_at, tweet.user.screen_name, tweet.text, **rating**])
csvFile.close()
If/Elif Statement I've written
rating = '0'
if 'abysmal' or 'appalling' or 'dreadful' or 'awful' or 'terrible' or 'very bad' or 'really bad' or '😑' or '😠' or '😷' in tweet.text:
(rating = '1')
elif 'rubbish' or 'unsatisfactory' or 'bad' or 'poor' or 'πŸ™' or '😞' or ':(' or '):' or 'πŸ’€' in tweet.text:
(rating = '2')
elif 'quite bad' or 'pretty bad' or 'somewhat bad' or 'below average' or 'πŸ’”' or '😣' or '☹️' or 'πŸ˜’' or '😒' in tweet.text:
(rating = '3')
elif 'mediocre' or 'πŸ™ƒ' or 'πŸ‘Ž' or 'πŸ™„' or 'πŸ€”' or 'πŸ˜ͺ' in tweet.text:
(rating = '4')
elif 'average' or 'not bad' or 'fair' or 'alright' or 'ok' or 'satisfactory' or 'fine' or 'somewhat good' or '😳' or '😭' or '😩' or '😫' or 'πŸ‘€' or '😱' or '😬' or 'omg' in tweet.text:
(rating = '5')
elif 'quite good' or 'decent' or 'above average' or 'pretty good' or 'good' or 'πŸ™‚' or 'πŸ’ͺ' or 'πŸ˜…' or '😎' or '😈' in tweet.text:
(rating = '6')
elif 'great' or 'gr8' or 'really good' or 'rlly good' or 'very good' or 'v good' or 'πŸ’–' or '☺️' or '😘' or '😌' or 'πŸ‘' or 'πŸ‘' or 'πŸ™Œ' ':)' or '(:' or 'πŸ’₯' or 'πŸ’™' or '🀣' or 'πŸ–€' or 'πŸ‘Œ' in tweet.text:
(rating = '7')
elif 'awesome' or 'fantastic' or 'πŸ˜‚' or 'πŸ’•' or '😍' or '😊' or '❀' or 'β™₯' or 'πŸ’œ' or 'πŸ’›' or 'βœ…' or 'πŸŽ‰' or 'πŸ€—' or 'πŸ™' or '✨' in tweet.text:
(rating = '8')
elif 'superb' or 'brilliant' or 'incredible' or 'excellent' or 'oustanding' or '😁' or 'πŸ˜„' or 'πŸ₯°' or 'πŸ’―' in tweet.text:
(rating = '9')
elif 'perfect' in tweet.text:
(rating = '10')
else:
(rating = 'N/A')
Expected: Produces .csv file with various numbers in
Actual: (rating = '1') SyntaxError: invalid syntax
Your conditionals are not working properly. To chain a conditional:
mylist = [1, 2, 3]
# Note that the full condition must be specified for
# each desired conditional
if 1 in mylist or 2 in mylist or 3 in mylist:
print("True")
# True
The issue with what you are using is that you are approaching the logic the way you would say it rather than how the interpreter reads it. As an example:
if 'a' or 'b':
print('True')
# True
Populated strings act as True and will evaluate your condition as such, so a modification should be made such that the entire conditional is specified:
# Evaluates to True, though it's not what you want
if 'a' and 'b' in 'bc':
print(True) # This is not what you want, but 'a' is read as 'true'
# True
if 'a' in 'bc' and 'b' in 'bc':
print(True)
# Doesn't print True because 'a' in 'bc' is False
The any function could help here, as it will look if any of the values evaluate to True:
mylist = [1, 2, 3]
if any([i in mylist for i in range(2,5)]):
print("True")
# True
Furthermore, there's no need for the parentheses around variable assignment:
if 'abysmal' in tweet.text or 'horrible' in tweet.text:
rating = 0
elif ...:
rating = 1
# So on and so forth

Python - Words in string affect outcome

I am new to python and I am trying to develop a phone troubleshoot program where I ask the user what is wrong with their device, and if my program detects the word 'wet' or 'water' it will reply with an outcome. Another example would be 'screen is cracked'. I am having a problem that if I input 'My screen is cracked'. My code does not detect it. Any help appreciated!
Snippet of my code:
print(60 * '-')
print('Could you describe what is wrong with your device?')
print(60 * '-')
time.sleep(1)
user_problem = input('')
if user_problem in ('water', 'waterdamage', 'rain', 'toilet', 'pool', 'sea', 'ocean', 'river',):
print('WATERDAMAGE VARAIBLE')
elif user_problem in ('screen', 'cracked', 'shattered', 'smashed',):
print('SCREEN VARIABLE')
userproblem is your whole user's input. You are checking if it belongs to a tuple with those keywords. So, if the input is "My phone is wet", this string does not belong to ('water', 'waterdamage', 'rain', 'toilet', 'pool', 'sea', 'ocean', 'river',) since it's not equal to any of these words. Same problem in the second if.
The correct solution is to ask if any of these words is contained in the input, which is quite the opposite. You would have something like:
userproblem_words = userproblem.split(' ')
water_related_words = ('water', 'waterdamage', 'rain', 'toilet', 'pool', 'sea', 'ocean', 'river')
if (any([(word in water_related_words) for word in userproblem_words])):
print('WATERDAMAGE VARIABLE')
break_related_words = ('screen', 'cracked', 'shattered', 'smashed')
elif (any([word in userproblem for word in break_related_words])):
print('SCREEN VARIABLE')
Or, if you don't like the list comprehension's readability in this case, you can use a plain for:
water_related_words = ('water', 'waterdamage', 'rain', 'toilet', 'pool', 'sea', 'ocean', 'river')
break_related_words = ('screen', 'cracked', 'shattered', 'smashed')
for word in userproblem.split(' '):
if word in water_related_words:
print('WATERDAMAGE VARIABLE')
break
elif word in break_related_words:
print('SCREEN VARIABLE')
break
You need to change your approach, here is an example how you can do it:
print (60 * '-')
print ('Could you describe what is wrong with your device?')
print (60 * '-')
time.sleep(1)
userproblem = input('')
water = ['water', 'waterdamage', 'rain', 'toilet', 'pool', 'sea', 'ocean', 'river']
screen = ['screen', 'cracked', 'shattered', 'smashed']
for item in water:
if item in userproblem.split(' '):
print('WATERDAMAGE VARIABLE')
break
for item in screen:
if item in userproblem.split(' '):
print('SCREEN VARIABLE')
break

How to train a sense2vec model

The documentation of sense2vec mentions 3 primary files - the first of them being merge_text.py. I have tried several types of inputs- txt,csv,bzipped file since merge_text.py tries to open files compressed by bzip2.
The file can be found at:
https://github.com/spacy-io/sense2vec/blob/master/bin/merge_text.py
What type of input format does this script require?
Further, if anyone could please suggest how to train the model.
I extended and adjusted the code samples from sense2vec.
You go from this input text:
"As far as Saudi Arabia and its motives, that is very simple also. The Saudis are
good at money and arithmetic. Faced with the painful choice of losing money
maintaining current production at US$60 per barrel or taking two million barrels
per day off the market and losing much more money - it's an easy choice: take
the path that is less painful. If there are secondary reasons like hurting US
tight oil producers or hurting Iran and Russia, that's great, but it's really
just about the money."
To this:
as|ADV far|ADV as|ADP saudi_arabia|ENT and|CCONJ its|ADJ motif|NOUN that|ADJ is|VERB very|ADV simple|ADJ also|ADV saudis|ENT are|VERB good|ADJ at|ADP money|NOUN and|CCONJ arithmetic|NOUN faced|VERB with|ADP painful_choice|NOUN of|ADP losing|VERB money|NOUN maintaining|VERB current_production|NOUN at|ADP us$|SYM 60|MONEY per|ADP barrel|NOUN or|CCONJ taking|VERB two_million|CARDINAL barrel|NOUN per|ADP day|NOUN off|ADP market|NOUN and|CCONJ losing|VERB much_more_money|NOUN it|PRON 's|VERB easy_choice|NOUN take|VERB path|NOUN that|ADJ is|VERB less|ADV painful|ADJ if|ADP there|ADV are|VERB secondary_reason|NOUN like|ADP hurting|VERB us|ENT tight_oil_producer|NOUN or|CCONJ hurting|VERB iran|ENT and|CCONJ russia|ENT 's|VERB great|ADJ but|CCONJ it|PRON 's|VERB really|ADV just|ADV about|ADP money|NOUN
Double line breaks are interpreted as separate documents.
Urls are recognized as such, stripped down to domain.tld and marked as |URL
Nouns (also noun being part of noun phrases) are lemmatized (as motives become motifs)
Words with POS-tags like DET (determinate article) and PUNCT (for punctuation) are dropped
Here's the code. Let me know if you have questions.
I'll probably publish it on github.com/woltob soon.
import spacy
import re
nlp = spacy.load('en')
nlp.matcher = None
LABELS = {
'ENT': 'ENT',
'PERSON': 'PERSON',
'NORP': 'ENT',
'FAC': 'ENT',
'ORG': 'ENT',
'GPE': 'ENT',
'LOC': 'ENT',
'LAW': 'ENT',
'PRODUCT': 'ENT',
'EVENT': 'ENT',
'WORK_OF_ART': 'ENT',
'LANGUAGE': 'ENT',
'DATE': 'DATE',
'TIME': 'TIME',
'PERCENT': 'PERCENT',
'MONEY': 'MONEY',
'QUANTITY': 'QUANTITY',
'ORDINAL': 'ORDINAL',
'CARDINAL': 'CARDINAL'
}
pre_format_re = re.compile(r'^[\`\*\~]')
post_format_re = re.compile(r'[\`\*\~]$')
url_re = re.compile(r'(https?:\/\/)?([a-z0-9-]+\.)?([\d\w]+?\.[^\/]{2,63})')
single_linebreak_re = re.compile('\n')
double_linebreak_re = re.compile('\n{2,}')
whitespace_re = re.compile(r'[ \t]+')
quote_re = re.compile(r'"|`|Β΄')
def strip_meta(text):
text = text.replace('per cent', 'percent')
text = text.replace('>', '>').replace('<', '<')
text = pre_format_re.sub('', text)
text = post_format_re.sub('', text)
text = double_linebreak_re.sub('{2break}', text)
text = single_linebreak_re.sub(' ', text)
text = text.replace('{2break}', '\n')
text = whitespace_re.sub(' ', text)
text = quote_re.sub('', text)
return text
def transform_doc(doc):
for ent in doc.ents:
ent.merge(ent.root.tag_, ent.text, LABELS[ent.label_])
for np in doc.noun_chunks:
while len(np) > 1 and np[0].dep_ not in ('advmod', 'amod', 'compound'):
np = np[1:]
np.merge(np.root.tag_, np.text, np.root.ent_type_)
strings = []
for sent in doc.sents:
sentence = []
if sent.text.strip():
for w in sent:
if w.is_space:
continue
w_ = represent_word(w)
if w_:
sentence.append(w_)
strings.append(' '.join(sentence))
if strings:
return '\n'.join(strings) + '\n'
else:
return ''
def represent_word(word):
if word.like_url:
x = url_re.search(word.text.strip().lower())
if x:
return x.group(3)+'|URL'
else:
return word.text.lower().strip()+'|URL?'
text = re.sub(r'\s', '_', word.text.strip().lower())
tag = LABELS.get(word.ent_type_)
# Dropping PUNCTUATION such as commas and DET like the
if tag is None and word.pos_ not in ['PUNCT', 'DET']:
tag = word.pos_
elif tag is None:
return None
# if not word.pos_:
# tag = '?'
return text + '|' + tag
corpus = '''
As far as Saudi Arabia and its motives, that is very simple also. The Saudis are
good at money and arithmetic. Faced with the painful choice of losing money
maintaining current production at US$60 per barrel or taking two million barrels
per day off the market and losing much more money - it's an easy choice: take
the path that is less painful. If there are secondary reasons like hurting US
tight oil producers or hurting Iran and Russia, that's great, but it's really
just about the money.
'''
corpus_stripped = strip_meta(corpus)
doc = nlp(corpus_stripped)
corpus_ = []
for word in doc:
# only lemmatize NOUN and PROPN
if word.pos_ in ['NOUN', 'PROPN'] and len(word.text) > 3 and len(word.text) != len(word.lemma_):
# Keep the original word with the length of the lemma, then add the white space, if it was there.:
lemma_ = str(word.text[:1]+word.lemma_[1:]+word.text_with_ws[len(word.text):])
# print(word.text, lemma_)
corpus_.append(lemma_)
# print(word.text, word.text[:len(word.lemma_)]+word.text_with_ws[len(word.text):])
# All other words are added normally.
else:
corpus_.append(word.text_with_ws)
result = transform_doc(nlp(''.join(corpus_)))
sense2vec_filename = 'text.txt'
file = open(sense2vec_filename,'w')
file.write(result)
file.close()
print(result)
You could visualise your model using Gensim in Tensorboard using this approach:
https://github.com/ArdalanM/gensim2tensorboard
I'll also adjust this code to work with the sense2vec approach (e.g. the words become lowercase in the preprocessing step, just comment it out in the code).
Happy coding,
woltob
The input file should be a bzipped json. To use a plain text file just edit the merge_text.py as follow:
def iter_comments(loc):
with bz2.BZ2File(loc) as file_:
for i, line in enumerate(file_):
yield line.decode('utf-8', errors='ignore')
# yield ujson.loads(line)['body']

Categories

Resources