Lets say I have a string and want to mark some entities such as Persons, and Locations.
string = 'My name is John Doe, and I live in USA'
string_tagged = 'My name is [John Doe], and I live in {USA}'
I want to mark persons with [ ] and locations with { }.
My code:
import spacy
nlp = spacy.load('en')
doc = nlp(string)
sentence = doc.text
for ent in doc.ents:
if ent.label_ == 'PERSON':
sentence = sentence[:ent.start_char] + sentence[ent.start_char:].replace(ent.text, '[' + ent.text + ']', 1)
elif ent.label_ == 'GPE':
sentence = sentence[:ent.start_char] + sentence[ent.start_char:].replace(ent.text, '{' + ent.text + '}', 1)
print(sentence[:ent.start_char] + sentence[ent.start_char:])
...so with the example string this works fine. But with more complicated sentences I get double quotes around some entities. For the sentence:
string_bug = 'Canada, Canada, Canada, Canada, Canada, Canada'
returns >> {Canada}, {Canada}, {Canada}, {Canada}, {{Canada}}, Canada
The reason why I splitted the sentence string into two was to only replace new words (with higher character positions). I think the bug might be in that I am in looping over doc.ents, so I get the old positions of my string, and the string grows for each loop with new [ ] and {}. But feels like there must be some easier way of dealing with this in spaCy.
Here's a slight modification that helped me work with your code.
string = 'My name is John Doe, and I live in USA'
import re
import spacy
nlp = spacy.load('en')
doc = nlp(string)
sentence = doc.text
for ent in doc.ents:
if ent.label_ == 'PERSON':
sentence = re.sub(ent.text, '[' + ent.text + ']', sentence)
elif ent.label_ == 'GPE':
sentence = re.sub(ent.text, '{' + ent.text + '}', sentence)
print sentence
Yields:
My name is [John Doe], and I live in {USA}
Related
I am cleaning companies names in dataframe rows, and want to remove places contained in a list, only if they appeared at the end of the name and it is not followed by "of" or "in".
I am able to do this in STATA but it takes ages to process big data.
So, I would like to remove Liverpool from "KFC liverpool" but not from "Taxis in Liverpool"
So far, I figured something like this (did not get to the "of" or "in" part yet):
places = ["liverpool", "west essex", "bristol", "sanders park"]
class remove_cities:
def __init__(self, text, places_list):
self.text = text
self.places_list= places_list
def remove_cities(self):
for plac in self.places_list:
self.text = self.text.removesuffix(plac)
return self
def identify_sentences(self):
self = self.remove_cities()
return self.text
places_list = places_all
tqdm.pandas(desc="Text Preprocessing")
clean_companyname_df = clean_companyname_df[["company_clean"]].progress_applymap(
lambda x: remove_cities(x, places_list=places_list).identify_sentences()
)
But I get the error: TypeError: removesuffix() argument must be str, not list
I do not want to convert the text into a string since cities can be such as "West Essex", and I do not want to eliminate the word "west" from the end of the company name.
Can somebody help me?
Since your case is quite specific, I don't think you can directly clean the company names with some panda functions.
First analyse the different cases :
Cases
Case 1 : company_name is a single word.
Simply return company_name.
Case 2 : company_name ends with "west essex" or "sanders park".
Case 2.1 : "of" or "in" occur just before "west essex" or "sanders park".
Return company_name as it is.
Case 2.2 : "of" or "in" does not occur just before "west essex" or "sanders park".
Remove "west essex" or "sanders park" from company_name.
Case 3 : company_name ends with "liverpool" or "bristol"
Case 3.1 : "of" or "in" occur just before "liverpool" or "bristol".
Return company_name as it is.
Case 3.2 : "of" or "in" does not occur just before "liverpool" or "bristol".
Remove "liverpool" or "bristol" from company_name.
Then write the code :
places = ["liverpool", "west essex", "bristol", "sanders park"]
def FilterCompanyName(company_name):
# convert company_name to a list of words
words = company_name.split()
# print(words)
if len(words) == 1: # make no change if company name is a single word
return company_name
last_word = words[-1]
second_last_word = words[-2] # word before last word
last_two_words = second_last_word.lower() + " " + last_word.lower()
# deal with case where last_two_words are "west essex" or "sanders park"
if last_two_words in places:
if len(words) == 2: # company name is "west essex" or "sanders park"
return company_name
third_last_word = words[-3] # word before second_last_word
if third_last_word.lower() == "in" or third_last_word.lower() == "of":
return company_name
# deal with case where last_two_words are "west essex" or
# "sanders park" and
# third_last_word is "in" or "of"
words.remove(last_word)
words.remove(second_last_word)
return ' '.join(words)
# deal with case where last word is "liverpool" or "bristol"
if last_word in places:
if (second_last_word.lower() == "in" or
second_last_word.lower() == "of"):
return company_name
words.remove(last_word)
return ' '.join(words)
test_cases = ["KFC liverpool",
"Taxis in Liverpool",
"West Essex",
"The best in west essex",
"LIVERPOOL",
"big bristol forever",
"big bristol",
"KFC west essex",
"KFC of Sanders Park"]
for test_case_number in range(0, len(test_cases)):
test = test_cases[test_case_number]
print("Test case " + str(test_case_number) + ": " + test +
" -> " + FilterCompanyName(test)+"\n")
Output :
Test case 0: KFC liverpool -> KFC
Test case 1: Taxis in Liverpool -> Taxis in Liverpool
Test case 2: West Essex -> West Essex
Test case 3: The best in west essex -> The best in west essex
Test case 4: LIVERPOOL -> LIVERPOOL
Test case 5: big bristol forever -> big bristol forever
Test case 6: big bristol -> big
Test case 7: KFC west essex -> KFC
Test case 8: KFC of Sanders Park -> KFC of Sanders Park
You will have to integrate my function with your code.
reserved_chars = "? & | ! { } [ ] ( ) ^ ~ * : \ " ' + -"
list_vals = ['gold-bear#gmail.com', 'P&G#dom.com', 'JACKSON! BOT', 'annoying\name']
What is that fastest way to loop through every element in a list and add a \ in front of the reserved character if one of the elements contains them?
desired output:
fixed_list = ['gold\-bear#gmail.com', 'P\&G#dom.com', 'JACKSON\! BOT', 'annoying\\name']
You could make a translation table with str.maketrans() and pass that into translate. This takes a little setup, but you can reuse the translation table and it's quite fast:
reserved_chars = '''?&|!{}[]()^~*:\\"'+-'''
list_vals = ['gold-bear#gmail.com', 'P&G#dom.com', 'JACKSON! BOT', 'annoying\\name']
# make trans table
replace = ['\\' + l for l in reserved_chars]
trans = str.maketrans(dict(zip(reserved_chars, replace)))
# translate with trans table
fixed_list = [s.translate(trans) for s in list_vals]
print("\n".join(fixed_list))
Prints:
gold\-bear#gmail.com
P\&G#dom.com
JACKSON\! BOT
annoying\\name
There is no fast way - you got strings, strings are immuteable, you need to create new ones.
Probably best way is to build your own translation dictionary and do the grunt work yourself:
reserved = """? & | ! { } [ ] ( ) ^ ~ * : \ " ' + -"""
tr = { c:f"\\{c}" for c in reserved}
print(tr)
data = ['gold-bear#gmail.com', 'P&G#dom.com', 'JACKSON! BOT', 'annoying\name']
transformed = [ ''.join(tr.get(letter,letter) for letter in word) for word in data]
for word in transformed:
print(word)
Output:
# translation dictionary
{'?': '\\?', ' ': '\\ ', '&': '\\&', '|': '\\|', '!': '\\!', '{': '\\{',
'}': '\\}', '[': '\\[', ']': '\\]', '(': '\\(', ')': '\\)', '^': '\\^',
'~': '\\~', '*': '\\*', ':': '\\:', '\\': '\\\\', '"': '\\"', "'": "\\'",
'+': '\\+', '-': '\\-'}
# transformed strings
gold\-bear#gmail.com
P\&G#dom.com
JACKSON\!\ BOT
annoying
ame
Sidenotes:
Your example missed to escape the space inside 'JACKSON\! BOT'.
The repl() of the transformed list looks "wrongly" escaped because when printing it escapes each '\' itself again - whats being printed see wordlist
Definitely not the fastest, but could be the easiest to code. Make a regex that does it for you, and run re.sub, like this:
import re
reserved_chars = "?&|!{}[]()^~*:\\\"'+-"
replace_regex = "([" + ''.join('\\x%x' % ord(x) for x in reserved_chars) + "])"
list_vals = ['gold-bear#gmail.com', 'P&G#dom.com', 'JACKSON! BOT', r'annoying\name']
escaped_vals = [re.sub(replace_regex, r"\\\1", x) for x in list_vals]
Again, just to clarify, regexes are SLOW.
I have a script that allows me to extract the info obtained from excel to a list, this list contains str values that contain phrases such as: "I like cooking", "My dog´s name is Doug", etc.
So I've tried this code that I found on the Internet, knowing that the int function has a way to transform an actual phrase into numbers.
The code I used is:
lista=["I like cooking", "My dog´s name is Doug", "Hi, there"]
test_list = [int(i, 36) for i in lista]
Running the code I get the following error:
builtins.ValueError: invalid literal for int() with base 36: "I like
cooking"
But I´ve tried the code without the spaces or punctuation, and i get an actual value, but I do need to take those characters into consideration.
To expand on the bytearray approach you could use int.to_bytes and int.from_bytes to actually get an int back, although the integers will be much longer than you show in your example.
def to_int(s):
return int.from_bytes(bytearray(s, 'utf-8'), 'big', signed=False)
def to_str(s):
return s.to_bytes((s.bit_length() +7 ) // 8, 'big').decode()
lista = ["I like cooking",
"My dog´s name is Doug",
"Hi, there"]
encoded = [to_int(s) for s in lista]
decoded = [to_str(s) for s in encoded]
encoded:
[1483184754092458833204681315544679,
28986146900667755422058678317652141643897566145770855,
1335744041264385192549]
decoded:
['I like cooking',
'My dog´s name is Doug',
'Hi, there']
As noted in the comments, converting phrases to integers with int() won't work if the phrase contains whitespace or most non-alphanumeric characters with a few exceptions.
If your phrases all use a common encoding, then you might get something closer to what you want by converting your strings to bytearrays. For example:
s = 'My dog´s name is Doug'
b = bytearray(s, 'utf-8')
print(list(b))
# [77, 121, 32, 100, 111, 103, 194, 180, 115, 32, 110, 97, 109, 101, 32, 105, 115, 32, 68, 111, 117, 103]
From there you would have to figure out whether or not you want to preserve the list of integers representing each phrase or combine them in some way depending on what you intend to do with these numerical string representations.
Since you want to convert your text for an AI, you should do something like this:
import re
def clean_text(text, vocab):
'''
normalizes the string
'''
chars = {'\'':[u"\u0060", u"\u00B4", u"\u2018", u"\u2019"], 'a':[u"\u00C0", u"\u00C1", u"\u00C2", u"\u00C3", u"\u00C4", u"\u00C5", u"\u00E0", u"\u00E1", u"\u00E2", u"\u00E3", u"\u00E4", u"\u00E5"],
'e':[u"\u00C8", u"\u00C9", u"\u00CA", u"\u00CB", u"\u00E8", u"\u00E9", u"\u00EA", u"\u00EB"],
'i':[u"\u00CC", u"\u00CD", u"\u00CE", u"\u00CF", u"\u00EC", u"\u00ED", u"\u00EE", u"\u00EF"],
'o':[u"\u00D2", u"\u00D3", u"\u00D4", u"\u00D5", u"\u00D6", u"\u00F2", u"\u00F3", u"\u00F4", u"\u00F5", u"\u00F6"],
'u':[u"\u00DA", u"\u00DB", u"\u00DC", u"\u00DD", u"\u00FA", u"\u00FB", u"\u00FC", u"\u00FD"]}
for gud in chars:
for bad in chars[gud]:
text = text.replace(bad, gud)
if 'http' in text:
return ''
text = text.replace('&', ' and ')
text = re.sub(r'\.( +\.)+', '..', text)
#text = re.sub(r'\.\.+', ' ^ ', text)
text = re.sub(r',+', ',', text)
text = re.sub(r'\-+', '-', text)
text = re.sub(r'\?+', ' ? ', text)
text = re.sub(r'\!+', ' ! ', text)
text = re.sub(r'\'+', "'", text)
text = re.sub(r';+', ':', text)
text = re.sub(r'/+', ' / ', text)
text = re.sub(r'<+', ' < ', text)
text = re.sub(r'>+', ' > ', text)
text = text.replace('%', '% ')
text = text.replace(' - ', ' : ')
text = text.replace(' -', " - ")
text = text.replace('- ', " - ")
text = text.replace(" '", " ")
text = text.replace("' ", " ")
#for c in ".,:":
# text = text.replace(c + ' ', ' ' + c + ' ')
text = re.sub(r' +', ' ', text.strip(' '))
for i in text:
if i not in vocab:
text = text.replace(i, '')
return text
def arr_to_vocab(arr, vocabDict):
'''
returns a provided array converted with provided vocab dict, all array elements have to be in the vocab, but not all vocab elements have to be in the input array, works with strings too
'''
try:
return [vocabDict[i] for i in arr]
except Exception as e:
print (e)
return []
def str_to_vocab(vocab):
'''
generates vocab dicts
'''
to_vocab = {}
from_vocab = {}
for index, i in enumerate(vocab):
to_vocab[index] = i
from_vocab[i] = index
return to_vocab, from_vocab
vocab = sorted([chr(i) for i in range(32, 127)]) # a basic vocab for your model
vocab.insert(0, None)
toVocab, fromVocab = str_to_vocab(vocab) #converting vocab into usable form
your_data_str = ["I like cooking", "My dog´s name is Doug", "Hi, there"] #your data, a list of strings
X = []
for i in your_data_str:
X.append(arr_to_vocab(clean_text(i, vocab), fromVocab)) # normalizing and converting to "ints" each string
# your data is now almost ready for your model, just pad it to the size of your input with zeros and it's done
print (X)
If you want to know how convert an "int" string back to a string, tell me.
I'm using the Stanford Named Entity Recognizer with Python to find the proper names in the novel "A Hundred years of solitud". There are many of them composed by first and last name e.g. "Aureliano Buendía" or "Santa Sofía de la Piedad". These Tokens are always separated e.g. "Aureliano" "Buendia", because of the tokenizer I am using.
I would like to have them together as a token, so they can be tagged together as "PERSON" with Stanford NER.
The code I wrote:
import nltk
from nltk.tag import StanfordNERTagger
from nltk import word_tokenize
from nltk import FreqDist
sentence1 = open('book1.txt').read()
sentence = sentence1.split()
path_to_model = "C:\Python34\stanford-ner-2015-04-20\classifiers\english.muc.7class.distsim.crf.ser"
path_to_jar = "C:\Python34\stanford-ner-2015-04-20\stanford-ner.jar"
st = StanfordNERTagger(model_filename=path_to_model, path_to_jar=path_to_jar)
taggedSentence = st.tag(sentence)
def findtags (tagged_text,tag_prefix):
cfd = nltk.ConditionalFreqDist((tag, word) for (word, tag) in taggedSentence
if tag.endswith(tag_prefix))
return dict((tag, cfd[tag].most_common(1000)) for tag in cfd.conditions())
print (findtags('_','PERSON'))
The result looks like this:
{'PERSON': [('Aureliano', 397), ('José', 294), ('Arcadio', 286), ('Buendía', 251), ...
Does anybody have a solution? I would be more than grateful
import nltk
from nltk.tag import StanfordNERTagger
sentence1 = open('book1.txt').read()
sentence = sentence1.split()
path_to_model = "C:\Python34\stanford-ner-2015-04-20\classifiers\english.muc.7class.distsim.crf.ser"
path_to_jar = "C:\Python34\stanford-ner-2015-04-20\stanford-ner.jar"
st = StanfordNERTagger(model_filename=path_to_model, path_to_jar=path_to_jar)
taggedSentence = st.tag(sentence)
test = []
test_dict = {}
for element in range(len(taggedSentence)):
a = ''
if element < len(taggedSentence):
while taggedSentence[element][1] == 'PERSON':
a += taggedSentence[element][0] + ' '
taggedSentence.pop(element)
if len(a) > 1:
test.append(a.strip())
test_dict[data.split('.')[0]] = tuple(test)
print(test_dict)
The documentation of sense2vec mentions 3 primary files - the first of them being merge_text.py. I have tried several types of inputs- txt,csv,bzipped file since merge_text.py tries to open files compressed by bzip2.
The file can be found at:
https://github.com/spacy-io/sense2vec/blob/master/bin/merge_text.py
What type of input format does this script require?
Further, if anyone could please suggest how to train the model.
I extended and adjusted the code samples from sense2vec.
You go from this input text:
"As far as Saudi Arabia and its motives, that is very simple also. The Saudis are
good at money and arithmetic. Faced with the painful choice of losing money
maintaining current production at US$60 per barrel or taking two million barrels
per day off the market and losing much more money - it's an easy choice: take
the path that is less painful. If there are secondary reasons like hurting US
tight oil producers or hurting Iran and Russia, that's great, but it's really
just about the money."
To this:
as|ADV far|ADV as|ADP saudi_arabia|ENT and|CCONJ its|ADJ motif|NOUN that|ADJ is|VERB very|ADV simple|ADJ also|ADV saudis|ENT are|VERB good|ADJ at|ADP money|NOUN and|CCONJ arithmetic|NOUN faced|VERB with|ADP painful_choice|NOUN of|ADP losing|VERB money|NOUN maintaining|VERB current_production|NOUN at|ADP us$|SYM 60|MONEY per|ADP barrel|NOUN or|CCONJ taking|VERB two_million|CARDINAL barrel|NOUN per|ADP day|NOUN off|ADP market|NOUN and|CCONJ losing|VERB much_more_money|NOUN it|PRON 's|VERB easy_choice|NOUN take|VERB path|NOUN that|ADJ is|VERB less|ADV painful|ADJ if|ADP there|ADV are|VERB secondary_reason|NOUN like|ADP hurting|VERB us|ENT tight_oil_producer|NOUN or|CCONJ hurting|VERB iran|ENT and|CCONJ russia|ENT 's|VERB great|ADJ but|CCONJ it|PRON 's|VERB really|ADV just|ADV about|ADP money|NOUN
Double line breaks are interpreted as separate documents.
Urls are recognized as such, stripped down to domain.tld and marked as |URL
Nouns (also noun being part of noun phrases) are lemmatized (as motives become motifs)
Words with POS-tags like DET (determinate article) and PUNCT (for punctuation) are dropped
Here's the code. Let me know if you have questions.
I'll probably publish it on github.com/woltob soon.
import spacy
import re
nlp = spacy.load('en')
nlp.matcher = None
LABELS = {
'ENT': 'ENT',
'PERSON': 'PERSON',
'NORP': 'ENT',
'FAC': 'ENT',
'ORG': 'ENT',
'GPE': 'ENT',
'LOC': 'ENT',
'LAW': 'ENT',
'PRODUCT': 'ENT',
'EVENT': 'ENT',
'WORK_OF_ART': 'ENT',
'LANGUAGE': 'ENT',
'DATE': 'DATE',
'TIME': 'TIME',
'PERCENT': 'PERCENT',
'MONEY': 'MONEY',
'QUANTITY': 'QUANTITY',
'ORDINAL': 'ORDINAL',
'CARDINAL': 'CARDINAL'
}
pre_format_re = re.compile(r'^[\`\*\~]')
post_format_re = re.compile(r'[\`\*\~]$')
url_re = re.compile(r'(https?:\/\/)?([a-z0-9-]+\.)?([\d\w]+?\.[^\/]{2,63})')
single_linebreak_re = re.compile('\n')
double_linebreak_re = re.compile('\n{2,}')
whitespace_re = re.compile(r'[ \t]+')
quote_re = re.compile(r'"|`|´')
def strip_meta(text):
text = text.replace('per cent', 'percent')
text = text.replace('>', '>').replace('<', '<')
text = pre_format_re.sub('', text)
text = post_format_re.sub('', text)
text = double_linebreak_re.sub('{2break}', text)
text = single_linebreak_re.sub(' ', text)
text = text.replace('{2break}', '\n')
text = whitespace_re.sub(' ', text)
text = quote_re.sub('', text)
return text
def transform_doc(doc):
for ent in doc.ents:
ent.merge(ent.root.tag_, ent.text, LABELS[ent.label_])
for np in doc.noun_chunks:
while len(np) > 1 and np[0].dep_ not in ('advmod', 'amod', 'compound'):
np = np[1:]
np.merge(np.root.tag_, np.text, np.root.ent_type_)
strings = []
for sent in doc.sents:
sentence = []
if sent.text.strip():
for w in sent:
if w.is_space:
continue
w_ = represent_word(w)
if w_:
sentence.append(w_)
strings.append(' '.join(sentence))
if strings:
return '\n'.join(strings) + '\n'
else:
return ''
def represent_word(word):
if word.like_url:
x = url_re.search(word.text.strip().lower())
if x:
return x.group(3)+'|URL'
else:
return word.text.lower().strip()+'|URL?'
text = re.sub(r'\s', '_', word.text.strip().lower())
tag = LABELS.get(word.ent_type_)
# Dropping PUNCTUATION such as commas and DET like the
if tag is None and word.pos_ not in ['PUNCT', 'DET']:
tag = word.pos_
elif tag is None:
return None
# if not word.pos_:
# tag = '?'
return text + '|' + tag
corpus = '''
As far as Saudi Arabia and its motives, that is very simple also. The Saudis are
good at money and arithmetic. Faced with the painful choice of losing money
maintaining current production at US$60 per barrel or taking two million barrels
per day off the market and losing much more money - it's an easy choice: take
the path that is less painful. If there are secondary reasons like hurting US
tight oil producers or hurting Iran and Russia, that's great, but it's really
just about the money.
'''
corpus_stripped = strip_meta(corpus)
doc = nlp(corpus_stripped)
corpus_ = []
for word in doc:
# only lemmatize NOUN and PROPN
if word.pos_ in ['NOUN', 'PROPN'] and len(word.text) > 3 and len(word.text) != len(word.lemma_):
# Keep the original word with the length of the lemma, then add the white space, if it was there.:
lemma_ = str(word.text[:1]+word.lemma_[1:]+word.text_with_ws[len(word.text):])
# print(word.text, lemma_)
corpus_.append(lemma_)
# print(word.text, word.text[:len(word.lemma_)]+word.text_with_ws[len(word.text):])
# All other words are added normally.
else:
corpus_.append(word.text_with_ws)
result = transform_doc(nlp(''.join(corpus_)))
sense2vec_filename = 'text.txt'
file = open(sense2vec_filename,'w')
file.write(result)
file.close()
print(result)
You could visualise your model using Gensim in Tensorboard using this approach:
https://github.com/ArdalanM/gensim2tensorboard
I'll also adjust this code to work with the sense2vec approach (e.g. the words become lowercase in the preprocessing step, just comment it out in the code).
Happy coding,
woltob
The input file should be a bzipped json. To use a plain text file just edit the merge_text.py as follow:
def iter_comments(loc):
with bz2.BZ2File(loc) as file_:
for i, line in enumerate(file_):
yield line.decode('utf-8', errors='ignore')
# yield ujson.loads(line)['body']