How determine if a token is part of an entity within Spacy? - python

I have
import spacy
nlp = spacy.load("en_core_web_lg")
line = "Rio de Janeiro is the capital of.."
doc = nlp(line)
for tok in doc:
printf(tok.lemma_)
for ent in doc.ents:
printf(e.lemma_)
I want obtain wikization: "[[Rio de Janeiro]] [[be|is]] [[the]] [[capital]] [[of]].."
how determine if token "Rio" is part of entity "Rio de Janeiro"?

Use the ent_type or ent_type_ attribute, if the value is not an empty string it is an entity.
Edit: for attribute ent_iob or ent_iob_ “B” means the token begins an entity, “I” means it is inside an entity, “O” means it is outside an entity, and "" means no entity tag is set.
import spacy
nlp = spacy.load("en_core_web_lg")
line = "Rio de Janeiro is the capital of.."
doc = nlp(line)
for tok in doc:
print(tok, tok.ent_type_, tok.ent_iob_)
Output:
Rio GPE B
de GPE I
Janeiro GPE I
is O
the O
capital O
of O
.. O

Entities have start and end property: indicies of token stream.
I can write:
import spacy
nlp = spacy.load("en_core_web_lg")
line = "Rio de Janeiro is the capital of.."
doc = nlp(line)
if len(doc.ents)>0:
firstEnt = doc.ents[0].start
else:
firstEnt = len(doc)
for j in range(firstEnt):
print(doc[j])
j = firstEnt
for i in range(len(doc.ents)):
ent = doc.ents[i]
while j<ent.start:
print(doc[j])
j+=1
print(ent)
if len(doc.ents) > 0:
j = ent.end
while j<len(doc):
print(doc[j])
j+=1

Related

Feature extraction with tweet IDs

I am trying to extract the information of a tweet from an ID. With the ID, I want to get the tweet creation date, the tweet, location, followers, friends, favorites, their profile description, if they're verified, and the language, but I'm having trouble doing it. Next, I will show the steps I follow to carry out what I want.
I have made the following code. To start with, I have the IDs of the tweets in a txt file and I read them as follows:
# Read txt file
txt = '/content/drive/MyDrive/Mini-proyecto Texto/archivo.txt'
with open(txt) as archivo:
lines = archivo.readlines()
Next, I add each of the IDs to a list:
# Add the IDs to a list
IDs = []
for i in lines:
IDs.append(i.rsplit())
#print(i.rsplit())
IDs
#[['1206924075374956547'],
# ['1210912199402819584'],
# ['1210643148998938625'],
# ['1207776839697129472'],
# ['1203627609759920128'],
# ['1205895318212136961'],
# ['1208145724879364100'], ...
Finally, start extracting the information you require as follows:
# Extract information from tweets
tweets_df2 = pd.DataFrame()
for i in IDs:
try:
info_tweet = api.get_status(i, tweet_mode="extended")
except:
pass
tweets_df2 = tweets_df2.append(pd.DataFrame({'ID': info_tweet.id,
'Tweet': info_tweet.full_text,
'Creado_tweet': info_tweet.created_at,
'Locacion_usuario': info_tweet.user.location,
'Seguidores_usuario': info_tweet.user.followers_count,
'Amigos_usuario': info_tweet.user.friends_count,
'Favoritos_usuario': info_tweet.user.favourites_count,
'Descripcion_usuario': info_tweet.user.description,
'Verificado_usuario': info_tweet.user.verified,
'Idioma': info_tweet.lang}, index=[0]))
tweets_df2 = tweets_df2.reset_index(drop=True)
tweets_df2
The following image is the output of the tweets_df2 variable, but I don't understand why the values are repeated over and over again. Does anyone know what's wrong with my code?
If you need the txt I provide you with the link of the drive. https://drive.google.com/file/d/1vyohQMpLqlKqm6b4iTItcVVqL2wBMXWp/view?usp=sharing
Thank you very much in advance for your time :3
Your code basically runs fine for me with a few adjustments. I noticed that your identation is not correct and your list of IDs is a list of lists (with just one element) rather than one list of elements.
Try this:
api = tweepy.Client(consumer_key=api_key,
consumer_secret=api_key_secret,
access_token=access_token,
access_token_secret=access_token_secret,
bearer_token=bearer_token,
wait_on_rate_limit=True,
)
auth = tweepy.OAuth1UserHandler(
api_key, api_key_secret, access_token, access_token_secret
)
api = tweepy.API(auth)
txt = "misocorpus-misogyny.txt"
with open(txt) as archivo:
lines = archivo.readlines()
IDs = []
for i in lines:
IDs.append(i.strip())# <-- use strip() to remove /n rather than split
tweets_df2 = pd.DataFrame()
for i in IDs:
try:
info_tweet = api.get_status(i, tweet_mode="extended")
except:
pass
tweets_df2 = tweets_df2.append(pd.DataFrame({'ID': info_tweet.id,
'Tweet': info_tweet.full_text,
'Creado_tweet': info_tweet.created_at,
'Locacion_usuario': info_tweet.user.location,
'Seguidores_usuario': info_tweet.user.followers_count,
'Amigos_usuario': info_tweet.user.friends_count,
'Favoritos_usuario': info_tweet.user.favourites_count,
'Descripcion_usuario': info_tweet.user.description,
'Verificado_usuario': info_tweet.user.verified,
'Idioma': info_tweet.lang}, index=[0]))
tweets_df2 = tweets_df2.reset_index(drop=True)
tweets_df2
Result:
ID Tweet Creado_tweet Locacion_usuario Seguidores_usuario Amigos_usuario Favoritos_usuario Descripcion_usuario Verificado_usuario Idioma
0 1206924075374956547 Las feminazis quieren por poco que este chico ... 2019-12-17 13:08:17+00:00 Argentina 1683 2709 28982 El Progresismo es un Cáncer que quiere destrui... False es
1 1210912199402819584 #CarlosVerareal #Galois2807 Los halagos con pi... 2019-12-28 13:15:40+00:00 Ecuador 398 1668 3123 Cuando te encuentres n una situación imposible... False es
2 1210643148998938625 #drummniatico No se vaya asustar! Ese es el gr... 2019-12-27 19:26:34+00:00 Samborondon - Ecuador 1901 1432 39508 Todo se alinea a nuestro favor. 💙💙💙💙 False es
3 1210643148998938625 #drummniatico No se vaya asustar! Ese es el gr... 2019-12-27 19:26:34+00:00 Samborondon - Ecuador 1901 1432 39508 Todo se alinea a nuestro favor. 💙💙💙💙 False es
4 1203627609759920128 Mostritas #Feminazi amenazando como ellas sabe... 2019-12-08 10:49:19+00:00 Lima, Perú 2505 3825 45087 Latam News Report. Regional and World affairs.... False

Named Entity Extraction

I am trying to extract list of persons using Stanford Named Entity Recognizer (NER) in Python NLTK. Code and obtained output is like this
Code
from nltk.tag import StanfordNERTagger
st = StanfordNERTagger('english.all.3class.distsim.crf.ser.gz')
sent = 'joel thompson tracy k smith new work world premierenew york philharmonic commission'
strin = sent.title()
value = st.tag(strin.split())
def get_continuous_chunks(tagged_sent):
continuous_chunk = []
current_chunk = []
for token, tag in tagged_sent:
if tag != "O":
current_chunk.append((token, tag))
else:
if current_chunk: # if the current chunk is not empty
continuous_chunk.append(current_chunk)
current_chunk = []
# Flush the final current_chunk into the continuous_chunk, if any.
if current_chunk:
continuous_chunk.append(current_chunk)
return continuous_chunk
named_entities = get_continuous_chunks(value)
named_entities_str = [" ".join([token for token, tag in ne]) for ne in named_entities]
print(named_entities_str)
Obtained Output
[('Joel Thompson Tracy K Smith New Work World Premierenew York Philharmonic Commission',
'PERSON')]
Desired Output
Person 1: Joel Thompson
Person 2: Tracy K Smith
Data : New Work World Premierenew York Philharmonic Commission

Looking at the next word

I would like to know how I can find a word which has the next one with the first letter capitalised.
For example:
ID Testo
141 Vivo in una piccola città
22 Gli Stati Uniti sono una grande nazione
153 Il Regno Unito ha votato per uscire dall'Europa
64 Hugh Laurie ha interpretato Dr. House
12 Mi piace bere birra.
My expected output would be:
ID Testo Estratte
141 Vivo in una piccola città []
22 Gli Stati Uniti sono una grande nazione [Gli Stati, Stati Uniti]
153 Il Regno Unito ha votato per uscire dall'Europa [Il Regno, Regno Unito]
64 Hugh Laurie ha interpretato Dr. House [Hugh Laurie, Dr House]
12 Mi piace bere birra. []
To extract letter capitalised I do:
df['Estratte'] = df['Testo'].str.findall(r'\b([A-Z][a-z]*)\b')
However this column collect only single words since the code does not look at the next word.
Could you please tell me which condition I should add to look at the next word?
Sometime regex is not always good , let us try split with explode
s=df.Testo.str.split(' ').explode()
s2=s.groupby(level=0).shift(-1)
assign=(s + ' ' + s2)[s.str.istitle() & s2.str.isttimeitle()].groupby(level=0).agg(list)
Out[244]:
1 [Gli Stati, Stati Uniti]
2 [Il Regno, Regno Unito]
3 [Hugh Laurie, Dr. House]
Name: Testo, dtype: object
df['New']=assign
# notice after assign the not find row will be assign as NaN
Maybe you could use my code below
def getCapitalize(myStr):
words = myStr.split()
for i in range(0, len(words) - 1):
if (words[i][0].isupper() and words[i+1][0].isupper()):
yield f"{words[i]} {words[i+1]}"
This function will create a generator and you will have to convert to a list or wtv
import re
import pandas as pd
x = {141 : 'Vivo in una piccola città', 22: 'Gli Stati Uniti sono una grande nazione',
153 : 'Il Regno Unito ha votato per uscire dall\'Europa', 64 : 'Hugh Laurie ha interpretato Dr. House', 12 :'Mi piace bere birra.'}
df = pd.DataFrame(x.items(), columns = ['id', 'testo'])
caps = []
vals = df.testo
for string in vals:
string = string.split(' ')
string = string[1:]
string = ' '.join(string)
caps.append(re.findall('([A-Z][a-z]+)', string))
df['Estratte'] = caps```
Why not match a word starting with capital letter but not at the start of line
df.Testo.str.findall('(?<!^)([A-Z]\w+)')
or
df.Testo.str.findall('(?<!^)[A-Z][a-z]+')
0 []
1 [Stati, Uniti]
2 [Regno, Unito, Europa]
3 [Laurie, Dr, House]
4 []
I think the simplest is to use regex, search (pattern-space-pattern), with overlapping:
import regex as re
df['Estratte'] = df.Testo.apply(lambda x: re.findall('[A-Z][a-z]+[ ][A-Z][a-z]+', x, overlapped=True))

Fetch the line that has two specific keywords

I have a list that contains pair of keywords ('k1', 'k2'). Here's a sample:
print (word_pairs)
--->[('salaire', 'dépense'), ('gratuité', 'argent'), ('causesmwedemwelamwemort', 'cadres'), ('caractèresmwedumwedispositif', 'historique'), ('psychomotricienmwediplôme', 'infirmier'), ('impôtmwesurmwelesmweréunionsmwesportives', 'compensation'), ('affichage', 'affichagemweopinion'), ('délaimweprorogation', 'défaillance'), ('créancemwenotion', 'généralités')]
I have a text file r_isa.txt (205MB) that contain words that share an "isa" relationship. Here's a sample, where \t represents a literal tab character:
égalité de Parseval\tformule_0.9333\tégalité_1.0
filiation illégitime\tfiliation_1.0
Loi reconnaissant l'égalité\tloi_1.0
égalité entre les sexes\tégalité_1.0
liberté égalité fraternité\tliberté_1.0
This basically means, "égalité de Parseval" isa "formule" with a score of 0.9333 and isa "égalité" with a score of 1. And so go on..
I want to know based on the r_isa file, if the keyword k1 isa k2, and if k2 is-a k1. On the output file, I want to save on each line the pair of words that do have the is-a relationship.
Here's what I did:
#Reading data as list
keywords = [line for line in open('version_final_PMI_espace.txt', encoding='utf8')]
keywords = ast.literal_eval(keywords[0])
word_pairs = []
for k,v in keywords.items():
if v:
word_pairs.append((k,v[0][0]))
len(list(set(word_pairs)))
#####
with open("r_isa.txt",encoding="utf-8") as readfile, open('Hyperonymy_file_pair.txt', 'w') as writefile:
for line in readfile:
firstfield = line.split('\t')[0].lower()
for w in word_pairs:
if w[0]==firstfield:
if w[1] in line:
writefile.write("".join(w[0]) + "\t"+"".join(w[1]) +"\n" )
This returns random pairs to me, for exemple:
salaire\targent
dépense\tcadres
unstead of ( in case of an existing isa relationship)
salaire\tdépense
causesmwedemwelamwemort\tcadres
Where did I go wrong ?
Updated Answer
The statement if w[1] in line: is highly suspect. See the following code for what I believe the logic should be. Since I don't have access to your files, I have turned readfile into a list of strings for testing purposes and instead of writing output to writefile, I am just printing some results. I have added some values to word_pairs and readfile so that I get some results. Also note that if you are converting the input file to lower case, then your word pairs must also be lower case.
This code checks if k1 isa k2 and if not, then checks if k2 isa k1.
word_pairs = [('égalité de parseval', 'égalité'), ('salaire', 'dépense'), ('gratuité', 'argent'), ('causesmwedemwelamwemort', 'cadres'), ('caractèresmwedumwedispositif', 'historique'), ('psychomotricienmwediplôme', 'infirmier'), ('impôtmwesurmwelesmweréunionsmwesportives', 'compensation'), ('affichage', 'affichagemweopinion'), ('délaimweprorogation', 'défaillance'), ('créancemwenotion', 'généralités')]
word_pairs2 = [(pair[1], pair[0]) for pair in word_pairs] # reverse the words
word_dict = dict(word_pairs) # create a dictionary for fast searching
word_dict2 = dict(word_pairs2)
readfile = [
'égalité de Parseval\tformule_0.9333\tégalité_1.0',
'filiation illégitime\tfiliation_1.0',
'Loi reconnaissant l\'égalité\tloi_1.0',
'égalité entre les sexes\tégalité_1.0',
'liberté égalité fraternité\tliberté_1.0',
'dépense\tsalaire_.9'
]
for line in readfile:
fields = line.lower().split('\t')
first_word = fields.pop(0)
isa_word = word_dict.get(first_word, word_dict2.get(first_word)) # check k2 isa k1 if k1 isa k2 is false
if isa_word is not None:
for field in fields: # check each one
fields2 = field.split('_')
second_word, score = fields2
if second_word == isa_word:
print(first_word, second_word, score)
Prints:
égalité de parseval égalité 1.0
dépense salaire .9
If it is possible that k1 isa k2 and k2 isa k1, then you need the more general (but more complicated) code:
word_pairs = [('égalité de parseval', 'égalité'), ('salaire', 'dépense'), ('gratuité', 'argent'), ('causesmwedemwelamwemort', 'cadres'), ('caractèresmwedumwedispositif', 'historique'), ('psychomotricienmwediplôme', 'infirmier'), ('impôtmwesurmwelesmweréunionsmwesportives', 'compensation'), ('affichage', 'affichagemweopinion'), ('délaimweprorogation', 'défaillance'), ('créancemwenotion', 'généralités')]
word_pairs2 = [(pair[1], pair[0]) for pair in word_pairs] # reverse the words
word_dict = dict(word_pairs) # create a dictionary for fast searching
word_dict2 = dict(word_pairs2)
readfile = [
'égalité de Parseval\tformule_0.9333\tégalité_1.0',
'filiation illégitime\tfiliation_1.0',
'Loi reconnaissant l\'égalité\tloi_1.0',
'égalité entre les sexes\tégalité_1.0',
'liberté égalité fraternité\tliberté_1.0',
'salaire\tdépense_1.0',
'dépense\tsalaire_.9'
]
for line in readfile:
fields = line.lower().split('\t')
first_word = fields.pop(0)
# k1 isa k2?
isa_word = word_dict.get(first_word)
if isa_word is not None:
for field in fields: # check each one
fields2 = field.split('_')
second_word, score = fields2
if second_word == isa_word:
print(first_word, second_word, score)
# k2 isa k1?
isa_word = word_dict2.get(first_word)
if isa_word is not None:
for field in fields: # check each one
fields2 = field.split('_')
second_word, score = fields2
if second_word == isa_word:
print(first_word, second_word, score)
Prints:
égalité de parseval égalité 1.0
salaire dépense 1.0
dépense salaire .9
kw = [('salaire', 'dépense'),
('gratuité', 'argent'),
('causesmwedemwelamwemort', 'cadres'),
('caractèresmwedumwedispositif', 'historique'),
('psychomotricienmwediplôme', 'infirmier'),
('impôtmwesurmwelesmweréunionsmwesportives', 'compensation'),
('affichage', 'affichagemweopinion'),
('délaimweprorogation', 'défaillance'),
('créancemwenotion', 'généralités')]
lines_from_file = ['égalité de Parseval\tformule_0.9333\tégalité_1.0',
'filiation illégitime\tfiliation_1.0',
'Loi reconnaissant l\'égalité\tloi_1.0',
'égalité entre les sexes\tégalité_1.0',
'liberté égalité fraternité\tliberté_1.0',
'créancemwenotion\tgénéralités_1.0',
'généralités\tcréancemwenotion_1.0']
who_is_who_dict = {}
for line in lines_from_file:
words = line.split('\t')
key = words[0]
other_words = [w.split('_')[0] for w in words[1:]]
if key in who_is_who_dict:
who_is_who_dict[key] = who_is_who_dict[key] + other_words
else:
who_is_who_dict[key] = other_words
pairs_to_write = []
for kw1, kw2 in kw:
if (kw1 in who_is_who_dict and kw2 in who_is_who_dict[kw1]
and kw2 in who_is_who_dict and kw1 in who_is_who_dict[kw2]):
pairs_to_write.append((kw1, kw2))
print(pairs_to_write)
output :
[('créancemwenotion', 'généralités')]

Parse by Python

I have a problem with my Python parsing. I have this kind of xml file:
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE Trans SYSTEM "trans-14.dtd">
<Trans scribe="maria" audio_filename="agora_2007_11_05_a" version="11" version_date="080826" xml:lang="catalan">
<Topics>
<Topic id="to1" desc="music"/>
<Topic id="to2" desc="bgnoise"/>
<Topic id="to4" desc="silence"/>
<Topic id="to5" desc="speech"/>
<Topic id="to6" desc="speech+music"/>
</Topics>
<Speakers>
<Speaker id="spk1" name="Xavi Coral" check="no" type="male" dialect="native" accent="catalan" scope="local"/>
<Speaker id="spk2" name="Ferran Martínez" check="no" type="male" dialect="native" accent="catalan" scope="local"/>
<Speaker id="spk3" name="Jordi Barbeta" check="no" type="male" dialect="native" accent="catalan" scope="local"/>
</Speakers>
<Section type="report" topic="to6" startTime="111.286" endTime="119.308">
<Turn speaker="spk1" startTime="111.286" endTime="119.308" mode="planned" channel="studio">
<Sync time="111.286"/>
ha estat director del diari La Vanguàrdia,
<Sync time="113.56"/>
ha estat director general de Barcelona Televisió i director del Centre Territorial de Televisió Espanyola a Catalunya,
<Sync time="119.308"/>
actualment col·labora en el diari
<Event desc="es" type="language" extent="begin"/>
El Periódico
<Event desc="es" type="language" extent="end"/>
de Catalunya.
</Turn>
</Section>
And this is my Python code:
import xml.etree.ElementTree as etree
import os
import sys
xmlD = etree.parse(sys.stdin)
root = xmlD.getroot()
sections = root.getchildren()[2].getchildren()
for section in sections:
turns = section.getchildren()
for turn in turns:
speaker = turn.get('speaker')
mode = turn.get('mode')
childs = turn.getchildren()
for child in childs:
time = child.get('time')
opt = child.get('desc')
extent = child.get('extent')
if opt == 'es' and extent == 'begin':
opt = "ESP:"
elif opt == "la" extent == 'begin':
opt = "LAT:"
elif opt == "en" extent == 'begin':
opt = "ENG:"
else:
opt = ""
if time:
time = time
else:
time = ""
print time, opt+child.tail.encode('latin-1')
I need to mark the words pronounced in other language with this tag LANG: For example:
spanish words ENG:hello, spanish words, but when I have 2 consecutive words pronounced in other language I don't know how to do this: spanish words ENG:hello ENG:man, spanish words . The change of language is in the Event xml tag.
Now, at the Output I have:
actualment col·labora en el diari ESP:El Periódico de Catalunya. and I want: actualment col·labora en el diari ESP:El ESP:Periódico de Catalunya.
Anyone could help me?
Thank you!
You can do something like -
print time, opt+(" " + opt).join([c.encode('latin-1').decode('latin-1') for c in child.tail.split(' ')])
instead of your print statement

Categories

Resources