Problems preserving the occurrence in a regex? - python

I have a very large string s, the s string is conformed by word_1 followed by word_2 an id and a number:
word_1 word_2 id number
I would like to create a regex that catch in a list all the occurrences of the words that has as an id RN_ _ _ followed by the id VA_ _ _ _ and the id VM_ _ _ _. The constrait to extract the RN_ _ _ _ _,VA_ _ _ _ _ _ and VM _ _ _ _ pattern is that the occurrences must appear one after another, where _ are free characters of the id string this free characters can be more than 3 e.g. :
casa casa NCFS000 0.979058
mejor mejor AQ0CS0 0.873665
que que PR0CN000 0.562517
mejor mejor AQ0CS0 0.873665
no no RN
esta estar VASI1S0
lavando lavar VMP00SM
. . Fp 1
This is the pattern I would like to extract since they are placed one after another. And this will be the desired output in a list:
[('no RN', 'estar VASI1S0', 'lavar VMP00SM')]
For example this will be wrong, since they are not one after another:
error error RN
error error VASI1S0
pues pues CS 0.998047
error error VMP00SM
So for the s string:
s = '''
No no RN 0.998045
sabía saber VMII3S0 0.592869
como como CS 0.999289
se se P00CN000 0.465639
ponía poner VMII3S0 0.65
una uno DI0FS0 0.951575
error error RN
actuar accion VMP00SM
lavadora lavadora NCFS000 0.414738
hasta hasta SPS00 0.957698
error error VMP00SM
que que PR0CN000 0.562517
conocí conocer VMIS1S0 1
esta este DD0FS0 0.986779
error error VA00SM
y y CC 0.999962
es ser VSIP3S0 1
que que CS 0.437483
es ser VSIP3S0 1
muy muy RG 1
sencilla sencillo AQ0FS0 1
de de SPS00 0.999984
utilizar utilizar VMN0000 1
! ! Fat 1
Todo todo DI0MS0 0.560961
un uno DI0MS0 0.987295
gustazo gustazo NCMS000 1
error error VA00SM
cuando cuando CS 0.985595
estamos estar VAIP1P0 1
error error VMP00RM
aprendiendo aprender VMG0000 1
para para SPS00 0.999103
emancipar emancipar VMN0000 1
nos nos PP1CP000 1
, , Fc 1
que que CS 0.437483
si si CS 0.99954
error error RN
nos nos PP1CP000 0.935743
ponen poner VMIP3P0 1
facilidad facilidad NCFS000 1
con con SPS00 1
las el DA0FP0 0.970954
error error VMP00RM
tareas tarea NCFP000 1
de de SPS00 0.999984
no no RN 0.998134
estás estar VAIP2S0 1
condicionado condicionar VMP00SM 0.491858
alla alla VASI1S0
la el DA0FS0 0.972269
casa casa NCFS000 0.979058
error error RN
error error VASI1S0
pues pues CS 0.998047
error error VMP00SM
mejor mejor AQ0CS0 0.873665
que que PR0CN000 0.562517
mejor mejor AQ0CS0 0.873665
no no RN 1
esta estar VASI1S0 0.908900
lavando lavar VMP00SM 0.9080972
. . Fp 1
'''
this is what I tried:
import re
weird_triple = re.findall(r'(?s)(\w+\s+RN)(?:(?!\s(?:RN|VA|VM)).)*?(\w+\s+VA\w+)(?:(?!\s(?:RN|VA|VM)).)*?(\w+\s+VM\w+)', s)
print "\n This is the weird triple\n"
print weird_triple
The problem with this aproach is that returns a list of the pattern RN_ _ _ _, VA_ _ _ _, VM_ _ _, but without the one after another order(some ids and words between this pattern are being matched). Any idea of how to fix this in order to obtain:
[('no RN', 'estar VASI1S0', 'lavar VMP00SM'),('estar VAIP2S0','condicionar VMP00SM', 'alla VASI1S0')]
Thanks in advance guys!
UPDATE
I tried the aproaches that other uses recommend me but the problem is that if I add another one after another pattern like:
no no RN 0.998134
estás estar VAIP2S0 1
condicionado condicionar VMP00SM 0.491858
To the s string the recommended regex of this question doesnt work. They only catch:
[('no RN', 'estar VASI1S0', 'lavar VMP00SM')]
Instead of:
[('no RN', 'estar VASI1S0', 'lavar VMP00SM'),('estar VAIP2S0','condicionar VMP00SM', 'alla VASI1S0')]
Which is right. Any idea of how to reach the one after another pattern output:
[('no RN', 'estar VASI1S0', 'lavar VMP00SM'),('estar VAIP2S0','condicionar VMP00SM', 'alla VASI1S0')]

Here you go:
^[\w]* (\w* RN) \d(?:\.\d*)?$\s^[^\s]* (\w* VA[^\s]*) \d(?:\.\d*)?$\s^[^\s]* (\w* VM[^\s]*) \d(?:\.\d*)?$

A little late, but mine is similar:
import re
print re.findall(r'\w+ (\w+ RN.*)\n\s*\w+ (\w+ VA.*)\n\s*\w+ (\w+ VM.*)',s)
Output:
[('no RN', 'estar VASI1S0', 'lavar VMP00SM')]
If you make the source string a Unicode string (u"xxxx") or use s.decode(encoding) to transform to a Unicode string, you can handle the accents added to your question update. Make sure to declare the source file encoding:
# coding: utf8
import re
s = u'''
(big string in question)
'''
print re.findall(ur'\w+ (\w+ RN.*)\n\s*\w+ (\w+ VA.*)\n\s*\w+ (\w+ VM.*)',s,re.UNICODE)
Output:
[(u'no RN 0.998134', u'estar VAIP2S0 1', u'condicionar VMP00SM 0.491858'), (u'no RN 1', u'estar VASI1S0 0.908900', u'lavar VMP00SM 0.9080972')]

(?:\s*\S+ (\S+ RN\S*)(?:\s*\S*))\n(?: *\S+ (\S+ VA\S*)(?:\s*\S*))\n(?: *\S+ (\S+ VM\S*)(?: *\S*))
It works for your example.
In [40]: s = '''
....: No no RN 0.998045
....: sabía saber VMII3S0 0.592869
....: . . Fp 1
....: '''
In [41]: import re
In [42]: p = re.compile(ur'(?:\s*\S+ (\S+ RN\S*)(?:\s*\S*))\n(?: *\S+ (\S+ VA\S*)(?:\s*\S*))\n(?: *\S+ (\S+ VM\S*)(?: *\S*))')
In [43]: re.findall(p, s)
Out[43]:
[('no RN', 'estar VAIP2S0', 'condicionar VMP00SM'),
('no RN', 'estar VASI1S0', 'lavar VMP00SM')]
You can play with the regex here

Related

Identify the names of the interlocutors (most frequent words that follow a certain pattern) to separate the dialog lines of a chat using regex

import re
#To read input data file
with open("dm_chat_data.txt") as input_data_file:
print(input_data_file.read())
#To write corrections in a new text file
with open('dm_chat_data_fixed.txt', 'w') as file:
file.write('\n')
This is the text file extracted by webscraping, but the lines of the dialogs of each of its chat partners are not separated, so the program must identify when each user starts the dialog.
File dm_chat_data.txt
Desempleada_19: HolaaLucyGirl: hola como estas?Desempleada_19: Masomenos y vos LucyGirl?Desempleada_19: Q edad tenes LucyGirl: tengo 19LucyGirl: masomenos? que paso? (si se puede preguntar claro)Desempleada_19: Yo tmb 19 me llamo PriscilaDesempleada_19: Desempleada_19: Q hacías LucyGirl: Entre al chat para ver que onda, no frecuento mucho
Charge file [100%] (ddddfdfdfd)
LucyGirl: Desempleada_19: Gracias!
AndrewSC: HolaAndrewSC: Si quieres podemos hablar LyraStar: claro LyraStar: que cuentas amigaAndrewSC: Todo bien y tú?
Charge file [100%] (ddddfdfdfd)
LyraStar: LyraStar: que tal ese auto?AndrewSC: Creo que...Diria que... ya son las 19 : 00 hs AndrewSC: Muy bien la verdad
Bsco_Pra_Cap_: HolaBsco_Pra_Cap_: como vaBsco_Pra_Cap_: Jorge, 47, de Floresta, me presento a la entrevista, vos?Bsco_Pra_Cap_: es aqui, cierto?LucyFlame: holaaLucyFlame: estas?LucyFlame: soy una programadora de la ciudad de HudsonBsco_Pra_Cap_: de Hudson centro? o hudson alejado...?Bsco_Pra_Cap_: contame, Lu, que buscas en esta organizacion?
And this is the file that you must create separating the dialogues of each interlocutor in each of the chats. The file edited_dm_chat_data.txt need to be like this...
Desempleada_19: Holaa
LucyGirl: hola como estas?
Desempleada_19: Masomenos y vos LucyGirl?
Desempleada_19: Q edad tenes
LucyGirl: tengo 19
LucyGirl: masomenos? que paso? (si se puede preguntar claro)
Desempleada_19: Yo tmb 19 me llamo Priscila
Desempleada_19:
Desempleada_19: Q hacías
LucyGirl: Entre al chat para ver que onda, no frecuento mucho
Charge file [100%] (ddddfdfdfd)
LucyGirl:
Desempleada_19: Gracias!
AndrewSC: Hola
AndrewSC: Si quieres podemos hablar
LyraStar: claro
LyraStar: que cuentas amiga
AndrewSC: Todo bien y tú?
Charge file [100%] (ddddfdfdfd)
LyraStar: LyraStar: que tal ese auto?
AndrewSC: Creo que...Diria que... ya son las 19 : 00 hs
AndrewSC: Muy bien la verdad
Bsco_Pra_Cap_: Hola
Bsco_Pra_Cap_: como va
Bsco_Pra_Cap_: Jorge, 47, de Floresta, me presento a la entrevista, vos?Bsco_Pra_Cap_: es aqui, cierto?
LucyFlame: holaa
LucyFlame: estas?
LucyFlame: soy una programadora de la ciudad de Hudson
Bsco_Pra_Cap_: de Hudson centro? o hudson alejado...?
Bsco_Pra_Cap_: contame, Lu, que buscas en esta organizacion?
I have tried to use regex, where each interlocutor is represented by a "Word" that begins in uppercase immediately followed by ": "
But there are some lines that give some problems to this logic, for example "Bsco_Pra_Cap_: HolaBsco_Pra_Cap_: como va", where the substring "Hola" is a simply word that is not a name and is attached to the name with capital letters, then it would be confused and consider "HolaBsco_Pra_Cap_: " as a name, but it's incorrect because the correct users name is "Bsco_Pra_Cap_: "
This problem arises because we don't know what the nicknames of the interlocutor users will be, and... the only thing we know is the structure where they start with a capital letter and end in : and then an empty space, but one thing I've noticed is that in all chats the names of the conversation partners are the most repeated words, so I think I could use a regular expression pattern as a word frequency counter by setting a search criteria like this "[INITIAL CAPITAL LETTER] hjasahshjas: " , and put as line separators those substrings with these characteristics as long as they are the ones that are repeated the most throughout the file
input_data_file = open("dm_chat_data.txt", "r+")
#maybe you can use something like this to count the occurrences and thus identify the nicknames
input_data_file.count(r"[A-Z][^A-Z]*:\s")
I think it is quite hard. but you can build a rules as shown in below code:
import nltk
from collections import Counter
text = '''Desempleada_19: HolaaLucyGirl: hola como estas?Desempleada_19:
Masomenos y vos LucyGirl?Desempleada_19: Q edad tenes LucyGirl: tengo
19LucyGirl: masomenos? que paso? (si se puede preguntar claro)Desempleada_19: Yo
tmb 19 me llamo PriscilaDesempleada_19: Desempleada_19: Q hacías LucyGirl: Entre
al chat para ver que onda, no frecuento mucho
Charge file [100%] (ddddfdfdfd)
LucyGirl: Desempleada_19: Gracias!
AndrewSC: HolaAndrewSC: Si quieres podemos hablar LyraStar: claro LyraStar: que
cuentas amigaAndrewSC: Todo bien y tú?
Charge file [100%] (ddddfdfdfd)
LyraStar: LyraStar: que tal ese auto?AndrewSC: Creo que...Diria que... ya son
las 19 : 00 hs AndrewSC: Muy bien la verdad
Bsco_Pra_Cap_: HolaBsco_Pra_Cap_: como vaBsco_Pra_Cap_: Jorge, 47, de Floresta,
me presento a la entrevista, vos?Bsco_Pra_Cap_: es aqui, cierto?LucyFlame:
holaaLucyFlame: estas?LucyFlame: soy una programadora de la ciudad de
HudsonBsco_Pra_Cap_: de Hudson centro? o hudson alejado...?Bsco_Pra_Cap_:
contame, Lu, que buscas en esta organizacion?
'''
data = nltk.word_tokenize(text)
user_lst = []
for ind, val in enumerate(data):
if val == ':':
user_lst.append(data[ind - 1])
# printing duplicates assuming the users were speaking more than one time. if a
user has one dialog box it fails.
users = [k for k, v in Counter(user_lst).items() if v > 1]
# function to replace a string:
def replacer(string, lst):
for v in lst:
string = string.replace(v, f' {v}')
return string
# replace users in old text with single space in it.
refined_text = replacer(text, users)
refined_data = nltk.word_tokenize(refined_text)
correct_users = []
dialog = []
for ind, val in enumerate(refined_data):
if val == ':':
correct_users.append(refined_data[ind - 1])
if val not in users:
dialog.append(val)
correct_dialog = ' '.join(dialog).replace(':', '<:').split('<')
strip_dialog = [i.strip() for i in correct_dialog if i.strip()]
chat = []
for i in range(len(correct_users)):
chat.append(f'{correct_users[i]}{strip_dialog[i]}')
print(chat)
>>>> ['Desempleada_19: Holaa', 'LucyGirl: hola como estas ?', 'Desempleada_19: Masomenos y vos ?', 'Desempleada_19: Q edad tenes', 'LucyGirl: tengo 19', 'LucyGirl: masomenos ? que paso ? ( si se puede preguntar claro )', 'Desempleada_19: Yo tmb 19 me llamo Priscila', 'Desempleada_19:', 'Desempleada_19: Q hacías', 'LucyGirl: Entre al chat para ver que onda , no frecuento mucho Charge file [ 100 % ] ( ddddfdfdfd )', 'LucyGirl:', 'Desempleada_19: Gracias !', 'AndrewSC: Hola', 'AndrewSC: Si quieres podemos hablar', 'LyraStar: claro', 'LyraStar: que cuentas amiga', 'AndrewSC: Todo bien y tú ? Charge file [ 100 % ] ( ddddfdfdfd )', 'LyraStar:', 'LyraStar: que tal ese auto ?', 'AndrewSC: Creo que ... Diria que ... ya son las 19', '19: 00 hs', 'AndrewSC: Muy bien la verdad', 'Bsco_Pra_Cap_: Hola', 'Bsco_Pra_Cap_: como va', 'Bsco_Pra_Cap_: Jorge , 47 , de Floresta , me presento a la entrevista , vos ?', 'Bsco_Pra_Cap_: es aqui , cierto ?', 'LucyFlame: holaa', 'LucyFlame: estas ?', 'LucyFlame: soy una programadora de la ciudad de Hudson', 'Bsco_Pra_Cap_: de Hudson centro ? o hudson alejado ... ?', 'Bsco_Pra_Cap_: contame , Lu , que buscas en esta organizacion ?']

How to make this regex ((?:\w\s*)+) extract substrings that include dots, commas and/or line breaks?

how to make this regex grab the entire string in the middle without being cut if it detects, points., commas ,, colons : or semicolons;
The only case where the regex should not grab the text is if there is a line break between the set ends
import re
input_text = "cerca de abbaab como estas?. Creo yo que bien,aunque solo haya 9 de ellas pero : no estoy muy segura ccccrrru, y luego..."
some_text = "\s*((?:\w\s*)+)\s*" #need to fix this
regex_pattern = r"(?:abbaab)" + some_text + r"(?:ccccrrru)"
m1 = re.search(regex_pattern, input_text, re.IGNORECASE)
if(m1):
association = m1.group()
print(repr(association)) #output
And the correct output is:
' como estas?. Creo yo que bien,aunque solo haya 9 de ellas pero : no estoy muy segura '
And how should I modify the regex to cover line breaks as well? For example for this input:
input_text = """cerca de abbaab como estas?.
Creo yo que bien,aunque solo haya 9 de ellas.
pero : no estoy muy segura ccccrrru, y luego...
Quizas sea."""
And the correct output for this case is:
' como estas?.
Creo yo que bien,aunque solo haya 9 de ellas.
pero : no estoy muy segura '
You could just make a character class with the additional letters you want:
some_text = r'\s*((?:[a-z0-9.,:.?]+\s+)+)'
Or simplify life and just use . to match any character:
some_text = r'\s*(.*?)'
If you use the latter solution, making the regex match line breaks as well is as simple as adding the re.DOTALL flag:
import re
some_text = r'\s*(.*?)'
regex_pattern = r"(?:abbaab)" + some_text + r"(?:ccccrrru)"
input_text = "cerca de abbaab como estas?. Creo yo que bien,aunque solo haya 9 de ellas pero : no estoy muy segura ccccrrru, y luego..."
m1 = re.search(regex_pattern, input_text, re.IGNORECASE)
print(m1.group(1))
input_text = """cerca de abbaab como estas?.
Creo yo que bien,aunque solo haya 9 de ellas.
pero : no estoy muy segura ccccrrru, y luego...
Quizas sea."""
m1 = re.search(regex_pattern, input_text, re.IGNORECASE)
print(m1)
m1 = re.search(regex_pattern, input_text, re.IGNORECASE | re.DOTALL)
print(m1.group(1))
Output:
como estas?. Creo yo que bien,aunque solo haya 9 de ellas pero : no estoy muy segura
None
como estas?.
Creo yo que bien,aunque solo haya 9 de ellas.
pero : no estoy muy segura

Problems with python regex encoding?

I have a large .txt file that is made up of: word1, word2, id, number as follows:
s = '''
Vaya ir VMM03S0 0.427083
mañanita mañana RG 0.796611
, , Fc 1
buscando buscar VMG0000 1
una uno DI0FS0 0.951575
lavadora lavadora NCFS000 0.414738
con con SPS00 1
la el DA0FS0 0.972269
que que PR0CN000 0.562517
sorprender sorprender VMN0000 1
a a SPS00 0.996023
una uno DI0FS0 0.951575
persona persona NCFS000 0.98773
muy muy RG 1
especiales especial AQ0CS0 1
para para SPS00 0.999103
nosotros nosotros PP1MP000 1
, , Fc 1
y y CC 0.999962
la lo PP3FSA00 0.0277039
encontramos encontrar VMIP1P0 0.65
. . Fp 1
Pero pero CC 0.999764
vamos ir VMIP1P0 0.655914
a a SPS00 0.996023
lo el DA0NS0 0.457533
que que PR0CN000 0.562517
interesa interesar VMIP3S0 0.994868
LO_QUE_INTERESA_La lo_que_interesa_la NP00000 1
lavadora lavador AQ0FS0 0.585262
tiene tener VMIP3S0 1
una uno DI0FS0 0.951575
clasificación clasificación NCFS000 1
A+ a+ NP00000 1
, , Fc 1
de de SPS00 0.999984
las el DA0FP0 0.970954
que que PR0CN000 0.562517
ahorran ahorrar VMIP3P0 1
energía energía NCFS000 1
, , Fc 1
si si CS 0.99954
no no RN 0.998134
me me PP1CS000 0.89124
equivoco equivocar VMIP1S0 1
. . Fp 1
Lava lavar VMIP3S0 0.397388
hasta hasta SPS00 0.957698
7 7 Z 1
kg kilogramo NCMN000 1
, , Fc 1
no no RN 0.998134
está estar VAIP3S0 0.999201
nada nada RG 0.135196
mal mal RG 0.497537
, , Fc 1
se se P00CN000 0.465639
le le PP3CSD00 1
veía ver VMII3S0 0.62272
un uno DI0MS0 0.987295
gran gran AQ0CS0 1
tambor tambor NCMS000 1
( ( Fpa 1
de de SPS00 0.999984
acero acero NCMS000 0.973481
inoxidable inoxidable AQ0CS0 1
) ) Fpt 1
y y CC 0.999962
un uno DI0MS0 0.987295
consumo consumo NCMS000 0.948927
máximo máximo AQ0MS0 0.986111
de de SPS00 0.999984
49 49 Z 1
litros litro NCMP000 1
Mandos mandos NP00000 1
intuitivos intuitivo AQ0MP0 1
, , Fc 1
todo todo PI0MS000 0.43165
muy muy RG 1
bien bien RG 0.902728
explicado explicar VMP00SM 1
, , Fc 1
nada nada PI0CS000 0.850279
que que PR0CN000 0.562517
ver ver VMN0000 0.997382
con con SPS00 1
hola RG 0.90937838
como VMP00SM 1
estas AQ089FG 0.90839
la el DA0FS0 0.972269
lavadora lavadora NCFS000 0.414738
de de SPS00 0.999984
casa casa NCFS000 0.979058
de de SPS00 0.999984
mis mi DP1CPS 0.995868
padres padre NCMP000 1
Además además NP00000 1
también también RG 1
seca seco AQ0FS0 0.45723
preciadas preciar VMP00PF 1
. . Fp 1'''
For example for the s "file" I would like to extract the ids that start with AQ and RG followed by their word2, but they must ocurre one after the other for the above example this words hold the one after another order:
muy muy RG 1
especial especial AQ0CS0 1
For example this words doesnt hold the one after another order, so I would not like to extract them in a tuple:
hola RG 0.90937838
como VMP00SM 1
estas AQ089FG 0.90839
I would like to create a regex that extract in a tuple list only the word2 followed by its id like this: [('word2','id')] for all the .txt file and for all the words that hold true the one after another order. For the above example this is the only valid output:
muy muy RG 1
especiales especial AQ0CS0 1
and
también también RG 1
seca seco AQ0FS0 0.45723
Then return them in a tuple with its full id, since they preserve the one after another order:
[('muy', 'RG', 'especial', 'AQ0CS0'), ('también', 'RG', 'seco', 'AQ0FS0')]
I tried the following:
in:
t = re.findall(r'(\w+)\s*(RG)[^\n]*\n[^\n]*?(\w+)\s*(AQ\w*)', s)
print t
But my output is wrong, since it is droping the accent and some characters:
out:
[('muy', 'RG', 'especial', 'AQ0CS0'), ('n', 'RG', 'seco', 'AQ0FS0')]
instead of, which is the correct:
[('muy', 'RG', 'especial', 'AQ0CS0'), ('también', 'RG', 'seco', 'AQ0FS0')]
Could someone help me to understand what happened with my above example and how to fix it in order to catch the word2 and idthat preserve the one after another ocurrence?. Thanks in advance guys.
In Python 2, with the 8-bit strings (str), \w matches [0-9a-zA-Z_]. However if your use unicode and compile your pattern with re.UNICODE flag, then \w matches the word characters based on the unicode database.
Python documentation 7.2.1 regular expression syntax:
When the LOCALE and UNICODE flags are not specified, matches any alphanumeric character and the underscore; this is equivalent to the set [a-zA-Z0-9_]. With LOCALE, it will match the set [0-9_] plus whatever characters are defined as alphanumeric for the current locale. If UNICODE is set, this will match the characters [0-9_] plus whatever is classified as alphanumeric in the Unicode character properties database.
Thus you can do
u = s.decode('UTF-8') # or whatever encoding is in your text file
t = re.findall(r'(\w+)\s*(RG)[^\n]*\n[^\n]*?(\w+)\s*(AQ\w*)', re.UNICODE)
In Python 3 much of the str/unicode confusion is gone; when you open a file in text mode and read its contents, you will get a Python 3 str object that handles everything as Unicode characters.
it seems that \w+ don't recognize special char é.
so if your txt is strictly split by space, you can replace \w with \S
the regex will be
t = re.findall(r'(\S+)\s*(RG)[^\n]*\n[^\n]*?(\S+)\s*(AQ\S*)', s)

problems with a regex in python?

I have a big string like this:
está estar VAIP3S0 0.999201
en en SPS00 1
el el DA0MS0 1
punto punto NCMS000 1
medio medio AQ0MS0 0.314286
. . Fp 1
Es ser VSIP3S0 1
de de SPS00 0.999984
color color NCMS000 1
blanco blanco AQ0MS0 0.598684
y y CC 0.999962
tiene tener VMIP3S0 1
carga carga NCFS000 0.952569
frontal frontal AQ0CS0 0.657209
, , Fc 1
no no RN 0.902728
estaba estar VAII1S0 0.5
equilibrada equilibrar VMP00SF 1
. . Fp 1'''
I would like to extract the the ids that have the RN VA_ _ _ _ _ and VMP_ _ _ _ _ where _ are free characters of the string(id) and the second word of the line for example, for the above list:
[(no RN, estar VAII1S0, equilibrar VMP00SF)]
This is what I all ready tried:
weird_triple = re.findall(r'^(\w+)\s.+\s(RN)\s[0-9.]+\n^(.+)\s.+\s(VA)', big_string, re.M)
print "\n This is the weird triple\n", weird_triple
print "\n This is the striped weird triple\n", [x[::2] for x in weird_triple]
This is the output:
This is the weird triple
[('no', 'RN', 'estaba', 'VA')]
This is the striped weird triple
[('no', 'estaba')]
You can modify your regex as follows:
>>> re.findall(r'(\w+\s+RN).*?(\w+\s+VA\w+).*?(\w+\s+VM\w+)', big_string, re.S)
[('no RN', 'estar VAII1S0', 'equilibrar VMP00SF')]
Note: The re.M flag causes ^ and $ to match the begin/end of each line while the re.S flag allows the dot to match across newline sequences.

Why I'm obtaining a null list of elements with this regex?

I have a text with some POS-tags and some words. I created a regex to generate some bigrams that look like this: [('word', 'POS-tag', 'word', 'POS-tag'), ('word', 'POS-tag', 'word', 'POS-tag')]
This is what i all ready done:
# -- coding: utf-8 --
import re
test_string= '''
Es ser VSIP3S0 1
muy muy RG 1
fácil fácil AQ0CS0 1
de de SPS00 0.999984
Por por SPS00 1
decir decir VMN0000 0.997512
algo algo PI0CS000 0.900246
malo malo AQ0MS0 0.657087
de de SPS00 0.999984
ella él PP3FS000 1
, , Fc 1
sería ser VSIC1S0 0.5
que que CS 0.437483
cuando cuando CS 0.985595
centrifuga centrifugar VMIP3S0 0.994859
, , Fc 1
algo algo PI0CS000 0.900246
que que PR0CN000 0.562517
hace hacer VMIP3S0 1
muy muy RG 1
bien bien RG 0.902728
sitio sitio NCMS000 0.980769
'''
regex = re.findall(r'^(\w+)\s\w+\s(RG)\s[0-9.]+\n^(\w+)\s\w+\s(AQ0CS0)', test_string, re.M)
print "\n This is a bigram:"
print regex
The problem is when i want to return all the words that have RG and AQ0CS0 that are consecutively, the final regex is empty. How can i solve this?. The output should look like this:
This is a bigram:
[('muy', 'RG'),('fácil','AQ0CS0')]
If you need to match unicode character, as you have in your example data, you need to set the unicode flag re.U or re.UNICODE
>>> re.findall(r'^(\w+)\s\w+\s(RG)\s[0-9.]+\n^(\w+)\s\w+\s(AQ0CS0)', test_string, re.M|re.U)
[('muy', 'RG', 'f\xe1cil', 'AQ0CS0')]
Problem is with the "á" character in fácil. It is not an ASCII alphabet, so \w cannot recognize it. You can use the below regex. It will solve your problem:
re.findall(r'^(\w+)\s.+\s(RG)\s[0-9.]+\n^(.+)\s.+\s(AQ0CS0)', test_string, re.M)

Categories

Resources