problems with a regex in python?

problems with a regex in python? - python

I have a big string like this:
está estar VAIP3S0 0.999201
en en SPS00 1
el el DA0MS0 1
punto punto NCMS000 1
medio medio AQ0MS0 0.314286
. . Fp 1
Es ser VSIP3S0 1
de de SPS00 0.999984
color color NCMS000 1
blanco blanco AQ0MS0 0.598684
y y CC 0.999962
tiene tener VMIP3S0 1
carga carga NCFS000 0.952569
frontal frontal AQ0CS0 0.657209
, , Fc 1
no no RN 0.902728
estaba estar VAII1S0 0.5
equilibrada equilibrar VMP00SF 1
. . Fp 1'''
I would like to extract the the ids that have the RN VA_ _ _ _ _ and VMP_ _ _ _ _ where _ are free characters of the string(id) and the second word of the line for example, for the above list:
[(no RN, estar VAII1S0, equilibrar VMP00SF)]
This is what I all ready tried:
weird_triple = re.findall(r'^(\w+)\s.+\s(RN)\s[0-9.]+\n^(.+)\s.+\s(VA)', big_string, re.M)
print "\n This is the weird triple\n", weird_triple
print "\n This is the striped weird triple\n", [x[::2] for x in weird_triple]
This is the output:
This is the weird triple
[('no', 'RN', 'estaba', 'VA')]
This is the striped weird triple
[('no', 'estaba')]

You can modify your regex as follows:
>>> re.findall(r'(\w+\s+RN).*?(\w+\s+VA\w+).*?(\w+\s+VM\w+)', big_string, re.S)
[('no RN', 'estar VAII1S0', 'equilibrar VMP00SF')]
Note: The re.M flag causes ^ and $ to match the begin/end of each line while the re.S flag allows the dot to match across newline sequences.

Related

How to perform string separations using regex as a reference and that a part of the used separator pattern is not removed from the following string?

import re
sentences_list = ["El coche ((VERB) es) rojo, la bicicleta ((VERB)está) allí; el monopatín ((VERB)ha sido pintado) de color rojo, y el camión también ((VERB)funciona) con cargas pesadas", "El árbol ((VERB es)) grande, las hojas ((VERB)son) doradas y ((VERB)son) secas, los juegos del parque ((VERB)estan) algo oxidados y ((VERB)es) peligroso subirse a ellos"]
aux_list = []
for i_input_text in sentences_list:
#separator_symbols = r'(?:(?:,|;|\.|\s+)\s*y\s+|,\s*|;\s*)'
separator_symbols = r'(?:(?:,|;|\.|)\s*y\s+|,\s*|;\s*)(?:[A-Z]|l[oa]s|la|[eé]l)'
pattern = r"\(\(VERB\)\s*\w+(?:\s+\w+)*\)"
# Separar la frase usando separator_symbols
frases = re.split(separator_symbols, i_input_text)
aux_frases_list = []
# Buscar el patrón en cada frase separada
for i_frase in frases:
verbos = re.findall(pattern, i_frase)
if verbos:
#print(f"Frase: {i_frase}")
#print(f"Verbos encontrados: {verbos}")
aux_frases_list.append(i_frase)
aux_list = aux_list + aux_frases_list
sentences_list = aux_list
print(sentences_list)
How to make these separations without what is identified by (?:[A-Z]|l[oa]s|la|[eé]l) be removed from the following string after the split?
Using this code I am getting this wrong output:
['El coche ((VERB) es) rojo', ' bicicleta ((VERB)está) allí', ' monopatín ((VERB)ha sido pintado) de color rojo', ' camión también ((VERB)funciona) con cargas pesadas', ' hojas ((VERB)son) doradas y ((VERB)son) secas', ' juegos del parque ((VERB)estan) algo oxidados y ((VERB)es) peligroso subirse a ellos']
It is curious that the sentence "El árbol ((VERB es)) grande" directly dasappeared from the final list, although it should be
Instead you should get this list of strings:
["El coche ((VERB) es) rojo", "la bicicleta ((VERB)está) allí", "el monopatín ((VERB)ha sido pintado) de color rojo", "el camión también ((VERB)funciona) con cargas pesadas", "El árbol ((VERB es)) grande", "las hojas ((VERB)son) doradas y ((VERB)son) secas", "los juegos del parque ((VERB)estan) algo oxidados y ((VERB)es) peligroso subirse a ellos"]

I'm taking a guess the splitter regex should be this:
(?:[,.;]?\s*y\s+|[,;]\s*)(?=[A-Z]|l(?:[ao]s|a)|[eé]l)
https://regex101.com/r/jpWfvq/1
(?: [,.;]? \s* y \s+ | [,;] \s* ) # consumed
(?= # not consumed
[A-Z]
| l
(?: [ao] s | a )
| [eé] l
)
which splits on punctuation and y (ands, optional) at the boundarys
while maintaining a forward looking group of qualifying text without consuming them. And trimming leading whitespace as a bonus.

How to build this regex so that it extracts a word that starts with a capital letter if only if it appears after a previous pattern?

I need a regex that extracts all the names (we will consider that they are all the words that start with a capital letter and respect having certain conditions prior to their appearance within the sentence) that are in a sentence. This must be done respecting the pattern that I clarify below, also extracting the content before and after this name, so that it can be printed next to the name that was extracted within that sequence or pattern.
This is the pseudo-regex pattern that I need:
the beginning of the input sentence or (,|;|.|y)
associated_sense_1: "some character string (alphanumeric)" or "nothing"
(con |juntos a |junto a |en compania de )
identified_person: "some word that starts with a capital letter (the name that I must extract)" and it ends when the regex find one or more space
associated_sense_2: "some character string (alphanumeric)" or "nothing"
the end o the input sentence or (,|;|.|y |con |juntos a |junto a |en compania de )
the (,|;|.|y) are just person connectors that are used to build a regex pattern, but they do not provide information beyond indicating the sequence of belonging, then they can be eliminated with a .replace( , "")
And with this regex I need extract this 3 string groups
associated_sense_1
identified_person
associated_sense_2
associated_sense = associated_sense_1 + " " + associated_sense_2
This is the proto-code:
import re
#Example 1
sense = "puede ser peligroso ir solas, quizas sea mejor ir con Adrian y seguro que luego podemos esperar por Melisa, Marcos y Lucy en la parada"
#Example 2
#sense = "Adrian ya esta en la parada; y alli probablemente esten Lucy y May en la parada esperandonos"
person_identify_pattern = r"\s*(con |por |, y |, |,y |y )\s*[A-Z][^A-Z]*"
#person_identify_pattern = r"\s*(con |por |, y |, |,y |y )\s*[^A-Z]*"
for identified_person in re.split(person_identify_pattern, sense):
identified_person = identified_person.strip()
if identified_person:
try:
print(f"Write '{associated_sense}' to {identified_person}.txt")
except:
associated_sense = identified_person
The wrong output I get...
Write 'puede ser peligroso ir solas, quizas sea mejor ir' to con.txt
Write 'puede ser peligroso ir solas, quizas sea mejor ir' to Melisa.txt
Write 'puede ser peligroso ir solas, quizas sea mejor ir' to ,.txt
Write 'puede ser peligroso ir solas, quizas sea mejor ir' to Lucy en la parada.txt
Correct output for example 1:
Write 'quizas sea mejor ir con' to Adrian.txt
Write 'y seguro que luego podemos esperar por en la parada' to Melisa.txt
Write 'y seguro que luego podemos esperar por en la parada' to Marcos.txt
Write 'y seguro que luego podemos esperar por en la parada' to Lucy.txt
Correct output for example 2:
Write 'ya esta en la parada' to Adrian.txt
Write 'alli probablemente esten en la parada esperandonos' to Lucy.txt
Write 'alli probablemente esten en la parada esperandonos' to May.txt
I was trying with this other regex but I still have problems with this code:
import re
sense = "puede ser peligroso ir solas, quizas sea mejor ir con Adrian y seguro que luego podemos esperar por Melisa, Marcos y Lucy en la parada"
person_identify_pattern = r"\s*(?:,|;|.|y |con |juntos a |junto a |en compania de |)\s*((?:\w\s*)+)\s*(?<=con|por|a, | y )\s*([A-Z].*?\b)\s*((?:\w\s*)+)\s*(?:,|;|.|y |con |juntos a |junto a |en compania de )\s*"
for m in re.split(person_identify_pattern, sense):
m = m.strip()
if m:
try:
print(f"Write '{content}' to {m}.txt")
except:
content = m
But I keep getting this wrong output
Write 'puede ser peligroso ir solas' to quizas sea mejor ir con Adrian y seguro que luego podemos esperar por.txt
Write 'puede ser peligroso ir solas' to Melisa,.txt
Write 'puede ser peligroso ir solas' to Marcos y Lucy en la parad.txt

import re
sense = "puede ser peligroso ir solas, quizas sea mejor ir con Adrian y seguro que luego podemos esperar por Melisa, Marcos y Lucy en la parada"
if match := re.findall(r"(?<=con|por|a, | y )\s*([A-Z].*?\b)", sense):
print(match)
it result = ['Adrian', 'Melisa', 'Marcos', 'Lucy']

Problems preserving the occurrence in a regex?

I have a very large string s, the s string is conformed by word_1 followed by word_2 an id and a number:
word_1 word_2 id number
I would like to create a regex that catch in a list all the occurrences of the words that has as an id RN_ _ _ followed by the id VA_ _ _ _ and the id VM_ _ _ _. The constrait to extract the RN_ _ _ _ _,VA_ _ _ _ _ _ and VM _ _ _ _ pattern is that the occurrences must appear one after another, where _ are free characters of the id string this free characters can be more than 3 e.g. :
casa casa NCFS000 0.979058
mejor mejor AQ0CS0 0.873665
que que PR0CN000 0.562517
mejor mejor AQ0CS0 0.873665
no no RN
esta estar VASI1S0
lavando lavar VMP00SM
. . Fp 1
This is the pattern I would like to extract since they are placed one after another. And this will be the desired output in a list:
[('no RN', 'estar VASI1S0', 'lavar VMP00SM')]
For example this will be wrong, since they are not one after another:
error error RN
error error VASI1S0
pues pues CS 0.998047
error error VMP00SM
So for the s string:
s = '''
No no RN 0.998045
sabía saber VMII3S0 0.592869
como como CS 0.999289
se se P00CN000 0.465639
ponía poner VMII3S0 0.65
una uno DI0FS0 0.951575
error error RN
actuar accion VMP00SM
lavadora lavadora NCFS000 0.414738
hasta hasta SPS00 0.957698
error error VMP00SM
que que PR0CN000 0.562517
conocí conocer VMIS1S0 1
esta este DD0FS0 0.986779
error error VA00SM
y y CC 0.999962
es ser VSIP3S0 1
que que CS 0.437483
es ser VSIP3S0 1
muy muy RG 1
sencilla sencillo AQ0FS0 1
de de SPS00 0.999984
utilizar utilizar VMN0000 1
! ! Fat 1
Todo todo DI0MS0 0.560961
un uno DI0MS0 0.987295
gustazo gustazo NCMS000 1
error error VA00SM
cuando cuando CS 0.985595
estamos estar VAIP1P0 1
error error VMP00RM
aprendiendo aprender VMG0000 1
para para SPS00 0.999103
emancipar emancipar VMN0000 1
nos nos PP1CP000 1
, , Fc 1
que que CS 0.437483
si si CS 0.99954
error error RN
nos nos PP1CP000 0.935743
ponen poner VMIP3P0 1
facilidad facilidad NCFS000 1
con con SPS00 1
las el DA0FP0 0.970954
error error VMP00RM
tareas tarea NCFP000 1
de de SPS00 0.999984
no no RN 0.998134
estás estar VAIP2S0 1
condicionado condicionar VMP00SM 0.491858
alla alla VASI1S0
la el DA0FS0 0.972269
casa casa NCFS000 0.979058
error error RN
error error VASI1S0
pues pues CS 0.998047
error error VMP00SM
mejor mejor AQ0CS0 0.873665
que que PR0CN000 0.562517
mejor mejor AQ0CS0 0.873665
no no RN 1
esta estar VASI1S0 0.908900
lavando lavar VMP00SM 0.9080972
. . Fp 1
'''
this is what I tried:
import re
weird_triple = re.findall(r'(?s)(\w+\s+RN)(?:(?!\s(?:RN|VA|VM)).)*?(\w+\s+VA\w+)(?:(?!\s(?:RN|VA|VM)).)*?(\w+\s+VM\w+)', s)
print "\n This is the weird triple\n"
print weird_triple
The problem with this aproach is that returns a list of the pattern RN_ _ _ _, VA_ _ _ _, VM_ _ _, but without the one after another order(some ids and words between this pattern are being matched). Any idea of how to fix this in order to obtain:
[('no RN', 'estar VASI1S0', 'lavar VMP00SM'),('estar VAIP2S0','condicionar VMP00SM', 'alla VASI1S0')]
Thanks in advance guys!
UPDATE
I tried the aproaches that other uses recommend me but the problem is that if I add another one after another pattern like:
no no RN 0.998134
estás estar VAIP2S0 1
condicionado condicionar VMP00SM 0.491858
To the s string the recommended regex of this question doesnt work. They only catch:
[('no RN', 'estar VASI1S0', 'lavar VMP00SM')]
Instead of:
[('no RN', 'estar VASI1S0', 'lavar VMP00SM'),('estar VAIP2S0','condicionar VMP00SM', 'alla VASI1S0')]
Which is right. Any idea of how to reach the one after another pattern output:
[('no RN', 'estar VASI1S0', 'lavar VMP00SM'),('estar VAIP2S0','condicionar VMP00SM', 'alla VASI1S0')]

Here you go:
^[\w]* (\w* RN) \d(?:\.\d*)?$\s^[^\s]* (\w* VA[^\s]*) \d(?:\.\d*)?$\s^[^\s]* (\w* VM[^\s]*) \d(?:\.\d*)?$

A little late, but mine is similar:
import re
print re.findall(r'\w+ (\w+ RN.*)\n\s*\w+ (\w+ VA.*)\n\s*\w+ (\w+ VM.*)',s)
Output:
[('no RN', 'estar VASI1S0', 'lavar VMP00SM')]
If you make the source string a Unicode string (u"xxxx") or use s.decode(encoding) to transform to a Unicode string, you can handle the accents added to your question update. Make sure to declare the source file encoding:
# coding: utf8
import re
s = u'''
(big string in question)
'''
print re.findall(ur'\w+ (\w+ RN.*)\n\s*\w+ (\w+ VA.*)\n\s*\w+ (\w+ VM.*)',s,re.UNICODE)
Output:
[(u'no RN 0.998134', u'estar VAIP2S0 1', u'condicionar VMP00SM 0.491858'), (u'no RN 1', u'estar VASI1S0 0.908900', u'lavar VMP00SM 0.9080972')]

(?:\s*\S+ (\S+ RN\S*)(?:\s*\S*))\n(?: *\S+ (\S+ VA\S*)(?:\s*\S*))\n(?: *\S+ (\S+ VM\S*)(?: *\S*))
It works for your example.
In [40]: s = '''
....: No no RN 0.998045
....: sabía saber VMII3S0 0.592869
....: . . Fp 1
....: '''
In [41]: import re
In [42]: p = re.compile(ur'(?:\s*\S+ (\S+ RN\S*)(?:\s*\S*))\n(?: *\S+ (\S+ VA\S*)(?:\s*\S*))\n(?: *\S+ (\S+ VM\S*)(?: *\S*))')
In [43]: re.findall(p, s)
Out[43]:
[('no RN', 'estar VAIP2S0', 'condicionar VMP00SM'),
('no RN', 'estar VASI1S0', 'lavar VMP00SM')]
You can play with the regex here

Problems with python regex encoding?

I have a large .txt file that is made up of: word1, word2, id, number as follows:
s = '''
Vaya ir VMM03S0 0.427083
mañanita mañana RG 0.796611
, , Fc 1
buscando buscar VMG0000 1
una uno DI0FS0 0.951575
lavadora lavadora NCFS000 0.414738
con con SPS00 1
la el DA0FS0 0.972269
que que PR0CN000 0.562517
sorprender sorprender VMN0000 1
a a SPS00 0.996023
una uno DI0FS0 0.951575
persona persona NCFS000 0.98773
muy muy RG 1
especiales especial AQ0CS0 1
para para SPS00 0.999103
nosotros nosotros PP1MP000 1
, , Fc 1
y y CC 0.999962
la lo PP3FSA00 0.0277039
encontramos encontrar VMIP1P0 0.65
. . Fp 1
Pero pero CC 0.999764
vamos ir VMIP1P0 0.655914
a a SPS00 0.996023
lo el DA0NS0 0.457533
que que PR0CN000 0.562517
interesa interesar VMIP3S0 0.994868
LO_QUE_INTERESA_La lo_que_interesa_la NP00000 1
lavadora lavador AQ0FS0 0.585262
tiene tener VMIP3S0 1
una uno DI0FS0 0.951575
clasificación clasificación NCFS000 1
A+ a+ NP00000 1
, , Fc 1
de de SPS00 0.999984
las el DA0FP0 0.970954
que que PR0CN000 0.562517
ahorran ahorrar VMIP3P0 1
energía energía NCFS000 1
, , Fc 1
si si CS 0.99954
no no RN 0.998134
me me PP1CS000 0.89124
equivoco equivocar VMIP1S0 1
. . Fp 1
Lava lavar VMIP3S0 0.397388
hasta hasta SPS00 0.957698
7 7 Z 1
kg kilogramo NCMN000 1
, , Fc 1
no no RN 0.998134
está estar VAIP3S0 0.999201
nada nada RG 0.135196
mal mal RG 0.497537
, , Fc 1
se se P00CN000 0.465639
le le PP3CSD00 1
veía ver VMII3S0 0.62272
un uno DI0MS0 0.987295
gran gran AQ0CS0 1
tambor tambor NCMS000 1
( ( Fpa 1
de de SPS00 0.999984
acero acero NCMS000 0.973481
inoxidable inoxidable AQ0CS0 1
) ) Fpt 1
y y CC 0.999962
un uno DI0MS0 0.987295
consumo consumo NCMS000 0.948927
máximo máximo AQ0MS0 0.986111
de de SPS00 0.999984
49 49 Z 1
litros litro NCMP000 1
Mandos mandos NP00000 1
intuitivos intuitivo AQ0MP0 1
, , Fc 1
todo todo PI0MS000 0.43165
muy muy RG 1
bien bien RG 0.902728
explicado explicar VMP00SM 1
, , Fc 1
nada nada PI0CS000 0.850279
que que PR0CN000 0.562517
ver ver VMN0000 0.997382
con con SPS00 1
hola RG 0.90937838
como VMP00SM 1
estas AQ089FG 0.90839
la el DA0FS0 0.972269
lavadora lavadora NCFS000 0.414738
de de SPS00 0.999984
casa casa NCFS000 0.979058
de de SPS00 0.999984
mis mi DP1CPS 0.995868
padres padre NCMP000 1
Además además NP00000 1
también también RG 1
seca seco AQ0FS0 0.45723
preciadas preciar VMP00PF 1
. . Fp 1'''
For example for the s "file" I would like to extract the ids that start with AQ and RG followed by their word2, but they must ocurre one after the other for the above example this words hold the one after another order:
muy muy RG 1
especial especial AQ0CS0 1
For example this words doesnt hold the one after another order, so I would not like to extract them in a tuple:
hola RG 0.90937838
como VMP00SM 1
estas AQ089FG 0.90839
I would like to create a regex that extract in a tuple list only the word2 followed by its id like this: [('word2','id')] for all the .txt file and for all the words that hold true the one after another order. For the above example this is the only valid output:
muy muy RG 1
especiales especial AQ0CS0 1
and
también también RG 1
seca seco AQ0FS0 0.45723
Then return them in a tuple with its full id, since they preserve the one after another order:
[('muy', 'RG', 'especial', 'AQ0CS0'), ('también', 'RG', 'seco', 'AQ0FS0')]
I tried the following:
in:
t = re.findall(r'(\w+)\s*(RG)[^\n]*\n[^\n]*?(\w+)\s*(AQ\w*)', s)
print t
But my output is wrong, since it is droping the accent and some characters:
out:
[('muy', 'RG', 'especial', 'AQ0CS0'), ('n', 'RG', 'seco', 'AQ0FS0')]
instead of, which is the correct:
[('muy', 'RG', 'especial', 'AQ0CS0'), ('también', 'RG', 'seco', 'AQ0FS0')]
Could someone help me to understand what happened with my above example and how to fix it in order to catch the word2 and idthat preserve the one after another ocurrence?. Thanks in advance guys.

In Python 2, with the 8-bit strings (str), \w matches [0-9a-zA-Z_]. However if your use unicode and compile your pattern with re.UNICODE flag, then \w matches the word characters based on the unicode database.
Python documentation 7.2.1 regular expression syntax:
When the LOCALE and UNICODE flags are not specified, matches any alphanumeric character and the underscore; this is equivalent to the set [a-zA-Z0-9_]. With LOCALE, it will match the set [0-9_] plus whatever characters are defined as alphanumeric for the current locale. If UNICODE is set, this will match the characters [0-9_] plus whatever is classified as alphanumeric in the Unicode character properties database.
Thus you can do
u = s.decode('UTF-8') # or whatever encoding is in your text file
t = re.findall(r'(\w+)\s*(RG)[^\n]*\n[^\n]*?(\w+)\s*(AQ\w*)', re.UNICODE)
In Python 3 much of the str/unicode confusion is gone; when you open a file in text mode and read its contents, you will get a Python 3 str object that handles everything as Unicode characters.

it seems that \w+ don't recognize special char é.
so if your txt is strictly split by space, you can replace \w with \S
the regex will be
t = re.findall(r'(\S+)\s*(RG)[^\n]*\n[^\n]*?(\S+)\s*(AQ\S*)', s)

Why I'm obtaining a null list of elements with this regex?

I have a text with some POS-tags and some words. I created a regex to generate some bigrams that look like this: [('word', 'POS-tag', 'word', 'POS-tag'), ('word', 'POS-tag', 'word', 'POS-tag')]
This is what i all ready done:
# -- coding: utf-8 --
import re
test_string= '''
Es ser VSIP3S0 1
muy muy RG 1
fácil fácil AQ0CS0 1
de de SPS00 0.999984
Por por SPS00 1
decir decir VMN0000 0.997512
algo algo PI0CS000 0.900246
malo malo AQ0MS0 0.657087
de de SPS00 0.999984
ella él PP3FS000 1
, , Fc 1
sería ser VSIC1S0 0.5
que que CS 0.437483
cuando cuando CS 0.985595
centrifuga centrifugar VMIP3S0 0.994859
, , Fc 1
algo algo PI0CS000 0.900246
que que PR0CN000 0.562517
hace hacer VMIP3S0 1
muy muy RG 1
bien bien RG 0.902728
sitio sitio NCMS000 0.980769
'''
regex = re.findall(r'^(\w+)\s\w+\s(RG)\s[0-9.]+\n^(\w+)\s\w+\s(AQ0CS0)', test_string, re.M)
print "\n This is a bigram:"
print regex
The problem is when i want to return all the words that have RG and AQ0CS0 that are consecutively, the final regex is empty. How can i solve this?. The output should look like this:
This is a bigram:
[('muy', 'RG'),('fácil','AQ0CS0')]

If you need to match unicode character, as you have in your example data, you need to set the unicode flag re.U or re.UNICODE
>>> re.findall(r'^(\w+)\s\w+\s(RG)\s[0-9.]+\n^(\w+)\s\w+\s(AQ0CS0)', test_string, re.M|re.U)
[('muy', 'RG', 'f\xe1cil', 'AQ0CS0')]

Problem is with the "á" character in fácil. It is not an ASCII alphabet, so \w cannot recognize it. You can use the below regex. It will solve your problem:
re.findall(r'^(\w+)\s.+\s(RG)\s[0-9.]+\n^(.+)\s.+\s(AQ0CS0)', test_string, re.M)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

problems with a regex in python? - python

Related

How to perform string separations using regex as a reference and that a part of the used separator pattern is not removed from the following string?

How to build this regex so that it extracts a word that starts with a capital letter if only if it appears after a previous pattern?

Problems preserving the occurrence in a regex?

Problems with python regex encoding?

Why I'm obtaining a null list of elements with this regex?

Categories

Resources