How to understand regular expression with python? - python

I'm new with python. Could anybody help me on how I can create a regular expression given a list of strings like this:
test_string = "pero pero CC
tan tan RG
antigua antiguo AQ0FS0
que que CS
según según SPS00
mi mi DP1CSS
madre madre NCFS000"
How to return a tuple like this:
> ([madre, NCFS00],[antigua, AQ0FS0])
I would like to return the word with it's associated tag given test_string, this is what I've done:
# -- coding: utf-8 --
import re
#str = "pero pero CC " \
"tan tan RG " \
"antigua antiguo AQ0FS0" \
"que que CS " \
"según según SPS00 " \
"mi mi DP1CSS " \
"madre madre NCFS000"
tupla1 = re.findall(r'(\w+)\s\w+\s(AQ0FS0)', str)
print tupla1
tupla2 = re.findall(r'(\w+)\s\w+\s(NCFS00)',str)
print tupla2
The output is the following:
[('antigua', 'AQ0FS0')] [('madre', 'NCFS00')]
The problem with this output is that if I pass it along test_string I need to preserve the "order" or "occurrence" of the tags (i.e. I only can print a tuple if and only if they have the following order: AQ0FS0 and NCFS000 in other words: female adjective, female noun).

^([a-zA-Z]+)\s+[a-zA-Z]+\s+([\w]+(?=\d$)\d)
Dont really know the basis for this selection but still you can get it like this.Just grab the captures.Dont forget to set the flags g and m.See demo.
http://regex101.com/r/nA6hN9/38

Related

How to perform string separations using regex as a reference and that a part of the used separator pattern is not removed from the following string?

import re
sentences_list = ["El coche ((VERB) es) rojo, la bicicleta ((VERB)está) allí; el monopatín ((VERB)ha sido pintado) de color rojo, y el camión también ((VERB)funciona) con cargas pesadas", "El árbol ((VERB es)) grande, las hojas ((VERB)son) doradas y ((VERB)son) secas, los juegos del parque ((VERB)estan) algo oxidados y ((VERB)es) peligroso subirse a ellos"]
aux_list = []
for i_input_text in sentences_list:
#separator_symbols = r'(?:(?:,|;|\.|\s+)\s*y\s+|,\s*|;\s*)'
separator_symbols = r'(?:(?:,|;|\.|)\s*y\s+|,\s*|;\s*)(?:[A-Z]|l[oa]s|la|[eé]l)'
pattern = r"\(\(VERB\)\s*\w+(?:\s+\w+)*\)"
# Separar la frase usando separator_symbols
frases = re.split(separator_symbols, i_input_text)
aux_frases_list = []
# Buscar el patrón en cada frase separada
for i_frase in frases:
verbos = re.findall(pattern, i_frase)
if verbos:
#print(f"Frase: {i_frase}")
#print(f"Verbos encontrados: {verbos}")
aux_frases_list.append(i_frase)
aux_list = aux_list + aux_frases_list
sentences_list = aux_list
print(sentences_list)
How to make these separations without what is identified by (?:[A-Z]|l[oa]s|la|[eé]l) be removed from the following string after the split?
Using this code I am getting this wrong output:
['El coche ((VERB) es) rojo', ' bicicleta ((VERB)está) allí', ' monopatín ((VERB)ha sido pintado) de color rojo', ' camión también ((VERB)funciona) con cargas pesadas', ' hojas ((VERB)son) doradas y ((VERB)son) secas', ' juegos del parque ((VERB)estan) algo oxidados y ((VERB)es) peligroso subirse a ellos']
It is curious that the sentence "El árbol ((VERB es)) grande" directly dasappeared from the final list, although it should be
Instead you should get this list of strings:
["El coche ((VERB) es) rojo", "la bicicleta ((VERB)está) allí", "el monopatín ((VERB)ha sido pintado) de color rojo", "el camión también ((VERB)funciona) con cargas pesadas", "El árbol ((VERB es)) grande", "las hojas ((VERB)son) doradas y ((VERB)son) secas", "los juegos del parque ((VERB)estan) algo oxidados y ((VERB)es) peligroso subirse a ellos"]
I'm taking a guess the splitter regex should be this:
(?:[,.;]?\s*y\s+|[,;]\s*)(?=[A-Z]|l(?:[ao]s|a)|[eé]l)
https://regex101.com/r/jpWfvq/1
(?: [,.;]? \s* y \s+ | [,;] \s* ) # consumed
(?= # not consumed
[A-Z]
| l
(?: [ao] s | a )
| [eé] l
)
which splits on punctuation and y (ands, optional) at the boundarys
while maintaining a forward looking group of qualifying text without consuming them. And trimming leading whitespace as a bonus.

How to extract a substring using this regex pattern? It's give a ValueError: too many values to unpack (expected 1)

import re, random, os, datetime, time
from os import remove
from unicodedata import normalize
from glob import glob
def learn_in_real_time(input_text, text):
#Quita acentos y demas diacríticos excepto la ñ
input_text = re.sub(
r"([^n\u0300-\u036f]|n(?!\u0303(?![\u0300-\u036f])))[\u0300-\u036f]+", r"\1",
normalize("NFD", input_text), 0, re.I
)
input_text = normalize( 'NFC', input_text) # -> NFC
input_text_to_check = input_text.lower() #Convierte a minuscula todo
words = []
words_associations = []
regex_what_who = r"(.*)\¿?(que sabes|que sabias|que sabrias|que te referis|que te refieres|que te referias|que te habias referido|que habias referido|a que|que|quienes|quien)\s*(con que|con lo que|con la que|con|acerca de que|acerca de quienes|acerca de quien|sobre de que|sobre que|sobre de quienes|sobre quienes|sobre de quien|sobre quien|)\s*(son|sean|es|serian|seria)\s*(iguales|igual|similares|similar|parecidos|parecido|comparables|comparable|asociables|asociable|distinguibles|distinguible|distintos|distinto|diferentes|diferente|diferenciables|diferenciable|)\s*(a |del |de |)\s*((?:\w+\s*)+)?"
l = re.search(regex_what_who, input_text_to_check, re.IGNORECASE) #Con esto valido la regex haber si entra o no en el bloque de code
if l:
#print("C")
association, = l.groups()
association = association.strip()
association_check = association + "\n" #Uso estas para las comparaciones, ya que sino las consideraria erroneamente como palabras que no estan en la lista solo por no tener el \n
return text
return text
I need it to extract the word that is in ((?: \ W + \ s *) +) and save it to a variable as a string, but the problem is that it gives me this error:
Traceback (most recent call last):
File "answer_about_learned_in_txt.py", line 106, in <module>
print(learn_in_real_time(input_t, text))
File "answer_about_learned_in_txt.py", line 72, in learn_in_real_time
association, = l.groups()
ValueError: too many values to unpack (expected 1)
How do I extract all what is in ((?: \ W + \ s *) +), and save it in a variable?
Taking advantage now that I ask how I would do to:
a) to extract everything that is in ((?: \ W + \ s *) +) and if there are blank spaces that it does not cut and save everything, for example: "Hello, how are you?"
b) to extract everything that is in ((?: \ W + \ s *) +) but to save up to the first white space, for example: "Hello"
I have the problem that if I put the following, position 6 of the tuple does not catch me
if l:
#print("C")
#association, = l.groups()
print(l.groups())
association, _temp = l.group(6)
And it gives me this error
File "answer_about_learned_in_txt.py", line 74, in learn_in_real_time
association, _temp = l.group(6)
ValueError: not enough values to unpack (expected 2, got 0)
In the end I was able to solve it with the following
If you enter
Que son los cometas
print (l.groups ())
('', 'que', '', 'son', '', '', 'los cometas')
I'm interested in the seventh position of the tuple, counting from 1
association = l.group (7)
And this give me :
'los cometas'
let's update patterns string to a logical view and follow main feature.
regex_what_who = r"(que sabes|que sabias|que sabrias|que te referis|que te refieres|que te referias|que te habias referido|que habias referido|a que|que|quienes|quien|con que|con lo que|con la que|con|acerca de que|acerca de quienes|acerca de quien|sobre de que|sobre que|sobre de quienes|sobre quienes|sobre de quien|sobre quien|son|sean|es|serian|seria|iguales|igual|similares|similar|parecidos|parecido|comparables|comparable|asociables|asociable|distinguibles|distinguible|distintos|distinto|diferentes|diferente|diferenciables|diferenciable).*(a|del|de)\s*((?:\w+\s*)+)?"
then, fix error first error in case if we got one result or many:
association, _temp = l.groups()
It Work's! -)

Need help for ValueError: substring not found

I want to make the sentences as the following:
(N(Hace calor.)-(S(De todas formas, no salgo a casa.)))
(N(Además, va a venir Peter.)-(S(Sin embargo, no lo sé a qué hora llegará exactamente.)))
But the program can only gives me the first sentence and gives an error as ValueError: substring not found for the second sentence. Any one can help? Thanks!
Here is my code:
from nltk import tokenize
sent = 'Hace calor. De todas formas, no salgo a casa. Además, va a venir Peter. Sin embargo, no lo sé a qué hora llegará exactamente.'
Ant = ['De todas formas', 'Sin embargo']
sent = tokenize.sent_tokenize(sent)
for i in sent:
for DMAnt in Ant:
if DMAnt in i:
sent = '(N(' + str(sent[sent.index(i)-1]) + ')-Antithesis-' +'(S(' + str(sent[sent.index(i)]) + '))'
print(sent)
you are changing your sent. I recommend creating a new variable, it will solve the issue.
import nltk
nltk.download('punkt')
from nltk import tokenize
sent = 'Hace calor. De todas formas, no salgo a casa. Además, va a venir Peter. Sin embargo, no lo sé a qué hora llegará exactamente.'
Ant = ['De todas formas', 'Sin embargo']
sent = tokenize.sent_tokenize(sent)
new=[]
for i in sent:
for DMAnt in Ant:
if DMAnt in i:
new.append('(N(' + str(sent[sent.index(i)-1]) + ')-Antithesis-' +'(S(' + str(sent[sent.index(i)]) + '))')
print(new)
Output:
['(N(Hace calor.)-Antithesis-(S(De todas formas, no salgo a casa.))', '(N(Además, va a venir Peter.)-Antithesis-(S(Sin embargo, no lo sé a qué hora llegará exactamente.))']
new variable will have your desirable output in the form of list.

Change unicode hardwrited in csv to corresponding character

I have a csv with 1 column having hard writed unicode character :
["Investir dans un parc d'activit\u00e9s"]
["S\u00e9curiser, restaurer et g\u00e9rer 1 372 ha de milieux naturels impact\u00e9s par la construction de l'autoroute"]
["Am\u00e9liorer la consommation \u00e9nerg\u00e9tique de b\u00e2timents publics"]
["Favoriser la recherche, am\u00e9liorer la qualit\u00e9 des traitements et assurer un \u00e9gal acc\u00e8s des soins \u00e0 tous les patients de Franche-Comt\u00e9."]
I'm trying to fix/replace them with the corresponding char, but I can't seems to make it, I tried with
df['Objectif(s)'] = df['Objectif(s)'].replace('\u00e9', 'é')
but the column don't change
Seing that the code below work, I tried to loop over the row to fix it with no success
s = "d'activit\u00e9s"
print(s) # d'activités
print(s.replace('\u00e9', 'é' )) # d'activités
for case in df['Objectif(s)']:
s = str(case)
df['Objectif(s)'][case] = s # ["Investir dans un parc d'activit\u00e9s"]
if this '\u00e9' is actually written into the file as \ u 0 0 e 9 as normal characters by the source of the data, you need to do a string replace.
the trick here is that you need to escape the \ character in the replace function first parameter
s.replace('\\u00e9', 'é' )
or use a "raw string literal" by prefixing r
s.replace(r'\u00e9', 'é' )
Try replacing
df['Objectif(s)'] = df['Objectif(s)'].replace('\u00e9', 'é')
to
df['Objectif(s)'] = df['Objectif(s)'].str.replace('\u00e9', 'é')

problems with a regex in python?

I have a big string like this:
está estar VAIP3S0 0.999201
en en SPS00 1
el el DA0MS0 1
punto punto NCMS000 1
medio medio AQ0MS0 0.314286
. . Fp 1
Es ser VSIP3S0 1
de de SPS00 0.999984
color color NCMS000 1
blanco blanco AQ0MS0 0.598684
y y CC 0.999962
tiene tener VMIP3S0 1
carga carga NCFS000 0.952569
frontal frontal AQ0CS0 0.657209
, , Fc 1
no no RN 0.902728
estaba estar VAII1S0 0.5
equilibrada equilibrar VMP00SF 1
. . Fp 1'''
I would like to extract the the ids that have the RN VA_ _ _ _ _ and VMP_ _ _ _ _ where _ are free characters of the string(id) and the second word of the line for example, for the above list:
[(no RN, estar VAII1S0, equilibrar VMP00SF)]
This is what I all ready tried:
weird_triple = re.findall(r'^(\w+)\s.+\s(RN)\s[0-9.]+\n^(.+)\s.+\s(VA)', big_string, re.M)
print "\n This is the weird triple\n", weird_triple
print "\n This is the striped weird triple\n", [x[::2] for x in weird_triple]
This is the output:
This is the weird triple
[('no', 'RN', 'estaba', 'VA')]
This is the striped weird triple
[('no', 'estaba')]
You can modify your regex as follows:
>>> re.findall(r'(\w+\s+RN).*?(\w+\s+VA\w+).*?(\w+\s+VM\w+)', big_string, re.S)
[('no RN', 'estar VAII1S0', 'equilibrar VMP00SF')]
Note: The re.M flag causes ^ and $ to match the begin/end of each line while the re.S flag allows the dot to match across newline sequences.

Categories

Resources