Need help for ValueError: substring not found - python

I want to make the sentences as the following:
(N(Hace calor.)-(S(De todas formas, no salgo a casa.)))
(N(Además, va a venir Peter.)-(S(Sin embargo, no lo sé a qué hora llegará exactamente.)))
But the program can only gives me the first sentence and gives an error as ValueError: substring not found for the second sentence. Any one can help? Thanks!
Here is my code:
from nltk import tokenize
sent = 'Hace calor. De todas formas, no salgo a casa. Además, va a venir Peter. Sin embargo, no lo sé a qué hora llegará exactamente.'
Ant = ['De todas formas', 'Sin embargo']
sent = tokenize.sent_tokenize(sent)
for i in sent:
for DMAnt in Ant:
if DMAnt in i:
sent = '(N(' + str(sent[sent.index(i)-1]) + ')-Antithesis-' +'(S(' + str(sent[sent.index(i)]) + '))'
print(sent)

you are changing your sent. I recommend creating a new variable, it will solve the issue.
import nltk
nltk.download('punkt')
from nltk import tokenize
sent = 'Hace calor. De todas formas, no salgo a casa. Además, va a venir Peter. Sin embargo, no lo sé a qué hora llegará exactamente.'
Ant = ['De todas formas', 'Sin embargo']
sent = tokenize.sent_tokenize(sent)
new=[]
for i in sent:
for DMAnt in Ant:
if DMAnt in i:
new.append('(N(' + str(sent[sent.index(i)-1]) + ')-Antithesis-' +'(S(' + str(sent[sent.index(i)]) + '))')
print(new)
Output:
['(N(Hace calor.)-Antithesis-(S(De todas formas, no salgo a casa.))', '(N(Además, va a venir Peter.)-Antithesis-(S(Sin embargo, no lo sé a qué hora llegará exactamente.))']
new variable will have your desirable output in the form of list.

Related

How to perform string separations using regex as a reference and that a part of the used separator pattern is not removed from the following string?

import re
sentences_list = ["El coche ((VERB) es) rojo, la bicicleta ((VERB)está) allí; el monopatín ((VERB)ha sido pintado) de color rojo, y el camión también ((VERB)funciona) con cargas pesadas", "El árbol ((VERB es)) grande, las hojas ((VERB)son) doradas y ((VERB)son) secas, los juegos del parque ((VERB)estan) algo oxidados y ((VERB)es) peligroso subirse a ellos"]
aux_list = []
for i_input_text in sentences_list:
#separator_symbols = r'(?:(?:,|;|\.|\s+)\s*y\s+|,\s*|;\s*)'
separator_symbols = r'(?:(?:,|;|\.|)\s*y\s+|,\s*|;\s*)(?:[A-Z]|l[oa]s|la|[eé]l)'
pattern = r"\(\(VERB\)\s*\w+(?:\s+\w+)*\)"
# Separar la frase usando separator_symbols
frases = re.split(separator_symbols, i_input_text)
aux_frases_list = []
# Buscar el patrón en cada frase separada
for i_frase in frases:
verbos = re.findall(pattern, i_frase)
if verbos:
#print(f"Frase: {i_frase}")
#print(f"Verbos encontrados: {verbos}")
aux_frases_list.append(i_frase)
aux_list = aux_list + aux_frases_list
sentences_list = aux_list
print(sentences_list)
How to make these separations without what is identified by (?:[A-Z]|l[oa]s|la|[eé]l) be removed from the following string after the split?
Using this code I am getting this wrong output:
['El coche ((VERB) es) rojo', ' bicicleta ((VERB)está) allí', ' monopatín ((VERB)ha sido pintado) de color rojo', ' camión también ((VERB)funciona) con cargas pesadas', ' hojas ((VERB)son) doradas y ((VERB)son) secas', ' juegos del parque ((VERB)estan) algo oxidados y ((VERB)es) peligroso subirse a ellos']
It is curious that the sentence "El árbol ((VERB es)) grande" directly dasappeared from the final list, although it should be
Instead you should get this list of strings:
["El coche ((VERB) es) rojo", "la bicicleta ((VERB)está) allí", "el monopatín ((VERB)ha sido pintado) de color rojo", "el camión también ((VERB)funciona) con cargas pesadas", "El árbol ((VERB es)) grande", "las hojas ((VERB)son) doradas y ((VERB)son) secas", "los juegos del parque ((VERB)estan) algo oxidados y ((VERB)es) peligroso subirse a ellos"]
I'm taking a guess the splitter regex should be this:
(?:[,.;]?\s*y\s+|[,;]\s*)(?=[A-Z]|l(?:[ao]s|a)|[eé]l)
https://regex101.com/r/jpWfvq/1
(?: [,.;]? \s* y \s+ | [,;] \s* ) # consumed
(?= # not consumed
[A-Z]
| l
(?: [ao] s | a )
| [eé] l
)
which splits on punctuation and y (ands, optional) at the boundarys
while maintaining a forward looking group of qualifying text without consuming them. And trimming leading whitespace as a bonus.

Make auto random phrases in same time

I need help to make lot of random text to send on twitter but its too long copy paste any one know how i can duplicate (rdmlol == 0): to 500 Thank
Code
if(rdmlol == 0):
api.update_status(status = 'coucou les toxic de ma tl feur ' + str(TweetCount) + ' je ne suis pas un robot lol allez voir feur ')
elif(rdmlol == 1):
api.update_status(status = 'coucou les pierre de ma tl feur ' + str(TweetCount) + ' je ne suis pas un robot lol allez voir feur')```
You're probably looking for a for-loop over a range?
for _ in range(500):
msg = "coucou les toxic de ma tl feur " + str(TweetCount) + " je ne suis pas un robot lol allez voir feur"
api.update_status(status = msg)
If you need to construct a different message each time, the solution depends on what exactly do you need, e.g.
import random
def create_message(tweet_count, words):
# f-strings were introduced in Python 3.6
return f"coucou les {random.choice(words)} de ma tl feur {tweet_count} je ne suis pas un robot lol allez voir feur"
def update_status(api):
words = ["toxic", "pierre"]
for i in range(500):
msg = create_message(i+1, words)
api.update_status(status = msg)
I hope this helps.

How to extract a substring using this regex pattern? It's give a ValueError: too many values to unpack (expected 1)

import re, random, os, datetime, time
from os import remove
from unicodedata import normalize
from glob import glob
def learn_in_real_time(input_text, text):
#Quita acentos y demas diacríticos excepto la ñ
input_text = re.sub(
r"([^n\u0300-\u036f]|n(?!\u0303(?![\u0300-\u036f])))[\u0300-\u036f]+", r"\1",
normalize("NFD", input_text), 0, re.I
)
input_text = normalize( 'NFC', input_text) # -> NFC
input_text_to_check = input_text.lower() #Convierte a minuscula todo
words = []
words_associations = []
regex_what_who = r"(.*)\¿?(que sabes|que sabias|que sabrias|que te referis|que te refieres|que te referias|que te habias referido|que habias referido|a que|que|quienes|quien)\s*(con que|con lo que|con la que|con|acerca de que|acerca de quienes|acerca de quien|sobre de que|sobre que|sobre de quienes|sobre quienes|sobre de quien|sobre quien|)\s*(son|sean|es|serian|seria)\s*(iguales|igual|similares|similar|parecidos|parecido|comparables|comparable|asociables|asociable|distinguibles|distinguible|distintos|distinto|diferentes|diferente|diferenciables|diferenciable|)\s*(a |del |de |)\s*((?:\w+\s*)+)?"
l = re.search(regex_what_who, input_text_to_check, re.IGNORECASE) #Con esto valido la regex haber si entra o no en el bloque de code
if l:
#print("C")
association, = l.groups()
association = association.strip()
association_check = association + "\n" #Uso estas para las comparaciones, ya que sino las consideraria erroneamente como palabras que no estan en la lista solo por no tener el \n
return text
return text
I need it to extract the word that is in ((?: \ W + \ s *) +) and save it to a variable as a string, but the problem is that it gives me this error:
Traceback (most recent call last):
File "answer_about_learned_in_txt.py", line 106, in <module>
print(learn_in_real_time(input_t, text))
File "answer_about_learned_in_txt.py", line 72, in learn_in_real_time
association, = l.groups()
ValueError: too many values to unpack (expected 1)
How do I extract all what is in ((?: \ W + \ s *) +), and save it in a variable?
Taking advantage now that I ask how I would do to:
a) to extract everything that is in ((?: \ W + \ s *) +) and if there are blank spaces that it does not cut and save everything, for example: "Hello, how are you?"
b) to extract everything that is in ((?: \ W + \ s *) +) but to save up to the first white space, for example: "Hello"
I have the problem that if I put the following, position 6 of the tuple does not catch me
if l:
#print("C")
#association, = l.groups()
print(l.groups())
association, _temp = l.group(6)
And it gives me this error
File "answer_about_learned_in_txt.py", line 74, in learn_in_real_time
association, _temp = l.group(6)
ValueError: not enough values to unpack (expected 2, got 0)
In the end I was able to solve it with the following
If you enter
Que son los cometas
print (l.groups ())
('', 'que', '', 'son', '', '', 'los cometas')
I'm interested in the seventh position of the tuple, counting from 1
association = l.group (7)
And this give me :
'los cometas'
let's update patterns string to a logical view and follow main feature.
regex_what_who = r"(que sabes|que sabias|que sabrias|que te referis|que te refieres|que te referias|que te habias referido|que habias referido|a que|que|quienes|quien|con que|con lo que|con la que|con|acerca de que|acerca de quienes|acerca de quien|sobre de que|sobre que|sobre de quienes|sobre quienes|sobre de quien|sobre quien|son|sean|es|serian|seria|iguales|igual|similares|similar|parecidos|parecido|comparables|comparable|asociables|asociable|distinguibles|distinguible|distintos|distinto|diferentes|diferente|diferenciables|diferenciable).*(a|del|de)\s*((?:\w+\s*)+)?"
then, fix error first error in case if we got one result or many:
association, _temp = l.groups()
It Work's! -)

Problems preserving the occurrence in a regex?

I have a very large string s, the s string is conformed by word_1 followed by word_2 an id and a number:
word_1 word_2 id number
I would like to create a regex that catch in a list all the occurrences of the words that has as an id RN_ _ _ followed by the id VA_ _ _ _ and the id VM_ _ _ _. The constrait to extract the RN_ _ _ _ _,VA_ _ _ _ _ _ and VM _ _ _ _ pattern is that the occurrences must appear one after another, where _ are free characters of the id string this free characters can be more than 3 e.g. :
casa casa NCFS000 0.979058
mejor mejor AQ0CS0 0.873665
que que PR0CN000 0.562517
mejor mejor AQ0CS0 0.873665
no no RN
esta estar VASI1S0
lavando lavar VMP00SM
. . Fp 1
This is the pattern I would like to extract since they are placed one after another. And this will be the desired output in a list:
[('no RN', 'estar VASI1S0', 'lavar VMP00SM')]
For example this will be wrong, since they are not one after another:
error error RN
error error VASI1S0
pues pues CS 0.998047
error error VMP00SM
So for the s string:
s = '''
No no RN 0.998045
sabía saber VMII3S0 0.592869
como como CS 0.999289
se se P00CN000 0.465639
ponía poner VMII3S0 0.65
una uno DI0FS0 0.951575
error error RN
actuar accion VMP00SM
lavadora lavadora NCFS000 0.414738
hasta hasta SPS00 0.957698
error error VMP00SM
que que PR0CN000 0.562517
conocí conocer VMIS1S0 1
esta este DD0FS0 0.986779
error error VA00SM
y y CC 0.999962
es ser VSIP3S0 1
que que CS 0.437483
es ser VSIP3S0 1
muy muy RG 1
sencilla sencillo AQ0FS0 1
de de SPS00 0.999984
utilizar utilizar VMN0000 1
! ! Fat 1
Todo todo DI0MS0 0.560961
un uno DI0MS0 0.987295
gustazo gustazo NCMS000 1
error error VA00SM
cuando cuando CS 0.985595
estamos estar VAIP1P0 1
error error VMP00RM
aprendiendo aprender VMG0000 1
para para SPS00 0.999103
emancipar emancipar VMN0000 1
nos nos PP1CP000 1
, , Fc 1
que que CS 0.437483
si si CS 0.99954
error error RN
nos nos PP1CP000 0.935743
ponen poner VMIP3P0 1
facilidad facilidad NCFS000 1
con con SPS00 1
las el DA0FP0 0.970954
error error VMP00RM
tareas tarea NCFP000 1
de de SPS00 0.999984
no no RN 0.998134
estás estar VAIP2S0 1
condicionado condicionar VMP00SM 0.491858
alla alla VASI1S0
la el DA0FS0 0.972269
casa casa NCFS000 0.979058
error error RN
error error VASI1S0
pues pues CS 0.998047
error error VMP00SM
mejor mejor AQ0CS0 0.873665
que que PR0CN000 0.562517
mejor mejor AQ0CS0 0.873665
no no RN 1
esta estar VASI1S0 0.908900
lavando lavar VMP00SM 0.9080972
. . Fp 1
'''
this is what I tried:
import re
weird_triple = re.findall(r'(?s)(\w+\s+RN)(?:(?!\s(?:RN|VA|VM)).)*?(\w+\s+VA\w+)(?:(?!\s(?:RN|VA|VM)).)*?(\w+\s+VM\w+)', s)
print "\n This is the weird triple\n"
print weird_triple
The problem with this aproach is that returns a list of the pattern RN_ _ _ _, VA_ _ _ _, VM_ _ _, but without the one after another order(some ids and words between this pattern are being matched). Any idea of how to fix this in order to obtain:
[('no RN', 'estar VASI1S0', 'lavar VMP00SM'),('estar VAIP2S0','condicionar VMP00SM', 'alla VASI1S0')]
Thanks in advance guys!
UPDATE
I tried the aproaches that other uses recommend me but the problem is that if I add another one after another pattern like:
no no RN 0.998134
estás estar VAIP2S0 1
condicionado condicionar VMP00SM 0.491858
To the s string the recommended regex of this question doesnt work. They only catch:
[('no RN', 'estar VASI1S0', 'lavar VMP00SM')]
Instead of:
[('no RN', 'estar VASI1S0', 'lavar VMP00SM'),('estar VAIP2S0','condicionar VMP00SM', 'alla VASI1S0')]
Which is right. Any idea of how to reach the one after another pattern output:
[('no RN', 'estar VASI1S0', 'lavar VMP00SM'),('estar VAIP2S0','condicionar VMP00SM', 'alla VASI1S0')]
Here you go:
^[\w]* (\w* RN) \d(?:\.\d*)?$\s^[^\s]* (\w* VA[^\s]*) \d(?:\.\d*)?$\s^[^\s]* (\w* VM[^\s]*) \d(?:\.\d*)?$
A little late, but mine is similar:
import re
print re.findall(r'\w+ (\w+ RN.*)\n\s*\w+ (\w+ VA.*)\n\s*\w+ (\w+ VM.*)',s)
Output:
[('no RN', 'estar VASI1S0', 'lavar VMP00SM')]
If you make the source string a Unicode string (u"xxxx") or use s.decode(encoding) to transform to a Unicode string, you can handle the accents added to your question update. Make sure to declare the source file encoding:
# coding: utf8
import re
s = u'''
(big string in question)
'''
print re.findall(ur'\w+ (\w+ RN.*)\n\s*\w+ (\w+ VA.*)\n\s*\w+ (\w+ VM.*)',s,re.UNICODE)
Output:
[(u'no RN 0.998134', u'estar VAIP2S0 1', u'condicionar VMP00SM 0.491858'), (u'no RN 1', u'estar VASI1S0 0.908900', u'lavar VMP00SM 0.9080972')]
(?:\s*\S+ (\S+ RN\S*)(?:\s*\S*))\n(?: *\S+ (\S+ VA\S*)(?:\s*\S*))\n(?: *\S+ (\S+ VM\S*)(?: *\S*))
It works for your example.
In [40]: s = '''
....: No no RN 0.998045
....: sabía saber VMII3S0 0.592869
....: . . Fp 1
....: '''
In [41]: import re
In [42]: p = re.compile(ur'(?:\s*\S+ (\S+ RN\S*)(?:\s*\S*))\n(?: *\S+ (\S+ VA\S*)(?:\s*\S*))\n(?: *\S+ (\S+ VM\S*)(?: *\S*))')
In [43]: re.findall(p, s)
Out[43]:
[('no RN', 'estar VAIP2S0', 'condicionar VMP00SM'),
('no RN', 'estar VASI1S0', 'lavar VMP00SM')]
You can play with the regex here

How to understand regular expression with python?

I'm new with python. Could anybody help me on how I can create a regular expression given a list of strings like this:
test_string = "pero pero CC
tan tan RG
antigua antiguo AQ0FS0
que que CS
según según SPS00
mi mi DP1CSS
madre madre NCFS000"
How to return a tuple like this:
> ([madre, NCFS00],[antigua, AQ0FS0])
I would like to return the word with it's associated tag given test_string, this is what I've done:
# -- coding: utf-8 --
import re
#str = "pero pero CC " \
"tan tan RG " \
"antigua antiguo AQ0FS0" \
"que que CS " \
"según según SPS00 " \
"mi mi DP1CSS " \
"madre madre NCFS000"
tupla1 = re.findall(r'(\w+)\s\w+\s(AQ0FS0)', str)
print tupla1
tupla2 = re.findall(r'(\w+)\s\w+\s(NCFS00)',str)
print tupla2
The output is the following:
[('antigua', 'AQ0FS0')] [('madre', 'NCFS00')]
The problem with this output is that if I pass it along test_string I need to preserve the "order" or "occurrence" of the tags (i.e. I only can print a tuple if and only if they have the following order: AQ0FS0 and NCFS000 in other words: female adjective, female noun).
^([a-zA-Z]+)\s+[a-zA-Z]+\s+([\w]+(?=\d$)\d)
Dont really know the basis for this selection but still you can get it like this.Just grab the captures.Dont forget to set the flags g and m.See demo.
http://regex101.com/r/nA6hN9/38

Categories

Resources