Regex to match comma-separated strings containing comma-formatted decimals - python

I have comma-separated strings like this one:
"Assistência 24hs com Guincho s/limite de km, 2o. Guincho 100 km no mesmo evento, Pacote de Benefícios HDI, Táxi sem Franquia, Serviços Residenciais, 7 dias de Carro Reserva quando Terceiro (sem ar cond), 7 dias de Carro Reserva, Vidros com franquia de R$ 260,00."
I want to split the string by comma, but the problem is that there are numbers with a comma as the decimal separator in the string (for example: 260,00), for which I don't want a split to happen.

You could split by comma, followed by space:
>>> s.split(", ")
['Assist\xc3\xaancia 24hs com Guincho s/limite de km',
'2o. Guincho 100 km no mesmo evento',
'Pacote de Benef\xc3\xadcios HDI',
'T\xc3\xa1xi sem Franquia',
'Servi\xc3\xa7os Residenciais',
'7 dias de Carro Reserva quando Terceiro (sem ar cond)',
'7 dias de Carro Reserva',
'Vidros com franquia de R$ 260,00.']
Note that this will remove both the comma and the following space from the resulting strings.

You're walking on thin ice here. From your example, it seems like using ", " as the field separator (comma-space) would work. Most would opt to quote the strings or use a different delimiter (pipe, tab, \x1F, etc).
This seems very fragile to me, and you could easily be broken further out in time. If you have any influence on what is being given to you, have that conversation first.

The following avoids the fragility that was pointed out by #dsz.
txt = '''Assistência 24hs com Guincho s/limite de km, 2o. Guincho 100 km no mesmo evento, Pacote de Benefícios HDI, Táxi sem
Franquia, Serviços Residenciais, 7 dias de Carro Reserva quando Terceiro (sem ar cond), 7 dias de Carro
Reserva, Vidros com franquia de R$ 260,00.'''
import re
re.split("\,[^\d+\.\d+]",txt)
output:
['Assist\xc3\xaancia 24hs com Guincho s/limite de km',
'2o. Guincho 100 km no mesmo evento',
'Pacote de Benef\xc3\xadcios HDI',
'T\xc3\xa1xi sem Franquia',
'Servi\xc3\xa7os Residenciais',
'7 dias de Carro Reserva quando Terceiro (sem ar cond)',
'7 dias de Carro\nReserva',
'Vidros com franquia de R$ 260,00.']

Related

How to build this regex so that it extracts a word that starts with a capital letter if only if it appears after a previous pattern?

I need a regex that extracts all the names (we will consider that they are all the words that start with a capital letter and respect having certain conditions prior to their appearance within the sentence) that are in a sentence. This must be done respecting the pattern that I clarify below, also extracting the content before and after this name, so that it can be printed next to the name that was extracted within that sequence or pattern.
This is the pseudo-regex pattern that I need:
the beginning of the input sentence or (,|;|.|y)
associated_sense_1: "some character string (alphanumeric)" or "nothing"
(con |juntos a |junto a |en compania de )
identified_person: "some word that starts with a capital letter (the name that I must extract)" and it ends when the regex find one or more space
associated_sense_2: "some character string (alphanumeric)" or "nothing"
the end o the input sentence or (,|;|.|y |con |juntos a |junto a |en compania de )
the (,|;|.|y) are just person connectors that are used to build a regex pattern, but they do not provide information beyond indicating the sequence of belonging, then they can be eliminated with a .replace( , "")
And with this regex I need extract this 3 string groups
associated_sense_1
identified_person
associated_sense_2
associated_sense = associated_sense_1 + " " + associated_sense_2
This is the proto-code:
import re
#Example 1
sense = "puede ser peligroso ir solas, quizas sea mejor ir con Adrian y seguro que luego podemos esperar por Melisa, Marcos y Lucy en la parada"
#Example 2
#sense = "Adrian ya esta en la parada; y alli probablemente esten Lucy y May en la parada esperandonos"
person_identify_pattern = r"\s*(con |por |, y |, |,y |y )\s*[A-Z][^A-Z]*"
#person_identify_pattern = r"\s*(con |por |, y |, |,y |y )\s*[^A-Z]*"
for identified_person in re.split(person_identify_pattern, sense):
identified_person = identified_person.strip()
if identified_person:
try:
print(f"Write '{associated_sense}' to {identified_person}.txt")
except:
associated_sense = identified_person
The wrong output I get...
Write 'puede ser peligroso ir solas, quizas sea mejor ir' to con.txt
Write 'puede ser peligroso ir solas, quizas sea mejor ir' to Melisa.txt
Write 'puede ser peligroso ir solas, quizas sea mejor ir' to ,.txt
Write 'puede ser peligroso ir solas, quizas sea mejor ir' to Lucy en la parada.txt
Correct output for example 1:
Write 'quizas sea mejor ir con' to Adrian.txt
Write 'y seguro que luego podemos esperar por en la parada' to Melisa.txt
Write 'y seguro que luego podemos esperar por en la parada' to Marcos.txt
Write 'y seguro que luego podemos esperar por en la parada' to Lucy.txt
Correct output for example 2:
Write 'ya esta en la parada' to Adrian.txt
Write 'alli probablemente esten en la parada esperandonos' to Lucy.txt
Write 'alli probablemente esten en la parada esperandonos' to May.txt
I was trying with this other regex but I still have problems with this code:
import re
sense = "puede ser peligroso ir solas, quizas sea mejor ir con Adrian y seguro que luego podemos esperar por Melisa, Marcos y Lucy en la parada"
person_identify_pattern = r"\s*(?:,|;|.|y |con |juntos a |junto a |en compania de |)\s*((?:\w\s*)+)\s*(?<=con|por|a, | y )\s*([A-Z].*?\b)\s*((?:\w\s*)+)\s*(?:,|;|.|y |con |juntos a |junto a |en compania de )\s*"
for m in re.split(person_identify_pattern, sense):
m = m.strip()
if m:
try:
print(f"Write '{content}' to {m}.txt")
except:
content = m
But I keep getting this wrong output
Write 'puede ser peligroso ir solas' to quizas sea mejor ir con Adrian y seguro que luego podemos esperar por.txt
Write 'puede ser peligroso ir solas' to Melisa,.txt
Write 'puede ser peligroso ir solas' to Marcos y Lucy en la parad.txt
import re
sense = "puede ser peligroso ir solas, quizas sea mejor ir con Adrian y seguro que luego podemos esperar por Melisa, Marcos y Lucy en la parada"
if match := re.findall(r"(?<=con|por|a, | y )\s*([A-Z].*?\b)", sense):
print(match)
it result = ['Adrian', 'Melisa', 'Marcos', 'Lucy']

How to convert a list of sentences in a single text?

I have an list of sentences like this:
['Circula hoje o caderno especial "Folha Rock". Ele traz todas as informações para quem vai ao
M2.000 Summer Concerts, que começa hoje, e ao Hollywood Rock, a partir do dia 14. Mais de 30
bandas se apresentam nos dois festivais. O reggae domina a primeira noite do M2.000.',
'O delegado Nélson Guimarães, que apura a morte do sindicalista Oswaldo Cruz Júnior, não descarta
"motivações políticas" para o crime. O enterro foi marcado pela disputa da sucessão. Um grupo
apoiou o irmão de Oswaldo. Outro quer Cícero Bezerra da Silva, ligado a José Benedito de Souza,
suspeito do crime que está foragido. Brasil']
And i need that list like this:
['Circula hoje o caderno especial "Folha Rock". Ele traz todas as informações para quem vai ao
M2.000 Summer Concerts, que começa hoje, e ao Hollywood Rock, a partir do dia 14. Mais de 30
bandas se apresentam nos dois festivais. O reggae domina a primeira noite do M2.000.
O delegado Nélson Guimarães, que apura a morte do sindicalista Oswaldo Cruz Júnior, não descarta
"motivações políticas" para o crime. O enterro foi marcado pela disputa da sucessão. Um grupo
apoiou o irmão de Oswaldo. Outro quer Cícero Bezerra da Silva, ligado a José Benedito de Souza,
suspeito do crime que está foragido. Brasil']
You want to convert all elements of a list into a single string right ?
This might help you.
it will give you a single string variable
yourlist = ['Circula hoje o caderno especial "Folha Rock". Ele traz todas as informações para quem vai ao M2.000 Summer Concerts, que começa hoje, e ao Hollywood Rock, a partir do dia 14. Mais de 30 bandas se apresentam nos dois festivais. O reggae domina a primeira noite do M2.000.',
'O delegado Nélson Guimarães, que apura a morte do sindicalista Oswaldo Cruz Júnior, não descarta"motivações políticas" para o crime. O enterro foi marcado pela disputa da sucessão. Um grupo apoiou o irmão de Oswaldo. Outro quer Cícero Bezerra da Silva, ligado a José Benedito de Souza, suspeito do crime que está foragido. Brasil']
str = ""
for x in yourlist:
str = str + x
print(str)
you can do
yourlistname = ['Circula hoje o caderno especial "Folha Rock". Ele traz todas as informações para quem vai ao
M2.000 Summer Concerts, que começa hoje, e ao Hollywood Rock, a partir do dia 14. Mais de 30
bandas se apresentam nos dois festivais. O reggae domina a primeira noite do M2.000.',
'O delegado Nélson Guimarães, que apura a morte do sindicalista Oswaldo Cruz Júnior, não descarta
"motivações políticas" para o crime. O enterro foi marcado pela disputa da sucessão. Um grupo
apoiou o irmão de Oswaldo. Outro quer Cícero Bezerra da Silva, ligado a José Benedito de Souza,
suspeito do crime que está foragido. Brasil']
output = '\n'.join(yourlistname)
this will give you what you want
you can chose any seperator other than \n
I think this will help you:
s = ['My name is', 'XYZ', 'I am from X']
sen = ' '.join([i for i in s])
sen
'My name is XYZ I am from X'

Transform characters to portuguese special characters

I have this string:
>>> str(row['letra'][0])
'<p>[Baviera]<br/>Menina, me d\xc3\xa1 sua m\xc3\xa3o, pense bem antes de agir<br/>Se n\xc3\xa3o for agora, te espero l\xc3\xa1 fora, ent\xc3\xa3o deixe-me ir<br/>Um dia te encontro nessas suas voltas<br/>Minha mente \xc3\xa9 m\xc3\xb3 confus\xc3\xa3o<br/>Solta a minha m\xc3\xa3o, que eu sei que c\xc3\xaa volta<br/>O tempo mostra nossa dire\xc3\xa7\xc3\xa3o</p>'
And i want to transform it to portuguese special characters, but when I try:
>>> unicode(str(row['letra'][0]).decode('utf-8')).encode('utf-8')
'<p>[Baviera]<br/>Menina, me d\xc3\xa1 sua m\xc3\xa3o, pense bem antes de agir<br/>Se n\xc3\xa3o for agora, te espero l\xc3\xa1 fora, ent\xc3\xa3o deixe-me ir<br/>Um dia te encontro nessas suas voltas<br/>Minha mente \xc3\xa9 m\xc3\xb3 confus\xc3\xa3o<br/>Solta a minha m\xc3\xa3o, que eu sei que c\xc3\xaa volta<br/>O tempo mostra nossa dire\xc3\xa7\xc3\xa3o</p>'
The characters doesn't came as I want.
How can I transform 'd\xc3\xa1' to 'dá', for example?

Classifying data from .arff files with scikit-learn?

In a previous post i learned about the process to follow for classifying text with scikit-learn. In order to organize my data in a better way i discover .arff files let's say i have the following .arff file:
#relation lang_identification
#attribute opinion string
#attribute lang_identification {bos, pt, es, slov}
#data
"Pošto je EULEX obećao da će obaviti istragu o prošlosedmičnom izbijanju nasilja na sjeveru Kosova, taj incident predstavlja još jedan ispit kapaciteta misije da doprinese jačanju vladavine prava.",bos
"De todas as provações que teve de suplantar ao longo da vida, qual foi a mais difícil? O início. Qualquer começo apresenta dificuldades que parecem intransponíveis. Mas tive sempre a minha mãe do meu lado. Foi ela quem me ajudou a encontrar forças para enfrentar as situações mais decepcionantes, negativas, as que me punham mesmo furiosa.",pt
"Al parecer, Andrea Guasch pone que una relación a distancia es muy difícil de llevar como excusa. Algo con lo que, por lo visto, Alex Lequio no está nada de acuerdo. ¿O es que más bien ya ha conseguido la fama que andaba buscando?",es
"Vo väčšine golfových rezortov ide o veľký komplex niekoľkých ihrísk blízko pri sebe spojených s hotelmi a ďalšími možnosťami trávenia voľného času – nie vždy sú manželky či deti nadšenými golfistami, a tak potrebujú iný druh vyžitia. Zaujímavé kombinácie ponúkajú aj rakúske, švajčiarske či talianske Alpy, kde sa dá v zime lyžovať a v lete hrať golf pod vysokými alpskými končiarmi.",slov
I would like to experiment with scikit-learn and classify with a supervised aproach a complete new test string let's say:
test = "Por ello, ha insistido en que Europa tiene que darle un toque de atención porque Portugal esta incumpliendo la directiva del establecimiento del peaje"
Scipy provide an arff loader, let's load an arff file with this:
from scipy.io.arff import loadarff
dataset = loadarff(open('/Users/user/Desktop/toy.arff','r'))
print dataset
This should return something like this: (array([]), how can use numpy record arrays to classify with scikit-learn?.

Python re.match() not working on string with accentuated chars

PAttern is working ok with test subject containing no accentuated char like á é í ã õ ñ
But simply returns no matches when I try it over the actual Portuguese-BR accentuated text.
Tried change encodings but got nothing.. Any help?
EDIT: Regex complete info here
HEX sample imput: 50:72:6f:63:65:73:73:6f:20:31:30:35:36:39:32:32:2d:38:34:2e:32:30:31:33:2e:38:2e
:32:36:2e:30:31:30:30:20:2d:20:45:78:65:63:75:c3:a7:c3:a3:6f:20:64:65:20:54:c3:a
d:74:75:6c:6f:20:45:78:74:72:61:6a:75:64:69:63:69:61:6c:20:2d:20:45:73:70:c3:a9:
63:69:65:73:20:64:65:20:43:6f:6e:74:72:61:74:6f:73:20:2d:20:4d:4f:42:49:4c:49:4e
:53:20:46:4f:52:4d:41:c3:87:c3:83:4f:20:50:52:4f:46:49:53:53:49:4f:4e:41:4c:20:4
5:4d:20:42:45:4c:45:5a:41:20:4c:54:44:41:2e:20:2d:20:4a:55:4c:49:41:4e:41:20:4d:
41:52:41:4e:48:c3:83:4f:20:50:4f:52:54:4f:20:44:41:20:53:49:4c:56:45:49:52:41:20
:2d:20:56:69:73:74:6f:73:2e:20:44:65:66:69:72:6f:20:6f:20:70:65:64:69:64:6f:20:7
0:61:72:61:20:61:20:70:65:73:71:75:69:73:61:20:64:65:20:62:65:6e:73:20:64:61:20:
70:61:72:74:65:20:72:65:71:75:65:72:69:64:61:20:4a:55:4c:49:41:4e:41:20:4d:41:52
:41:4e:48:c3:83:4f:20:50:4f:52:54:4f:20:44:41:20:53:49:4c:56:45:49:52:41:2c:20:4
3:50:46:20:30:33:30:2e:37:39:37:2e:35:36:34:2d:39:35:20:28:64:65:63:6c:61:72:61:
c3:a7:c3:a3:6f:20:64:6f:73:20:63:69:6e:63:6f:20:c3:ba:6c:74:69:6d:6f:73:20:65:78
:65:72:63:c3:ad:63:69:6f:73:29:2c:20:6f:20:71:75:61:6c:20:c3:a9:20:72:65:61:6c:6
9:7a:61:64:6f:2c:20:6e:65:73:74:61:20:64:61:74:61:2c:20:70:6f:72:20:6d:65:69:6f:
20:64:65:20:6f:66:c3:ad:63:69:6f:20:65:6e:76:69:61:64:6f:20:c3:a0:20:52:65:63:65
:69:74:61:20:46:65:64:65:72:61:6c:2c:20:70:72:6f:74:6f:63:6f:6c:61:64:6f:20:65:6
c:65:74:72:6f:6e:69:63:61:6d:65:6e:74:65:2c:20:70:6f:72:20:69:6e:74:65:72:6d:c3:
a9:64:69:6f:20:64:6f:20:73:69:73:74:65:6d:61:20:49:4e:46:4f:4a:55:44:2e:20:49:6e
:74:69:6d:65:2d:73:65:2e:20:2d:20:41:44:56:3a:20:4d:41:54:48:45:55:53:20:44:45:2
0:4f:4c:49:56:45:49:52:41:20:54:41:56:41:52:45:53:20:28:4f:41:42:20:31:36:30:37:
31:31:2f:53:50:29:50:72:6f:63:65:73:73:6f:20:31:30:35:36:39:32:32:2d:38:34:2e:32
:30:31:33:2e:38:2e:32:36:2e:30:31:30:30:20:2d:20:45:78:65:63:75:c3:a7:c3:a3:6f:2
0:64:65:20:54:c3:ad:74:75:6c:6f:20:45:78:74:72:61:6a:75:64:69:63:69:61:6c:20:2d:
20:45:73:70:c3:a9:63:69:65:73:20:64:65:20:43:6f:6e:74:72:61:74:6f:73:20:2d:20:4d
:4f:42:49:4c:49:4e:53:20:46:4f:52:4d:41:c3:87:c3:83:4f:20:50:52:4f:46:49:53:53:4
9:4f:4e:41:4c:20:45:4d:20:42:45:4c:45:5a:41:20:4c:54:44:41:2e:20:2d:20:4a:55:4c:
49:41:4e:41:20:4d:41:52:41:4e:48:c3:83:4f:20:50:4f:52:54:4f:20:44:41:20:53:49:4c
:56:45:49:52:41:20:2d:20:56:69:73:74:6f:73:2e:20:31:29:20:43:69:c3:aa:6e:63:69:6
1:20:64:61:20:72:65:73:70:6f:73:74:61:20:64:6f:20:6f:66:c3:ad:63:69:6f:20:65:78:
70:65:64:69:64:6f:20:c3:a0:20:52:65:63:65:69:74:61:20:46:65:64:65:72:61:6c:2c:20
:66:69:63:61:6e:64:6f:20:6f:73:20:64:61:64:6f:73:20:73:69:67:69:6c:6f:73:6f:73:2
0:61:72:71:75:69:76:61:64:6f:73:20:65:6d:20:70:61:73:74:61:20:70:72:c3:b3:70:72:
69:61:2e:20:32:29:20:50:6f:72:20:63:6f:6e:73:65:67:75:69:6e:74:65:2c:20:61:20:70
:61:72:74:65:20:65:78:65:71:75:65:6e:74:65:20:64:65:76:65:20:6d:61:6e:69:66:65:7
3:74:61:72:2d:73:65:2c:20:65:6d:20:63:69:6e:63:6f:20:64:69:61:73:2e:20:4e:6f:20:
73:69:6c:c3:aa:6e:63:69:6f:2c:20:61:6f:20:61:72:71:75:69:76:6f:2e:20:49:6e:74:69
:6d:65:2d:73:65:2e:20:2d:20:41:44:56:3a:20:4d:41:54:48:45:55:53:20:44:45:20:4f:4
c:49:56:45:49:52:41:20:54:41:56:41:52:45:53:20:28:4f:41:42:20:31:36:30:37:31:31:
2f:53:50:29:50:72:6f:63:65:73:73:6f:20:31:30:35:37:32:38:30:2d:31:35:2e:32:30:31
:34:2e:38:2e:32:36:2e:30:31:30:30
This has nothing to do with accented characters. The answer provided to you doesn't work because:
In the new input the word Process was replaced with Processo.
The new input has several instances of the regular expression pattern, so re.findall should be invoked, rather than re.match (in fact, since the old input has several instances as well, that solution won't work perfectly there either).
Therefore, here is the correct solution:
>>> print input
Processo 1056922-84.2013.8.26.0100 - Execução de Título Extrajudicial - Espécies de Contratos - MOBILINS FORMAÇÃO PROFISSIONAL EM BELEZA LTDA. - JULIANA MARANHÃO PORTO DA SILVEIRA - Vistos. Defiro o pedido para a pesquisa de bens da parte requerida JULIANA MARANHÃO PORTO DA SILVEIRA, CPF 030.797.564-95 (declaração dos cinco últimos exercícios), o qual é realizado, nesta data, por meio de ofício enviado à Receita Federal, protocolado eletronicamente, por intermédio do sistema INFOJUD. Intime-se. - ADV: MATHEUS DE OLIVEIRA TAVARES (OAB 160711/SP)Processo 1056922-84.2013.8.26.0100 - Execução de Título Extrajudicial - Espécies de Contratos - MOBILINS FORMAÇÃO PROFISSIONAL EM BELEZA LTDA. - JULIANA MARANHÃO PORTO DA SILVEIRA - Vistos. 1) Ciência da resposta do ofício expedido à Receita Federal, ficando os dados sigilosos arquivados em pasta própria. 2) Por conseguinte, a parte exequente deve manifestar-se, em cinco dias. No silêncio, ao arquivo. Intime-se. - ADV: MATHEUS DE OLIVEIRA TAVARES (OAB 160711/SP)Processo 1057280-15.2014.8.26.0100
>>> regex = re.compile('(Processo \\d{7}\\-\\d{2}\\.\\d{4}\\.\\d+\\.\\d{2}\\.\\d{4}.*?)(?=Processo)|(Processo \\d{7}\\-\\d{2}\\.\\d{4}\\.\\d+\\.\\d{2}\\.\\d{4}.*)')
>>> regex.findall(y)
[('Processo 1056922-84.2013.8.26.0100 - Execu\xc3\xa7\xc3\xa3o de T\xc3\xadtulo Extrajudicial - Esp\xc3\xa9cies de Contratos - MOBILINS FORMA\xc3\x87\xc3\x83O PROFISSIONAL EM BELEZA LTDA. - JULIANA MARANH\xc3\x83O PORTO DA SILVEIRA - Vistos. Defiro o pedido para a pesquisa de bens da parte requerida JULIANA MARANH\xc3\x83O PORTO DA SILVEIRA, CPF 030.797.564-95 (declara\xc3\xa7\xc3\xa3o dos cinco \xc3\xbaltimos exerc\xc3\xadcios), o qual \xc3\xa9 realizado, nesta data, por meio de of\xc3\xadcio enviado \xc3\xa0 Receita Federal, protocolado eletronicamente, por interm\xc3\xa9dio do sistema INFOJUD. Intime-se. - ADV: MATHEUS DE OLIVEIRA TAVARES (OAB 160711/SP)', ''), ('Processo 1056922-84.2013.8.26.0100 - Execu\xc3\xa7\xc3\xa3o de T\xc3\xadtulo Extrajudicial - Esp\xc3\xa9cies de Contratos - MOBILINS FORMA\xc3\x87\xc3\x83O PROFISSIONAL EM BELEZA LTDA. - JULIANA MARANH\xc3\x83O PORTO DA SILVEIRA - Vistos. 1) Ci\xc3\xaancia da resposta do of\xc3\xadcio expedido \xc3\xa0 Receita Federal, ficando os dados sigilosos arquivados em pasta pr\xc3\xb3pria. 2) Por conseguinte, a parte exequente deve manifestar-se, em cinco dias. No sil\xc3\xaancio, ao arquivo. Intime-se. - ADV: MATHEUS DE OLIVEIRA TAVARES (OAB 160711/SP)', ''), ('', 'Processo 1057280-15.2014.8.26.0100')]
If both inputs are legal (i.e. the input may contain the word Process and may contain the word Processo), then this regular expression should be used:
>>> regex = re.compile('(Processo? \\d{7}\\-\\d{2}\\.\\d{4}\\.\\d+\\.\\d{2}\\.\\d{4}.*?)(?=Processo?)|(Processo? \\d{7}\\-\\d{2}\\.\\d{4}\\.\\d+\\.\\d{2}\\.\\d{4}.*)')

Categories

Resources