I have a csv file that looks like the following
Porta-a-Porta-d87134d1-e2bd-426b-b1f6-90d8dca68855;2842.020;2843.270;Unknown;; tecnici delle societ…
Porta-a-Porta-d87134d1-e2bd-426b-b1f6-90d8dca68855;2903.310;2906.360;Unknown;; pu• avere un profilo specifico
Porta-a-Porta-d87134d1-e2bd-426b-b1f6-90d8dca68855;2745.860;2749.060;Unknown;; Š quadruplicato rispetto al 1967.
Porta-a-Porta-d87134d1-e2bd-426b-b1f6-90d8dca68855;1023.580;1026.250;Unknown;; monitoraggio fosse completo e cosŤ via.
Porta-a-Porta-d87134d1-e2bd-426b-b1f6-90d8dca68855;708.870;711.290;Unknown;; Non solo un ponte, ma qualcosa di pi—.
Porta-a-Porta-d605218c-b8c5-4b3b-9086-b83e4c958bf5;4199.210;4200.540;Unknown;; piů straziante.
Porta-a-Porta-c28a23f4-d7b0-4624-8b49-72ba25be653e;4702.720;4703.900;Unknown;; tant'č che questo ragazzo
Presa-Diretta-Burocrazia-al-potere-ce58265f-da04-4b19-a1ad-2746830cac0a;4229.110;4232.130;Unknown;; a un testo di 13 pagine con 7/8.000 parole.<
Presa-Diretta-Burocrazia-al-potere-ce58265f-da04-4b19-a1ad-2746830cac0a;4541.560;4543.100;Unknown;; sei/otto ore al giorno.<
PresaDiretta-Il-capitale-naturale-8f39ea4f-a5fb-4c93-a504-a04d6482c086;1938.730;1941.830;Unknown;; abbattere i cervi.> Senza di loro, questa terra sarebbe
Quante-storie-15aef095-7ba8-4237-af6e-aded20d1d40a;19.920;22.630;Unknown;; questa puntata {an2}che ha come ospite una
Quante-storie-15aef095-7ba8-4237-af6e-aded20d1d40a;64.080;68.090;Unknown;; {an2}Sì, perché c'è come un ritegno a venire in una
Quante-storie-200b0694-7d54-4b5c-af5a-b54cae157ffd;446.730;447.790;Unknown;; della nostra Patria. {an2}[LA
Quante-storie-2583a3a2-2e8c-4589-bede-933736b65043;1781.910;1783.030;Unknown;; UDIBILI]
Porta-a-Porta-3b4b81d5-2f0f-4e51-9c29-00f9a2aa4444;4159.470;4160.890;Unknown;; bianca torneremo.#
Porta-a-Porta-3b4b81d5-2f0f-4e51-9c29-00f9a2aa4444;4196.930;4198.230;Unknown;; del sole#
and I am trying to spot unnecessary characters that should not belong in this file such as < or { or {an2} or [ and so on.
This is the regex I have right now and does the job well except it does not catch some cases like {an2} or # as described above. I would like to find everything including an2 and leave every Italian characters as is.
[^a-zA-Z0-9;'"\.\- ,\?:£\]\[\/()%!èàéùòìíŕěúůňčÂŤŠÈÉôü&+<>##$%^…—‚–]
Let me know if there is any easier way to solve this problem.
My guess is that, maybe we would find those undesired parts, then replace with an empty string, with some expressions similar to:
{.+?}|[\[\]<>]
Test
import re
regex = r"{.+?}|[\[\]<>]"
test_str = ("Porta-a-Porta-d87134d1-e2bd-426b-b1f6-90d8dca68855;2842.020;2843.270;Unknown;; tecnici delle societ…\n"
"Porta-a-Porta-d87134d1-e2bd-426b-b1f6-90d8dca68855;2903.310;2906.360;Unknown;; pu• avere un profilo specifico\n"
"Porta-a-Porta-d87134d1-e2bd-426b-b1f6-90d8dca68855;2745.860;2749.060;Unknown;; Š quadruplicato rispetto al 1967.\n"
"Porta-a-Porta-d87134d1-e2bd-426b-b1f6-90d8dca68855;1023.580;1026.250;Unknown;; monitoraggio fosse completo e cosŤ via.\n"
"Porta-a-Porta-d87134d1-e2bd-426b-b1f6-90d8dca68855;708.870;711.290;Unknown;; Non solo un ponte, ma qualcosa di pi—.\n"
"Porta-a-Porta-d605218c-b8c5-4b3b-9086-b83e4c958bf5;4199.210;4200.540;Unknown;; piů straziante.\n"
"Porta-a-Porta-c28a23f4-d7b0-4624-8b49-72ba25be653e;4702.720;4703.900;Unknown;; tant'č che questo ragazzo\n"
"Presa-Diretta-Burocrazia-al-potere-ce58265f-da04-4b19-a1ad-2746830cac0a;4229.110;4232.130;Unknown;; a un testo di 13 pagine con 7/8.000 parole.<\n"
"Presa-Diretta-Burocrazia-al-potere-ce58265f-da04-4b19-a1ad-2746830cac0a;4541.560;4543.100;Unknown;; sei/otto ore al giorno.<\n"
"PresaDiretta-Il-capitale-naturale-8f39ea4f-a5fb-4c93-a504-a04d6482c086;1938.730;1941.830;Unknown;; abbattere i cervi.> Senza di loro, questa terra sarebbe\n"
"Quante-storie-15aef095-7ba8-4237-af6e-aded20d1d40a;19.920;22.630;Unknown;; questa puntata {an2}che ha come ospite una\n"
"Quante-storie-15aef095-7ba8-4237-af6e-aded20d1d40a;64.080;68.090;Unknown;; {an2}Sì, perché c'è come un ritegno a venire in una\n"
"Quante-storie-200b0694-7d54-4b5c-af5a-b54cae157ffd;446.730;447.790;Unknown;; della nostra Patria. {an2}[LA\n"
"Quante-storie-2583a3a2-2e8c-4589-bede-933736b65043;1781.910;1783.030;Unknown;; UDIBILI]\n"
"Porta-a-Porta-3b4b81d5-2f0f-4e51-9c29-00f9a2aa4444;4159.470;4160.890;Unknown;; bianca torneremo.#\n"
"Porta-a-Porta-3b4b81d5-2f0f-4e51-9c29-00f9a2aa4444;4196.930;4198.230;Unknown;; del sole#")
subst = ""
result = re.sub(regex, subst, test_str, 0, re.MULTILINE)
if result:
print (result)
Demo
Related
import re
#example
input_text = 'Alrededor de las 00:16 am o las 23:30 pm , quizas cerca del 2022_-_02_-_18 llega el avion, pero no a las (2022_-_02_-_18 00:16 am), de esos hay dos (22)'
identify_time_regex = r"(?P<hh>\d{2}):(?P<mm>\d{2})[\s|]*(?P<am_or_pm>(?:am|pm))"
restructuring_structure_00 = r"(\g<hh>----\g<mm>----\g<am_or_pm>)"
#replacement
input_text = re.sub(identify_regex_01_a, restructuring_structure_00, input_text)
print(repr(input_text)) # --> output
I have to change things in this regex identify_time_regex so that it extracts the hour numbers but only if it is inside a structure like the following (2022_-_02_-_18 00:16 am), which can be generalized as follows:
r"(\d*_-_\d{2}_-_\d{2}) " + identify_time_regex
The output that I need, you can see that only those hours were modified where there was no date before:
input_text = 'Alrededor de las 00----16----am o las 23----30----pm , quizas cerca del 2022_-_02_-_18 llega el avion, pero no a las (2022_-_02_-_18 00:16 am), de esos hay dos (22)'
You can use
import re
input_text = 'Alrededor de las 00:16 am o las 23:30 pm , quizas cerca del 2022_-_02_-_18 llega el avion, pero no a las (2022_-_02_-_18 00:16 am), de esos hay dos (22)'
identify_time_regex = r"(\b\d{4}_-_\d{2}_-_\d{2}\s+)?(?P<hh>\d{2}):(?P<mm>\d{2})[\s|]*(?P<am_or_pm>[ap]m)"
restructuring_structure_00 = lambda x: x.group() if x.group(1) else fr"{x.group('hh')}----{x.group('mm')}----{x.group('am_or_pm')}"
input_text = re.sub(identify_time_regex, restructuring_structure_00, input_text)
print(input_text)
# Alrededor de las 00----16----am o las 23----30----pm , quizas cerca del 2022_-_02_-_18 llega el avion, pero no a las (2022_-_02_-_18 00:16 am), de esos hay dos (22)
See the Python demo.
The logic is the following: if the (\b\d{4}_-_\d{2}_-_\d{2}\s+)? optional capturing group matches, the replacement is the whole match (i.e. no replacement occurs), and if it does not, your replacement takes place.
The restructuring_structure_00 must be a lambda expression since the match structure needs to be evaluated before replacement.
The \b\d{4}_-_\d{2}_-_\d{2}\s+ pattern matches a word boundary, four digits, _-_, two digits, _-_, two digits, and one or more whitespaces.
I'm using scrapy on html like:
<td nowrap="" valign="top" align="right">
<br>
Text is here.
<br>
Other text is here
<br>
</td>
td[1]/text()[1] gives me:
(empty line)
Text is here.
I've tried normalize-space, i.e. normalize-space(td[1]/text()[1]), which works when I test in my firefox extension, but not in scrapy. I think scrapy is getting tripped up by the \n and it skips over (or only takes first line of node, which is nothing). I've also tried some "preceding" and "following" code, but I think it might be considered one element, my DOM says the nodeValue = "\nText is here" Any thoughts?,
Extract every text, get the desired one by index. For instance:
response.xpath("//table[#id='myid']/tr[1]/td[1]//text()")[1]
Demo from the Scrapy Shell:
$ scrapy shell http://www.trobar.org/troubadours/coms_de_peiteu/guilhen_de_peiteu_01.php
In [1]: table = response.xpath("//table")[2]
In [2]: td = "".join(table.xpath(".//td[1]//text()").extract())
In [3]: print(td)
Companho, farai un vers qu'er covinen,
Et aura-i mais de foudatz no-y a de sen,
Et er totz mesclatz d'amor e de joy e de joven.
E tenguatz lo per vilan qui no-l enten,
O dins son cor voluntiers non l'apren:
Greu partir si fai d'amor qui la troba a talen.
Dos cavalhs ai a ma sselha, ben e gen,
Bon son et adreg per armas e valen,
E no-ls puesc amdos tener, que l'us l'autre non cossen.
Si-ls pogues adomesjar a mon talen,
Ja no volgr'alhors mudar mon garnimen,
Que meils for'encavalguatz de nuill ome viven.
Launs fon dels montaniers lo plus corren,
Mas aitan fer' estranhez'a longuamen
Et es tan fers e salvatges, que del bailar si defen.
L'autre fon noyritz sa jus part Cofolen
Ez anc no-n vis bellazor, mon escien:
Aquest non er ja camjatz ni per aur ni per argen.
Qu'ie-l donei a son senhor polin payssen,
Pero si-m retinc ieu tan de covenen
Que, s'ilh lo tenia un an, qu'ieu lo tengues mais de cen.
Cavalier, datz mi cosselh d'un pessamen:
-Anc mays no fuy issaratz de cauzimen- :
Res non sai ab qual me tengua, de n'Agnes o de n'Arsen.
De Gimel ai lo castel e-l mandamen,
E per Niol fauc ergueill a tota gen:
C'ambedui me son jurat e plevit per sagramen.
In a previous post i learned about the process to follow for classifying text with scikit-learn. In order to organize my data in a better way i discover .arff files let's say i have the following .arff file:
#relation lang_identification
#attribute opinion string
#attribute lang_identification {bos, pt, es, slov}
#data
"Pošto je EULEX obećao da će obaviti istragu o prošlosedmičnom izbijanju nasilja na sjeveru Kosova, taj incident predstavlja još jedan ispit kapaciteta misije da doprinese jačanju vladavine prava.",bos
"De todas as provações que teve de suplantar ao longo da vida, qual foi a mais difícil? O início. Qualquer começo apresenta dificuldades que parecem intransponíveis. Mas tive sempre a minha mãe do meu lado. Foi ela quem me ajudou a encontrar forças para enfrentar as situações mais decepcionantes, negativas, as que me punham mesmo furiosa.",pt
"Al parecer, Andrea Guasch pone que una relación a distancia es muy difícil de llevar como excusa. Algo con lo que, por lo visto, Alex Lequio no está nada de acuerdo. ¿O es que más bien ya ha conseguido la fama que andaba buscando?",es
"Vo väčšine golfových rezortov ide o veľký komplex niekoľkých ihrísk blízko pri sebe spojených s hotelmi a ďalšími možnosťami trávenia voľného času – nie vždy sú manželky či deti nadšenými golfistami, a tak potrebujú iný druh vyžitia. Zaujímavé kombinácie ponúkajú aj rakúske, švajčiarske či talianske Alpy, kde sa dá v zime lyžovať a v lete hrať golf pod vysokými alpskými končiarmi.",slov
I would like to experiment with scikit-learn and classify with a supervised aproach a complete new test string let's say:
test = "Por ello, ha insistido en que Europa tiene que darle un toque de atención porque Portugal esta incumpliendo la directiva del establecimiento del peaje"
Scipy provide an arff loader, let's load an arff file with this:
from scipy.io.arff import loadarff
dataset = loadarff(open('/Users/user/Desktop/toy.arff','r'))
print dataset
This should return something like this: (array([]), how can use numpy record arrays to classify with scikit-learn?.
I have a text with some POS-tags and some words. I created a regex to generate some bigrams that look like this: [('word', 'POS-tag', 'word', 'POS-tag'), ('word', 'POS-tag', 'word', 'POS-tag')]
This is what i all ready done:
# -- coding: utf-8 --
import re
test_string= '''
Es ser VSIP3S0 1
muy muy RG 1
fácil fácil AQ0CS0 1
de de SPS00 0.999984
Por por SPS00 1
decir decir VMN0000 0.997512
algo algo PI0CS000 0.900246
malo malo AQ0MS0 0.657087
de de SPS00 0.999984
ella él PP3FS000 1
, , Fc 1
sería ser VSIC1S0 0.5
que que CS 0.437483
cuando cuando CS 0.985595
centrifuga centrifugar VMIP3S0 0.994859
, , Fc 1
algo algo PI0CS000 0.900246
que que PR0CN000 0.562517
hace hacer VMIP3S0 1
muy muy RG 1
bien bien RG 0.902728
sitio sitio NCMS000 0.980769
'''
regex = re.findall(r'^(\w+)\s\w+\s(RG)\s[0-9.]+\n^(\w+)\s\w+\s(AQ0CS0)', test_string, re.M)
print "\n This is a bigram:"
print regex
The problem is when i want to return all the words that have RG and AQ0CS0 that are consecutively, the final regex is empty. How can i solve this?. The output should look like this:
This is a bigram:
[('muy', 'RG'),('fácil','AQ0CS0')]
If you need to match unicode character, as you have in your example data, you need to set the unicode flag re.U or re.UNICODE
>>> re.findall(r'^(\w+)\s\w+\s(RG)\s[0-9.]+\n^(\w+)\s\w+\s(AQ0CS0)', test_string, re.M|re.U)
[('muy', 'RG', 'f\xe1cil', 'AQ0CS0')]
Problem is with the "á" character in fácil. It is not an ASCII alphabet, so \w cannot recognize it. You can use the below regex. It will solve your problem:
re.findall(r'^(\w+)\s.+\s(RG)\s[0-9.]+\n^(.+)\s.+\s(AQ0CS0)', test_string, re.M)
PAttern is working ok with test subject containing no accentuated char like á é í ã õ ñ
But simply returns no matches when I try it over the actual Portuguese-BR accentuated text.
Tried change encodings but got nothing.. Any help?
EDIT: Regex complete info here
HEX sample imput: 50:72:6f:63:65:73:73:6f:20:31:30:35:36:39:32:32:2d:38:34:2e:32:30:31:33:2e:38:2e
:32:36:2e:30:31:30:30:20:2d:20:45:78:65:63:75:c3:a7:c3:a3:6f:20:64:65:20:54:c3:a
d:74:75:6c:6f:20:45:78:74:72:61:6a:75:64:69:63:69:61:6c:20:2d:20:45:73:70:c3:a9:
63:69:65:73:20:64:65:20:43:6f:6e:74:72:61:74:6f:73:20:2d:20:4d:4f:42:49:4c:49:4e
:53:20:46:4f:52:4d:41:c3:87:c3:83:4f:20:50:52:4f:46:49:53:53:49:4f:4e:41:4c:20:4
5:4d:20:42:45:4c:45:5a:41:20:4c:54:44:41:2e:20:2d:20:4a:55:4c:49:41:4e:41:20:4d:
41:52:41:4e:48:c3:83:4f:20:50:4f:52:54:4f:20:44:41:20:53:49:4c:56:45:49:52:41:20
:2d:20:56:69:73:74:6f:73:2e:20:44:65:66:69:72:6f:20:6f:20:70:65:64:69:64:6f:20:7
0:61:72:61:20:61:20:70:65:73:71:75:69:73:61:20:64:65:20:62:65:6e:73:20:64:61:20:
70:61:72:74:65:20:72:65:71:75:65:72:69:64:61:20:4a:55:4c:49:41:4e:41:20:4d:41:52
:41:4e:48:c3:83:4f:20:50:4f:52:54:4f:20:44:41:20:53:49:4c:56:45:49:52:41:2c:20:4
3:50:46:20:30:33:30:2e:37:39:37:2e:35:36:34:2d:39:35:20:28:64:65:63:6c:61:72:61:
c3:a7:c3:a3:6f:20:64:6f:73:20:63:69:6e:63:6f:20:c3:ba:6c:74:69:6d:6f:73:20:65:78
:65:72:63:c3:ad:63:69:6f:73:29:2c:20:6f:20:71:75:61:6c:20:c3:a9:20:72:65:61:6c:6
9:7a:61:64:6f:2c:20:6e:65:73:74:61:20:64:61:74:61:2c:20:70:6f:72:20:6d:65:69:6f:
20:64:65:20:6f:66:c3:ad:63:69:6f:20:65:6e:76:69:61:64:6f:20:c3:a0:20:52:65:63:65
:69:74:61:20:46:65:64:65:72:61:6c:2c:20:70:72:6f:74:6f:63:6f:6c:61:64:6f:20:65:6
c:65:74:72:6f:6e:69:63:61:6d:65:6e:74:65:2c:20:70:6f:72:20:69:6e:74:65:72:6d:c3:
a9:64:69:6f:20:64:6f:20:73:69:73:74:65:6d:61:20:49:4e:46:4f:4a:55:44:2e:20:49:6e
:74:69:6d:65:2d:73:65:2e:20:2d:20:41:44:56:3a:20:4d:41:54:48:45:55:53:20:44:45:2
0:4f:4c:49:56:45:49:52:41:20:54:41:56:41:52:45:53:20:28:4f:41:42:20:31:36:30:37:
31:31:2f:53:50:29:50:72:6f:63:65:73:73:6f:20:31:30:35:36:39:32:32:2d:38:34:2e:32
:30:31:33:2e:38:2e:32:36:2e:30:31:30:30:20:2d:20:45:78:65:63:75:c3:a7:c3:a3:6f:2
0:64:65:20:54:c3:ad:74:75:6c:6f:20:45:78:74:72:61:6a:75:64:69:63:69:61:6c:20:2d:
20:45:73:70:c3:a9:63:69:65:73:20:64:65:20:43:6f:6e:74:72:61:74:6f:73:20:2d:20:4d
:4f:42:49:4c:49:4e:53:20:46:4f:52:4d:41:c3:87:c3:83:4f:20:50:52:4f:46:49:53:53:4
9:4f:4e:41:4c:20:45:4d:20:42:45:4c:45:5a:41:20:4c:54:44:41:2e:20:2d:20:4a:55:4c:
49:41:4e:41:20:4d:41:52:41:4e:48:c3:83:4f:20:50:4f:52:54:4f:20:44:41:20:53:49:4c
:56:45:49:52:41:20:2d:20:56:69:73:74:6f:73:2e:20:31:29:20:43:69:c3:aa:6e:63:69:6
1:20:64:61:20:72:65:73:70:6f:73:74:61:20:64:6f:20:6f:66:c3:ad:63:69:6f:20:65:78:
70:65:64:69:64:6f:20:c3:a0:20:52:65:63:65:69:74:61:20:46:65:64:65:72:61:6c:2c:20
:66:69:63:61:6e:64:6f:20:6f:73:20:64:61:64:6f:73:20:73:69:67:69:6c:6f:73:6f:73:2
0:61:72:71:75:69:76:61:64:6f:73:20:65:6d:20:70:61:73:74:61:20:70:72:c3:b3:70:72:
69:61:2e:20:32:29:20:50:6f:72:20:63:6f:6e:73:65:67:75:69:6e:74:65:2c:20:61:20:70
:61:72:74:65:20:65:78:65:71:75:65:6e:74:65:20:64:65:76:65:20:6d:61:6e:69:66:65:7
3:74:61:72:2d:73:65:2c:20:65:6d:20:63:69:6e:63:6f:20:64:69:61:73:2e:20:4e:6f:20:
73:69:6c:c3:aa:6e:63:69:6f:2c:20:61:6f:20:61:72:71:75:69:76:6f:2e:20:49:6e:74:69
:6d:65:2d:73:65:2e:20:2d:20:41:44:56:3a:20:4d:41:54:48:45:55:53:20:44:45:20:4f:4
c:49:56:45:49:52:41:20:54:41:56:41:52:45:53:20:28:4f:41:42:20:31:36:30:37:31:31:
2f:53:50:29:50:72:6f:63:65:73:73:6f:20:31:30:35:37:32:38:30:2d:31:35:2e:32:30:31
:34:2e:38:2e:32:36:2e:30:31:30:30
This has nothing to do with accented characters. The answer provided to you doesn't work because:
In the new input the word Process was replaced with Processo.
The new input has several instances of the regular expression pattern, so re.findall should be invoked, rather than re.match (in fact, since the old input has several instances as well, that solution won't work perfectly there either).
Therefore, here is the correct solution:
>>> print input
Processo 1056922-84.2013.8.26.0100 - Execução de Título Extrajudicial - Espécies de Contratos - MOBILINS FORMAÇÃO PROFISSIONAL EM BELEZA LTDA. - JULIANA MARANHÃO PORTO DA SILVEIRA - Vistos. Defiro o pedido para a pesquisa de bens da parte requerida JULIANA MARANHÃO PORTO DA SILVEIRA, CPF 030.797.564-95 (declaração dos cinco últimos exercícios), o qual é realizado, nesta data, por meio de ofício enviado à Receita Federal, protocolado eletronicamente, por intermédio do sistema INFOJUD. Intime-se. - ADV: MATHEUS DE OLIVEIRA TAVARES (OAB 160711/SP)Processo 1056922-84.2013.8.26.0100 - Execução de Título Extrajudicial - Espécies de Contratos - MOBILINS FORMAÇÃO PROFISSIONAL EM BELEZA LTDA. - JULIANA MARANHÃO PORTO DA SILVEIRA - Vistos. 1) Ciência da resposta do ofício expedido à Receita Federal, ficando os dados sigilosos arquivados em pasta própria. 2) Por conseguinte, a parte exequente deve manifestar-se, em cinco dias. No silêncio, ao arquivo. Intime-se. - ADV: MATHEUS DE OLIVEIRA TAVARES (OAB 160711/SP)Processo 1057280-15.2014.8.26.0100
>>> regex = re.compile('(Processo \\d{7}\\-\\d{2}\\.\\d{4}\\.\\d+\\.\\d{2}\\.\\d{4}.*?)(?=Processo)|(Processo \\d{7}\\-\\d{2}\\.\\d{4}\\.\\d+\\.\\d{2}\\.\\d{4}.*)')
>>> regex.findall(y)
[('Processo 1056922-84.2013.8.26.0100 - Execu\xc3\xa7\xc3\xa3o de T\xc3\xadtulo Extrajudicial - Esp\xc3\xa9cies de Contratos - MOBILINS FORMA\xc3\x87\xc3\x83O PROFISSIONAL EM BELEZA LTDA. - JULIANA MARANH\xc3\x83O PORTO DA SILVEIRA - Vistos. Defiro o pedido para a pesquisa de bens da parte requerida JULIANA MARANH\xc3\x83O PORTO DA SILVEIRA, CPF 030.797.564-95 (declara\xc3\xa7\xc3\xa3o dos cinco \xc3\xbaltimos exerc\xc3\xadcios), o qual \xc3\xa9 realizado, nesta data, por meio de of\xc3\xadcio enviado \xc3\xa0 Receita Federal, protocolado eletronicamente, por interm\xc3\xa9dio do sistema INFOJUD. Intime-se. - ADV: MATHEUS DE OLIVEIRA TAVARES (OAB 160711/SP)', ''), ('Processo 1056922-84.2013.8.26.0100 - Execu\xc3\xa7\xc3\xa3o de T\xc3\xadtulo Extrajudicial - Esp\xc3\xa9cies de Contratos - MOBILINS FORMA\xc3\x87\xc3\x83O PROFISSIONAL EM BELEZA LTDA. - JULIANA MARANH\xc3\x83O PORTO DA SILVEIRA - Vistos. 1) Ci\xc3\xaancia da resposta do of\xc3\xadcio expedido \xc3\xa0 Receita Federal, ficando os dados sigilosos arquivados em pasta pr\xc3\xb3pria. 2) Por conseguinte, a parte exequente deve manifestar-se, em cinco dias. No sil\xc3\xaancio, ao arquivo. Intime-se. - ADV: MATHEUS DE OLIVEIRA TAVARES (OAB 160711/SP)', ''), ('', 'Processo 1057280-15.2014.8.26.0100')]
If both inputs are legal (i.e. the input may contain the word Process and may contain the word Processo), then this regular expression should be used:
>>> regex = re.compile('(Processo? \\d{7}\\-\\d{2}\\.\\d{4}\\.\\d+\\.\\d{2}\\.\\d{4}.*?)(?=Processo?)|(Processo? \\d{7}\\-\\d{2}\\.\\d{4}\\.\\d+\\.\\d{2}\\.\\d{4}.*)')