Problems fixing a python tuple regex?

Problems fixing a python tuple regex? - python

I have a large .txt file that is made up of: word1, word2, id, number as follows:
s = '''
Vaya ir VMM03S0 0.427083
mañanita mañana RG 0.796611
, , Fc 1
buscando buscar VMG0000 1
una uno DI0FS0 0.951575
lavadora lavadora NCFS000 0.414738
con con SPS00 1
la el DA0FS0 0.972269
que que PR0CN000 0.562517
sorprender sorprender VMN0000 1
a a SPS00 0.996023
una uno DI0FS0 0.951575
persona persona NCFS000 0.98773
muy muy RG 1
especiales especial AQ0CS0 1
para para SPS00 0.999103
nosotros nosotros PP1MP000 1
, , Fc 1
y y CC 0.999962
la lo PP3FSA00 0.0277039
encontramos encontrar VMIP1P0 0.65
. . Fp 1
Pero pero CC 0.999764
vamos ir VMIP1P0 0.655914
a a SPS00 0.996023
lo el DA0NS0 0.457533
que que PR0CN000 0.562517
interesa interesar VMIP3S0 0.994868
LO_QUE_INTERESA_La lo_que_interesa_la NP00000 1
lavadora lavador AQ0FS0 0.585262
tiene tener VMIP3S0 1
una uno DI0FS0 0.951575
clasificación clasificación NCFS000 1
A+ a+ NP00000 1
, , Fc 1
de de SPS00 0.999984
las el DA0FP0 0.970954
que que PR0CN000 0.562517
ahorran ahorrar VMIP3P0 1
energía energía NCFS000 1
, , Fc 1
si si CS 0.99954
no no RN 0.998134
me me PP1CS000 0.89124
equivoco equivocar VMIP1S0 1
. . Fp 1
Lava lavar VMIP3S0 0.397388
hasta hasta SPS00 0.957698
7 7 Z 1
kg kilogramo NCMN000 1
, , Fc 1
no no RN 0.998134
está estar VAIP3S0 0.999201
nada nada RG 0.135196
mal mal RG 0.497537
, , Fc 1
se se P00CN000 0.465639
le le PP3CSD00 1
veía ver VMII3S0 0.62272
un uno DI0MS0 0.987295
gran gran AQ0CS0 1
tambor tambor NCMS000 1
( ( Fpa 1
de de SPS00 0.999984
acero acero NCMS000 0.973481
inoxidable inoxidable AQ0CS0 1
) ) Fpt 1
y y CC 0.999962
un uno DI0MS0 0.987295
consumo consumo NCMS000 0.948927
máximo máximo AQ0MS0 0.986111
de de SPS00 0.999984
49 49 Z 1
litros litro NCMP000 1
Mandos mandos NP00000 1
intuitivos intuitivo AQ0MP0 1
, , Fc 1
todo todo PI0MS000 0.43165
muy muy RG 1
bien bien RG 0.902728
explicado explicar VMP00SM 1
, , Fc 1
nada nada PI0CS000 0.850279
que que PR0CN000 0.562517
ver ver VMN0000 0.997382
con con SPS00 1
hola RG 0.90937838
como VMP00SM 1
estas AQ089FG 0.90839
la el DA0FS0 0.972269
lavadora lavadora NCFS000 0.414738
de de SPS00 0.999984
casa casa NCFS000 0.979058
de de SPS00 0.999984
mis mi DP1CPS 0.995868
padres padre NCMP000 1
Además además NP00000 1
también también RG 1
seca seco AQ0FS0 0.45723
preciadas preciar VMP00PF 1
. . Fp 1'''
For example for the s "file" I would like to extract the ids that start with AQ and RG followed by their word2, but they must ocurre one after the other for the above example this words hold the one after another order:
muy muy RG 1
especial especial AQ0CS0 1
For example this words doesnt hold the one after another order, so I would not like to extract them in a tuple:
hola RG 0.90937838
como VMP00SM 1
estas AQ089FG 0.90839
I would like to create a regex that extract in a tuple list only the word2 followed by its id like this: [('word2','id')] for all the .txt file and for all the words that hold true the one after another order. For the above example this is the only valid output:
muy muy RG 1
especiales especial AQ0CS0 1
and
también también RG 1
seca seco AQ0FS0 0.45723
Then return them in a tuple with its full id, since they preserve the one after another order:
[('muy', 'RG', 'especial', 'AQ0CS0'), ('también', 'RG', 'seco', 'AQ0FS0')]
I tried the following:
in:
t = re.findall(r'(\w+)\s*(RG)[^\n]*\n[^\n]*?(\w+)\s*(AQ\w*)', s)
print t
But my output is wrong, since it is droping the accent and some characters:
out:
[('muy', 'RG', 'especial', 'AQ0CS0'), ('n', 'RG', 'seco', 'AQ0FS0')]
instead of, which is the correct:
[('muy', 'RG', 'especial', 'AQ0CS0'), ('también', 'RG', 'seco', 'AQ0FS0')]
Could someone help me to understand what happened with my above example and how to fix it in order to catch the word2 and idthat preserve the one after another ocurrence?. Thanks in advance guys.

If you wanted the full ID included, then add that to your regular expression:
re.findall(r'^(\w+)\s.+\s(RG)\s[0-9.]+\n(.+)\s.+\s(AQ[A-Z0-9]+)', s, re.M)
Note that the \w class won't match non-ASCII characters. Decode s to unicode and use a Unicode regex:
re.findall(r'^(\w+)\s.+\s(RG)\s[0-9.]+\n(.+)\s.+\s(AQ[A-Z0-9]+)',
s.decode('utf8'), re.M | re.UNICODE)
What codec to use for decoding depends on your input file; I picked UTF-8 here as an example but that is not necessarily correct.
Demo:
>>> re.findall(r'^(\w+)\s.+\s(RG)\s[0-9.]+\n^(.+)\s.+\s(AQ[A-Z0-9]+)',
... s.decode('utf8'), re.M | re.UNICODE)
[(u'muy', u'RG', u'especiales', u'AQ0CS0'), (u'muy', u'RG', u'sencilla', u'AQ3948')]

def code(aline):
try:
a,b,c,d = aline.split()
return c[:2]
except ValueError:
return ''
result = []
l2 = ''
with open('texte.txt') as fp:
for l3 in fp:
l1, l2 = l2, l3
if code(l1)=='RG' and code(l2)=='AQ':
a,b,c,d = l1.split()
e,g,h,j = l2.split()
result.append((a, c, e, h))
print(result)

Related

Add a comma after two words in pandas

I have the following texts in a df column:
La Palma
La Palma Nueva
La Palma, Nueva Concepcion
El Estor
El Estor Nuevo
Nuevo Leon
San Jose
La Paz Colombia
Mexico Distrito Federal
El Estor, Nuevo Lugar
What I need is to add a comma at the end of each row but the condition that it is only two words. I found a partial solution:
df['Column3'] = df['Column3'].apply(lambda x: str(x)+',')
(solution found in stackoverflow)

Given:
words
0 La Palma
1 La Palma Nueva
2 La Palma, Nueva Concepcion
3 El Estor
4 El Estor Nuevo
5 Nuevo Leon
6 San Jose
7 La Paz Colombia
8 Mexico Distrito Federal
9 El Estor, Nuevo Lugar
Doing:
df.words = df.words.apply(lambda x: x+',' if len(x.split(' ')) == 2 else x)
print(df)
Outputs:
words
0 La Palma,
1 La Palma Nueva
2 La Palma, Nueva Concepcion
3 El Estor,
4 El Estor Nuevo
5 Nuevo Leon,
6 San Jose,
7 La Paz Colombia
8 Mexico Distrito Federal
9 El Estor, Nuevo Lugar

How to append key and values to a dataframe without creating each time row of column name?

So I am trying to apped values and key from a dict to a dataframe, I for that when looping in the dict I create columns with the name, so far it works but there is a problem. Each time I loop in the dict I get, new row of column name I would like to avoid. How can I do that:
Input data in the dictionnary
{'all': ['463\tQuels problèmes ce concept résout-il ? Nous revenons à la fiche de fonction, elle résout ce problème de technologie, avec une grande facilité.\tM',
'2647\tCela signifie donc que pour le résoudre, nous devons mettre à jour tous les Magelis que nous avons............\tC',
"5391\tJe ne pense pas que cela changera la qualité du produit, donc cela ne changera rien pour moi. Je continuerai à l'acheter.\tM",
"1120\tSur les stations de pompage, c'est très intéressant car environnement pollué, corrosif ?\tM"],
"am": ['463\tQuels problèmes ce concept résout-il ? Nous revenons à la fiche de fonction, elle résout ce problème de technologie, avec une grande facilité.\tM',
'2647\tCela signifie donc que pour le résoudre, nous devons mettre à jour tous les Magelis que nous avons............\tC',
"5391\tJe ne pense pas que cela changera la qualité du produit, donc cela ne changera rien pour moi. Je continuerai à l'acheter.\tM",
"1120\tSur les stations de pompage, c'est très intéressant car environnement pollué, corrosif ?\tM"]}
for k, v in Liste_phrases_retraduit.items():
v.pop(0)
v = [i.split("\t") for i in v]
df = pd.DataFrame(v, columns = ['identifiant', 'verbatim', 'etiquette'] )
df['id_langue'] = k
print(df.head())
Result
identifiant verbatim etiquette \
0 463 Quels problèmes ce concept résout-il ? Nous re... M
1 2647 Cela signifie donc que pour la solution, vous ... C
2 5391 Cela ne changera pas la qualité du produit, je... M
3 1120 C'est très intéressant, parce que c'est un env... M
4 4021 C'est bien pensé. La couleur, la présentation ... M
id_langue
0 aBr
1 aBr
2 aBr
3 aBr
4 aBr
identifiant verbatim etiquette \
0 463 Quels problèmes ce concept résout-il ? Nous re... M
1 2647 Cela signifie donc que pour le résoudre, nous ... C
2 5391 Je ne pense pas que cela changera la qualité d... M
3 1120 Sur les stations de pompage, c'est très intére... M
4 4021 C'est bien pensé. La représentation colorée et... M
id_langue
0 all
1 all
2 all
3 all
4 all
expectation
identifiant verbatim etiquette \
0 463 Quels problèmes ce concept résout-il ? Nous re... M
1 2647 Cela signifie donc que pour la solution, vous ... C
2 5391 Cela ne changera pas la qualité du produit, je... M
3 1120 C'est très intéressant, parce que c'est un env... M
4 4021 C'est bien pensé. La couleur, la présentation ... M
id_langue
0 aBr
1 aBr
2 aBr
3 aBr
4 aBr
0 463 Quels problèmes ce concept résout-il ? Nous re... M
1 2647 Cela signifie donc que pour le résoudre, nous ... C
2 5391 Je ne pense pas que cela changera la qualité d... M
3 1120 Sur les stations de pompage, c'est très intére... M
4 4021 C'est bien pensé. La représentation colorée et... M
0 all
1 all
2 all
3 all
4 all
0 463 Quels problèmes ce concept résout-il ? Nous re... M
1 2647 Cela signifie donc que pour le résoudre, nous ... C
2 5391 Je ne pense pas que cela changera la qualité d... M
3 1120 Sur les stations de pompage, c'est très intére... M
4 4021 C'est bien pensé. La représentation colorée et... M
0 german
1 german
2 german
3 german
4 german
How do I get rid of that line. I could save data like this and then filter and suppress it in excel file but since I have to send the code , it would not be really professional.

In pandas how to create a dataframe from a list of dictionaries?

In python3 and pandas I have a list of dictionaries in this format:
a = [{'texto27/2': 'SENADO: PLS 00143/2016, de autoria de Telmário Mota, fala sobre maternidade e sofreu alterações em sua tramitação. Tramitação: Comissão de Assuntos Sociais. Situação: PRONTA PARA A PAUTA NA COMISSÃO. http://legis.senado.leg.br/sdleg-getter/documento?dm=2914881'}, {'texto27/3': 'SENADO: PEC 00176/2019, de autoria de Randolfe Rodrigues, fala sobre maternidade e sofreu alterações em sua tramitação. Tramitação: Comissão de Constituição, Justiça e Cidadania. Situação: PRONTA PARA A PAUTA NA COMISSÃO. http://legis.senado.leg.br/sdleg-getter/documento?dm=8027142'}, {'texto6/4': 'SENADO: PL 05643/2019, de autoria de Câmara dos Deputados, fala sobre violência sexual e sofreu alterações em sua tramitação. Tramitação: Comissão de Direitos Humanos e Legislação Participativa. Situação: MATÉRIA COM A RELATORIA. http://legis.senado.leg.br/sdleg-getter/documento?dm=8015569'}]
I tried to transform it into a dataframe with these commands:
import pandas as pd
df_lista_sentencas = pd.DataFrame(a)
df_lista_sentencas.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
texto27/2 1 non-null object
texto27/3 1 non-null object
texto6/4 1 non-null object
dtypes: object(3)
memory usage: 100.0+ bytes
But the generated dataframe has blank lines:
df_lista_sentencas.reset_index()
index texto27/2 texto27/3 texto6/4
0 0 SENADO: PLS 00143/2016, de autoria de Telmário... NaN NaN
1 1 NaN SENADO: PEC 00176/2019, de autoria de Randolfe... NaN
2 2 NaN NaN SENADO: PL 05643/2019, de autoria de Câmara do...
I would like to generate something like this:
texto27/2 texto27/3 texto6/4
SENADO: PLS 00143/2016, de autoria de Telmário... SENADO: PEC 00176/2019, de autoria de Randolfe.. SENADO: PL 05643/2019, de autoria de Câmara do...
Please, does anyone know how I can create a dataframe without blank lines?

May be using bfill:
df = df_lista_sentencas.bfill().iloc[[0]]
print(df)

How to slow down the speed exec of a func

When I apply my code the "attaquedeplac" func is running too fast:
I used the after func but "attaquedeplac" ran 8th times, stop and wait 1000ms and ran 8th times again and again. Here the complete code, but my problem is in this part :
for a,b in attaque.items():
#a = nom de la variable , b = objet Tkinter
x = liste[1]
y = liste[2]
ajoutx = listedeco[0]
ajouty = listedeco[1]
compteur = 0
def attaquedeplac():
global x,y,ajoutx,ajouty,compteur
x =x + (compteur * ajoutx)
y =y + (compteur * ajouty)
Fond.coords(b, x , y , x+ajoutx, y+ajouty)
compteur +=1
print("Tout vas bien {}".format(compteur))
while x>40 and x<980 and y > 40 and y < 680:
attaquedeplac()
fenetre.after(1000,attaquedeplac)
Output:
Tout vas bien 1
Tout vas bien 2
Tout vas bien 3
Tout vas bien 4
Tout vas bien 5
Tout vas bien 6
Tout vas bien 7
Tout vas bien 8
<here a step>
Tout vas bien 9
Tout vas bien 10
Tout vas bien 11
Tout vas bien 12
Tout vas bien 13
Tout vas bien 14
Tout vas bien 15
Tout vas bien 16
<the other step>
Tout vas bien 17
Tout vas bien 18
Tout vas bien 19
Tout vas bien 20
Tout vas bien 21
Tout vas bien 22
Tout vas bien 23
Tout vas bien 24
<the other step>
Tout vas bien 25
Tout vas bien 26
Tout vas bien 27
Tout vas bien 28
I tried to make the same concept of the arrow like in Zelda 1 with slow progress, which is visible to the human eyes).

Try this
import time
time.sleep(1)
Where 1 = 1 second = 1000ms

How write a regex that preserve the order of appearence with python?

Given the following string i would like to extract a tuple such that the tuple preserve the appeareance of an assosiated id (POS tag). The order is: NCFS000, AQ0CS0. They need to be concecutive, no other id/tag can be between NCFS000, AQ0CS0.
For example:
string= ''' Hola hola I 1
compis compis NCMS000 0.500006
! ! Fat 1
No no RN 0.998045
sabía saber VMII3S0 0.592869
como como CS 0.999289
se se P00CN000 0.465639
ponía poner VMII3S0 0.65
una uno DI0FS0 0.951575
lavadora lavadora NCFS000 0.414738
hasta hasta SPS00 0.957698
que que PR0CN000 0.562517
conocí conocer VMIS1S0 1
esta este DD0FS0 0.986779
y y CC 0.999962
es ser VSIP3S0 1
que que CS 0.437483
es ser VSIP3S0 1
muy muy RG 1
sencilla sencillo AQ0FS0 1
de de SPS00 0.999984
utilizar utilizar VMN0000 1
! ! Fat 1
Todo todo DI0MS0 0.560961
un uno DI0MS0 0.987295
gustazo gustazo NCMS000 1
cuando cuando CS 0.985595
estamos estar VAIP1P0 1
aprendiendo aprender VMG0000 1
para para SPS00 0.999103
emancipar emancipar VMN0000 1
nos nos PP1CP000 1
, , Fc 1
que que CS 0.437483
si si CS 0.99954
nos nos PP1CP000 0.935743
ponen poner VMIP3P0 1
facilidad facilidad NCFS000 1
con con SPS00 1
las el DA0FP0 0.970954
tareas tarea NCFP000 1
de de SPS00 0.999984
la el DA0FS0 0.972269
casa casa NCFS000 0.979058
pues pues CS 0.998047
mejor mejor AQ0CS0 0.873665
que que PR0CN000 0.562517
mejor mejor AQ0CS0 0.873665
. . Fp 1
Antes_de antes_de SPS00 1
esta este PD0FS000 0.0132212
teníamos tener VMII1P0 1
otra otro DI0FS0 0.803899
de de SPS00 0.999984
la el DA0FS0 0.972269
marca marca NCFS000 0.972603
Otsein otsein NP00000 1
, , Fc 1
de de SPS00 0.999984
estas este DD0FP0 0.97043
que que PR0CN000 0.562517
van ir VMIP3P0 1
incluidas incluir VMP00PF 1
en en SPS00 1
el el DA0MS0 1
mobiliario mobiliario NCMS000 0.476077
y y CC 0.999962
además además RG 1
era ser VSII1S0 0.491262
de de SPS00 0.999984
carga carga NCFS000 0.952569
superior superior AQ0CS0 0.992424
, , Fc 1
pero pero CC 0.999764
tan tan RG 1
antigua antiguo AQ0FS0 0.953488
que que CS 0.437483
según según SPS00 0.995943
mi mi DP1CSS 0.999101
madre madre NCFS000 1
, , Fc 1
nadie nadie PI0CS000 1
la lo PP3FSA00 0.0277039
podía poder VMII3S0 0.63125
tocar tocar VMN0000 1
porque porque CS 1
solo solo RG 0.0472103
la lo PP3FSA00 0.0277039
entendía entender VMII3S0 0.65
ella él PP3FS000 1
. . Fp 1
Esta este PD0FS000 0.0132212
es ser VSIP3S0 1
de de SPS00 0.999984
la el DA0FS0 0.972269
marca marca NCFS000 0.972603
Aeg aeg NP00000 1
y y CC 0.999962
dentro_de dentro_de SPS00 1
este este DD0MS0 0.960092
tipo tipo NCMS000 1
de de SPS00 0.999984
lavadoras lavadora NCFP000 0.411969
de de SPS00 0.999984
esta este DD0FS0 0.986779
marca marca NCFS000 0.972603
las lo PP3FPA00 0.0289466
había haber VAII1S0 0.353863
más más RG 1
caras caro AQ0FP0 0.417273
o o CC 0.999769
más más RG 1
baratas barato AQ0FP0 0.3573
y y CC 0.999962
está estar VAIP3S0 0.999201
digamos decir VMSP1P0 0.785925
que que CS 0.437483
está estar VAIP3S0 0.999201
en en SPS00 1
el el DA0MS0 1
punto punto NCMS000 1
medio medio AQ0MS0 0.314286
. . Fp 1
Es ser VSIP3S0 1
de de SPS00 0.999984
color color NCMS000 1
blanco blanco AQ0MS0 0.598684
y y CC 0.999962
tiene tener VMIP3S0 1
carga carga NCFS000 0.952569
frontal frontal AQ0CS0 0.657209
, , Fc 1
con con SPS00 1
una uno DI0FS0 0.951575
capacidad capacidad NCFS000 1
de de SPS00 0.999984
6kg 6kg Z 1
. . Fp 1
En en SPS00 1
casa casa NCFS000 0.979058
a_pesar_de a_pesar_de SPS00 1
ser ser VSN0000 0.940705
cuatro 4 Z 1
, , Fc 1
se se P00CN000 0.465639
ponen poner VMIP3P0 1
lavadoras lavadora NCFP000 0.411969
casi casi RG 1
todos todo DI0MP0 0.624221
o o CC 0.999769
todos todo DI0MP0 0.624221
los el DA0MP0 0.976481
días día NCMP000 1
. . Fp 1
En en SPS00 1
su su DP3CS0 1
parte parte NCFS000 0.499183
de de SPS00 0.999984
arriba arriba RG 0.986014
encontramos encontrar VMIP1P0 0.65
la el DA0FS0 0.972269
" " Fe 1
; ; Fx 1
zona zona NCFS000 1
de de SPS00 0.999984
mandos mando NCMP000 1
" " Fe 1
; ; Fx 1
, , Fc 1
donde donde PR000000 0.967437
se se P00CN000 0.465639
puede poder VMIP3S0 0.999117
echar echar VMN0000 1
el el DA0MS0 1
detergente detergente NCMS000 0.49273
, , Fc 1
aunque aunque CC 1
en en SPS00 1
nuestro nuestro DP1MSP 0.94402
caso caso NCMS000 0.99812
lo lo PP3CNA00 0.271177
a a SPS00 1
el el DA0MS0 1
ser ser VSN0000 0.940705
gel gel NCMS000 1
lo lo PP3CNA00 0.271177
ponemos poner VMIP1P0 1
directamente directamente RG 1
junto_con junto_con SPS00 1
la el DA0FS0 0.972269
ropa ropa NCFS000 1
. . Fp 1
Luego luego RG 0.996689
tiene tener VMIP3S0 1
la el DA0FS0 0.972269
rueda rueda NCFS000 0.72093
para para SPS00 0.999103
elegir elegir VMN0000 1
el el DA0MS0 1
programa programa NCMS000 0.953488
y y CC 0.999962
los el DA0MP0 0.976481
intermitentes intermitente NCMP000 0.342773
que que PR0CN000 0.562517
indican indicar VMIP3P0 1
en en SPS00 1
que que CS 0.437483
paso paso NCMS000 0.905738
de de SPS00 1
el el DA0MS0 1
programa programa NCMS000 0.953488
estaba estar VAII1S0 0.5
. . Fp 1
Como como CS 0.999289
todas todo PI0FP000 0.0490506
tiene tener VMIP3S0 1
programas programa NCMP000 0.97619
más más RG 1
cortos corto AQ0MP0 1
y y CC 0.999962
más más RG 1
largos largo AQ0MP0 0.97619
, , Fc 1
incluso incluso RG 0.996383
un uno DI0MS0 0.987295
programa programa NCMS000 0.953488
que que PR0CN000 0.562517
seria seriar VMIP3S0 0.151546
como como CS 0.999289
lavar lavar VMN0000 1
a a SPS00 0.996023
mano mano NCFS000 0.992095
y y CC 0.999962
otro otro DI0MS0 0.612994
ideal ideal NCMS000 0.5
para para SPS00 0.999103
estores estor NCMP000 1
, , Fc 1
que que PR0CN000 0.562517
salen salir VMIP3P0 0.972603
casi casi RG 1
secos seco AQ0MP0 1
y y CC 0.999962
planchaditos planchar VMP00PM 0.691767
para para SPS00 0.999103
colgar colgar VMN0000 1
y y CC 0.999962
ya ya RG 0.999395
está estar VAIP3S0 0.999201
. . Fp 1
Es ser VSIP3S0 1
muy muy RG 1
fácil fácil AQ0CS0 1
de de SPS00 0.999984
aprender aprender VMN0000 1
la lo PP3FSA00 1
y y CC 0.999962
además además RG 1
tiene tener VMIP3S0 1
indicador indicador NCMS000 0.64273
por por SPS00 1
sonido sonido NCMS000 1
de de SPS00 0.999984
cuando cuando CS 0.985595
acaba acabar VMIP3S0 0.992958
, , Fc 1
lista listar VMIP3S0 0.220088
para para SPS00 0.999103
abrir abrir VMN0000 1
y y CC 0.999962
tender tender VMN0000 1
. . Fp 1
Por por SPS00 1
decir decir VMN0000 0.997512
algo algo PI0CS000 0.900246
malo malo AQ0MS0 0.657087
de de SPS00 0.999984
ella él PP3FS000 1
, , Fc 1
sería ser VSIC1S0 0.5
que que CS 0.437483
cuando cuando CS 0.985595
centrifuga centrifugar VMIP3S0 0.994859
, , Fc 1
algo algo PI0CS000 0.900246
que que PR0CN000 0.562517
hace hacer VMIP3S0 1
muy muy RG 1
bien bien RG 0.902728
, , Fc 1
pues pues CS 0.998047
vibra vibrar VMIP3S0 0.994856
un uno DI0MS0 0.987295
poco poco RG 0.542435
y y CC 0.999962
se se P00CN000 0.465639
nota notar VMIP3S0 0.419995
el el DA0MS0 1
ruido ruido NCMS000 1
jeje jeje NCMS000 0.833445
, , Fc 1
pero pero CC 0.999764
no no RN 0.998134
se se P00CN000 0.465639
mueve mover VMIP3S0 0.994868
de de SPS00 0.999984
su su DP3CS0 1
sitio sitio NCMS000 0.980769
! ! Fat 1
! ! Fat 1
Saludillos saludillos NP00000 0.411768
! ! Fat 1
'''
What is all ready done is to extract that tuple with a regex with:
import re
tuple = re.findall(r'(\w+)\s\w+\s(NCFS000|AQ0CS0)',string)
id1 = "NCFS000"
id2 = "AQ0CS0"
arr = []
for i in range(0, len(tuple)-1):
if tupla[i][1]==id1 and tuple[i+1][1]==id2:
arr.append(tuple[i])
arr.append(tuple[i+1])
print arr
the output is the following:
[('casa', 'NCFS000'), ('mejor', 'AQ0CS0'), ('carga', 'NCFS000'), ('superior', 'AQ0CS0')]`
The problem with output this is that if you check out string, this side of the tuple dont preserve the order of appeareance(NCFS000, AQ0CS0):
('casa', 'NCFS000'), ('mejor', 'AQ0CS0') since:
casa casa NCFS000
pues pues CS
mejor mejor AQ0CS0
Are not consecutive, by the other hand if you check out the other side of the tuple:
('carga', 'NCFS000'), ('superior', 'AQ0CS0')
They are consecutive, since they follow the order (NCFS000, AQ0CS0) i.e. they have no other tag between them:
carga carga NCFS000
superior superior AQ0CS0
How can i fix the code in order to just write the tuples that follow the NCFS000, AQ0CS0 order?

Add the re.M flag and use this regex:
^(\w+)\s\w+\s(NCFS000)\n^(\w+)\s\w+\s(AQ0CS0)
Live Demo
Or, with your example string:
>>> re.findall(r'^(\w+)\s\w+\s(NCFS000)\n^(\w+)\s\w+\s(AQ0CS0)',string, re.M)
[('carga', 'NCFS000', 'superior', 'AQ0CS0')]
With your updated string, you need to account for the trailing digits:
^(\w+)\s\w+\s(NCFS000)\s[0-9.]+\n^(\w+)\s\w+\s(AQ0CS0)
Live Demo
In Python:
>>> re.findall(r'^(\w+)\s\w+\s(NCFS000)\s[0-9.]+\n^(\w+)\s\w+\s(AQ0CS0)', txt, re.M)
[('carga', 'NCFS000', 'superior', 'AQ0CS0'), ('carga', 'NCFS000', 'frontal', 'AQ0CS0')]

While you could postprocess the tuples by fixing your code, or writing simpler code like this:
for entry, next_entry in zip(tup, tup[1:]):
if entry[1] == 'NCFS000' and next_entry[1] == 'AQ0CS0':
# good, do something with entry and next_entry
… that still isn't going to catch the case where there were other, unmatched, lines between the two, because you've already thrown that information away.
While you could instead match every line (replace the (NCFS000|AQ0CS0) in your regexp with (\w+), and then check them this way, it seems like it would be a lot easier to just rewrite your regexp to capture only those cases in the first place:
matches = re.findall(r'(\w+)\s\w+\sNCFS000\n(\w+)\s\w+\sAQ0CS0', string)
for entry, next_entry in matches:
# no need for a check
(As a side note, don't name your variables things like tuple and string. Besides not saying anything useful about the contents of those variables, that also overshadows, respectively, a type and a module that you probably don't want to hide.)

Just split string into rows and check successive rows:
rows = string.splitlines()
for i in range(0, len(tuple)-1):
if re.search('NCFS000'), rows[i]) and re.search('AQ0CS0', rows[i+1]):
arr.append(rows[i])
arr.append(rows[i+1])
Notes:
This is not the most elegant (i.e., "pythonic") way to write your loop, but it's closest to what you were doing. Instead of working with indices, you could also use #abarnert's zip approach-- just use it on the complete list of rows.
Using re.search above is overkill for this example, since you could just say if "NCFS000" in rows[i]. But you might have more complex tests. You could also split each row into a triple and check the third element-- it's nicer but unnecessary in this case since your IDs will not match any real words. Use whatever suits you.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Problems fixing a python tuple regex? - python

def code(aline): try: a,b,c,d = aline.split() return c[:2] except ValueError: return '' result = [] l2 = '' with open('texte.txt') as fp: for l3 in fp: l1, l2 = l2, l3 if code(l1)=='RG' and code(l2)=='AQ': a,b,c,d = l1.split() e,g,h,j = l2.split() result.append((a, c, e, h)) print(result)

Related

Add a comma after two words in pandas

How to append key and values to a dataframe without creating each time row of column name?

In pandas how to create a dataframe from a list of dictionaries?

How to slow down the speed exec of a func

How write a regex that preserve the order of appearence with python?

Categories

Resources