How to decode bit code to emoji python pandas

How to decode bit code to emoji python pandas - python

How to decode bit code to emoji from each text of a row in pandas, study case sentiment analysis
Text
Sentimen
\xf0\x9f\x8e\xb6 la la la...hm hmm \xf0\x9f\x8e\xa7 "Semua diam ,semua bisu"\n"Kita coba tanya sama rumput yg bergoyang" \xe2\x99\xab\xe2\x99\xab\xe2\x99\xab\xe2\x99\xaa\xe2\x99\xaa\xe2\x99\xaa'
Positif
Cerita silat lae \xf0\x9f\x98\x80 semacam wejangan
Negatif
sewot..\xf0\x9f\x98\x82\xf0\x9f\x98\x82\xf0\x9f\x98\x82 dukung dia terus
Positif
kunyuk!!!!\xf0\x9f\x98\x82\xf0\x9f\x98\x82\xf0\x9f\x98\x82 kuy gaslah
Negatif
aku sudah mengalaminya \xf0\x9f\x98\x82\xf0\x9f\x98\x82\xf0\x9f\x98\x82\xf0\x9f\x98\x82 tetiba muncul grub wa
Negatif
g\n\nlagi bosan huft \xf0\x9f\x98\xaa
Negatif
How I want it to look:
Text
Sentimen
🎶 la la la...hm hmm 🎧 "Semua diam ,semua bisu"\n"Kita coba tanya sama rumput yg bergoyang" ♫♫♫♪♪♪''
Positif
Cerita silat lae 😀 semacam wejangan
Negatif
sewot...😂😂😂 dukung dia terus
Positif
kunyuk!!!!😂😂😂 kuy gaslah
Negatif
aku sudah mengalaminya 😂😂😂😂 tetiba muncul grub wa
Negatif
lagi bosan huft 😪
Negatif
I've tried it but make contents comments from text fields all become NaN
enter image description here
I'm out of ideas. Any help would be appreciated

Apply the encoding parameter while converting source to a data frame.
Example with hard-coded text:
import io
import pandas as pd
data_string='''
Text Sentimen
\xf0\x9f\x8e\xb6 la la la...hm hmm \xf0\x9f\x8e\xa7 "Semua diam ,semua bisu" "Kita coba tanya sama rumput yg bergoyang" \xe2\x99\xab\xe2\x99\xab\xe2\x99\xab\xe2\x99\xaa\xe2\x99\xaa\xe2\x99\xaa' Positif
Cerita silat lae \xf0\x9f\x98\x80 semacam wejangan Negatif
sewot..\xf0\x9f\x98\x82\xf0\x9f\x98\x82\xf0\x9f\x98\x82 dukung dia terus Positif
kunyuk!!!!\xf0\x9f\x98\x82\xf0\x9f\x98\x82\xf0\x9f\x98\x82 kuy gaslah Negatif
aku sudah mengalaminya \xf0\x9f\x98\x82\xf0\x9f\x98\x82\xf0\x9f\x98\x82\xf0\x9f\x98\x82 tetiba muncul grub wa Negatif
g lagi bosan huft \xf0\x9f\x98\xaa Negatif
'''.encode('latin1').decode('utf-8')
df = pd.read_csv( io.StringIO(data_string), sep="\t", encoding='utf-8')
print(df)
Output: .\SO\67060643.py
Text Sentimen
0 🎶 la la la...hm hmm 🎧 "Semua diam ,semua bisu"... Positif
1 Cerita silat lae 😀 semacam wejangan Negatif
2 sewot..😂😂😂 dukung dia terus Positif
3 kunyuk!!!!😂😂😂 kuy gaslah Negatif
4 aku sudah mengalaminya 😂😂😂😂 tetiba muncul grub wa Negatif
5 g lagi bosan huft 😪 Negatif

Related

Frequency of words in a text (pandas)

I have a column of words and would like to count the frequency of each word in a text and save that result in another column.
Data:
word frequency
0 l’iss
1 station
2 américaines
3 capsule
4 dernier
5 solaires
6 fusées
7 privé
Text:
états-unis : lancement réussi pour station space x dragon états-unis : lancement réussi pour space x dragon la fusée falcon 9, développée par une société privée : spacex, a décollé de la station sans problème ce matin à 7h44 utc. 22 mai 2012. - prévu initialement pour samedi dernier, le lancement a été reporté à la dernière seconde, suite à la défaillance d'une valve dans un des neuf moteurs du pre\xadmier étage du lan\xadceur. le lanceur a décollé du site de lancement du pas de tir 40 (slc-40) de la base de cape canaveral en floride, qui était autrefois utilisé pour les fusée titan iii et iv et qui a été reconverti pour ce lanceur.
I tried:
from collections import Counter
freq = df['word'].str.apply(Counter(text))
My output:
AttributeError: 'StringMethods' object has no attribute 'apply'
Good output:
word frequency
0 cape 1
1 station 2
2 américaines 0
3 capsule 0
4 dernier 1
5 solaires 0
6 fusée 2

You can transform the text into a counter and then fetch the results from it, using a mix of value_counts and to_dict.
# Assuming the text split is on \s
text_counts = pd.Series(text.split(' ')).value_counts().to_dict()
df['Frequency'] = df.word.apply(lambda x: text_counts.get(x, 0)) # In case the word doesn't exist
word Frequency
0 l’iss 0
1 station 2
2 américaines 0
3 capsule 0
4 dernier 0
5 solaires 0
6 fusées 0
7 privé 0
Another approach is using Python's native Counter:
from collections import Counter
text_counter = Counter(text.split())
df['Frequency'] = df.word.apply(lambda x: text_counter.get(x, 0))

It would be easier the other way around. Start with the Counter object and from that build the dataframe
from collections import Counter
text = '''états-unis : lancement réussi pour station space x dragon états-unis : lancement réussi pour space x dragon la fusée falcon 9, développée par une société privée : spacex, a décollé de la station sans problème ce matin à 7h44 utc. 22 mai 2012. - prévu initialement pour samedi dernier, le lancement a été reporté à la dernière seconde, suite à la défaillance d'une valve dans un des neuf moteurs du pre\xadmier étage du lan\xadceur. le lanceur a décollé du site de lancement du pas de tir 40 (slc-40) de la base de cape canaveral en floride, qui était autrefois utilisé pour les fusée titan iii et iv et qui a été reconverti pour ce lanceur.'''
# naive splitting, it might be better to use regex with \b
c = Counter(text.split())
df = pd.DataFrame(list(c.items()), columns=['word', 'count'])
print(df.head())
Outputs
word count
0 états-unis 2
1 : 3
2 lancement 4
3 réussi 2
4 pour 5
You can then filter the dataframe for the words you want (or you can do the filtering while building the dataframe).

Replace your commas with space, then .split() and then use a dictionary comprehension and map that to your df.
import pandas as pd
text = "états-unis : lancement réussi pour station space x dragon états-unis : lancement réussi pour space x dragon la fusée falcon 9, développée par une société privée : spacex, a décollé de la station sans problème ce matin à 7h44 utc. 22 mai 2012. - prévu initialement pour samedi dernier, le lancement a été reporté à la dernière seconde, suite à la défaillance d'une valve dans un des neuf moteurs du pre\xadmier étage du lan\xadceur. le lanceur a décollé du site de lancement du pas de tir 40 (slc-40) de la base de cape canaveral en floride, qui était autrefois utilisé pour les fusée titan iii et iv et qui a été reconverti pour ce lanceur."
df = pd.DataFrame({'word': ["l’iss", 'station', "américaines", "capsule", "dernier", "solaires", "fusée", "privé"]})
text_list = text.replace(',', ' ').split()
word_counts = {i: text_list.count(i) for i in text_list}
df['frequency'] = df['word'].map(word_counts).fillna(0)

You could summarise the word counts in a dict then use map:
text_list = text.split()
word_counts = {word: text_list.count(word) for word in text_list}
df['frequency'] = df['word'].map(word_counts).fillna(0)

Parsing forum posts with python3 and beautiful soup

I need to get the text from the forum posts.
The site is this one:
http://forum.pcekspert.com/showthread.php?t=263544
I tried to do it like this:
import requests
from bs4 import BeautifulSoup
# Create a variable with the url
url = 'http://forum.pcekspert.com/showthread.php?t=263544'
# Use requests to get the contents
r = requests.get(url)
# Get the text of the contents
html_content = r.text
# Convert the html content into a beautiful soup object
soup = BeautifulSoup(html_content,"lxml")
rez = soup.find_all('id=\"__xclaimwords_wrapper\"')
print(rez)
From the html of the file I found out that the post message is in two tags.
TEXT
the xxx in the first id is a number with 7 digits and every post message has a different one.

You can use a css select to find the dic with the __xclaimwords_wrapper ids and pull the text from that, that is where the message text is:
soup = BeautifulSoup(html_content,"lxml")
rez = [d.text for d in soup.select("#__xclaimwords_wrapper")]
Outout:
['Izgleda da novi update za 8 pod nazivom Threshold ce zapravo biti win9.. pa ako nekoga zanima više malo evo vam linka : http://www.pcgamer.com/2014/01/20/wi...ame=0&ns_fee=0\r\ni malo opsirnije o threshold-u http://winsupersite.com/windows-8/th...hip-april-2015', '\nCitiraj:\n\n\n\n\r\n\t\t\t\t\tAutor Sneaky\n\n\nIzgleda da novi update za 8 pod nazivom Threshold ce zapravo biti win9.. pa ako nekoga zanima više malo evo vam linka : http://www.pcgamer.com/2014/01/20/wi...ame=0&ns_fee=0\r\ni malo opsirnije o threshold-u http://winsupersite.com/windows-8/th...hip-april-2015\n\n\n\nIzgleda da su prihvatili logiku Appla i da stancaju verzije i malo po malo "cokaju", nekako mi se cini logicnije nego svako 5 g izbacivat op OS koji u danasnem svitu androida/iosa/wpa teze prolazi naplatu. Nije da su direktno usporedivi, ali ljudi su drski pa je tesko odvojit vece pare za OS.', 'Ja bi prije rekao da su shvatili da su malo zaj* za Win8, pa da isprave stetu - lakse je izbaciti novu verziju.\nSlicno kao i sa Vistom.', 'Tocno to. Win8 (8.1) je meni osobno isto kao i WinME i Vista. U biti MS napravi svaku drugu generaciju Winsa kvalitetno jer popravlja prethodnu koju za*ere maksimalno sa novim features-ima ', 'Nek oni meni samo refreshaju izgled desktopa sa novim ikonama, explorer elementima i animacijama i ja sam zadovoljan \n\nI da, još bolju cloud i WP9 integraciju.', 'Da, fakat je idiocki izbacivat Service packove, kad možeš pičiti nove OS-ove \n\nKako ono beše:\nWin 2000 - SP4\nWin XP - SP3\nVista - SP2\nWin 7 - SP1\nWin 8 - Win 8.1 ', 'meni više smeta to što još nisu uveli novi filesystem koji se mislim da još od viste obećava, a ne nekakve vizualne gluposti', 'nije se nista obecavalo od viste\r\nono sto je bio WinFS nije bio novi FS nego NTFS + "layer" baze podataka koja je vukla metadata iz fajlova\n\r\nmozes si to sranje i instalirat na xp-u (beta 1) pa vidjet sta je to\n\r\nu svakom slucaju ja preskacem i win9\r\nsumnjam da ce ista novog pridonjeti ako ne i jos gore cim pocne sve vise integracija sranja koje nitko normalan netreba\n\r\nako se nevaram win8 ima strgani nacin loginova (slagalica, face reckognition), nebi se cudio da 9-tka bude samo profinjenija 8-ica\r\nsa popravljenim tim stvarima\r\ncak bi i ReFS mogao vidjeti svjetlost posto ga samo nude na serverima kao test\n\r\nali to su samo minarne stvari, to mene ne zanima :P\r\nak ce jasit i dalje metro sranja i tviter i assbook integracije i bing pizd***e\r\nonda ostajem na 7-ici ', '\nCitiraj:\n\n\n\n\r\n\t\t\t\t\tAutor Drug Brko\n\n\nWin 2000 - SP4\n\n\n\nSa zadnjim rollupom, skoro 5, ali uzmi da je NT 4 imao 6+1 komada, tako da ti je tablica dobra. ', '\nCitiraj:\n\n\n\n\r\n\t\t\t\t\tAutor Bubba\n\n\nSa zadnjim rollupom, skoro 5, ali uzmi da je NT 4 imao 6+1 komada, tako da ti je tablica dobra. \n\n\n\nMa znam da je bio rollup za win2000, sjećam se da sam 2005 slipstreamao cd sa tim updateom, al nije SP. \nXP SP3 nakon instalacije OS-a ima još oko 1GB updateova sa neta, Win 7 isto. Vjerojatno i vista \nSad će bit za win 8.1 bit 100 MB i odma će doći win 9 ', '\nCitiraj:\n\n\n\n\r\n\t\t\t\t\tAutor Drug Brko\n\n\nMa znam da je bio rollup za win2000, sjećam se da sam 2005 slipstreamao cd sa tim updateom, al nije SP.\n\n\n\nnije SP jer oni nisu htjeli da bude SP\r\njer kad izbace SP onda moraju produljiti rok podrske OS-a\n\r\nzato XP i nece imati SP4 iako ima tonu patcheva nakon SP3-a\r\nkao sto i vista nece dobit SP3 a win "7" SP2\n\r\nsvaki SP ih kosta minimalno novih 3 godina podrske sto se neuklapa u njihovu ideju\r\nsvake 2 godine novi OS\n\r\nzasto bi davali besplatne SP-ove kad mogu u istom roku naplacivati novi OS', '\nCitiraj:\n\n\n\n\r\n\t\t\t\t\tAutor Baja 001\n\n\nIzgleda da su prihvatili logiku Appla i da stancaju verzije i malo po malo "cokaju", nekako mi se cini logicnije nego svako 5 g izbacivat op OS koji u danasnem svitu androida/iosa/wpa teze prolazi naplatu. Nije da su direktno usporedivi, ali ljudi su drski pa je tesko odvojit vece pare za OS.\n\n\n\nS druge strane, Appleovi updateovi se više ne naplaćuju. Prvo su smanjili cijenu, a 10.9 je besplatan. Dovoljno je kupit hardware.', '\nCitiraj:\n\n\n\n\r\n\t\t\t\t\tAutor IcyTexx\n\n\nS druge strane, Appleovi updateovi se više ne naplaćuju. Prvo su smanjili cijenu, a 10.9 je besplatan. Dovoljno je kupit hardware.\n\n\n\nNaplati se kroz hw, ali svakako idu na taj mobilni trend izbacivanja.', '\nCitiraj:\n\n\n\n\r\n\t\t\t\t\tAutor IcyTexx\n\n\nS druge strane, Appleovi updateovi se više ne naplaćuju. Prvo su smanjili cijenu, a 10.9 je besplatan. Dovoljno je kupit hardware.\n\n\n\nS druge strane appleovi updateovi su smeće. To ti kažem s punom odgovornošću kao dugogodišnji profesionalni korisnik (FCP7 itd.). Trenutno imamo predzadnju generaciju njihovih Mac Pro 6 jezgrenih đubradi i nakon upgrejda na Mavericse sve je ošlo u tri pm. Popis je predugačak...a imaš dovoljno o tome i na netu.\n\n\nCitiraj:\n\n\n\n\r\n\t\t\t\t\tAutor Baja 001\n\n\nNaplati se kroz hw, ali svakako idu na taj mobilni trend izbacivanja.\n\n\n\nTako je. Dovoljno je baciti oko na cjenik.', 'Zato kažem dovoljno je kupit hardware... U Americi su im laptopi pojeftinili. Kad ću ići sad u 5. mj u Ameriku opet ću kupit jednog jer sam im voljan dati taj price premium. Prije dvije godine sam im dao više radi kvalitete izrade, a kroz korištenje sad bi im dao više radi OS-a. Vise ne naplaćuju ni svoj "Office" software. Gledao sam i u PC svijetu ne mogu naći zamjenu njihovom 15" MBP Retina. Barem po mojim kriterijima. Ako netko nadje "retina" laptop od 15" koji je jednako debeo kao i MBPR, da je metalan, a ne plastican, sa mu jednako ili duže traje baterija, istih ili jačih specifikacija sa PCIE storageom za iste ili manje pare skidam mu kapu. A od Windowsovih DPI postavki mi bude zlo.\n\nS druge strane, Windows 8 kosta $120. A on je cvijeće, ha? Osnovni Office $130. Evo čekam u redu da im iskeširam pare...\n\nNo neću sad ulazit u to budući da je ovo MS tema, samo ce se krenut ljutit ljudi.\n\n#gnujko vjerujem ti, mene nije zahvatilo bas puno problema kod Mavericksa, a isto nisam "običan" korisnik. Budem pogledao o kakvim se problemima tu radi.', 'moguce je da windows 8.2 ne bude 8.2 nego windows 9\nali nadam se da ce poboljsat i destop i metro u isto vrijeme jer dosta stvari koje na destopu imas nema u metrou sto bi stvarno koristilo pogotovo samo za RT verziju kao npr explorer\n\nali budemo vidjeli, brat mi je bio na mvp summitu i nisu puno govorili o novim windowsima', '\nCitiraj:\n\n\n\n\r\n\t\t\t\t\tAutor Metlina\n\n\ni nisu puno govorili o novim windowsima\n\n\n\nzato sto ih jos nisu ni poceli "raditi"\r\nsad se jos uvjek fokusiraju na ovo: http://wzascok.livejournal.com/12774.html', '\nCitiraj:\n\n\n\n\r\n\t\t\t\t\tAutor Metlina\n\n\nmoguce je da windows 8.2 ne bude 8.2 nego windows 9\nali nadam se da ce poboljsat i destop i metro u isto vrijeme jer dosta stvari koje na destopu imas nema u metrou sto bi stvarno koristilo pogotovo samo za RT verziju kao npr explorer\n\nali budemo vidjeli, brat mi je bio na mvp summitu i nisu puno govorili o novim windowsima\n\n\n\n100% ne bude bilo Win8.2, ide se na Threshold (Windows 9).\n\nImaš Metro verziju File Managera, SkyDrive (OneDrive).\n\nMVP-ovi su pod NDA pa nit ne smiju pričati o takvim stvarima \n\nEDIT: I Threshold je počeo sa developmentom.', 'http://youtu.be/VF4Eva_4UNE', '\n\r\nnije 9 nego 10 \n\r\nslike i tekst http://thenextweb.com/microsoft/2014...es-windows-10/\n\r\nstižu sredinom 2015.', 'Jednostavno smo priznali da smo zajebali ', 'Tema bi se trebala preimenovat u Windows 10 ', '\n\n\n\n\nvolio bih zamijeniti sedmicu, vidjet ćemo sada hoću li ', 'Odlicno, taman dovoljno dugo dok ne izguram ove w7 sa zaobilazenjem w8 \n\n90-tih sam furo windows 3.1 pa nesto jako malo W95 a koristio W98, WNT tj. W2000 smo isto zaobisli pa WXP, SP1 i SP2 a SP3 i W Vistu isto debelo zaobisao. \n\nMislim da mi je windows vista najgore legla od svih ti windowsa.\nE sad ostaje nada da ce ovi w10 biti na razini w7...', '\nCitiraj:\n\n\n\n\r\n\t\t\t\t\tAutor mamutarka\n\n\nhttp://youtu.be/VF4Eva_4UNE\n\n\n\nNe kuzim, kao, ovo je neki "novi" super feature?\n\r\nOpen/CreateWindowsStation() i Create/SwitchDesktop() su funkcije podrzane u Windows API-ju jos od Windowsa 2000, dakle, cijelih 14 godina, rekli bi smo - old news.\n\r\nDrugim rijecima, svaki nemusti programer je mogao napraviti ovo bez nekih vecih problema i poteskoca, sto i jesu radili jer ovakvi programcici postoje "oduvijek".\n\r\nNadam se da ipak imaju u planu neki novi ozbiljniji feature, tipa WinFS ili stogod drugo, jer ovo je smijesno i za pokazivati, a kamo li stavljati kao novi ficur novih Windowsa... :\\', 'http://youtu.be/84NI5fjTfpQ', 'http://blogs.windows.com/bloggingwin...ng-windows-10/', 'mergao dvije teme, windows 9 i windows 10 u jednu...\nOvdje će izaći tech preview:\nhttp://windows.microsoft.com/hr-hr/w...ew-coming-soon', 'Čim vidim ono tablet sučelje, imam osjećaj da neću mjenjat sedmicu još ohoho', '\nCitiraj:\n\n\n\n\r\n\t\t\t\t\tAutor ivan77\n\n\nČim vidim ono tablet sučelje, imam osjećaj da neću mjenjat sedmicu još ohoho\n\n\n\nBezveze si skeptičan, meni se na kraju i svidio metro. Pogotovo što imaš aplikacije iz store-a koje su prilično zanimljive, od nogometnih rezultata preko tečajne liste do aplikacije koja streama svjetske televizije tipa BBC bez da ti treba proxy, hola i slično.']
Which matches what you see on the page.

Regex to match comma-separated strings containing comma-formatted decimals

I have comma-separated strings like this one:
"Assistência 24hs com Guincho s/limite de km, 2o. Guincho 100 km no mesmo evento, Pacote de Benefícios HDI, Táxi sem Franquia, Serviços Residenciais, 7 dias de Carro Reserva quando Terceiro (sem ar cond), 7 dias de Carro Reserva, Vidros com franquia de R$ 260,00."
I want to split the string by comma, but the problem is that there are numbers with a comma as the decimal separator in the string (for example: 260,00), for which I don't want a split to happen.

You could split by comma, followed by space:
>>> s.split(", ")
['Assist\xc3\xaancia 24hs com Guincho s/limite de km',
'2o. Guincho 100 km no mesmo evento',
'Pacote de Benef\xc3\xadcios HDI',
'T\xc3\xa1xi sem Franquia',
'Servi\xc3\xa7os Residenciais',
'7 dias de Carro Reserva quando Terceiro (sem ar cond)',
'7 dias de Carro Reserva',
'Vidros com franquia de R$ 260,00.']
Note that this will remove both the comma and the following space from the resulting strings.

You're walking on thin ice here. From your example, it seems like using ", " as the field separator (comma-space) would work. Most would opt to quote the strings or use a different delimiter (pipe, tab, \x1F, etc).
This seems very fragile to me, and you could easily be broken further out in time. If you have any influence on what is being given to you, have that conversation first.

The following avoids the fragility that was pointed out by #dsz.
txt = '''Assistência 24hs com Guincho s/limite de km, 2o. Guincho 100 km no mesmo evento, Pacote de Benefícios HDI, Táxi sem
Franquia, Serviços Residenciais, 7 dias de Carro Reserva quando Terceiro (sem ar cond), 7 dias de Carro
Reserva, Vidros com franquia de R$ 260,00.'''
import re
re.split("\,[^\d+\.\d+]",txt)
output:
['Assist\xc3\xaancia 24hs com Guincho s/limite de km',
'2o. Guincho 100 km no mesmo evento',
'Pacote de Benef\xc3\xadcios HDI',
'T\xc3\xa1xi sem Franquia',
'Servi\xc3\xa7os Residenciais',
'7 dias de Carro Reserva quando Terceiro (sem ar cond)',
'7 dias de Carro\nReserva',
'Vidros com franquia de R$ 260,00.']

Match everything until certain pattern [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
So I have this piece of text..
200 g torskefilet
havsalt og friskkværnet peber
8 tynde skiver røget spæk
150 g rødbede (ca. 2 stk.)
½ dl blommeeddike
1 spsk. olivenolie
1 spsk. rørsukker
lidt rødbedeblade eller anden bitter salat
Vask rødbederne, kom dem i en gryde med vand, og bring dem i kog. Kog rødbederne i ca. 30 min., til de er møre, men stadig har lidt bid.
Kom koldt vand på rødbederne, så de køler lidt af, og ”gnid” skindet af efterfølgende, så du kommer ind til rødbedens silkebløde indre.
Skær rødbeder ud i grove tern, og kom dem i en gryde med eddike, sukker og olivenolie.
Lad eddiken koge ind i rødbederne, så de bliver glaserede, og lagen får konsistens som sirup.
Tjek torskefileterne for evt. vildfarne ben og for friskhed (de skal dufte af hav – ikke af havn!).
Skær torsken ud i helt små tern. Brug en skarp kniv, så kødet bliver skåret – ikke moset.
Kom torsketataren i en skål, krydr med salt og peber, og rør rundt.
Form tataren til 4 små bøffer, og sæt dem i køleskabet i 10 min., så tataren lige sætter sig inden den steges.
Varm en pande op, og steg de tynde skiver spæk, så de bliver sprøde og krøller lidt sammen. Tag spækket af panden, og læg det på et stykke køkkenrulle.
Hæld overskydende fedt af panden, og steg så tatarbøfferne i 30 sek. – bare på den ene side. På den måde får de en flot steges korpe, samtidig med at de stadig er rå.
Server bøfferne med de lune, glaserede rødbeder, det sprøde spæk, lidt salat og friskskrabet peberrods-sne (se næste opskrift).
Jeg elsker klassisk kogetorsk med hele svineriet: syltede rødbeder, spæk, høvlet peberrod, fiskesennep og smørsovs – og denne forret er min moderne variant af den evigtgode kombination.
Its a recipe text, the first half is the ingredients with their quantities and the second half is the steps. I'm trying to match using Regex the ingredients and the steps separately.
For ingredients I came up with this Regex string ^(\d|[a-z]).*(?=[A-Z].*)?$ and the for the steps I did this ^[A-Z].*.
However the piece of text is part of a document with recipes each with their own ingredients and steps, so I want to identify each recipe separately. To do that I want to match the ingredients up until the first occurence of the steps pattern and the steps until the first occurence of the ingredients pattern.
How do I do that?

You don't need a regex:
from itertools import imap
from itertools import takewhile, groupby
with open("recipes.txt") as f:
grps = groupby(imap(str.rstrip, f), key=str.islower )
for k, v in grps:
if k:
ing, steps = list(v), list(next(grps, ("", ""))[1])
print(ing)
print(steps)
print("\n\n")
That will work regardless of spacing once the ingredients are always lowercase.
Input:
200 g torskefilet
havsalt og friskkværnet peber
8 tynde skiver røget spæk
150 g rødbede (ca. 2 stk.)
½ dl blommeeddike
1 spsk. olivenolie
1 spsk. rørsukker
lidt rødbedeblade eller anden bitter salat
Vask rødbederne, kom dem i en gryde med vand, og bring dem i kog. Kog rødbederne i ca. 30 min., til de er møre, men stadig har lidt bid.
Kom koldt vand på rødbederne, så de køler lidt af, og ”gnid” skindet af efterfølgende, så du kommer ind til rødbedens silkebløde indre.
Skær rødbeder ud i grove tern, og kom dem i en gryde med eddike, sukker og olivenolie.
Lad eddiken koge ind i rødbederne, så de bliver glaserede, og lagen får konsistens som sirup.
Tjek torskefileterne for evt. vildfarne ben og for friskhed (de skal dufte af hav – ikke af havn!).
Skær torsken ud i helt små tern. Brug en skarp kniv, så kødet bliver skåret – ikke moset.
Kom torsketataren i en skål, krydr med salt og peber, og rør rundt.
Form tataren til 4 små bøffer, og sæt dem i køleskabet i 10 min., så tataren lige sætter sig inden den steges.
Varm en pande op, og steg de tynde skiver spæk, så de bliver sprøde og krøller lidt sammen. Tag spækket af panden, og læg det på et stykke køkkenrulle.
Hæld overskydende fedt af panden, og steg så tatarbøfferne i 30 sek. – bare på den ene side. På den måde får de en flot steges korpe, samtidig med at de stadig er rå.
Server bøfferne med de lune, glaserede rødbeder, det sprøde spæk, lidt salat og friskskrabet peberrods-sne (se næste opskrift).
Jeg elsker klassisk kogetorsk med hele svineriet: syltede rødbeder, spæk, høvlet peberrod, fiskesennep og smørsovs – og denne forret er min moderne variant af den evigtgode kombination.
200 blah
400 bar
foobar bar
Foobar
Bar
Output:
['200 g torskefilet', 'havsalt og friskkv\xc3\xa6rnet peber', '8 tynde skiver r\xc3\xb8get sp\xc3\xa6k', '150 g r\xc3\xb8dbede (ca. 2 stk.)', '\xc2\xbd dl blommeeddike', '1 spsk. olivenolie', '1 spsk. r\xc3\xb8rsukker', 'lidt r\xc3\xb8dbedeblade eller anden bitter salat']
['', 'Vask r\xc3\xb8dbederne, kom dem i en gryde med vand, og bring dem i kog. Kog r\xc3\xb8dbederne i ca. 30 min., til de er m\xc3\xb8re, men stadig har lidt bid.', '', 'Kom koldt vand p\xc3\xa5 r\xc3\xb8dbederne, s\xc3\xa5 de k\xc3\xb8ler lidt af, og \xe2\x80\x9dgnid\xe2\x80\x9d skindet af efterf\xc3\xb8lgende, s\xc3\xa5 du kommer ind til r\xc3\xb8dbedens silkebl\xc3\xb8de indre.', '', 'Sk\xc3\xa6r r\xc3\xb8dbeder ud i grove tern, og kom dem i en gryde med eddike, sukker og olivenolie.', '', 'Lad eddiken koge ind i r\xc3\xb8dbederne, s\xc3\xa5 de bliver glaserede, og lagen f\xc3\xa5r konsistens som sirup.', '', 'Tjek torskefileterne for evt. vildfarne ben og for friskhed (de skal dufte af hav \xe2\x80\x93 ikke af havn!).', '', 'Sk\xc3\xa6r torsken ud i helt sm\xc3\xa5 tern. Brug en skarp kniv, s\xc3\xa5 k\xc3\xb8det bliver sk\xc3\xa5ret \xe2\x80\x93 ikke moset.', '', 'Kom torsketataren i en sk\xc3\xa5l, krydr med salt og peber, og r\xc3\xb8r rundt.', 'Form tataren til 4 sm\xc3\xa5 b\xc3\xb8ffer, og s\xc3\xa6t dem i k\xc3\xb8leskabet i 10 min., s\xc3\xa5 tataren lige s\xc3\xa6tter sig inden den steges.', '', 'Varm en pande op, og steg de tynde skiver sp\xc3\xa6k, s\xc3\xa5 de bliver spr\xc3\xb8de og kr\xc3\xb8ller lidt sammen. Tag sp\xc3\xa6kket af panden, og l\xc3\xa6g det p\xc3\xa5 et stykke k\xc3\xb8kkenrulle.', 'H\xc3\xa6ld overskydende fedt af panden, og steg s\xc3\xa5 tatarb\xc3\xb8fferne i 30 sek. \xe2\x80\x93 bare p\xc3\xa5 den ene side. P\xc3\xa5 den m\xc3\xa5de f\xc3\xa5r de en flot steges korpe, samtidig med at de stadig er r\xc3\xa5.', '', 'Server b\xc3\xb8fferne med de lune, glaserede r\xc3\xb8dbeder, det spr\xc3\xb8de sp\xc3\xa6k, lidt salat og friskskrabet peberrods-sne (se n\xc3\xa6ste opskrift).', '', 'Jeg elsker klassisk kogetorsk med hele svineriet: syltede r\xc3\xb8dbeder, sp\xc3\xa6k, h\xc3\xb8vlet peberrod, fiskesennep og sm\xc3\xb8rsovs \xe2\x80\x93 og denne forret er min moderne variant af den evigtgode kombination.']
['200 blah', '400 bar', 'foobar bar']
['', 'Foobar', 'Bar']
If you want to remove the empty lines:
grps = groupby((line for line in imap(str.rstrip, f) if line), key=str.islower )
Which will remove any empty lines i.e:
['200 blah', '400 bar', 'foobar bar']
['Foobar', 'Bar']

It seems like the first empty line is between the ingredients and the steps, so you might not even need regex:
ingredients, steps = recipe_string.split("\n\n", 1)
This will split the string at the first instance of 2 consecutive newlines.

Python re.match() not working on string with accentuated chars

PAttern is working ok with test subject containing no accentuated char like á é í ã õ ñ
But simply returns no matches when I try it over the actual Portuguese-BR accentuated text.
Tried change encodings but got nothing.. Any help?
EDIT: Regex complete info here
HEX sample imput: 50:72:6f:63:65:73:73:6f:20:31:30:35:36:39:32:32:2d:38:34:2e:32:30:31:33:2e:38:2e
:32:36:2e:30:31:30:30:20:2d:20:45:78:65:63:75:c3:a7:c3:a3:6f:20:64:65:20:54:c3:a
d:74:75:6c:6f:20:45:78:74:72:61:6a:75:64:69:63:69:61:6c:20:2d:20:45:73:70:c3:a9:
63:69:65:73:20:64:65:20:43:6f:6e:74:72:61:74:6f:73:20:2d:20:4d:4f:42:49:4c:49:4e
:53:20:46:4f:52:4d:41:c3:87:c3:83:4f:20:50:52:4f:46:49:53:53:49:4f:4e:41:4c:20:4
5:4d:20:42:45:4c:45:5a:41:20:4c:54:44:41:2e:20:2d:20:4a:55:4c:49:41:4e:41:20:4d:
41:52:41:4e:48:c3:83:4f:20:50:4f:52:54:4f:20:44:41:20:53:49:4c:56:45:49:52:41:20
:2d:20:56:69:73:74:6f:73:2e:20:44:65:66:69:72:6f:20:6f:20:70:65:64:69:64:6f:20:7
0:61:72:61:20:61:20:70:65:73:71:75:69:73:61:20:64:65:20:62:65:6e:73:20:64:61:20:
70:61:72:74:65:20:72:65:71:75:65:72:69:64:61:20:4a:55:4c:49:41:4e:41:20:4d:41:52
:41:4e:48:c3:83:4f:20:50:4f:52:54:4f:20:44:41:20:53:49:4c:56:45:49:52:41:2c:20:4
3:50:46:20:30:33:30:2e:37:39:37:2e:35:36:34:2d:39:35:20:28:64:65:63:6c:61:72:61:
c3:a7:c3:a3:6f:20:64:6f:73:20:63:69:6e:63:6f:20:c3:ba:6c:74:69:6d:6f:73:20:65:78
:65:72:63:c3:ad:63:69:6f:73:29:2c:20:6f:20:71:75:61:6c:20:c3:a9:20:72:65:61:6c:6
9:7a:61:64:6f:2c:20:6e:65:73:74:61:20:64:61:74:61:2c:20:70:6f:72:20:6d:65:69:6f:
20:64:65:20:6f:66:c3:ad:63:69:6f:20:65:6e:76:69:61:64:6f:20:c3:a0:20:52:65:63:65
:69:74:61:20:46:65:64:65:72:61:6c:2c:20:70:72:6f:74:6f:63:6f:6c:61:64:6f:20:65:6
c:65:74:72:6f:6e:69:63:61:6d:65:6e:74:65:2c:20:70:6f:72:20:69:6e:74:65:72:6d:c3:
a9:64:69:6f:20:64:6f:20:73:69:73:74:65:6d:61:20:49:4e:46:4f:4a:55:44:2e:20:49:6e
:74:69:6d:65:2d:73:65:2e:20:2d:20:41:44:56:3a:20:4d:41:54:48:45:55:53:20:44:45:2
0:4f:4c:49:56:45:49:52:41:20:54:41:56:41:52:45:53:20:28:4f:41:42:20:31:36:30:37:
31:31:2f:53:50:29:50:72:6f:63:65:73:73:6f:20:31:30:35:36:39:32:32:2d:38:34:2e:32
:30:31:33:2e:38:2e:32:36:2e:30:31:30:30:20:2d:20:45:78:65:63:75:c3:a7:c3:a3:6f:2
0:64:65:20:54:c3:ad:74:75:6c:6f:20:45:78:74:72:61:6a:75:64:69:63:69:61:6c:20:2d:
20:45:73:70:c3:a9:63:69:65:73:20:64:65:20:43:6f:6e:74:72:61:74:6f:73:20:2d:20:4d
:4f:42:49:4c:49:4e:53:20:46:4f:52:4d:41:c3:87:c3:83:4f:20:50:52:4f:46:49:53:53:4
9:4f:4e:41:4c:20:45:4d:20:42:45:4c:45:5a:41:20:4c:54:44:41:2e:20:2d:20:4a:55:4c:
49:41:4e:41:20:4d:41:52:41:4e:48:c3:83:4f:20:50:4f:52:54:4f:20:44:41:20:53:49:4c
:56:45:49:52:41:20:2d:20:56:69:73:74:6f:73:2e:20:31:29:20:43:69:c3:aa:6e:63:69:6
1:20:64:61:20:72:65:73:70:6f:73:74:61:20:64:6f:20:6f:66:c3:ad:63:69:6f:20:65:78:
70:65:64:69:64:6f:20:c3:a0:20:52:65:63:65:69:74:61:20:46:65:64:65:72:61:6c:2c:20
:66:69:63:61:6e:64:6f:20:6f:73:20:64:61:64:6f:73:20:73:69:67:69:6c:6f:73:6f:73:2
0:61:72:71:75:69:76:61:64:6f:73:20:65:6d:20:70:61:73:74:61:20:70:72:c3:b3:70:72:
69:61:2e:20:32:29:20:50:6f:72:20:63:6f:6e:73:65:67:75:69:6e:74:65:2c:20:61:20:70
:61:72:74:65:20:65:78:65:71:75:65:6e:74:65:20:64:65:76:65:20:6d:61:6e:69:66:65:7
3:74:61:72:2d:73:65:2c:20:65:6d:20:63:69:6e:63:6f:20:64:69:61:73:2e:20:4e:6f:20:
73:69:6c:c3:aa:6e:63:69:6f:2c:20:61:6f:20:61:72:71:75:69:76:6f:2e:20:49:6e:74:69
:6d:65:2d:73:65:2e:20:2d:20:41:44:56:3a:20:4d:41:54:48:45:55:53:20:44:45:20:4f:4
c:49:56:45:49:52:41:20:54:41:56:41:52:45:53:20:28:4f:41:42:20:31:36:30:37:31:31:
2f:53:50:29:50:72:6f:63:65:73:73:6f:20:31:30:35:37:32:38:30:2d:31:35:2e:32:30:31
:34:2e:38:2e:32:36:2e:30:31:30:30

This has nothing to do with accented characters. The answer provided to you doesn't work because:
In the new input the word Process was replaced with Processo.
The new input has several instances of the regular expression pattern, so re.findall should be invoked, rather than re.match (in fact, since the old input has several instances as well, that solution won't work perfectly there either).
Therefore, here is the correct solution:
>>> print input
Processo 1056922-84.2013.8.26.0100 - Execução de Título Extrajudicial - Espécies de Contratos - MOBILINS FORMAÇÃO PROFISSIONAL EM BELEZA LTDA. - JULIANA MARANHÃO PORTO DA SILVEIRA - Vistos. Defiro o pedido para a pesquisa de bens da parte requerida JULIANA MARANHÃO PORTO DA SILVEIRA, CPF 030.797.564-95 (declaração dos cinco últimos exercícios), o qual é realizado, nesta data, por meio de ofício enviado à Receita Federal, protocolado eletronicamente, por intermédio do sistema INFOJUD. Intime-se. - ADV: MATHEUS DE OLIVEIRA TAVARES (OAB 160711/SP)Processo 1056922-84.2013.8.26.0100 - Execução de Título Extrajudicial - Espécies de Contratos - MOBILINS FORMAÇÃO PROFISSIONAL EM BELEZA LTDA. - JULIANA MARANHÃO PORTO DA SILVEIRA - Vistos. 1) Ciência da resposta do ofício expedido à Receita Federal, ficando os dados sigilosos arquivados em pasta própria. 2) Por conseguinte, a parte exequente deve manifestar-se, em cinco dias. No silêncio, ao arquivo. Intime-se. - ADV: MATHEUS DE OLIVEIRA TAVARES (OAB 160711/SP)Processo 1057280-15.2014.8.26.0100
>>> regex = re.compile('(Processo \\d{7}\\-\\d{2}\\.\\d{4}\\.\\d+\\.\\d{2}\\.\\d{4}.*?)(?=Processo)|(Processo \\d{7}\\-\\d{2}\\.\\d{4}\\.\\d+\\.\\d{2}\\.\\d{4}.*)')
>>> regex.findall(y)
[('Processo 1056922-84.2013.8.26.0100 - Execu\xc3\xa7\xc3\xa3o de T\xc3\xadtulo Extrajudicial - Esp\xc3\xa9cies de Contratos - MOBILINS FORMA\xc3\x87\xc3\x83O PROFISSIONAL EM BELEZA LTDA. - JULIANA MARANH\xc3\x83O PORTO DA SILVEIRA - Vistos. Defiro o pedido para a pesquisa de bens da parte requerida JULIANA MARANH\xc3\x83O PORTO DA SILVEIRA, CPF 030.797.564-95 (declara\xc3\xa7\xc3\xa3o dos cinco \xc3\xbaltimos exerc\xc3\xadcios), o qual \xc3\xa9 realizado, nesta data, por meio de of\xc3\xadcio enviado \xc3\xa0 Receita Federal, protocolado eletronicamente, por interm\xc3\xa9dio do sistema INFOJUD. Intime-se. - ADV: MATHEUS DE OLIVEIRA TAVARES (OAB 160711/SP)', ''), ('Processo 1056922-84.2013.8.26.0100 - Execu\xc3\xa7\xc3\xa3o de T\xc3\xadtulo Extrajudicial - Esp\xc3\xa9cies de Contratos - MOBILINS FORMA\xc3\x87\xc3\x83O PROFISSIONAL EM BELEZA LTDA. - JULIANA MARANH\xc3\x83O PORTO DA SILVEIRA - Vistos. 1) Ci\xc3\xaancia da resposta do of\xc3\xadcio expedido \xc3\xa0 Receita Federal, ficando os dados sigilosos arquivados em pasta pr\xc3\xb3pria. 2) Por conseguinte, a parte exequente deve manifestar-se, em cinco dias. No sil\xc3\xaancio, ao arquivo. Intime-se. - ADV: MATHEUS DE OLIVEIRA TAVARES (OAB 160711/SP)', ''), ('', 'Processo 1057280-15.2014.8.26.0100')]
If both inputs are legal (i.e. the input may contain the word Process and may contain the word Processo), then this regular expression should be used:
>>> regex = re.compile('(Processo? \\d{7}\\-\\d{2}\\.\\d{4}\\.\\d+\\.\\d{2}\\.\\d{4}.*?)(?=Processo?)|(Processo? \\d{7}\\-\\d{2}\\.\\d{4}\\.\\d+\\.\\d{2}\\.\\d{4}.*)')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.