XPath with Scrapy node begins with \n - python

I'm using scrapy on html like:
<td nowrap="" valign="top" align="right">
<br>
Text is here.
<br>
Other text is here
<br>
</td>
td[1]/text()[1] gives me:
(empty line)
Text is here.
I've tried normalize-space, i.e. normalize-space(td[1]/text()[1]), which works when I test in my firefox extension, but not in scrapy. I think scrapy is getting tripped up by the \n and it skips over (or only takes first line of node, which is nothing). I've also tried some "preceding" and "following" code, but I think it might be considered one element, my DOM says the nodeValue = "\nText is here" Any thoughts?,

Extract every text, get the desired one by index. For instance:
response.xpath("//table[#id='myid']/tr[1]/td[1]//text()")[1]
Demo from the Scrapy Shell:
$ scrapy shell http://www.trobar.org/troubadours/coms_de_peiteu/guilhen_de_peiteu_01.php
In [1]: table = response.xpath("//table")[2]
In [2]: td = "".join(table.xpath(".//td[1]//text()").extract())
In [3]: print(td)
Companho, farai un vers qu'er covinen,
Et aura-i mais de foudatz no-y a de sen,
Et er totz mesclatz d'amor e de joy e de joven.
E tenguatz lo per vilan qui no-l enten,
O dins son cor voluntiers non l'apren:
Greu partir si fai d'amor qui la troba a talen.
Dos cavalhs ai a ma sselha, ben e gen,
Bon son et adreg per armas e valen,
E no-ls puesc amdos tener, que l'us l'autre non cossen.
Si-ls pogues adomesjar a mon talen,
Ja no volgr'alhors mudar mon garnimen,
Que meils for'encavalguatz de nuill ome viven.
Launs fon dels montaniers lo plus corren,
Mas aitan fer' estranhez'a longuamen
Et es tan fers e salvatges, que del bailar si defen.
L'autre fon noyritz sa jus part Cofolen
Ez anc no-n vis bellazor, mon escien:
Aquest non er ja camjatz ni per aur ni per argen.
Qu'ie-l donei a son senhor polin payssen,
Pero si-m retinc ieu tan de covenen
Que, s'ilh lo tenia un an, qu'ieu lo tengues mais de cen.
Cavalier, datz mi cosselh d'un pessamen:
-Anc mays no fuy issaratz de cauzimen- :
Res non sai ab qual me tengua, de n'Agnes o de n'Arsen.
De Gimel ai lo castel e-l mandamen,
E per Niol fauc ergueill a tota gen:
C'ambedui me son jurat e plevit per sagramen.

Related

Identify the names of the interlocutors (most frequent words that follow a certain pattern) to separate the dialog lines of a chat using regex

import re
#To read input data file
with open("dm_chat_data.txt") as input_data_file:
print(input_data_file.read())
#To write corrections in a new text file
with open('dm_chat_data_fixed.txt', 'w') as file:
file.write('\n')
This is the text file extracted by webscraping, but the lines of the dialogs of each of its chat partners are not separated, so the program must identify when each user starts the dialog.
File dm_chat_data.txt
Desempleada_19: HolaaLucyGirl: hola como estas?Desempleada_19: Masomenos y vos LucyGirl?Desempleada_19: Q edad tenes LucyGirl: tengo 19LucyGirl: masomenos? que paso? (si se puede preguntar claro)Desempleada_19: Yo tmb 19 me llamo PriscilaDesempleada_19: Desempleada_19: Q hacías LucyGirl: Entre al chat para ver que onda, no frecuento mucho
Charge file [100%] (ddddfdfdfd)
LucyGirl: Desempleada_19: Gracias!
AndrewSC: HolaAndrewSC: Si quieres podemos hablar LyraStar: claro LyraStar: que cuentas amigaAndrewSC: Todo bien y tú?
Charge file [100%] (ddddfdfdfd)
LyraStar: LyraStar: que tal ese auto?AndrewSC: Creo que...Diria que... ya son las 19 : 00 hs AndrewSC: Muy bien la verdad
Bsco_Pra_Cap_: HolaBsco_Pra_Cap_: como vaBsco_Pra_Cap_: Jorge, 47, de Floresta, me presento a la entrevista, vos?Bsco_Pra_Cap_: es aqui, cierto?LucyFlame: holaaLucyFlame: estas?LucyFlame: soy una programadora de la ciudad de HudsonBsco_Pra_Cap_: de Hudson centro? o hudson alejado...?Bsco_Pra_Cap_: contame, Lu, que buscas en esta organizacion?
And this is the file that you must create separating the dialogues of each interlocutor in each of the chats. The file edited_dm_chat_data.txt need to be like this...
Desempleada_19: Holaa
LucyGirl: hola como estas?
Desempleada_19: Masomenos y vos LucyGirl?
Desempleada_19: Q edad tenes
LucyGirl: tengo 19
LucyGirl: masomenos? que paso? (si se puede preguntar claro)
Desempleada_19: Yo tmb 19 me llamo Priscila
Desempleada_19:
Desempleada_19: Q hacías
LucyGirl: Entre al chat para ver que onda, no frecuento mucho
Charge file [100%] (ddddfdfdfd)
LucyGirl:
Desempleada_19: Gracias!
AndrewSC: Hola
AndrewSC: Si quieres podemos hablar
LyraStar: claro
LyraStar: que cuentas amiga
AndrewSC: Todo bien y tú?
Charge file [100%] (ddddfdfdfd)
LyraStar: LyraStar: que tal ese auto?
AndrewSC: Creo que...Diria que... ya son las 19 : 00 hs
AndrewSC: Muy bien la verdad
Bsco_Pra_Cap_: Hola
Bsco_Pra_Cap_: como va
Bsco_Pra_Cap_: Jorge, 47, de Floresta, me presento a la entrevista, vos?Bsco_Pra_Cap_: es aqui, cierto?
LucyFlame: holaa
LucyFlame: estas?
LucyFlame: soy una programadora de la ciudad de Hudson
Bsco_Pra_Cap_: de Hudson centro? o hudson alejado...?
Bsco_Pra_Cap_: contame, Lu, que buscas en esta organizacion?
I have tried to use regex, where each interlocutor is represented by a "Word" that begins in uppercase immediately followed by ": "
But there are some lines that give some problems to this logic, for example "Bsco_Pra_Cap_: HolaBsco_Pra_Cap_: como va", where the substring "Hola" is a simply word that is not a name and is attached to the name with capital letters, then it would be confused and consider "HolaBsco_Pra_Cap_: " as a name, but it's incorrect because the correct users name is "Bsco_Pra_Cap_: "
This problem arises because we don't know what the nicknames of the interlocutor users will be, and... the only thing we know is the structure where they start with a capital letter and end in : and then an empty space, but one thing I've noticed is that in all chats the names of the conversation partners are the most repeated words, so I think I could use a regular expression pattern as a word frequency counter by setting a search criteria like this "[INITIAL CAPITAL LETTER] hjasahshjas: " , and put as line separators those substrings with these characteristics as long as they are the ones that are repeated the most throughout the file
input_data_file = open("dm_chat_data.txt", "r+")
#maybe you can use something like this to count the occurrences and thus identify the nicknames
input_data_file.count(r"[A-Z][^A-Z]*:\s")
I think it is quite hard. but you can build a rules as shown in below code:
import nltk
from collections import Counter
text = '''Desempleada_19: HolaaLucyGirl: hola como estas?Desempleada_19:
Masomenos y vos LucyGirl?Desempleada_19: Q edad tenes LucyGirl: tengo
19LucyGirl: masomenos? que paso? (si se puede preguntar claro)Desempleada_19: Yo
tmb 19 me llamo PriscilaDesempleada_19: Desempleada_19: Q hacías LucyGirl: Entre
al chat para ver que onda, no frecuento mucho
Charge file [100%] (ddddfdfdfd)
LucyGirl: Desempleada_19: Gracias!
AndrewSC: HolaAndrewSC: Si quieres podemos hablar LyraStar: claro LyraStar: que
cuentas amigaAndrewSC: Todo bien y tú?
Charge file [100%] (ddddfdfdfd)
LyraStar: LyraStar: que tal ese auto?AndrewSC: Creo que...Diria que... ya son
las 19 : 00 hs AndrewSC: Muy bien la verdad
Bsco_Pra_Cap_: HolaBsco_Pra_Cap_: como vaBsco_Pra_Cap_: Jorge, 47, de Floresta,
me presento a la entrevista, vos?Bsco_Pra_Cap_: es aqui, cierto?LucyFlame:
holaaLucyFlame: estas?LucyFlame: soy una programadora de la ciudad de
HudsonBsco_Pra_Cap_: de Hudson centro? o hudson alejado...?Bsco_Pra_Cap_:
contame, Lu, que buscas en esta organizacion?
'''
data = nltk.word_tokenize(text)
user_lst = []
for ind, val in enumerate(data):
if val == ':':
user_lst.append(data[ind - 1])
# printing duplicates assuming the users were speaking more than one time. if a
user has one dialog box it fails.
users = [k for k, v in Counter(user_lst).items() if v > 1]
# function to replace a string:
def replacer(string, lst):
for v in lst:
string = string.replace(v, f' {v}')
return string
# replace users in old text with single space in it.
refined_text = replacer(text, users)
refined_data = nltk.word_tokenize(refined_text)
correct_users = []
dialog = []
for ind, val in enumerate(refined_data):
if val == ':':
correct_users.append(refined_data[ind - 1])
if val not in users:
dialog.append(val)
correct_dialog = ' '.join(dialog).replace(':', '<:').split('<')
strip_dialog = [i.strip() for i in correct_dialog if i.strip()]
chat = []
for i in range(len(correct_users)):
chat.append(f'{correct_users[i]}{strip_dialog[i]}')
print(chat)
>>>> ['Desempleada_19: Holaa', 'LucyGirl: hola como estas ?', 'Desempleada_19: Masomenos y vos ?', 'Desempleada_19: Q edad tenes', 'LucyGirl: tengo 19', 'LucyGirl: masomenos ? que paso ? ( si se puede preguntar claro )', 'Desempleada_19: Yo tmb 19 me llamo Priscila', 'Desempleada_19:', 'Desempleada_19: Q hacías', 'LucyGirl: Entre al chat para ver que onda , no frecuento mucho Charge file [ 100 % ] ( ddddfdfdfd )', 'LucyGirl:', 'Desempleada_19: Gracias !', 'AndrewSC: Hola', 'AndrewSC: Si quieres podemos hablar', 'LyraStar: claro', 'LyraStar: que cuentas amiga', 'AndrewSC: Todo bien y tú ? Charge file [ 100 % ] ( ddddfdfdfd )', 'LyraStar:', 'LyraStar: que tal ese auto ?', 'AndrewSC: Creo que ... Diria que ... ya son las 19', '19: 00 hs', 'AndrewSC: Muy bien la verdad', 'Bsco_Pra_Cap_: Hola', 'Bsco_Pra_Cap_: como va', 'Bsco_Pra_Cap_: Jorge , 47 , de Floresta , me presento a la entrevista , vos ?', 'Bsco_Pra_Cap_: es aqui , cierto ?', 'LucyFlame: holaa', 'LucyFlame: estas ?', 'LucyFlame: soy una programadora de la ciudad de Hudson', 'Bsco_Pra_Cap_: de Hudson centro ? o hudson alejado ... ?', 'Bsco_Pra_Cap_: contame , Lu , que buscas en esta organizacion ?']

Python selenium can't extract text

I'm trying to scrape the text from a list, this is the URL:
https://www.eneba.com/es/lego-dimensions-starter-pack-playstation-4
This is my code:
1º I find de list (ul)
2º for each li in ul print the text
ul = driver.find_element_by_xpath('//h2[2]/following-sibling::ul')
li = ul.find_elements_by_tag_name('li')
#print(li.text)
for element in li:
print(element.text)
This code returns blank spaces instead of text, what am I doing wrong ?
There some answers of people that say things without checking anything.
1º The xpath is definetly there, check it
2º There are 29 H2 in the html of the page, not just one
I received a negative vote from people who say the 1º thing the think without even checking anything.
What I want to extract is this text:
• Rompecabezas - resolver varios acertijos es una de las mecánicas centrales del juego;
• Para todos los públicos – El juego es apropiado para jugadores de todas las edades;
• Arcade - los jugadores deben terminar con éxito los niveles que aumentan en dificultad a medida que avanzan en el juego;
• Acción - este título incluye desafíos que deben superarse utilizando habilidades como precisión, tiempo de respuesta rápido, etc.;
• Superhéroes - Los jugadores entran en un mundo peligroso, donde los únicos capaces de detener el crimen son los héroes bendecidos con poderes únicos;
• Un jugador - el juego presenta una campaña en solitario con una historia;
• Multijugador local - esta función permite que varias personas participen en los mismos partidos, ya sea a través de la pantalla dividida o la misma conexión de red.
1 I've managed to get the data from my region, but I had to spend some time on selecting Spain from the list of countries.
Please note that I avoided using xpath locators as they would be too long.
All explanations are in the comments to the code.
2 To get a text from an element .text is usually used, but for this case is does not work. So get_attribute("innerHTML") is what you really need.
I left all waits that I used. YOu can debug by yourself and remove ones that are not necessary.
Solution:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome(executable_path='/snap/bin/chromium.chromedriver')
driver.get('https://www.eneba.com/es/lego-dimensions-starter-pack-playstation-4')
wait = WebDriverWait(driver, 15)
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "._1FArM6>.qGNWom.qGNWom"))).click() # accept cookies
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, ".EcNujK._2afX4x._1OhNBA"))).click() # click region button
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "#region .css-1hwfws3"))).click() # change region
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "#region .react-select__input>input")))
driver.find_element_by_css_selector("#region .react-select__input>input").send_keys("spa") # input country name
driver.find_element_by_css_selector("#react-select-2-option-5").click() # select found country
driver.find_element_by_css_selector("._3Fpvn5>button[type='submit']").click() # submit country
wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "div[itemprop=description]>ul>li"))) # wait for all li elements
cards = driver.find_elements_by_css_selector("div[itemprop=description]>ul>li")
for card in cards:
print(card.get_attribute("innerHTML"))
driver.close()
driver.quit()
Result:
• Rompecabezas - resolver varios acertijos es una de las mecánicas centrales del juego;
• Para todos los públicos – El juego es apropiado para jugadores de todas las edades;
• Arcade - los jugadores deben terminar con éxito los niveles que aumentan en dificultad a medida que avanzan en el juego;
• Acción - este título incluye desafíos que deben superarse utilizando habilidades como precisión, tiempo de respuesta rápido, etc.;
• Superhéroes - Los jugadores entran en un mundo peligroso, donde los únicos capaces de detener el crimen son los héroes bendecidos con poderes únicos;
• Un jugador - el juego presenta una campaña en solitario con una historia;
• Multijugador local - esta función permite que varias personas participen en los mismos partidos, ya sea a través de la pantalla dividida o la misma conexión de red.
There are no elements located by the xpath you defined '//h2[2]/following-sibling::ul.
This is why ul is actually a null and li is an empty list.

Python Regex to Find Special Characters and characters in between

I have a csv file that looks like the following
Porta-a-Porta-d87134d1-e2bd-426b-b1f6-90d8dca68855;2842.020;2843.270;Unknown;; tecnici delle societ…
Porta-a-Porta-d87134d1-e2bd-426b-b1f6-90d8dca68855;2903.310;2906.360;Unknown;; pu• avere un profilo specifico
Porta-a-Porta-d87134d1-e2bd-426b-b1f6-90d8dca68855;2745.860;2749.060;Unknown;; Š quadruplicato rispetto al 1967.
Porta-a-Porta-d87134d1-e2bd-426b-b1f6-90d8dca68855;1023.580;1026.250;Unknown;; monitoraggio fosse completo e cosŤ via.
Porta-a-Porta-d87134d1-e2bd-426b-b1f6-90d8dca68855;708.870;711.290;Unknown;; Non solo un ponte, ma qualcosa di pi—.
Porta-a-Porta-d605218c-b8c5-4b3b-9086-b83e4c958bf5;4199.210;4200.540;Unknown;; piů straziante.
Porta-a-Porta-c28a23f4-d7b0-4624-8b49-72ba25be653e;4702.720;4703.900;Unknown;; tant'č che questo ragazzo
Presa-Diretta-Burocrazia-al-potere-ce58265f-da04-4b19-a1ad-2746830cac0a;4229.110;4232.130;Unknown;; a un testo di 13 pagine con 7/8.000 parole.<
Presa-Diretta-Burocrazia-al-potere-ce58265f-da04-4b19-a1ad-2746830cac0a;4541.560;4543.100;Unknown;; sei/otto ore al giorno.<
PresaDiretta-Il-capitale-naturale-8f39ea4f-a5fb-4c93-a504-a04d6482c086;1938.730;1941.830;Unknown;; abbattere i cervi.> Senza di loro, questa terra sarebbe
Quante-storie-15aef095-7ba8-4237-af6e-aded20d1d40a;19.920;22.630;Unknown;; questa puntata {an2}che ha come ospite una
Quante-storie-15aef095-7ba8-4237-af6e-aded20d1d40a;64.080;68.090;Unknown;; {an2}Sì, perché c'è come un ritegno a venire in una
Quante-storie-200b0694-7d54-4b5c-af5a-b54cae157ffd;446.730;447.790;Unknown;; della nostra Patria. {an2}[LA
Quante-storie-2583a3a2-2e8c-4589-bede-933736b65043;1781.910;1783.030;Unknown;; UDIBILI]
Porta-a-Porta-3b4b81d5-2f0f-4e51-9c29-00f9a2aa4444;4159.470;4160.890;Unknown;; bianca torneremo.#
Porta-a-Porta-3b4b81d5-2f0f-4e51-9c29-00f9a2aa4444;4196.930;4198.230;Unknown;; del sole#
and I am trying to spot unnecessary characters that should not belong in this file such as < or { or {an2} or [ and so on.
This is the regex I have right now and does the job well except it does not catch some cases like {an2} or # as described above. I would like to find everything including an2 and leave every Italian characters as is.
[^a-zA-Z0-9;'"\.\- ,\?:£\]\[\/()%!èàéùòìíŕěúůňčÂŤŠÈÉôü&+<>##$%^…—‚–]
Let me know if there is any easier way to solve this problem.
My guess is that, maybe we would find those undesired parts, then replace with an empty string, with some expressions similar to:
{.+?}|[\[\]<>]
Test
import re
regex = r"{.+?}|[\[\]<>]"
test_str = ("Porta-a-Porta-d87134d1-e2bd-426b-b1f6-90d8dca68855;2842.020;2843.270;Unknown;; tecnici delle societ…\n"
"Porta-a-Porta-d87134d1-e2bd-426b-b1f6-90d8dca68855;2903.310;2906.360;Unknown;; pu• avere un profilo specifico\n"
"Porta-a-Porta-d87134d1-e2bd-426b-b1f6-90d8dca68855;2745.860;2749.060;Unknown;; Š quadruplicato rispetto al 1967.\n"
"Porta-a-Porta-d87134d1-e2bd-426b-b1f6-90d8dca68855;1023.580;1026.250;Unknown;; monitoraggio fosse completo e cosŤ via.\n"
"Porta-a-Porta-d87134d1-e2bd-426b-b1f6-90d8dca68855;708.870;711.290;Unknown;; Non solo un ponte, ma qualcosa di pi—.\n"
"Porta-a-Porta-d605218c-b8c5-4b3b-9086-b83e4c958bf5;4199.210;4200.540;Unknown;; piů straziante.\n"
"Porta-a-Porta-c28a23f4-d7b0-4624-8b49-72ba25be653e;4702.720;4703.900;Unknown;; tant'č che questo ragazzo\n"
"Presa-Diretta-Burocrazia-al-potere-ce58265f-da04-4b19-a1ad-2746830cac0a;4229.110;4232.130;Unknown;; a un testo di 13 pagine con 7/8.000 parole.<\n"
"Presa-Diretta-Burocrazia-al-potere-ce58265f-da04-4b19-a1ad-2746830cac0a;4541.560;4543.100;Unknown;; sei/otto ore al giorno.<\n"
"PresaDiretta-Il-capitale-naturale-8f39ea4f-a5fb-4c93-a504-a04d6482c086;1938.730;1941.830;Unknown;; abbattere i cervi.> Senza di loro, questa terra sarebbe\n"
"Quante-storie-15aef095-7ba8-4237-af6e-aded20d1d40a;19.920;22.630;Unknown;; questa puntata {an2}che ha come ospite una\n"
"Quante-storie-15aef095-7ba8-4237-af6e-aded20d1d40a;64.080;68.090;Unknown;; {an2}Sì, perché c'è come un ritegno a venire in una\n"
"Quante-storie-200b0694-7d54-4b5c-af5a-b54cae157ffd;446.730;447.790;Unknown;; della nostra Patria. {an2}[LA\n"
"Quante-storie-2583a3a2-2e8c-4589-bede-933736b65043;1781.910;1783.030;Unknown;; UDIBILI]\n"
"Porta-a-Porta-3b4b81d5-2f0f-4e51-9c29-00f9a2aa4444;4159.470;4160.890;Unknown;; bianca torneremo.#\n"
"Porta-a-Porta-3b4b81d5-2f0f-4e51-9c29-00f9a2aa4444;4196.930;4198.230;Unknown;; del sole#")
subst = ""
result = re.sub(regex, subst, test_str, 0, re.MULTILINE)
if result:
print (result)
Demo

Classifying data from .arff files with scikit-learn?

In a previous post i learned about the process to follow for classifying text with scikit-learn. In order to organize my data in a better way i discover .arff files let's say i have the following .arff file:
#relation lang_identification
#attribute opinion string
#attribute lang_identification {bos, pt, es, slov}
#data
"Pošto je EULEX obećao da će obaviti istragu o prošlosedmičnom izbijanju nasilja na sjeveru Kosova, taj incident predstavlja još jedan ispit kapaciteta misije da doprinese jačanju vladavine prava.",bos
"De todas as provações que teve de suplantar ao longo da vida, qual foi a mais difícil? O início. Qualquer começo apresenta dificuldades que parecem intransponíveis. Mas tive sempre a minha mãe do meu lado. Foi ela quem me ajudou a encontrar forças para enfrentar as situações mais decepcionantes, negativas, as que me punham mesmo furiosa.",pt
"Al parecer, Andrea Guasch pone que una relación a distancia es muy difícil de llevar como excusa. Algo con lo que, por lo visto, Alex Lequio no está nada de acuerdo. ¿O es que más bien ya ha conseguido la fama que andaba buscando?",es
"Vo väčšine golfových rezortov ide o veľký komplex niekoľkých ihrísk blízko pri sebe spojených s hotelmi a ďalšími možnosťami trávenia voľného času – nie vždy sú manželky či deti nadšenými golfistami, a tak potrebujú iný druh vyžitia. Zaujímavé kombinácie ponúkajú aj rakúske, švajčiarske či talianske Alpy, kde sa dá v zime lyžovať a v lete hrať golf pod vysokými alpskými končiarmi.",slov
I would like to experiment with scikit-learn and classify with a supervised aproach a complete new test string let's say:
test = "Por ello, ha insistido en que Europa tiene que darle un toque de atención porque Portugal esta incumpliendo la directiva del establecimiento del peaje"
Scipy provide an arff loader, let's load an arff file with this:
from scipy.io.arff import loadarff
dataset = loadarff(open('/Users/user/Desktop/toy.arff','r'))
print dataset
This should return something like this: (array([]), how can use numpy record arrays to classify with scikit-learn?.

python ascii to unicode conversion

i have a file with data like this:
\r\n\tSoci\u00e9t\u00e9 implant\u00e9 dans l'internet recrute des t\u00e9l\u00e9conseillers en b to b pour effectuer de la prise de rendez-vous qualifi\u00e9 pour de la conception de site internet et du r\u00e9f\u00e9rencement google.
how can i print it as unicode, like this:
Société implanté dans l'internet recrute des téléconseillers en b to b pour effectuer de la prise de rendez-vous qualifié pour de la conception de site internet et du référencement google.
i know i have to use some unicode function but what?
That looks like a python unicode string literal; decode this from unicode_escape.
Demo:
>>> data = "\r\n\tSoci\u00e9t\u00e9 implant\u00e9 dans l'internet recrute des t\u00e9l\u00e9conseillers en b to b pour effectuer de la prise de rendez-vous qualifi\u00e9 pour de la conception de site internet et du r\u00e9f\u00e9rencement google."
>>> data.decode('unicode_escape')
u"\r\n\tSoci\xe9t\xe9 implant\xe9 dans l'internet recrute des t\xe9l\xe9conseillers en b to b pour effectuer de la prise de rendez-vous qualifi\xe9 pour de la conception de site internet et du r\xe9f\xe9rencement google."
>>> print data.decode('unicode_escape')
Société implanté dans l'internet recrute des téléconseillers en b to b pour effectuer de la prise de rendez-vous qualifié pour de la conception de site internet et du référencement google.
You can either decode the data as you read it from the file (using a binary mode), or you can use io.open() in Python 2, or regular open() in Python 3 to have data decoded on the fly:
from io import open
with open(filename, 'r', encoding="unicode_escape") as inputfile:
for line in inputfile:
print(inputfile)
Note that JSON strings use the same escape syntax; \uhhhh denotes a Unicode codepoint using just ASCII characters.

Categories

Resources