Trimming white spaces in python bs4 - python

I am trying to remove whitespaces in the scraped data. I referred all the solutions that are available, but nothing seems to work.
Here is my code
from bs4 import BeautifulSoup
import urllib2
url="http://www.sfap.org/klsfaprep_search?page=38&type=1&strname=&loc=&op=Lancer%20la%20recherche&form_build_id=form-72a297de309517ed5a2c28af7ed15208&form_id=klsfaprep_search_form"
page=urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
universities=soup.findAll('div',{'class':'field-item odd'})
for eachuniversity in universities:
#print eachuniversity['href']+","+eachuniversity.string.encode('utf-8').strip()
print eachuniversity.string if eachuniversity else ''
The output I am getting is
EMSP
None
None
BP J5
98880
NOUMEA
Nouvelle-Calédonie
Intra établissement
Dr Chantal Barbe
c.barbe#cht.nc
00 687 25 66 66 (standard)
emasp#cht.nc
1078 (poste Dr Barbe)
Accueil stagiaire
None
Régional
None
But I want it to be
EMSP,None,None, BP J5,98880,NOUMEA,Nouvelle-Calédonie,Intra établissement,Dr Chantal Barbe, c.barbe#cht.nc, 00 687 25 66 66 (standard), emasp#cht.nc, 1078 (poste Dr Barbe), Accueil stagiaire, None, Régional,None
When I tried other SO answers I got Nonetype attribute error.
Update
I have improved my script as following
from bs4 import BeautifulSoup
import urllib2
url="http://www.sfap.org/klsfaprep_search?page=38&type=1&strname=&loc=&op=Lancer%20la%20recherche&form_build_id=form-72a297de309517ed5a2c28af7ed15208&form_id=klsfaprep_search_form"
page=urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
for eachuniversity in soup.findAll('div',{'class':'field-item odd'}):
print ''.join(eachuniversity.findAll(text=True)).encode('utf-8').strip()
This gives me the following output
EMSP
Nom de la structure: 
EMASP
Hôpital Gaston Bourret
BP J5
98880
NOUMEA
Nouvelle-Calédonie
Intra établissement
Dr Chantal Barbe
c.barbe#cht.nc
00 687 25 66 66 (standard)
emasp#cht.nc
1078 (poste Dr Barbe)
Accueil stagiaire
7h30 17h
Régional
ouverture équipe mobile depuis le 1 aout 2011
Travail au quotidien avec le malade sur demande médecin référent
Activités de formation intra et extra hospitalières sur toute la Nouvelle Calédonie auprès de professionnels de la santé, des auxiliaires de vie, des bénévoles, des prêtres....
Information auprès du grand public
Travail de recherche : étude des problèmes ethniques; évaluation du ressenti des malades walisien et /ou kanak sur l' approche SP et propositions
But I want this to be in a single line with comma-separated.

To print on the same line, just add a , at the end of the print statement:
print ''.join(eachuniversity.findAll(text=True)).encode('utf-8').strip(),',',
You might want to remove newlines from the text.
print re.sub(r'\s+',' ',''.join(eachuniversity.findAll(text=True)).encode('utf-8')),',',
It will replace all consecutive whitespace characters including newlines with a single space.

Related

Python Regex to Find Special Characters and characters in between

I have a csv file that looks like the following
Porta-a-Porta-d87134d1-e2bd-426b-b1f6-90d8dca68855;2842.020;2843.270;Unknown;; tecnici delle societ…
Porta-a-Porta-d87134d1-e2bd-426b-b1f6-90d8dca68855;2903.310;2906.360;Unknown;; pu• avere un profilo specifico
Porta-a-Porta-d87134d1-e2bd-426b-b1f6-90d8dca68855;2745.860;2749.060;Unknown;; Š quadruplicato rispetto al 1967.
Porta-a-Porta-d87134d1-e2bd-426b-b1f6-90d8dca68855;1023.580;1026.250;Unknown;; monitoraggio fosse completo e cosŤ via.
Porta-a-Porta-d87134d1-e2bd-426b-b1f6-90d8dca68855;708.870;711.290;Unknown;; Non solo un ponte, ma qualcosa di pi—.
Porta-a-Porta-d605218c-b8c5-4b3b-9086-b83e4c958bf5;4199.210;4200.540;Unknown;; piů straziante.
Porta-a-Porta-c28a23f4-d7b0-4624-8b49-72ba25be653e;4702.720;4703.900;Unknown;; tant'č che questo ragazzo
Presa-Diretta-Burocrazia-al-potere-ce58265f-da04-4b19-a1ad-2746830cac0a;4229.110;4232.130;Unknown;; a un testo di 13 pagine con 7/8.000 parole.<
Presa-Diretta-Burocrazia-al-potere-ce58265f-da04-4b19-a1ad-2746830cac0a;4541.560;4543.100;Unknown;; sei/otto ore al giorno.<
PresaDiretta-Il-capitale-naturale-8f39ea4f-a5fb-4c93-a504-a04d6482c086;1938.730;1941.830;Unknown;; abbattere i cervi.> Senza di loro, questa terra sarebbe
Quante-storie-15aef095-7ba8-4237-af6e-aded20d1d40a;19.920;22.630;Unknown;; questa puntata {an2}che ha come ospite una
Quante-storie-15aef095-7ba8-4237-af6e-aded20d1d40a;64.080;68.090;Unknown;; {an2}Sì, perché c'è come un ritegno a venire in una
Quante-storie-200b0694-7d54-4b5c-af5a-b54cae157ffd;446.730;447.790;Unknown;; della nostra Patria. {an2}[LA
Quante-storie-2583a3a2-2e8c-4589-bede-933736b65043;1781.910;1783.030;Unknown;; UDIBILI]
Porta-a-Porta-3b4b81d5-2f0f-4e51-9c29-00f9a2aa4444;4159.470;4160.890;Unknown;; bianca torneremo.#
Porta-a-Porta-3b4b81d5-2f0f-4e51-9c29-00f9a2aa4444;4196.930;4198.230;Unknown;; del sole#
and I am trying to spot unnecessary characters that should not belong in this file such as < or { or {an2} or [ and so on.
This is the regex I have right now and does the job well except it does not catch some cases like {an2} or # as described above. I would like to find everything including an2 and leave every Italian characters as is.
[^a-zA-Z0-9;'"\.\- ,\?:£\]\[\/()%!èàéùòìíŕěúůňčÂŤŠÈÉôü&+<>##$%^…—‚–]
Let me know if there is any easier way to solve this problem.
My guess is that, maybe we would find those undesired parts, then replace with an empty string, with some expressions similar to:
{.+?}|[\[\]<>]
Test
import re
regex = r"{.+?}|[\[\]<>]"
test_str = ("Porta-a-Porta-d87134d1-e2bd-426b-b1f6-90d8dca68855;2842.020;2843.270;Unknown;; tecnici delle societ…\n"
"Porta-a-Porta-d87134d1-e2bd-426b-b1f6-90d8dca68855;2903.310;2906.360;Unknown;; pu• avere un profilo specifico\n"
"Porta-a-Porta-d87134d1-e2bd-426b-b1f6-90d8dca68855;2745.860;2749.060;Unknown;; Š quadruplicato rispetto al 1967.\n"
"Porta-a-Porta-d87134d1-e2bd-426b-b1f6-90d8dca68855;1023.580;1026.250;Unknown;; monitoraggio fosse completo e cosŤ via.\n"
"Porta-a-Porta-d87134d1-e2bd-426b-b1f6-90d8dca68855;708.870;711.290;Unknown;; Non solo un ponte, ma qualcosa di pi—.\n"
"Porta-a-Porta-d605218c-b8c5-4b3b-9086-b83e4c958bf5;4199.210;4200.540;Unknown;; piů straziante.\n"
"Porta-a-Porta-c28a23f4-d7b0-4624-8b49-72ba25be653e;4702.720;4703.900;Unknown;; tant'č che questo ragazzo\n"
"Presa-Diretta-Burocrazia-al-potere-ce58265f-da04-4b19-a1ad-2746830cac0a;4229.110;4232.130;Unknown;; a un testo di 13 pagine con 7/8.000 parole.<\n"
"Presa-Diretta-Burocrazia-al-potere-ce58265f-da04-4b19-a1ad-2746830cac0a;4541.560;4543.100;Unknown;; sei/otto ore al giorno.<\n"
"PresaDiretta-Il-capitale-naturale-8f39ea4f-a5fb-4c93-a504-a04d6482c086;1938.730;1941.830;Unknown;; abbattere i cervi.> Senza di loro, questa terra sarebbe\n"
"Quante-storie-15aef095-7ba8-4237-af6e-aded20d1d40a;19.920;22.630;Unknown;; questa puntata {an2}che ha come ospite una\n"
"Quante-storie-15aef095-7ba8-4237-af6e-aded20d1d40a;64.080;68.090;Unknown;; {an2}Sì, perché c'è come un ritegno a venire in una\n"
"Quante-storie-200b0694-7d54-4b5c-af5a-b54cae157ffd;446.730;447.790;Unknown;; della nostra Patria. {an2}[LA\n"
"Quante-storie-2583a3a2-2e8c-4589-bede-933736b65043;1781.910;1783.030;Unknown;; UDIBILI]\n"
"Porta-a-Porta-3b4b81d5-2f0f-4e51-9c29-00f9a2aa4444;4159.470;4160.890;Unknown;; bianca torneremo.#\n"
"Porta-a-Porta-3b4b81d5-2f0f-4e51-9c29-00f9a2aa4444;4196.930;4198.230;Unknown;; del sole#")
subst = ""
result = re.sub(regex, subst, test_str, 0, re.MULTILINE)
if result:
print (result)
Demo

Python: Split CSV with character count

Need help in importing CSV file into python.
My CSV file
0,Donc, 2 jours, je me suis rendu compte que Musikfest est le lendemain de voir dmb, quel problème. Signifie que je ne peux pas aller ...
0,Le son est définitivement gâché.Noooooo mon bb
0,Il est le mien! Haha il me suit: ') m'aime et me veut.haha.i wana vivre en Amérique annie
I want to split the above file into 2 columns
Coloumn1 ---- Coloumn2
0 ---- Donc, 2 jours, je me suis rendu compte que Musikfest est le
lendemain de voir dmb, quel problème. Signifie que je ne peux pas
aller ...
0 ---- Le son est définitivement gâché.Noooooo mon bb
0 ---- Il est le mien! Haha il me suit: ') m'aime et me veut.haha.i wana
vivre en Amérique annie
Since my text has commas embedded and my value for the text is always the first character. Is it possible to read my CSV file with splitting first character and rest of the text?
You can use string.split() and specify a max split of 1. By this I mean, if you just want to split the line on the first comma, then do not read the file as a CSV. Instead read it line by line and split the line using string.split(',', 1)
You should use csv library to work with csv files: https://docs.python.org/3/library/csv.html#csv.reader
import csv
result = []
with open('test.csv') as csvfile:
csvreader = csv.reader(csvfile)
for row in csvreader:
result.append((row[0], ''.join(row[1:])))
print(result)

Change unicode hardwrited in csv to corresponding character

I have a csv with 1 column having hard writed unicode character :
["Investir dans un parc d'activit\u00e9s"]
["S\u00e9curiser, restaurer et g\u00e9rer 1 372 ha de milieux naturels impact\u00e9s par la construction de l'autoroute"]
["Am\u00e9liorer la consommation \u00e9nerg\u00e9tique de b\u00e2timents publics"]
["Favoriser la recherche, am\u00e9liorer la qualit\u00e9 des traitements et assurer un \u00e9gal acc\u00e8s des soins \u00e0 tous les patients de Franche-Comt\u00e9."]
I'm trying to fix/replace them with the corresponding char, but I can't seems to make it, I tried with
df['Objectif(s)'] = df['Objectif(s)'].replace('\u00e9', 'é')
but the column don't change
Seing that the code below work, I tried to loop over the row to fix it with no success
s = "d'activit\u00e9s"
print(s) # d'activités
print(s.replace('\u00e9', 'é' )) # d'activités
for case in df['Objectif(s)']:
s = str(case)
df['Objectif(s)'][case] = s # ["Investir dans un parc d'activit\u00e9s"]
if this '\u00e9' is actually written into the file as \ u 0 0 e 9 as normal characters by the source of the data, you need to do a string replace.
the trick here is that you need to escape the \ character in the replace function first parameter
s.replace('\\u00e9', 'é' )
or use a "raw string literal" by prefixing r
s.replace(r'\u00e9', 'é' )
Try replacing
df['Objectif(s)'] = df['Objectif(s)'].replace('\u00e9', 'é')
to
df['Objectif(s)'] = df['Objectif(s)'].str.replace('\u00e9', 'é')

XPath with Scrapy node begins with \n

I'm using scrapy on html like:
<td nowrap="" valign="top" align="right">
<br>
Text is here.
<br>
Other text is here
<br>
</td>
td[1]/text()[1] gives me:
(empty line)
Text is here.
I've tried normalize-space, i.e. normalize-space(td[1]/text()[1]), which works when I test in my firefox extension, but not in scrapy. I think scrapy is getting tripped up by the \n and it skips over (or only takes first line of node, which is nothing). I've also tried some "preceding" and "following" code, but I think it might be considered one element, my DOM says the nodeValue = "\nText is here" Any thoughts?,
Extract every text, get the desired one by index. For instance:
response.xpath("//table[#id='myid']/tr[1]/td[1]//text()")[1]
Demo from the Scrapy Shell:
$ scrapy shell http://www.trobar.org/troubadours/coms_de_peiteu/guilhen_de_peiteu_01.php
In [1]: table = response.xpath("//table")[2]
In [2]: td = "".join(table.xpath(".//td[1]//text()").extract())
In [3]: print(td)
Companho, farai un vers qu'er covinen,
Et aura-i mais de foudatz no-y a de sen,
Et er totz mesclatz d'amor e de joy e de joven.
E tenguatz lo per vilan qui no-l enten,
O dins son cor voluntiers non l'apren:
Greu partir si fai d'amor qui la troba a talen.
Dos cavalhs ai a ma sselha, ben e gen,
Bon son et adreg per armas e valen,
E no-ls puesc amdos tener, que l'us l'autre non cossen.
Si-ls pogues adomesjar a mon talen,
Ja no volgr'alhors mudar mon garnimen,
Que meils for'encavalguatz de nuill ome viven.
Launs fon dels montaniers lo plus corren,
Mas aitan fer' estranhez'a longuamen
Et es tan fers e salvatges, que del bailar si defen.
L'autre fon noyritz sa jus part Cofolen
Ez anc no-n vis bellazor, mon escien:
Aquest non er ja camjatz ni per aur ni per argen.
Qu'ie-l donei a son senhor polin payssen,
Pero si-m retinc ieu tan de covenen
Que, s'ilh lo tenia un an, qu'ieu lo tengues mais de cen.
Cavalier, datz mi cosselh d'un pessamen:
-Anc mays no fuy issaratz de cauzimen- :
Res non sai ab qual me tengua, de n'Agnes o de n'Arsen.
De Gimel ai lo castel e-l mandamen,
E per Niol fauc ergueill a tota gen:
C'ambedui me son jurat e plevit per sagramen.

Cleaning text files with regex of python

I have a huge file where there lines like this one:
"En g茅n茅ral un tr猫s bon hotel La terrasse du bar pr猫s du lobby"
How to remove these Sinographic characters from the lines of the file so I get a new file where these lines are with Roman alphabet characters only?
I was thinking of using regular expressions.
Is there a character class for all Roman alphabet characters, e.g. Arabic numerals, a-nA-N and other(punctuation)?
I find this regex cheet sheet to come in very handy for situations like these.
# -*- coding: utf-8
import re
import string
u = u"En.!?+ 123 g茅n茅ral un tr猫s bon hotel La terrasse du bar pr猫s du lobby"
p = re.compile(r"[^\w\s\d{}]".format(re.escape(string.punctuation)))
for m in p.finditer(u):
print m.group()
>>> 茅
>>> 茅
>>> 猫
>>> 猫
I'm also a huge fan of the unidecode module.
from unidecode import unidecode
u = u"En.!?+ 123 g茅n茅ral un tr猫s bon hotel La terrasse du bar pr猫s du lobby"
print unidecode(u)
>>> En.!?+ 123 gMao nMao ral un trMao s bon hotel La terrasse du bar prMao s du lobby
You can use the string module.
>>> string.ascii_letters
'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'
>>> string.digits
'0123456789'
>>> string.punctuation
'!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~'
>>>
And it seems the code you want to replace is Chinese. If you all your string is unicode, you can use the simple range [\u4e00-\u9fa5] to replace them. This is not the whole range of Chinese but enough.
>>> s = u"En g茅n茅ral un tr猫s bon hotel La terrasse du bar pr猫s du lobby"
>>> s
u'En g\u8305n\u8305ral un tr\u732bs bon hotel La terrasse du bar pr\u732bs du lobby'
>>> import re
>>> re.sub(ur'[\u4e00-\u9fa5]', '', s)
u'En gnral un trs bon hotel La terrasse du bar prs du lobby'
>>>
You can do it without regexes.
To keep only ascii characters:
# -*- coding: utf-8 -*-
import unicodedata
unistr = u"En g茅n茅ral un tr猫s bon hotel La terrasse du bar pr猫s du lobby"
unistr = unicodedata.normalize('NFD', unistr) # to preserve `e` in `é`
ascii_bytes = unistr.encode('ascii', 'ignore')
To remove everything except ascii letters, numbers, punctuation:
from string import ascii_letters, digits, punctuation, whitespace
to_keep = set(map(ord, ascii_letters + digits + punctuation + whitespace))
all_bytes = range(0x100)
to_remove = bytearray(b for b in all_bytes if b not in to_keep)
text = ascii_bytes.translate(None, to_remove).decode()
# -> En gnral un trs bon hotel La terrasse du bar prs du lobby

Categories

Resources