Python web scrape url to dataframe - python

I want to web scrape a website (https://www.assemblee-nationale.fr/13/cri/2006-2007/20070152.asp) and create a dataframe.
This is the dataframe that I want:
name text
M. le président La séance est...
M. le président L'ordre du jour...
M. Jean-Marc Ayrault Je demande la ...
Initially I thought that I should use BeautifulSoup, and I started to write the following code:
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = "https://www.assemblee-nationale.fr/13/cri/2006-2007/20070152.asp"
r=requests.get(url)
soup_data=BeautifulSoup(r.text, 'html.parser')
first=soup_data.find_all('div')
name=first.b.text
But I obtained the error:
AttributeError: ResultSet object has no attribute 'b'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?
Because I could not go further, then I thought that the best idea has to get the html, and work in a similar way as if I had a xml file:
import urllib
import xml.etree.ElementTree as ET
import pandas as pd
import lxml
from lxml import etree
urllib.request.urlretrieve("https://www.assemblee-nationale.fr/13/cri/2006-2007/20070152.asp", "file.txt")
d = {'head': ['title'],
'body':['b', 'p']}
tree = ET.parse("file.txt")
root = tree.getroot()
# initialize two lists: `cols` and `data`
cols, data = list(), list()
# loop through d.items
for k, v in d.items():
# find child
child = root.find(f'{{*}}{k}')
# use iter to check each descendant (`elem`)
for elem in child.iter():
# get `tag_end` for each descendant, e.g. `texte` in "{http://schemas.assemblee-nationale.fr/referentiel}texte"
tag_end = elem.tag.split('}')[-1]
# check if `tag_end` in `v(alue)`
if tag_end in v:
# add `tag_end` and `elem.text` to appropriate list
cols.append(tag_end)
data.append(elem.text)
df = pd.DataFrame(data).T
But I obtain the error: "not well-formed (invalid token)".
Here is a summary of the html:
<html>
<head>
<title> Assemblée Nationale - Séance du mercredi ... </title>
</head>
<body>
<div id="englobe">
<p>
<orateur>
<b> M. le président </b>
</orateur>
La séance est...
</p>
<p>
<orateur>
<b> M. le président </b>
</orateur>
L'ordre du jour...
</p>
</div>
</body>
</html>
How I should web scrape the website? I will want to do this for several similar websites.

So, your approach with beautifulsoup is definitely the way to go. The error already points you towards your error: what you call first is really of type bs4.element.ResultSet, which -- as the name suggests -- is not a single element. The easiest way to access the actual results is to loop through it using a for loop.
I'm not sure if you really need to go for the div's as, really, you're looking for the p's that include an orateur element (long story short: the first for-loop is unnecessary and you could heavily simplify this further), but anyways, here's how you can access the elements you want
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = "https://www.assemblee-nationale.fr/13/cri/2006-2007/20070152.asp"
r=requests.get(url)
soup_data=BeautifulSoup(r.text, 'html.parser')
list_soup_div=soup_data.find_all('div')
# Edit: list of dicts to store the orateur's name and the text he/she spoke
list_dict_transcript = []
for item_soup_div in list_soup_div:
list_soup_sub_orateur = item_soup_div.find_all('orateur')
# Check on whether the <div> contains an <orateur> element
if len(list_soup_sub_orateur):
for item_soup_p in item_soup_div.find_all('p'):
list_orateur = item_soup_p.find_all('orateur')
if len(list_orateur):
# Edit: recording
# print(item_soup_p)
for item_b in item_soup_p.find_all('b'):
text_orateur = item_b.get_text()
text_speech = item_soup_p.find('orateur').next_sibling
list_dict_transcript.append({'orateur': text_orateur, 'speech': text_speech})
# Edit: conversion of list into dataframe
df_transcript = pd.DataFrame(data = list_dict_transcript)
After that, you only need to filter out the lines with links, append to a dictionary, convert to a datarame and voilà, there's your desired dataframe. Hope that helps! If not, do let me know.
Edit:
I have added a couple of lines to a) initialize an empty list of dicts, b) filling these dicts with the respective texts (using the .next_sibling() function as per Extracting text outside of a tag BeautifulSoup to get hold of the text, the orateur was saying and c) getting this into a dataframe.

find_all method is used to find all elements with filters you want like the div tag in your example and you can't extract the text of all elements.
you just have to make for loop and extract the text of each element and store them into a list then add it as a column in your data frame like this.
import requests
from bs4 import BeautifulSoup
import pandas as pd
df = pd.DataFrame()
names_list = []
url = "https://www.assemblee-nationale.fr/13/cri/2006-2007/20070152.asp"
r=requests.get(url)
soup_data=BeautifulSoup(r.text, 'html.parser')
names=soup_data.find_all('div')
for name in names:
names_list.append(name.text)
df['name']=names_list

Related

Extracting Embedded <span> in Python using BeautifulSoup

I am trying to extract a value in a span however the span is embedded into another. I was wondering how I get the value of only 1 span rather than both.
from bs4 import BeautifulSoup
some_price = page_soup.find("div", {"class":"price_FHDfG large_3aP7Z"})
some_price.span
# that code returns this:
'''
<span>$289<span class="rightEndPrice_6y_hS">99</span></span>
'''
# BUT I only want the $289 part, not the 99 associated with it
After making this adjustment:
some_price.span.text
the interpreter returns
$28999
Would it be possible to somehow remove the '99' at the end? Or to only extract the first part of the span?
Any help/suggestions would be appreciated!
You can access the desired value from the soup.contents attribute:
from bs4 import BeautifulSoup as soup
html = '''
<span>$289<span class="rightEndPrice_6y_hS">99</span></span>
'''
result = soup(html, 'html.parser').find('span').contents[0]
Output:
'$289'
Thus, in the context of your original div lookup:
result = page_soup.find("div", {"class":"price_FHDfG large_3aP7Z"}).span.contents[0]

is there any convenient way to get a index of sub section in a page?

it is convenient to use "index-x" to quick locate a sub section in a page.
for instance
https://docs.python.org/3/library/re.html#index-2
gives 3rd sub-section in this page.
when i want to share the location of a sub-section to others, how to get the index in a convenient way?
for instance, how to get the index of {m,n} sub-section without counting from index-0?
With bs4 4.7.1 you can use :has and :contains to target a specific text string and return the index (note that using select_one will return first match. Use a list comprehension and select if want to return all matches
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://docs.python.org/3/library/re.html')
soup = bs(r.content, 'lxml')
index = soup.select_one('dl:has(.pre:contains("{m,n}"))')['id']
print(index)
Any version: if you want a dictionary that maps special characters to indices. Thanks to #zoe for spotting the error in my dictionary comprehension.
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://docs.python.org/3/library/re.html')
soup = bs(r.content, 'lxml')
mappings = dict([(item['id'], [i.text for i in item.select('dt .pre')]) for item in soup.select('[id^="index-"]')])
indices = {i: k for (k, v) in mappings.items() for i in v}
You're looking for index-7.
You can download the HTML of the page and get all the possible values of index-something with the following code:
import re
import requests
from bs4 import BeautifulSoup
r = requests.get('https://docs.python.org/3/library/re.html')
soup = BeautifulSoup(r.content.decode())
result = [t['id'] for t in soup.find_all(id=re.compile('index-\d+'))]
print(result)
Output:
['index-0', 'index-1', 'index-2', 'index-3', 'index-4', 'index-5', 'index-6', 'index-7', 'index-8', 'index-9', 'index-10', 'index-11', 'index-12', 'index-13', 'index-14', 'index-15', 'index-16', 'index-17', 'index-18', 'index-19', 'index-20', 'index-21', 'index-22', 'index-23', 'index-24', 'index-25', 'index-26', 'index-27', 'index-28', 'index-29', 'index-30', 'index-31', 'index-32', 'index-33', 'index-34', 'index-35', 'index-36', 'index-37', 'index-38']
The t objects in the list comprehension contain the HTML of the tags whose id matches the regex.

How to scrape data from multiple wikipedia pages with python?

I want grab the age, place of birth and previous occupation of senators.
Information for each individual senator is available on Wikipedia, on their respective pages, and there is another page with a table that lists all senators by the name.
How can I go through that list, follow links to the respective pages of each senator, and grab the information I want?
Here is what I've done so far.
1 . (no python) Found out that DBpedia exists and wrote a query to search for senators. Unfortunately DBpedia hasn't categorized most (if any) of them:
SELECT ?senator, ?country WHERE {
?senator rdf:type <http://dbpedia.org/ontology/Senator> .
?senator <http://dbpedia.org/ontology/nationality> ?country
}
Query results are unsatisfactory.
2 . Found out that there is a python module called wikipedia that allows me to search and retrieve information from individual wiki pages. Used it to get a list of senator names from the table by looking at the hyperlinks.
import wikipedia as w
w.set_lang('pt')
# Grab page with table of senator names.
s = w.page(w.search('Lista de Senadores do Brasil da 55 legislatura')[0])
# Get links to senator names by removing links of no interest
# For each link in the page, check if it's a link to a senator page.
senators = [name for name in s.links if not
# Senator names don't contain digits nor ,
(any(char.isdigit() or char == ',' for char in name) or
# And full names always contain spaces.
' ' not in name)]
At this point I'm a bit lost. Here the list senators contains all senator names, but also other names, e.g., party names. The wikipidia module (at least from what I could find in the API documentation) also doesn't implement functionality to follow links or search through tables.
I've seen two related entries here on StackOverflow that seem helpful, but they both (here and here) extract information from a single page.
Can anyone point me towards a solution?
Thanks!
Ok, so I figured it out (thanks to a comment pointing me to BeautifulSoup).
There is actually no big secret to achieve what I wanted. I just had to go through the list with BeautifulSoup and store all the links, and then open each stored link with urllib2, call BeautifulSoup on the response, and.. done. Here is the solution:
import urllib2 as url
import wikipedia as w
from bs4 import BeautifulSoup as bs
import re
# A dictionary to store the data we'll retrieve.
d = {}
# 1. Grab the list from wikipedia.
w.set_lang('pt')
s = w.page(w.search('Lista de Senadores do Brasil da 55 legislatura')[0])
html = url.urlopen(s.url).read()
soup = bs(html, 'html.parser')
# 2. Names and links are on the second column of the second table.
table2 = soup.findAll('table')[1]
for row in table2.findAll('tr'):
for colnum, col in enumerate(row.find_all('td')):
if (colnum+1) % 5 == 2:
a = col.find('a')
link = 'https://pt.wikipedia.org' + a.get('href')
d[a.get('title')] = {}
d[a.get('title')]['link'] = link
# 3. Now that we have the links, we can iterate through them,
# and grab the info from the table.
for senator, data in d.iteritems():
page = bs(url.urlopen(data['link']).read(), 'html.parser')
# (flatten list trick: [a for b in nested for a in b])
rows = [item for table in
[item.find_all('td') for item in page.find_all('table')[0:3]]
for item in table]
for rownumber, row in enumerate(rows):
if row.get_text() == 'Nascimento':
birthinfo = rows[rownumber+1].getText().split('\n')
try:
d[senator]['birthplace'] = birthinfo[1]
except IndexError:
d[senator]['birthplace'] = ''
birth = re.search('(.*\d{4}).*\((\d{2}).*\)', birthinfo[0])
d[senator]['birthdate'] = birth.group(1)
d[senator]['age'] = birth.group(2)
if row.get_text() == 'Partido':
d[senator]['party'] = rows[rownumber + 1].getText()
if 'Profiss' in row.get_text():
d[senator]['profession'] = rows[rownumber + 1].getText()
Pretty simple. BeautifulSoup works wonders =)

Can't figure out how to scrape data in body tag using beautiful soup (Python)

from bs4 import BeautifulSoup
import urllib
from openpyxl import Workbook
from openpyxl.compat import range
from openpyxl.cell import get_column_letter
r = urllib.urlopen('https://www.vrbo.com/576329').read()
soup = BeautifulSoup(r)
rate = soup.find_all('body')
print rate
print type(soup)
I'm trying to capture values in containers such as data-bedrooms="3", specifically the values given in the quotations, but I have no idea what they are formally called or how to parse them.
The below is a sample of part of the print out for the "body" so I know the values are there, the capturing the specific part is what I can't get:
data-ratemaximum="$260" data-rateminimum="$220" data-rateunits="night" data-rawlistingnumber="576329" data-requestuuid="73bcfaa3-9637-40a8-801c-ae86f93caf39" data-searchpdptab="C" data-serverday="18" data-showbookingphone="False"
To obtain the value of an attribute used rate [ 'attr'], example:
from bs4 import BeautifulSoup
import urllib
from openpyxl import Workbook
from openpyxl.compat import range
from openpyxl.cell import get_column_letter
r = urllib.urlopen('https://www.vrbo.com/576329').read()
soup = BeautifulSoup(r, "html.parser")
rate = soup.find('body')
print rate['data-ratemaximum']
print rate['data-rateunits']
print rate['data-rawlistingnumber']
print rate['data-requestuuid']
print rate['data-searchpdptab']
print rate['data-serverday']
print rate['data-searchpdptab']
print rate['data-showbookingphone']
print rate
print type(soup)
You need to pick apart your result. It might be helpful to know that those things you seek are called attributes of a tag in HTML:
body_tag = rate[0]
data_bedrooms = body_tag.attrs['data-bedrooms']
The code above assumes you only have one <body> -- if you have more you will need to use a for loop on rate. You'll also possibly want to convert the value to an integer with int().
Not sure if you wanted only data-bedrooms from the soup object or not. I did some cursory checking of the output produce and was able to reason that the data-* items you mentioned were attributes, rather than tags. If doc structure is consistent, you could probably locate the respective tag associated with the attribute, and make finding these more efficient:
import re
# regex pattern for attribs
data_tag_pattern = re.compile('^data\-')
# Create list of attribs
attribs_wanted = "data-bedrooms data-rateminimumdata-rateunits data-rawlistingnumber data-requestuuid data-searchpdptab data-serverday data-showbookingphone".split()
# Search entire tree
for item in soup.findAll():
# Use descendants to recurse downwards
for child in item.descendants:
try:
for attribute in child.attrs:
if data_tag_pattern.match(attribute) and attribute in attribs_wanted:
print("{}: {}".format(attribute, child[attribute]))
except AttributeError:
pass
This will produce output as so:
data-showbookingphone: False
data-bedrooms: 3
data-requestuuid: 2b6f4d21-8b04-403d-9d25-0a660802fb46
data-serverday: 18
data-rawlistingnumber: 576329
data-searchpdptab: C
hth!

Python - Beautiful Soup: Extract "strings" from tag in right order

I'm searching for a beautiful soup command combination to extract "strings" from a-tag string in the right order.
Source 1:
a-string <img alt="img-alt"> <span>span-string</span>
Target 1:
"a-string img-alt span-string"
Source 2:
<span>span</span> string <img alt="alt">
Target 2:
"span-string a-string img-alt"
It's easy to get the child elements via "find_all()" and the text via "get_text()".
How to get the right order of the different "strings"? Or to sequentially parse all information in the a-string?
For 1:
import bs4
a = bs4.BeautifulSoup("""a-string <img alt="img-alt"> <span>span-string</span>""")
print(" ".join((a.find(text=True),a.find("img").attrs["alt"],a.find("span").text)))
For 2:
import bs4
a = bs4.BeautifulSoup("""a-string <img alt="img-alt"> <span>span-string</span>""")
print(" ".join((a.find("span").text, a.find(text=True), a.find("img").attrs["alt"])))
I don't think that there's a generic way to extract what you want, as you are mixing text content and attributes.
a.find(text=True) ## Get first element text
a.findAll(text=True) ## Get a list of text elements from string
I think this is the answer you are looking for...the contents method returns a list
from BeautifulSoup import BeautifulSoup
s="""a-string <img alt="img-alt"> <span>span-string</span>"""
soup=BeautifulSoup(s)
z=soup.find("a")
print z.contents
Using lxml,
import lxml.html as LH
content = '''
a-string <img alt="img-alt"> <span>span-string</span>
<span>span</span> string <img alt="alt">
'''
root = LH.fromstring(content)
for atag in root.xpath('//a'):
print(' '.join(atag.xpath('''
descendant-or-self::text()
|
descendant-or-self::*/#alt
''')))
yields
a-string img-alt span-string
span string alt

Categories

Resources