Beautiful soup web page scraper - python

I am trying to scrape a webpage with following url
https://www.bseindia.com/corporates/shpSecurities.aspx?scripcd=500209&qtrid=96.00
and I want to scrape a table with following html code. I have tried few things but not able to achieve the desired table to insert into csv.Here the <"tr"> tag is not closed for the data so segregating the data into different row is an issue.
Thanks for help
--J
<table border='0' width='900' align='center' cellspacing='1' cellpadding='4'>
<tr>
<td class='innertable_header1' rowspan='3'>Category of shareholder</td>
<td class='innertable_header1' rowspan='3'>Nos. of shareholders</td>
<td class='innertable_header1' rowspan='3'>No. of fully paid up equity shares held</td>
<td class='innertable_header1' rowspan='3'>No. of shares underlying Depository Receipts</td>
<td class='innertable_header1' rowspan='3'>Total nos. shares held</td>
<td class='innertable_header1' rowspan='3'>Shareholding as a % of total no. of shares (calculated as per SCRR, 1957)As a % of (A+B+C2)</td>
<td class='innertable_header1' rowspan='3'> Number of equity shares held in dematerialized form</td>
</tr>
<tr></tr>
<tr></tr>
<tr>
<td class='TTRow_left'>(A) Promoter & Promoter Group</td>
<td class='TTRow_right'>19</td>
<td class='TTRow_right'>28,17,02,889</td>
<td class='TTRow_right'></td>
<td class='TTRow_right'>28,17,02,889</td>
<td class='TTRow_right'>12.90</td>
<td class='TTRow_right'>28,17,02,889</td>
<tr>
<td class='TTRow_left'>(B) Public</td>
<td class='TTRow_right'>9,16,058</td>
<td class='TTRow_right'>1,87,81,45,362</td>
<td class='TTRow_right'>1,32,95,642</td>
<td class='TTRow_right'>1,89,14,41,004</td>
<td class='TTRow_right'>86.61</td>
<td class='TTRow_right'>1,88,74,40,959</td>
<tr>
<td class='TTRow_left'>(C1) Shares underlying DRs</td>
<td class='TTRow_right'></td>
<td class='TTRow_right'></td>
<td class='TTRow_right'></td>
<td class='TTRow_right'></td>
<td class='TTRow_right'>0.00</td>
<td class='TTRow_right'></td>
<tr>
<td class='TTRow_left'>(C2) Shares held by Employee Trust</td>
<td class='TTRow_right'>1</td>
<td class='TTRow_right'>1,08,05,896</td>
<td class='TTRow_right'></td>
<td class='TTRow_right'>1,08,05,896</td>
<td class='TTRow_right'>0.49</td>
<td class='TTRow_right'>1,08,05,896</td>
<tr>
<td class='TTRow_left'>(C) Non Promoter-Non Public</td>
<td class='TTRow_right'>1</td>
<td class='TTRow_right'>1,08,05,896</td>
<td class='TTRow_right'></td>
<td class='TTRow_right'>1,08,05,896</td>
<td class='TTRow_right'>0.49</td>
<td class='TTRow_right'>1,08,05,896</td>
<tr>
<td class='TTRow_left'>Grand Total</td>
<td class='TTRow_right'>9,16,078</td>
<td class='TTRow_right'>2,17,06,54,147</td>
<td class='TTRow_right'>1,32,95,642</td>
<td class='TTRow_right'>2,18,39,49,789</td>
<td class='TTRow_right'>100.00</td>
<td class='TTRow_right'>2,17,99,49,744</td>
</tr>
</table>

You can try this:
from bs4 import BeautifulSoup as soup
import urllib
import re
s = soup(str(urllib.urlopen('https://www.bseindia.com/corporates/shpSecurities.aspx?scripcd=500209&qtrid=96.00').read()), 'lxml')
results = filter(None, [re.sub('[\n\r]+|\s{2,}', '', i.text) for i in s.find_all('td', {'class':re.compile('TTRow_right|TTRow_left')})])
Output:
[u'(A) Promoter & Promoter Group', u'19', u'28,17,02,889', u'28,17,02,889', u'12.90', u'28,17,02,889', u'(B) Public', u'9,16,058', u'1,87,81,45,362', u'1,32,95,642', u'1,89,14,41,004', u'86.61', u'1,88,74,40,959', u'(C1) Shares underlying DRs', u'0.00', u'(C2) Shares held by Employee Trust', u'1', u'1,08,05,896', u'1,08,05,896', u'0.49', u'1,08,05,896', u'(C) Non Promoter-Non Public', u'1', u'1,08,05,896', u'1,08,05,896', u'0.49', u'1,08,05,896', u'Grand Total', u'9,16,078', u'2,17,06,54,147', u'1,32,95,642', u'2,18,39,49,789', u'100.00', u'2,17,99,49,744']

Related

Extracting data from a table getting only the last row

I have the following table from this website :
<table id="sample">
<tbody>
<tr class="toprow">
<td></td>
<td colspan="5">Number of Jurisdictions</td>
</tr>
<tr class="toprow">
<td>Region</td>
<td>Jurisdictions in the region</td>
<td>Jurisdictions that require IFRS Standards <br>
for all or most domestic publicly accountable entities</td>
<td>Jurisdictions that require IFRS Standards as % of total jurisdictions in the region</td>
<td>Jurisdictions that permit or require IFRS Standards for at least some (but not all or most) domestic publicly accountable entities</td>
<td>Jurisdictions that neither require nor permit IFRS Standards for any domestic publicly accountable entities</td>
</tr>
<tr>
<td class="leftcol">Europe</td>
<td class="data">44</td>
<td class="data">43</td>
<td class="data">98%</td>
<td class="data">1</td>
<td class="data">0</td>
</tr>
<tr>
<td class="leftcol">Africa</td>
<td class="data">23</td>
<td class="data">19</td>
<td class="data">83%</td>
<td class="data">1</td>
<td class="data">3</td>
</tr>
<tr>
<td class="leftcol">Middle East</td>
<td class="data">13</td>
<td class="data">13</td>
<td class="data">100%</td>
<td class="data">0</td>
<td class="data">0</td>
</tr>
<tr>
<td class="leftcol">Asia-Oceania</td>
<td class="data">33</td>
<td class="data">24</td>
<td class="data">73%</td>
<td class="data">3</td>
<td class="data">6</td>
</tr>
<tr>
<td class="leftcol">Americas</td>
<td class="data">37</td>
<td class="data">27</td>
<td class="data">73%</td>
<td class="data">8</td>
<td class="data">2</td>
</tr>
<tr>
<td class="leftcol" style="border-top:2px solid #000000"><strong>Totals</strong></td>
<td class="data" style="border-top:2px solid #000000"><strong>150</strong></td>
<td class="data" style="border-top:2px solid #000000"><strong>126</strong></td>
<td class="data" style="border-top:2px solid #000000"><strong>84%</strong></td>
<td class="data" style="border-top:2px solid #000000"><strong>13</strong></td>
<td class="data" style="border-top:2px solid #000000"><strong>11</strong></td>
</tr>
<tr>
<td class="leftcol"><strong>As % <br>
of 150</strong></td>
<td class="data"><strong>100%</strong></td>
<td class="data"><strong>84%</strong></td>
<td class="data"><strong> </strong></td>
<td class="data"><strong>9%</strong></td>
<td class="data"><strong>7%</strong></td>
</tr>
</tbody>
</table>
This is my following attempt :
from bs4 import BeautifulSoup
import requests
import pandas as pd
import requests
# Site URL
url = "http://archive.ifrs.org/Use-around-the-world/Pages/Analysis-of-the-IFRS-jurisdictional-profiles.aspx"
# Make a GET request to fetch the raw HTML content
html_content = requests.get(url).text
# Parse HTML code for the entire site
soup = BeautifulSoup(html_content, "lxml")
# print(soup.prettify()) # print the parsed data of html
# On site there are 3 tables with the class "wikitable"
# The following line will generate a list of HTML content for each table
gdp = soup.select("table#sample")[0]
rows = []
cols = []
for g in gdp.select('tr.toprow'):
for c in g.select('td'):
cols.append(c.text)
for g in gdp.select('tr:not(.toprow)'):
row = []
for item in g.select('td'):
row.append(item.text)
rows.append(row)
pd.DataFrame(rows, columns=cols)
The problem is cols is getting the right result :
['', 'Number of Jurisdictions', 'Region', 'Jurisdictions in the region', 'Jurisdictions that require IFRS\xa0Standards\xa0\r\n
for all or most domestic publicly accountable entities', 'Jurisdictions that require IFRS Standards\xa0as % of total jurisdictions in the region', 'Jurisdictions that permit or require IFRS\xa0Standards for at least some (but not all or most) domestic publicly accountable entities', 'Jurisdictions that neither require nor permit IFRS Standards for any domestic publicly accountable entities']
The problem is with the rows, it's getting me only the last row :
['As % \r\n of 150', '100%', '84%', '\xa0', '9%', '7%']
I am getting this error :
ValueError: 8 columns passed, passed data had 6 columns
there is two tr with .toprow, skip the first .toprow
for g in gdp.select('tr.toprow')[1:]:
Your solution will look like:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_content, "lxml")
gdp = soup.select("table#sample")[0]
rows = []
cols = []
for g in gdp.select('tr.toprow')[1:]:
for c in g.select('td'):
cols.append(c.text)
for g in gdp.select('tr:not(.toprow)'):
row = []
for item in g.select('td'):
row.append(item.text)
rows.append(row)
pd.DataFrame(rows, columns=cols)

how do we select the child element tbody after extracting the entire html?

I'm still a python noob trying to learn beautifulsoup.I looked at solutions on stack but was unsuccessful Please help me to understand this better.
i have extracted the html which is as shown below
<table cellspacing="0" id="ContentPlaceHolder1_dlDetails"
style="width:100%;border-collapse:collapse;">
<tbody><tr>
<td>
<table border="0" cellpadding="5" cellspacing="0" width="70%">
<tbody><tr>
<td> </td>
<td> </td>
</tr>
<tr>
<td bgcolor="#4F95FF" class="listhead" width="49%">Location:</td>
<td bgcolor="#4F95FF" class="listhead" width="51%">On Site </td>
</tr>
<tr>
<td class="listmaintext">ATM ID: </td>
<td class="listmaintext">DAGR00401111111</td>
</tr>
<tr>
<td class="listmaintext">ATM Centre:</td>
<td class="listmaintext"></td>
</tr>
<tr>
<td class="listmaintext">Site Location: </td>
<td class="listmaintext">ADA Building - Agra</td>
</tr>
i tried to parse find_all('tbody') but was unsuccessful
#table = bs.find("table", {"id": "ContentPlaceHolder1_dlDetails"})
html = browser.page_source
soup = bs(html, "lxml")
table = soup.find_all('table', {'id':'ContentPlaceHolder1_dlDetails'})
table_body = table.find('tbody')
rows = table.select('tr')
for row in rows:
cols = row.find_all('td')
cols = [ele.text.strip() for ele in cols]
data.append([ele for ele in cols if ele])values
I'm trying to save values in "listmaintext" class
Error message
AttributeError: ResultSet object has no attribute 'find'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?
Another way to do this using next_sibling
from bs4 import BeautifulSoup as bs
html ='''
<html>
<table cellspacing="0" id="ContentPlaceHolder1_dlDetails"
style="width:100%;border-collapse:collapse;">
<tbody><tr>
<td>
<table border="0" cellpadding="5" cellspacing="0" width="70%">
<tbody><tr>
<td> </td>
<td> </td>
</tr>
<tr>
<td bgcolor="#4F95FF" class="listhead" width="49%">Location:</td>
<td bgcolor="#4F95FF" class="listhead" width="51%">On Site </td>
</tr>
<tr>
<td class="listmaintext">ATM ID: </td>
<td class="listmaintext">DAGR00401111111</td>
</tr>
<tr>
<td class="listmaintext">ATM Centre:</td>
<td class="listmaintext"></td>
</tr>
<tr>
<td class="listmaintext">Site Location: </td>
<td class="listmaintext">ADA Building - Agra</td>
</tr>
</html>'''
soup = bs(html, 'lxml')
data = [' '.join((item.text, item.next_sibling.next_sibling.text)) for item in soup.select('#ContentPlaceHolder1_dlDetails tr .listmaintext:first-child') if item.text !='']
print(data)
from bs4 import BeautifulSoup
data = '''<table cellspacing="0" id="ContentPlaceHolder1_dlDetails"
style="width:100%;border-collapse:collapse;">
<tbody><tr>
<td>
<table border="0" cellpadding="5" cellspacing="0" width="70%">
<tbody><tr>
<td> </td>
<td> </td>
</tr>
<tr>
<td bgcolor="#4F95FF" class="listhead" width="49%">Location:</td>
<td bgcolor="#4F95FF" class="listhead" width="51%">On Site </td>
</tr>
<tr>
<td class="listmaintext">ATM ID: </td>
<td class="listmaintext">DAGR00401111111</td>
</tr>
<tr>
<td class="listmaintext">ATM Centre:</td>
<td class="listmaintext"></td>
</tr>
<tr>
<td class="listmaintext">Site Location: </td>
<td class="listmaintext">ADA Building - Agra</td>
</tr>'''
soup = BeautifulSoup(data, 'lxml')
s = soup.select('.listmaintext')
for td1, td2 in zip(s[::2], s[1::2]):
print('{} [{}]'.format(td1.text.strip(), td2.text.strip()))
Prints:
ATM ID: [DAGR00401111111]
ATM Centre: []
Site Location: [ADA Building - Agra]

How to add a 2nd Y-axis on a grouped bar chart using Altair? and sort the bar using value of one of the column from the data

I'm trying to add a 3rd axis or 2nd Y-axis to the group chart. I'm not sure if it is possible.
Ideally, I want to -
1) add a line to this chart, which represents the "percentage of Arrest" made for the given year and a crime type.
2) sort the bars with each group using a value of column "rank" from the data.
Here is my code and the current visualization. Your valuable feedback is much appreciated. Thank you.
import altair as alt
base = alt.Chart().encode(
x=alt.X('primary_type',scale=alt.Scale(rangeStep=12),title=None,sort=alt.EncodingSortField(op='sum', field='rank')),
color=alt.Color('primary_type:N')
)
bar = base.mark_bar().encode(
alt.Y('sum(Number_of_Incidents):Q',title='Total Number of Incidents')
)
line = base.mark_line(color='red').encode(
alt.Y('percent_arrest',
axis=alt.Axis(title=None))
)
combined = alt.layer(bar, line, data=q13a)
combined.facet(
column=alt.Column('year')
).resolve_scale(x='independent'
).configure_view(
stroke='transparent'
)
Sample Data -
<table class="table table-bordered table-hover table-condensed">
<thead><tr><th title="Field #1">year</th>
<th title="Field #2">primary_type</th>
<th title="Field #3">Number_of_Incidents</th>
<th title="Field #4">number_of_arrests</th>
<th title="Field #5">percent_arrest</th>
<th title="Field #6">rank</th>
</tr></thead>
<tbody><tr>
<td align="right">2018</td>
<td>THEFT</td>
<td align="right">57330</td>
<td align="right">5503</td>
<td align="right">9.6</td>
<td align="right">1</td>
</tr>
<tr>
<td align="right">2018</td>
<td>BATTERY</td>
<td align="right">44667</td>
<td align="right">8886</td>
<td align="right">19.89</td>
<td align="right">2</td>
</tr>
<tr>
<td align="right">2018</td>
<td>CRIMINAL DAMAGE</td>
<td align="right">24889</td>
<td align="right">1498</td>
<td align="right">6.02</td>
<td align="right">3</td>
</tr>
<tr>
<td align="right">2018</td>
<td>ASSAULT</td>
<td align="right">18229</td>
<td align="right">2931</td>
<td align="right">16.08</td>
<td align="right">4</td>
</tr>
<tr>
<td align="right">2018</td>
<td>DECEPTIVE PRACTICE</td>
<td align="right">15879</td>
<td align="right">713</td>
<td align="right">4.49</td>
<td align="right">5</td>
</tr>
<tr>
<td align="right">2017</td>
<td>THEFT</td>
<td align="right">64334</td>
<td align="right">6459</td>
<td align="right">10.04</td>
<td align="right">1</td>
</tr>
<tr>
<td align="right">2017</td>
<td>BATTERY</td>
<td align="right">49213</td>
<td align="right">10060</td>
<td align="right">20.44</td>
<td align="right">2</td>
</tr>
<tr>
<td align="right">2017</td>
<td>CRIMINAL DAMAGE</td>
<td align="right">29040</td>
<td align="right">1747</td>
<td align="right">6.02</td>
<td align="right">3</td>
</tr>
<tr>
<td align="right">2017</td>
<td>ASSAULT</td>
<td align="right">19298</td>
<td align="right">3455</td>
<td align="right">17.9</td>
<td align="right">4</td>
</tr>
<tr>
<td align="right">2017</td>
<td>DECEPTIVE PRACTICE</td>
<td align="right">18816</td>
<td align="right">805</td>
<td align="right">4.28</td>
<td align="right">5</td>
</tr>
<tr>
<td align="right">2016</td>
<td>THEFT</td>
<td align="right">61600</td>
<td align="right">6518</td>
<td align="right">10.58</td>
<td align="right">1</td>
</tr>
<tr>
<td align="right">2016</td>
<td>BATTERY</td>
<td align="right">50292</td>
<td align="right">10328</td>
<td align="right">20.54</td>
<td align="right">2</td>
</tr>
<tr>
<td align="right">2016</td>
<td>CRIMINAL DAMAGE</td>
<td align="right">31018</td>
<td align="right">1668</td>
<td align="right">5.38</td>
<td align="right">3</td>
</tr>
<tr>
<td align="right">2016</td>
<td>ASSAULT</td>
<td align="right">18738</td>
<td align="right">3490</td>
<td align="right">18.63</td>
<td align="right">4</td>
</tr>
<tr>
<td align="right">2016</td>
<td>DECEPTIVE PRACTICE</td>
<td align="right">18733</td>
<td align="right">815</td>
<td align="right">4.35</td>
<td align="right">5</td>
</tr>
<tr>
<td align="right">2015</td>
<td>THEFT</td>
<td align="right">57335</td>
<td align="right">6771</td>
<td align="right">11.81</td>
<td align="right">1</td>
</tr>
<tr>
<td align="right">2015</td>
<td>BATTERY</td>
<td align="right">48918</td>
<td align="right">11558</td>
<td align="right">23.63</td>
<td align="right">2</td>
</tr>
<tr>
<td align="right">2015</td>
<td>CRIMINAL DAMAGE</td>
<td align="right">28675</td>
<td align="right">1835</td>
<td align="right">6.4</td>
<td align="right">3</td>
</tr>
<tr>
<td align="right">2015</td>
<td>NARCOTICS</td>
<td align="right">23883</td>
<td align="right">23875</td>
<td align="right">99.97</td>
<td align="right">4</td>
</tr>
<tr>
<td align="right">2015</td>
<td>OTHER OFFENSE</td>
<td align="right">17552</td>
<td align="right">4795</td>
<td align="right">27.32</td>
<td align="right">5</td>
</tr>
<tr>
<td align="right">2014</td>
<td>THEFT</td>
<td align="right">61561</td>
<td align="right">7415</td>
<td align="right">12.04</td>
<td align="right">1</td>
</tr>
<tr>
<td align="right">2014</td>
<td>BATTERY</td>
<td align="right">49447</td>
<td align="right">12517</td>
<td align="right">25.31</td>
<td align="right">2</td>
</tr>
<tr>
<td align="right">2014</td>
<td>NARCOTICS</td>
<td align="right">29116</td>
<td align="right">29000</td>
<td align="right">99.6</td>
<td align="right">3</td>
</tr>
<tr>
<td align="right">2014</td>
<td>CRIMINAL DAMAGE</td>
<td align="right">27798</td>
<td align="right">2095</td>
<td align="right">7.54</td>
<td align="right">4</td>
</tr>
<tr>
<td align="right">2014</td>
<td>OTHER OFFENSE</td>
<td align="right">16979</td>
<td align="right">4159</td>
<td align="right">24.49</td>
<td align="right">5</td>
</tr>
<tr>
<td align="right">2013</td>
<td>THEFT</td>
<td align="right">71530</td>
<td align="right">7727</td>
<td align="right">10.8</td>
<td align="right">1</td>
</tr>
<tr>
<td align="right">2013</td>
<td>BATTERY</td>
<td align="right">54002</td>
<td align="right">12927</td>
<td align="right">23.94</td>
<td align="right">2</td>
</tr>
<tr>
<td align="right">2013</td>
<td>NARCOTICS</td>
<td align="right">34127</td>
<td align="right">33819</td>
<td align="right">99.1</td>
<td align="right">3</td>
</tr>
<tr>
<td align="right">2013</td>
<td>CRIMINAL DAMAGE</td>
<td align="right">30853</td>
<td align="right">2107</td>
<td align="right">6.83</td>
<td align="right">4</td>
</tr>
<tr>
<td align="right">2013</td>
<td>OTHER OFFENSE</td>
<td align="right">17993</td>
<td align="right">3400</td>
<td align="right">18.9</td>
<td align="right">5</td>
</tr>
<tr>
<td align="right">2012</td>
<td>THEFT</td>
<td align="right">75460</td>
<td align="right">8249</td>
<td align="right">10.93</td>
<td align="right">1</td>
</tr>
<tr>
<td align="right">2012</td>
<td>BATTERY</td>
<td align="right">59135</td>
<td align="right">13061</td>
<td align="right">22.09</td>
<td align="right">2</td>
</tr>
<tr>
<td align="right">2012</td>
<td>CRIMINAL DAMAGE</td>
<td align="right">35854</td>
<td align="right">2462</td>
<td align="right">6.87</td>
<td align="right">3</td>
</tr>
<tr>
<td align="right">2012</td>
<td>NARCOTICS</td>
<td align="right">35488</td>
<td align="right">35226</td>
<td align="right">99.26</td>
<td align="right">4</td>
</tr>
<tr>
<td align="right">2012</td>
<td>BURGLARY</td>
<td align="right">22843</td>
<td align="right">1285</td>
<td align="right">5.63</td>
<td align="right">5</td>
</tr>
<tr>
<td align="right">2011</td>
<td>THEFT</td>
<td align="right">75148</td>
<td align="right">8468</td>
<td align="right">11.27</td>
<td align="right">1</td>
</tr>
<tr>
<td align="right">2011</td>
<td>BATTERY</td>
<td align="right">60458</td>
<td align="right">14139</td>
<td align="right">23.39</td>
<td align="right">2</td>
</tr>
<tr>
<td align="right">2011</td>
<td>NARCOTICS</td>
<td align="right">38605</td>
<td align="right">38544</td>
<td align="right">99.84</td>
<td align="right">3</td>
</tr>
<tr>
<td align="right">2011</td>
<td>CRIMINAL DAMAGE</td>
<td align="right">37332</td>
<td align="right">2583</td>
<td align="right">6.92</td>
<td align="right">4</td>
</tr>
<tr>
<td align="right">2011</td>
<td>BURGLARY</td>
<td align="right">26619</td>
<td align="right">1272</td>
<td align="right">4.78</td>
<td align="right">5</td>
</tr>
<tr>
<td align="right">2010</td>
<td>THEFT</td>
<td align="right">76754</td>
<td align="right">7844</td>
<td align="right">10.22</td>
<td align="right">1</td>
</tr>
<tr>
<td align="right">2010</td>
<td>BATTERY</td>
<td align="right">65403</td>
<td align="right">14277</td>
<td align="right">21.83</td>
<td align="right">2</td>
</tr>
<tr>
<td align="right">2010</td>
<td>NARCOTICS</td>
<td align="right">43393</td>
<td align="right">43294</td>
<td align="right">99.77</td>
<td align="right">3</td>
</tr>
<tr>
<td align="right">2010</td>
<td>CRIMINAL DAMAGE</td>
<td align="right">40653</td>
<td align="right">2641</td>
<td align="right">6.5</td>
<td align="right">4</td>
</tr>
<tr>
<td align="right">2010</td>
<td>BURGLARY</td>
<td align="right">26422</td>
<td align="right">1382</td>
<td align="right">5.23</td>
<td align="right">5</td>
</tr>
<tr>
<td align="right">2009</td>
<td>THEFT</td>
<td align="right">80973</td>
<td align="right">9900</td>
<td align="right">12.23</td>
<td align="right">1</td>
</tr>
<tr>
<td align="right">2009</td>
<td>BATTERY</td>
<td align="right">68462</td>
<td align="right">16325</td>
<td align="right">23.85</td>
<td align="right">2</td>
</tr>
<tr>
<td align="right">2009</td>
<td>CRIMINAL DAMAGE</td>
<td align="right">47724</td>
<td align="right">3270</td>
<td align="right">6.85</td>
<td align="right">3</td>
</tr>
<tr>
<td align="right">2009</td>
<td>NARCOTICS</td>
<td align="right">43543</td>
<td align="right">43193</td>
<td align="right">99.2</td>
<td align="right">4</td>
</tr>
<tr>
<td align="right">2009</td>
<td>BURGLARY</td>
<td align="right">26766</td>
<td align="right">1412</td>
<td align="right">5.28</td>
<td align="right">5</td>
</tr>
<tr>
<td align="right">2008</td>
<td>THEFT</td>
<td align="right">88433</td>
<td align="right">9291</td>
<td align="right">10.51</td>
<td align="right">1</td>
</tr>
<tr>
<td align="right">2008</td>
<td>BATTERY</td>
<td align="right">75922</td>
<td align="right">15520</td>
<td align="right">20.44</td>
<td align="right">2</td>
</tr>
<tr>
<td align="right">2008</td>
<td>CRIMINAL DAMAGE</td>
<td align="right">52841</td>
<td align="right">3403</td>
<td align="right">6.44</td>
<td align="right">3</td>
</tr>
<tr>
<td align="right">2008</td>
<td>NARCOTICS</td>
<td align="right">46507</td>
<td align="right">45459</td>
<td align="right">97.75</td>
<td align="right">4</td>
</tr>
<tr>
<td align="right">2008</td>
<td>OTHER OFFENSE</td>
<td align="right">26533</td>
<td align="right">3496</td>
<td align="right">13.18</td>
<td align="right">5</td>
</tr>
<tr>
<td align="right">2007</td>
<td>THEFT</td>
<td align="right">85156</td>
<td align="right">9783</td>
<td align="right">11.49</td>
<td align="right">1</td>
</tr>
<tr>
<td align="right">2007</td>
<td>BATTERY</td>
<td align="right">79591</td>
<td align="right">19386</td>
<td align="right">24.36</td>
<td align="right">2</td>
</tr>
<tr>
<td align="right">2007</td>
<td>NARCOTICS</td>
<td align="right">54454</td>
<td align="right">53251</td>
<td align="right">97.79</td>
<td align="right">3</td>
</tr>
<tr>
<td align="right">2007</td>
<td>CRIMINAL DAMAGE</td>
<td align="right">53749</td>
<td align="right">3994</td>
<td align="right">7.43</td>
<td align="right">4</td>
</tr>
<tr>
<td align="right">2007</td>
<td>OTHER OFFENSE</td>
<td align="right">26863</td>
<td align="right">4230</td>
<td align="right">15.75</td>
<td align="right">5</td>
</tr>
<tr>
<td align="right">2006</td>
<td>THEFT</td>
<td align="right">86240</td>
<td align="right">10108</td>
<td align="right">11.72</td>
<td align="right">1</td>
</tr>
<tr>
<td align="right">2006</td>
<td>BATTERY</td>
<td align="right">80666</td>
<td align="right">18892</td>
<td align="right">23.42</td>
<td align="right">2</td>
</tr>
<tr>
<td align="right">2006</td>
<td>CRIMINAL DAMAGE</td>
<td align="right">57124</td>
<td align="right">4135</td>
<td align="right">7.24</td>
<td align="right">3</td>
</tr>
<tr>
<td align="right">2006</td>
<td>NARCOTICS</td>
<td align="right">55813</td>
<td align="right">55236</td>
<td align="right">98.97</td>
<td align="right">4</td>
</tr>
<tr>
<td align="right">2006</td>
<td>OTHER OFFENSE</td>
<td align="right">27100</td>
<td align="right">4010</td>
<td align="right">14.8</td>
<td align="right">5</td>
</tr>
<tr>
<td align="right">2005</td>
<td>THEFT</td>
<td align="right">85685</td>
<td align="right">11338</td>
<td align="right">13.23</td>
<td align="right">1</td>
</tr>
<tr>
<td align="right">2005</td>
<td>BATTERY</td>
<td align="right">83965</td>
<td align="right">19994</td>
<td align="right">23.81</td>
<td align="right">2</td>
</tr>
<tr>
<td align="right">2005</td>
<td>NARCOTICS</td>
<td align="right">56234</td>
<td align="right">56121</td>
<td align="right">99.8</td>
<td align="right">3</td>
</tr>
<tr>
<td align="right">2005</td>
<td>CRIMINAL DAMAGE</td>
<td align="right">54548</td>
<td align="right">4083</td>
<td align="right">7.49</td>
<td align="right">4</td>
</tr>
<tr>
<td align="right">2005</td>
<td>OTHER OFFENSE</td>
<td align="right">28028</td>
<td align="right">4726</td>
<td align="right">16.86</td>
<td align="right">5</td>
</tr>
<tr>
<td align="right">2004</td>
<td>THEFT</td>
<td align="right">95463</td>
<td align="right">12068</td>
<td align="right">12.64</td>
<td align="right">1</td>
</tr>
<tr>
<td align="right">2004</td>
<td>BATTERY</td>
<td align="right">87136</td>
<td align="right">20718</td>
<td align="right">23.78</td>
<td align="right">2</td>
</tr>
<tr>
<td align="right">2004</td>
<td>NARCOTICS</td>
<td align="right">57060</td>
<td align="right">57034</td>
<td align="right">99.95</td>
<td align="right">3</td>
</tr>
<tr>
<td align="right">2004</td>
<td>CRIMINAL DAMAGE</td>
<td align="right">53164</td>
<td align="right">3965</td>
<td align="right">7.46</td>
<td align="right">4</td>
</tr>
<tr>
<td align="right">2004</td>
<td>OTHER OFFENSE</td>
<td align="right">29532</td>
<td align="right">5386</td>
<td align="right">18.24</td>
<td align="right">5</td>
</tr>
<tr>
<td align="right">2003</td>
<td>THEFT</td>
<td align="right">98875</td>
<td align="right">12889</td>
<td align="right">13.04</td>
<td align="right">1</td>
</tr>
<tr>
<td align="right">2003</td>
<td>BATTERY</td>
<td align="right">88378</td>
<td align="right">20459</td>
<td align="right">23.15</td>
<td align="right">2</td>
</tr>
<tr>
<td align="right">2003</td>
<td>CRIMINAL DAMAGE</td>
<td align="right">55011</td>
<td align="right">4060</td>
<td align="right">7.38</td>
<td align="right">3</td>
</tr>
<tr>
<td align="right">2003</td>
<td>NARCOTICS</td>
<td align="right">54288</td>
<td align="right">54283</td>
<td align="right">99.99</td>
<td align="right">4</td>
</tr>
<tr>
<td align="right">2003</td>
<td>OTHER OFFENSE</td>
<td align="right">31147</td>
<td align="right">5856</td>
<td align="right">18.8</td>
<td align="right">5</td>
</tr>
<tr>
<td align="right">2002</td>
<td>THEFT</td>
<td align="right">98327</td>
<td align="right">13697</td>
<td align="right">13.93</td>
<td align="right">1</td>
</tr>
<tr>
<td align="right">2002</td>
<td>BATTERY</td>
<td align="right">94153</td>
<td align="right">21331</td>
<td align="right">22.66</td>
<td align="right">2</td>
</tr>
<tr>
<td align="right">2002</td>
<td>CRIMINAL DAMAGE</td>
<td align="right">55940</td>
<td align="right">4403</td>
<td align="right">7.87</td>
<td align="right">3</td>
</tr>
<tr>
<td align="right">2002</td>
<td>NARCOTICS</td>
<td align="right">51789</td>
<td align="right">51781</td>
<td align="right">99.98</td>
<td align="right">4</td>
</tr>
<tr>
<td align="right">2002</td>
<td>OTHER OFFENSE</td>
<td align="right">32599</td>
<td align="right">5701</td>
<td align="right">17.49</td>
<td align="right">5</td>
</tr>
<tr>
<td align="right">2001</td>
<td>THEFT</td>
<td align="right">99264</td>
<td align="right">15543</td>
<td align="right">15.66</td>
<td align="right">1</td>
</tr>
<tr>
<td align="right">2001</td>
<td>BATTERY</td>
<td align="right">93447</td>
<td align="right">20463</td>
<td align="right">21.9</td>
<td align="right">2</td>
</tr>
<tr>
<td align="right">2001</td>
<td>CRIMINAL DAMAGE</td>
<td align="right">55851</td>
<td align="right">4548</td>
<td align="right">8.14</td>
<td align="right">3</td>
</tr>
<tr>
<td align="right">2001</td>
<td>NARCOTICS</td>
<td align="right">50567</td>
<td align="right">50559</td>
<td align="right">99.98</td>
<td align="right">4</td>
</tr>
<tr>
<td align="right">2001</td>
<td>ASSAULT</td>
<td align="right">31384</td>
<td align="right">7150</td>
<td align="right">22.78</td>
<td align="right">5</td>
</tr>
</tbody></table>
The trouble is that, as far as I know, you cannot draw lines across charts. When creating a grouped bar chart, you have to facet across a column of your data. In effect, this produces several charts that are horizontally concatenated. So, for each chart you have only one point (for each color). If you want to have a line across years, you have to define your x axis to be years, and not facet it, and plot it separately. I would suggest vertical concatenation, to have the lines below the bars.
Note that I have taken the data from your previous question (How to create a nested Grouped Bar Chart using Altair? - Added sample data) because the way you provided it is not practical and I already had this one.
import altair as alt
import pandas as pd
from io import StringIO
q13a = pd.read_table(StringIO("""year primary_type Number_of_Incidents number_of_arrests percent_arrest rank
2018 THEFT 57330 5503 9.6 1
2018 BATTERY 44667 8886 19.89 2
2018 CRIMINAL DAMAGE 24889 1498 6.02 3
2018 ASSAULT 18229 2931 16.08 4
2018 DECEPTIVE PRACTICE 15879 713 4.49 5
2017 THEFT 64334 6459 10.04 1
2017 BATTERY 49213 10060 20.44 2
2017 CRIMINAL DAMAGE 29040 1747 6.02 3
2017 ASSAULT 19298 3455 17.9 4
2017 DECEPTIVE PRACTICE 18816 805 4.28 5
2016 THEFT 61600 6518 10.58 1
2016 BATTERY 50292 10328 20.54 2
2016 CRIMINAL DAMAGE 31018 1668 5.38 3
2016 ASSAULT 18738 3490 18.63 4
2016 DECEPTIVE PRACTICE 18733 815 4.35 5
2015 THEFT 57335 6771 11.81 1
2015 BATTERY 48918 11558 23.63 2
2015 CRIMINAL DAMAGE 28675 1835 6.4 3
2015 NARCOTICS 23883 23875 99.97 4
2015 OTHER OFFENSE 17552 4795 27.32 5
2014 THEFT 61561 7415 12.04 1
2014 BATTERY 49447 12517 25.31 2
2014 NARCOTICS 29116 29000 99.6 3
2014 CRIMINAL DAMAGE 27798 2095 7.54 4
2014 OTHER OFFENSE 16979 4159 24.49 5
2013 THEFT 71530 7727 10.8 1
2013 BATTERY 54002 12927 23.94 2
2013 NARCOTICS 34127 33819 99.1 3
2013 CRIMINAL DAMAGE 30853 2107 6.83 4
2013 OTHER OFFENSE 17993 3400 18.9 5"""))
bar = alt.Chart(height=200, width=100).mark_bar().encode(
x=alt.X('primary_type:N',
axis=None,
title=None,
sort=alt.EncodingSortField(op='sum', field='rank')),
y=alt.Y('sum(Number_of_Incidents):Q',
title='Total Number of Incidents'),
color=alt.Color('primary_type:N')
).facet(
column=alt.Column('year:O')
).resolve_scale(
x='independent'
)
line = alt.Chart().mark_line(point=True, color='red').encode(
x=alt.X('year:O', axis=alt.Axis(labelAngle=0)),
y=alt.Y('percent_arrest:Q'),
color=alt.Color('primary_type:N', legend=None)
).properties(height=80, width=680)
alt.vconcat(bar, line, data=q13a).configure_view(stroke='transparent')
Created on 2018-11-29 by the reprexpy package

python xpath : extract only few items from tables

I want to extract only few items from the html which is a table.
<table cellspacing="0" cellpadding="2" width="100%" border="0" class="TableBorderBottom">
<tr>
<td class="tblBursaSummHeader">No.</td>
<td class="tblBursaSummHeader">Name</td>
<td class="tblBursaSummHeader">Stock<br>Code</td>
<td class="tblBursaSummHeader">Rem</td>
<td class="tblBursaSummHeader">Last<br>Done</td>
<td class="tblBursaSummHeader" width="55">Chg</td>
<td class="tblBursaSummHeader">% Chg</td>
<td class="tblBursaSummHeader">Vol<br>('00)</td>
<td class="tblBursaSummHeader">Buy Vol<br>('00)</td>
<td class="tblBursaSummHeader">Buy</td>
<td class="tblBursaSummHeader">Sell</td>
<td class="tblBursaSummHeader">Sell Vol<br>('00)</td>
<td class="tblBursaSummHeader">High</td>
<td class="tblBursaSummHeaderRect">Low</td>
</tr>
<tr>
<td class="tblBursaSEvenRow">1</td>
<td class="tblBursaSEvenRow">LBI CAPITAL BHD-WARRANT A 08/8 (LBICAP-WA)</td>
<td class="tblBursaSEvenRow Right">8494WA</td>
<td class="tblBursaSEvenRow Right">s</td>
<td class="tblBursaSEvenRow Right">0.160</td>
<td class="tblBursaSEvenRow Right"><img src="/images/upArrow.gif" border=0> <span class=tblUp>+0.120</span></td>
<td class="tblBursaSEvenRow Right">300.0</td>
<td class="tblBursaSEvenRow Right">341,238</td>
<td class="tblBursaSEvenRow Right">745</td>
<td class="tblBursaSEvenRow Right">0.160</td>
<td class="tblBursaSEvenRow Right">0.160</td>
<td class="tblBursaSEvenRow Right">1,049</td>
<td class="tblBursaSEvenRow Right">0.185</td>
<td class="tblBursaSEvenRowRight Right">0.040</td>
</tr>
<tr>
<td class="tblBursaSOddRow">2</td>
<td class="tblBursaSOddRow">UNIMECH GROUP BHD-WA13/18 (UNIMECH-WA)</td>
<td class="tblBursaSOddRow Right">7091WA</td>
<td class="tblBursaSOddRow Right">s</td>
<td class="tblBursaSOddRow Right">0.070</td>
<td class="tblBursaSOddRow Right"><img src="/images/upArrow.gif" border=0> <span class=tblUp>+0.040</span></td>
<td class="tblBursaSOddRow Right">133.3</td>
<td class="tblBursaSOddRow Right">261,521</td>
<td class="tblBursaSOddRow Right">8,468</td>
<td class="tblBursaSOddRow Right">0.065</td>
<td class="tblBursaSOddRow Right">0.070</td>
<td class="tblBursaSOddRow Right">5,008</td>
<td class="tblBursaSOddRow Right">0.080</td>
<td class="tblBursaSOddRowRight Right">0.040</td>
</tr>
<tr>
My desired output is from Stock, Last done and Change. So the desirable output is
8494WA
0.160
+0.120
7091WA
0.070
+0.040
I able to extract the data but I need three lines of code but I prefer a one line code that can do the same works.
page_gain = requests.get('url')
gain = html.fromstring(page_gain.content)
stock = gain.xpath('//table[#class="TableBorderBottom"]/tr/td[3]/text()')
>>> ['Stock', 'Code', '8494WA', '7091WA']
gain.xpath('//table[#class="TableBorderBottom"]/tr/td[5]/text()')
>>>['Last', 'Done', '0.145', '0.075']
gain.xpath('//td/span/text()')
>>>['+0.120', '+0.070']
Notice that I also wish to eliminate the string 'Stock', 'Code','Last','Done' in the results
You need to process each row in the loop and get information you want from it:
data = []
for data_row in gain.xpath('//table[#class="TableBorderBottom"]/tr[position() > 1]'):
stock = data_row.xpath('./td[3]/text()')[0]
last_done = data_row.xpath('./td[5]/text()')[0]
change = data_row.xpath('./td[6]/span/text()')[0]
data.append({ "Stock": stock, "Last Done": last_done, "Change": change })

Python BeautifulSoup how to get the index or of the HTML table

<TABLE WIDTH="100%"> <TR> <TH scope="row" VALIGN="TOP" ALIGN="LEFT" WIDTH="10%">Inventors:</TH> <TD ALIGN="LEFT" WIDTH="90%">
<B>Shimada; Masahiro</B> (Shiga, <B>JP</B>) </TD> </TR>
<TR><TH scope="row" VALIGN="TOP" ALIGN="LEFT" WIDTH="10%">Applicant: </TH><TD ALIGN="LEFT" WIDTH="90%"> <TABLE> <TR> <TH scope="column" ALIGN="center">Name</TH> <TH scope="column" ALIGN="center">City</TH> <TH scope="column" ALIGN="center">State</TH> <TH
scope="column" ALIGN="center">Country</TH> <TH scope="column" ALIGN="center">Type</TH> </TR> <TR> <TD> <B><br>Shimada; Masahiro</B> </TD><TD> <br>Shiga </TD><TD ALIGN="center"> <br>N/A </TD><TD ALIGN="center"> <br>JP </TD> </TD><TD ALIGN="left"> </TD>
</TR> </TABLE> </TD></TR>
<TR> <TH scope="row" VALIGN="TOP" ALIGN="LEFT" WIDTH="10%">Assignee:</TH>
<TD ALIGN="LEFT" WIDTH="90%">
<B>Ishida Co., Ltd.</B>
(Kyoto,
<B>JP</B>)
<BR>
</TD>
</TR>
<TR><TH scope="row" VALIGN="TOP" ALIGN="LEFT" WIDTH="10%" NOWRAP>Appl. No.:
</TH><TD ALIGN="LEFT" WIDTH="90%">
<B>12/791,478</B></TD></TR>
<TR><TH scope="row" VALIGN="TOP" ALIGN="LEFT" WIDTH="10%">Filed:
</TH><TD ALIGN="LEFT" WIDTH="90%">
<B>June 1, 2010</B></TD></TR>
</TABLE>
which is taken from this US Patent Office URL.
Above is the HTML Table I need to get the data out.
But when I use the:
trtemp=souptemp.findAll('tr')
PattentInventors=trtemp[7].text.strip()
PattentCompany=trtemp[11].text.strip()
PattentFiledtime=trtemp[13].text.strip()
The tr index 7,11,13 is not constant at all the pages.
So I change to use re module like this:
souptemp.findAll(text=re.compile("Assi"))[0]
This is to get the data for Assignee: Ishida Co., Ltd. (Kyoto, JP)
but I could not get the index of the tr list.
How could I do the get the right index for Assignee: Ishida Co., Ltd. (Kyoto, JP)
Thank you!
In [78]: anchor = soup.findAll(text=re.compile("Assi"))[0]
In [77]: ' '.join(anchor.find_next('td').stripped_strings)
Out[77]: u'Ishida Co., Ltd. (Kyoto, JP )'
import bs4 as bs
import urllib2
import re
url = 'http://patft.uspto.gov//netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&p=1&u=%2Fnetahtml%2FPTO%2Fsearch-bool.html&r=2&f=G&l=50&co1=AND&d=PTXT&s1=%22X+ray%22.ABTX.&s2=detect.ABTX.&OS=ABST/%22X+ray%22+AND+ABST/detect&RS=ABST/%22X+ray%22+AND+ABST/detect'
soup = bs.BeautifulSoup(urllib2.urlopen(url).read())
anchor = soup.findAll(text=re.compile("Assi"))[0]
assignee = ' '.join(anchor.find_next('td').stripped_strings)
print(assignee)
yields
Ishida Co., Ltd. (Kyoto, JP )

Categories

Resources