How do i change this code to scrape a table - python

I am struggling to data scrape this website:
https://wix-visual-data.appspot.com/app/widget?pageId=cu7nt&compId=comp-kesofw00&viewerCompId=comp-kesofw00&siteRevision=947&viewMode=site&deviceType=desktop&locale=en&tz=Europe%2FLondon&width=980&height=890&instance=k983l1LiiUeOz5_3Pd_CLXbjfadc08q1fEu54xfh9aA.eyJpbnN0YW5jZUlkIjoiYjQ0MWIxMGUtNTRmNy00YzdhLTgwY2QtNmU0ZjkwYzljMzA3IiwiYXBwRGVmSWQiOiIxMzQxMzlmMy1mMmEwLTJjMmMtNjkzYy1lZDIyMTY1Y2ZkODQiLCJtZXRhU2l0ZUlkIjoiM2M3ZmE5OWItY2I3Yy00MTg0LTk1OTEtNWY0MDhmYWYwZmRhIiwic2lnbkRhdGUiOiIyMDIxLTAxLTMwVDAxOjIzOjAyLjU1MVoiLCJ1aWQiOiIzYWMyNDI3YS04NGVhLTQ0ZGUtYjYxMS02MTNiZTVlOWJiZGQiLCJkZW1vTW9kZSI6ZmFsc2UsImFpZCI6IjczYWE3ZWNjLTQyODUtNDY2My1iNjMxLTMzMjE0MWJiZDhhMiIsImJpVG9rZW4iOiI4ODNlMTg5NS05ZjhiLTBkZmUtMTU1Yy0zMTBmMWY2NmNjZGQiLCJzaXRlT3duZXJJZCI6ImVhYWU1MDEzLTMxZjgtNDQzNC04MDFhLTE3NDQ2N2EwZjE5YSIsImV4cGlyYXRpb25EYXRlIjoiMjAyMS0wMS0zMFQwNToyMzowMi41NTFaIiwiaGFzVXNlclJvbGUiOmZhbHNlfQ&currency=GBP&currentCurrency=GBP&vsi=795183b4-8f30-4854-bd85-77678dbe4cf8&consent-policy=%7B%22func%22%3A0%2C%22anl%22%3A0%2C%22adv%22%3A0%2C%22dt3%22%3A1%2C%22ess%22%3A1%7D&commonConfig=%7B%22brand%22%3A%22wix%22%2C%22bsi%22%3Anull%2C%22BSI%22%3Anull%7D
This URL has a table but for some reason I am not able to scrape this into an excel file. This is my current code in Python and this is what I have tried. Any help is much appreciated thank you legends!
import urllib
import urllib.request
from bs4 import BeautifulSoup
import requests
import pandas as pd
url = "https://wix-visual-data.appspot.com/app/widget?pageId=cu7nt&compId=comp-kesofw00&viewerCompId=comp-kesofw00&siteRevision=947&viewMode=site&deviceType=desktop&locale=en&tz=Europe%2FLondon&width=980&height=890&instance=dxGyx3zK9ULK0A8UtGOrLw-__FTD9EBEfzQojJ7Bz00.eyJpbnN0YW5jZUlkIjoiYjQ0MWIxMGUtNTRmNy00YzdhLTgwY2QtNmU0ZjkwYzljMzA3IiwiYXBwRGVmSWQiOiIxMzQxMzlmMy1mMmEwLTJjMmMtNjkzYy1lZDIyMTY1Y2ZkODQiLCJtZXRhU2l0ZUlkIjoiM2M3ZmE5OWItY2I3Yy00MTg0LTk1OTEtNWY0MDhmYWYwZmRhIiwic2lnbkRhdGUiOiIyMDIxLTAxLTI5VDE4OjM0OjQwLjgwM1oiLCJ1aWQiOiIzYWMyNDI3YS04NGVhLTQ0ZGUtYjYxMS02MTNiZTVlOWJiZGQiLCJkZW1vTW9kZSI6ZmFsc2UsImFpZCI6IjczYWE3ZWNjLTQyODUtNDY2My1iNjMxLTMzMjE0MWJiZDhhMiIsImJpVG9rZW4iOiI4ODNlMTg5NS05ZjhiLTBkZmUtMTU1Yy0zMTBmMWY2NmNjZGQiLCJzaXRlT3duZXJJZCI6ImVhYWU1MDEzLTMxZjgtNDQzNC04MDFhLTE3NDQ2N2EwZjE5YSIsImV4cGlyYXRpb25EYXRlIjoiMjAyMS0wMS0yOVQyMjozNDo0MC44MDNaIiwiaGFzVXNlclJvbGUiOmZhbHNlfQ&currency=GBP&currentCurrency=GBP&vsi=57130cda-8191-488e-8089-f472928266e3&consent-policy=%7B%22func%22%3A0%2C%22anl%22%3A0%2C%22adv%22%3A0%2C%22dt3%22%3A1%2C%22ess%22%3A1%7D&commonConfig=%7B%22brand%22%3A%22wix%22%2C%22bsi%22%3Anull%2C%22BSI%22%3Anull%7D"
table_id = "theTable"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
table = soup.find('table', attrs={"id" : theTable})
df = pd.read_html(str(table))

The page is loading the data using JavaScript. You can find the URL using the Network tab of Firefox. Even better news, the data is in the CSV format so you don't even need an HTML parser to parse it.
You can find the CSV here.

Related

scraping table from a website result as empty

I am trying to scrape the main table with tag :
<table _ngcontent-jna-c4="" class="rayanDynamicStatement">
from following website using 'BeautifulSoup' library, but the code returns empty [] while printing soup returns html string and request status is 200. I found out that when i use browser 'inspect element' tool i can see the table tag but in "view page source" the table tag which is part of "app-root" tag is not shown. (you see <app-root></app-root> which is empty). Besides there is no "json" file in the webpage's components to extract data from it. Please help me how can I scrape the table data.
import urllib.request
import pandas as pd
from urllib.parse import unquote
from bs4 import BeautifulSoup
yurl='https://www.codal.ir/Reports/Decision.aspx?LetterSerial=T1hETjlDjOQQQaQQQfaL0Mb7uucg%3D%3D&rt=0&let=6&ct=0&ft=-1&sheetId=0'
req=urllib.request.urlopen(yurl)
print(req.status)
#get response
response = req.read()
html = response.decode("utf-8")
#make html readable
soup = BeautifulSoup(html, features="html")
table_body=soup.find_all("table")
print(table_body)
The table is in the source HTML but kinda hidden and then rendered by JavaScript. It's in one of the <script> tags. This can be located with bs4 and then parsed with regex. Finally, the table data can be dumped to json.loads then to a pandas and to a .csv file, but since I don't know any Persian, you'd have to see if it's of any use.
Just by looking at some values, I think it is.
Oh, and this can be done without selenium.
Here's how:
import pandas as pd
import json
import re
import requests
from bs4 import BeautifulSoup
url = "https://www.codal.ir/Reports/Decision.aspx?LetterSerial=T1hETjlDjOQQQaQQQfaL0Mb7uucg%3D%3D&rt=0&let=6&ct=0&ft=-1&sheetId=0"
scripts = BeautifulSoup(
requests.get(url, verify=False).content,
"lxml",
).find_all("script", {"type": "text/javascript"})
table_data = json.loads(
re.search(r"var datasource = ({.*})", scripts[-5].string).group(1),
)
pd.DataFrame(
table_data["sheets"][0]["tables"][0]["cells"],
).to_csv("huge_table.csv", index=False)
This outputs a huge file that looks like this:
Might not the best solution, but with webdriver in headless mode you can get all what you want:
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
option = Options()
option.add_argument('--headless')
url = 'https://www.codal.ir/Reports/Decision.aspx?LetterSerial=T1hETjlDjOQQQaQQQfaL0Mb7uucg%3D%3D&rt=0&let=6&ct=0&ft=-1&sheetId=0'
driver = webdriver.Chrome(options=option)
driver.get(url)
bs = BeautifulSoup(driver.page_source, 'html.parser')
print(bs.find('table'))
driver.quit()
It looks like the elements your're trying to get are rendered by some JavaScript code. You will need to use something like Selenium instead in order to get the fully rendered HTML.

Python Error: 'NoneType' object has no attribute 'find_all' using Beautiful Soup

I'm having a problem with some webscraping code that I'm trying to run. To scrape information from a series of links like the following:
http://www2.congreso.gob.pe/Sicr/TraDocEstProc/CLProLey2006.nsf/ec97fee42a2412d5052578bb001539ee/89045fe8ae896e2e0525751c005544cd?OpenDocument
I am trying to scrape certain elements from the table, but I received the following error:
Python Error: 'NoneType' object has no attribute 'find_all'
I know this has to do with the fact that it's not actually finding the table because when I run the following simplified code:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import csv
import time
url = 'http://www2.congreso.gob.pe/Sicr/TraDocEstProc/CLProLey2006.nsf/ec97fee42a2412d5052578bb001539ee/89045fe8ae896e2e0525751c005544cd?OpenDocument'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
table = soup.find('table', {'bordercolor' : '#6583A0'})
print(table)
It returns a 'None' for the printed table, meaning the code cannot scrape any of the features of the table. I've been running similar code for similar pages and I am able to find the table just fine so I'm not sure why this is not working? I'm new to webscraping but I'd appreciate any help!
I think the html contains some flaws that made the html parser fails to properlly parse your html, you can verify that by printing page.text and then print soup, you will find that the document has some parts removed by parser.
However lxml parser successfully parsed it with its flaw as lxml is better on ill-formatted html documents:
rom bs4 import BeautifulSoup
import requests
import pandas as pd
import csv
import time
url = 'http://www2.congreso.gob.pe/Sicr/TraDocEstProc/CLProLey2006.nsf/ec97fee42a2412d5052578bb001539ee/89045fe8ae896e2e0525751c005544cd?OpenDocument'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'lxml')
table = soup.find('table', {'bordercolor' : '#6583A0'})
print(table)
that should catch the table tag correctly
So the soup doesn't parse the website content correctly, because one tag is incorrect and break the structure. You have to fix it before parse it:
url = 'http://www2.congreso.gob.pe/Sicr/TraDocEstProc/CLProLey2006.nsf/ec97fee42a2412d5052578bb001539ee/89045fe8ae896e2e0525751c005544cd?OpenDocument'
page = requests.get(url)
soup = BeautifulSoup(page.text.replace("</script\n", "</script>"), 'html.parser')
table = soup.find('table', {'bordercolor' : '#6583A0'})
print(table)
import pandas as pd
df = pd.read_html(
"http://www2.congreso.gob.pe/Sicr/TraDocEstProc/CLProLey2006.nsf/ec97fee42a2412d5052578bb001539ee/89045fe8ae896e2e0525751c005544cd?OpenDocument")[0]
print(df)
df.to_csv("Data.csv", index=False, header=None)
Output: view online

Parsing a table on webpage using BeautifulSoup

Trying get a table from the website SGX.
The page is saved to local drive and I am using BeautifulSoup to parse it:
soup = BeautifulSoup(open(pages), "lxml")
soup.prettify()
list_0 = soup.find_all('table')[0]
print list_0
What it returned, is not the first row on the page:
[<tr><td>Zhongmin Baihui</td><td>5SR</td><td class="nowrap">09:44 AM</td><td class="nowrap">09:49 AM</td><td>0.615</td><td>0.675</td><td>0.555</td></tr>]
What's the right way to retrieve this table?
Thank you.
Data are being fetched after page loads using AJAX request, by inspecting the page you can find the API URL (the Url below), and then you can use something like that:
import pandas as pd
import requests
import json
response = requests.get('https://api.sgx.com/securities/v1.1?excludetypes=bonds&params=nc%2Cadjusted-vwap%2Cb%2Cbv%2Cp%2Cc%2Cchange_vs_pc%2Cchange_vs_pc_percentage%2Ccx%2Ccn%2Cdp%2Cdpc%2Cdu%2Ced%2Cfn%2Ch%2Ciiv%2Ciopv%2Clt%2Cl%2Co%2Cp_%2Cpv%2Cptd%2Cs%2Csv%2Ctrading_time%2Cv_%2Cv%2Cvl%2Cvwap%2Cvwap-currency')
data = json.loads(response.content)["data"]["prices"]
df = pd.DataFrame(data)
print(df)
If your requirement are complex and your crawling done in regular basis I recommend using scrapy.

How to scrape a specific table form website using python (beautifulsoup4 and requests or any other library)?

https://en.wikipedia.org/wiki/Economy_of_the_European_Union
Above is the link to website and I want to scrape table: Fortune top 10 E.U. corporations by revenue (2016).
Please, share the code for the same:
import requests
from bs4 import BeautifulSoup
def web_crawler(url):
page = requests.get(url)
plain_text = page.text
soup = BeautifulSoup(plain_text,"html.parser")
tables = soup.findAll("tbody")[1]
print(tables)
soup = web_crawler("https://en.wikipedia.org/wiki/Economy_of_the_European_Union")
following what #FanMan said , this is simple code to help you get started, keep in mind that you will need to clean it and also perform the rest of the work on your own.
import requests
from bs4 import BeautifulSoup
url='https://en.wikipedia.org/wiki/Economy_of_the_European_Union'
r=requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
temp_datastore=list()
for text in soup.findAll('p'):
w=text.findAll(text=True)
if(len(w)>0):
temp_datastore.append(w)
Some documentation
beautiful soup:https://www.crummy.com/software/BeautifulSoup/bs4/doc/
requests: http://docs.python-requests.org/en/master/user/intro/
urllib: https://docs.python.org/2/library/urllib.html
You're first issue is that your url is not properly defined. After that you need to find the table to extract and it's class. In this case the class was "wikitable" and it was a the first table. I have started your code for you so it gives you the extracted data from the table. Web-scraping is good to learn but if your are just starting to program, practice with some simpler stuff first.
import requests
from bs4 import BeautifulSoup
def webcrawler():
url = "https://en.wikipedia.org/wiki/Economy_of_the_European_Union"
page = requests.get(url)
soup = BeautifulSoup(page.text,"html.parser")
tables = soup.findAll("table", class_='wikitable')[0]
print(tables)
webcrawler()

Python BeautifulSoup cannot find table ID

I am running into some trouble scraping a table using BeautifulSoup. Here is my code
from urllib.request import urlopen
from bs4 import BeautifulSoup
site = "http://www.sports-reference.com/cbb/schools/clemson/2014.html"
page = urlopen(site)
soup = BeautifulSoup(page,"html.parser")
stats = soup.find('table', id = 'totals')
In [78]: print(stats)
None
When I right click on the table to inspect the element the HTML looks as I'd expect, however when I view the source the only element with id = 'totals' is commented out. Is there a way to scrape a table from the commented source code?
I have referenced this post but can't seem to replicate their solution.
Here is a link to the webpage I am interested in. I'd like to scrape the table labeled "Totals" and store it as a data frame.
I am relatively new to Python, HTML, and web scraping. Any help would be greatly appreciated.
Thanks in advance.
Michael
Comments are string instances in BeautifulSoup. You can use BeautifulSoup's find method with a regular expression to find the particular string that you're after. Once you have the string, have BeautifulSoup parse that and there you go.
In other words,
import re
from urllib.request import urlopen
from bs4 import BeautifulSoup
site = "http://www.sports-reference.com/cbb/schools/clemson/2014.html"
page = urlopen(site)
soup = BeautifulSoup(page,"html.parser")
stats_html = soup.find(string=re.compile('id="totals"'))
stats_soup = BeautifulSoup(stats_html, "html.parser")
print(stats_soup.table.caption.text)
You can do this:
from urllib2 import *
from bs4 import BeautifulSoup
site = "http://www.sports-reference.com/cbb/schools/clemson/2014.html"
page = urlopen(site)
soup = BeautifulSoup(page,"lxml")
stats = soup.findAll('div', id = 'all_totals')
print stats
Please inform me if I helped!

Categories

Resources