How to scrape a table with a blind caption?

How to scrape a table with a blind caption? - python

I'm scraping a table from a page.
But the table's caption is 'blind'.
Is there no way to extract the table from the site?
Using BeautifulSoup like:
from urllib.request import urllib
from bs4 import BeautifulSoup

Take a look at this:
import bs4 as bs
import urllib.request
link = 'http://companyinfo.stock.naver.com/v1/company/c1010001.aspx?cn=&cmp_cd=005930&menuType=block'
source = urllib.request.urlopen(link)
soup = bs.BeautifulSoup(source, 'html.parser')
table = soup.find('table', attrs={'id' : 'cTB24'})
for tr in table.find_all('tr'):
for td in tr.find_all('td'):
print(td.text)

Related

Webscrape a table with BeautifulSoup

I'm trying to get the tables (and then the tr and td contents) with requests and BeautifulSoup from this link: https://www.basketball-reference.com/teams/PHI/2022/lineups/ , but I get no results.
I tried with:
import requests
from bs4 import BeautifulSoup
url = "https://www.basketball-reference.com/teams/PHI/2022/lineups/"
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
tables = soup.find_all('table')
However the result of tables is [].

It looks like the tables are placed in the comments, so you have to adjust the response text:
page = page.text.replace("<!--","").replace("-->","")
soup = BeautifulSoup(page, 'html.parser')
Example
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = "https://www.basketball-reference.com/teams/PHI/2022/lineups/"
page = requests.get(url)
page = page.text.replace("<!--","").replace("-->","")
soup = BeautifulSoup(page, 'html.parser')
tables = soup.find_all('table')
Just in addition as mentioned also by #chitown88 there is an option with beautifulsoup method of Comment, to find all comments in HTML. Be aware you have to transform the strings into bs4 again:
soup.find_all(string=lambda text: isinstance(text, Comment) and '<table' in text))
Example
import requests
from bs4 import BeautifulSoup
from bs4 import Comment
url = "https://www.basketball-reference.com/teams/PHI/2022/lineups/"
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
soupTables = BeautifulSoup(''.join(soup.find_all(string=lambda text: isinstance(text, Comment) and '<table' in text)))
soupTables.find_all('table')

How to getdata from webpage table

On the website https://www.shanghairanking.com/rankings/arwu/2020
the URL doesn't change when I hit "next". Any ideas on how to scrape the tables on the next pages. Using bs4 in Python, I am able to only scrape the table on the first page.
What I did so far:
from bs4 import BeautifulSoup
import requests
import pandas as pd
html_text = requests.get('https://www.shanghairanking.com/rankings/arwu/2020').text
soup = BeautifulSoup(html_text,'lxml')
data = soup.find('table', class_= "rk-table").text.replace(' ','')
print(data)

How to find all classes under table tag in python using web scraping library beautiful soup

import requests
req = requests.get("https://en.wikipedia.org/wiki/Harvard_University")
from bs4 import BeautifulSoup
soup.table["class"]

Add this and you will find the class of table in that page.
soup = BeautifulSoup(req.content, 'html.parser')
soup.table["class"]
Result:
['infobox', 'vcard']

How to scrape content from nested div html

I tried to get price ($50.56) info from this link: https://www.google.com/search?biw=1600&bih=758&output=search&tbm=shop&q=McKleinUSA+17564+N+Series+ARIA+%28Khaki%29&oq=McKleinUSA+17564+N+Series+ARIA+%28Khaki%29&gs_l=products-cc.12...26051.26051.0.27263.1.1.0.0.0.0.71.71.1.1.0....0...1ac.2.64.products-cc..0.0.0....0.AxAVt6XRExI#spd=15005512733707849930
but I keep get nothing.
So I get link info from excel file using openpyxl and use requests and bs4 to scrape the info.
from openpyxl import load_workbook
from bs4 import BeautifulSoup
import requests
wb = load_workbook(filename = "test.xlsx")
ws = wb.get_active_sheet()
html = ws['A2'].hyperlink.target
source = requests.get(html).text
soup = BeautifulSoup(source,'lxml')
test = soup.find('span', attrs = {'class': "O8U6h"})

Classes are dynamic but there is a constant id and you can use the relationship of the b tag to this
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://www.google.com/search?biw=1600&bih=758&output=search&tbm=shop&q=McKleinUSA+17564+N+Series+ARIA+%28Khaki%29&oq=McKleinUSA+17564+N+Series+ARIA+%28Khaki%29&gs_l=products-cc.12...26051.26051.0.27263.1.1.0.0.0.0.71.71.1.1.0....0...1ac.2.64.products-cc..0.0.0....0.AxAVt6XRExI#spd=15005512733707849930')
soup = bs(r.content, 'lxml')
print(soup.select_one('#ires b').text)

Looks like the class names change. I opened the link myself and couldn't find an element with class O8U6h.

How to receive website link in Python using BeautifulSoup

I want to collect the link : /hmarchhak/102217 from a site (https://www.vanglaini.org/) and print it as https://www.vanglaini.org/hmarchhak/102217. Please help
Img
import requests
import pandas as pd
from bs4 import BeautifulSoup
source = requests.get('https://www.vanglaini.org/').text
soup = BeautifulSoup(source, 'lxml')
for article in soup.find_all('article'):
headline = article.a.text
summary=article.p.text
link = article.a.href
print(headline)
print(summary)
print(link)
print()
This is my code.

Unless I am missing something headline and summary appear to be the same text. You can use :has with bs4 4.7.1+ to ensure your article has a child href; and this seems to strip out article tag elements that are not part of main body which I suspect is actually your aim
from bs4 import BeautifulSoup as bs
import requests
base = 'https://www.vanglaini.org'
r = requests.get(base)
soup = bs(r.content, 'lxml')
for article in soup.select('article:has([href])'):
headline = article.h5.text.strip()
summary = re.sub(r'\n+|\r+',' ',article.p.text.strip())
link = f"{base}{article.a['href']})"
print(headline)
print(summary)
print(link)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to scrape a table with a blind caption? - python

I'm scraping a table from a page. But the table's caption is 'blind'. Is there no way to extract the table from the site? Using BeautifulSoup like: from urllib.request import urllib from bs4 import BeautifulSoup

Related

Webscrape a table with BeautifulSoup

How to getdata from webpage table

How to find all classes under table tag in python using web scraping library beautiful soup

How to scrape content from nested div html

How to receive website link in Python using BeautifulSoup

Categories

Resources