I'm scraping a table from a page.
But the table's caption is 'blind'.
Is there no way to extract the table from the site?
Using BeautifulSoup like:
from urllib.request import urllib
from bs4 import BeautifulSoup
Take a look at this:
import bs4 as bs
import urllib.request
link = 'http://companyinfo.stock.naver.com/v1/company/c1010001.aspx?cn=&cmp_cd=005930&menuType=block'
source = urllib.request.urlopen(link)
soup = bs.BeautifulSoup(source, 'html.parser')
table = soup.find('table', attrs={'id' : 'cTB24'})
for tr in table.find_all('tr'):
for td in tr.find_all('td'):
print(td.text)
Related
I'm trying to get the tables (and then the tr and td contents) with requests and BeautifulSoup from this link: https://www.basketball-reference.com/teams/PHI/2022/lineups/ , but I get no results.
I tried with:
import requests
from bs4 import BeautifulSoup
url = "https://www.basketball-reference.com/teams/PHI/2022/lineups/"
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
tables = soup.find_all('table')
However the result of tables is [].
It looks like the tables are placed in the comments, so you have to adjust the response text:
page = page.text.replace("<!--","").replace("-->","")
soup = BeautifulSoup(page, 'html.parser')
Example
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = "https://www.basketball-reference.com/teams/PHI/2022/lineups/"
page = requests.get(url)
page = page.text.replace("<!--","").replace("-->","")
soup = BeautifulSoup(page, 'html.parser')
tables = soup.find_all('table')
Just in addition as mentioned also by #chitown88 there is an option with beautifulsoup method of Comment, to find all comments in HTML. Be aware you have to transform the strings into bs4 again:
soup.find_all(string=lambda text: isinstance(text, Comment) and '<table' in text))
Example
import requests
from bs4 import BeautifulSoup
from bs4 import Comment
url = "https://www.basketball-reference.com/teams/PHI/2022/lineups/"
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
soupTables = BeautifulSoup(''.join(soup.find_all(string=lambda text: isinstance(text, Comment) and '<table' in text)))
soupTables.find_all('table')
On the website https://www.shanghairanking.com/rankings/arwu/2020
the URL doesn't change when I hit "next". Any ideas on how to scrape the tables on the next pages. Using bs4 in Python, I am able to only scrape the table on the first page.
What I did so far:
from bs4 import BeautifulSoup
import requests
import pandas as pd
html_text = requests.get('https://www.shanghairanking.com/rankings/arwu/2020').text
soup = BeautifulSoup(html_text,'lxml')
data = soup.find('table', class_= "rk-table").text.replace(' ','')
print(data)
import requests
req = requests.get("https://en.wikipedia.org/wiki/Harvard_University")
from bs4 import BeautifulSoup
soup.table["class"]
Add this and you will find the class of table in that page.
soup = BeautifulSoup(req.content, 'html.parser')
soup.table["class"]
Result:
['infobox', 'vcard']
I tried to get price ($50.56) info from this link: https://www.google.com/search?biw=1600&bih=758&output=search&tbm=shop&q=McKleinUSA+17564+N+Series+ARIA+%28Khaki%29&oq=McKleinUSA+17564+N+Series+ARIA+%28Khaki%29&gs_l=products-cc.12...26051.26051.0.27263.1.1.0.0.0.0.71.71.1.1.0....0...1ac.2.64.products-cc..0.0.0....0.AxAVt6XRExI#spd=15005512733707849930
but I keep get nothing.
So I get link info from excel file using openpyxl and use requests and bs4 to scrape the info.
from openpyxl import load_workbook
from bs4 import BeautifulSoup
import requests
wb = load_workbook(filename = "test.xlsx")
ws = wb.get_active_sheet()
html = ws['A2'].hyperlink.target
source = requests.get(html).text
soup = BeautifulSoup(source,'lxml')
test = soup.find('span', attrs = {'class': "O8U6h"})
Classes are dynamic but there is a constant id and you can use the relationship of the b tag to this
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://www.google.com/search?biw=1600&bih=758&output=search&tbm=shop&q=McKleinUSA+17564+N+Series+ARIA+%28Khaki%29&oq=McKleinUSA+17564+N+Series+ARIA+%28Khaki%29&gs_l=products-cc.12...26051.26051.0.27263.1.1.0.0.0.0.71.71.1.1.0....0...1ac.2.64.products-cc..0.0.0....0.AxAVt6XRExI#spd=15005512733707849930')
soup = bs(r.content, 'lxml')
print(soup.select_one('#ires b').text)
Looks like the class names change. I opened the link myself and couldn't find an element with class O8U6h.
I want to collect the link : /hmarchhak/102217 from a site (https://www.vanglaini.org/) and print it as https://www.vanglaini.org/hmarchhak/102217. Please help
Img
import requests
import pandas as pd
from bs4 import BeautifulSoup
source = requests.get('https://www.vanglaini.org/').text
soup = BeautifulSoup(source, 'lxml')
for article in soup.find_all('article'):
headline = article.a.text
summary=article.p.text
link = article.a.href
print(headline)
print(summary)
print(link)
print()
This is my code.
Unless I am missing something headline and summary appear to be the same text. You can use :has with bs4 4.7.1+ to ensure your article has a child href; and this seems to strip out article tag elements that are not part of main body which I suspect is actually your aim
from bs4 import BeautifulSoup as bs
import requests
base = 'https://www.vanglaini.org'
r = requests.get(base)
soup = bs(r.content, 'lxml')
for article in soup.select('article:has([href])'):
headline = article.h5.text.strip()
summary = re.sub(r'\n+|\r+',' ',article.p.text.strip())
link = f"{base}{article.a['href']})"
print(headline)
print(summary)
print(link)