As the title suggests, I am trying to scrape a table using both Beautifulsoup and Selenium. I'm aware I most likely do not need both libraries, however I wanted to try if using Selenium's xpathselectors would help, which unfortunately they did not.
The website can be found here:
https://app.capitoltrades.com/politician/491
What I'm trying to do is scrape the table under 'Trades' at the bottom
Here is a screenshot
Once I can grab the table, I will collect the td data inside the table rows.
So for example, I would want '29/Dec/2021'under 'Publication Date'. Unfortunately, I haven't been able to get this far because I can't grab the table.
Here is my code:
from bs4 import BeautifulSoup
import requests
from selenium import webdriver
url = 'https://app.capitoltrades.com/politician/491'
resp = requests.get(url)
#soup = BeautifulSoup(resp.text, "html5lib")
soup = BeautifulSoup(resp.text, 'lxml')
table = soup.find("table", {"class": "p-datatable-table ng-star-
inserted"}).findAll('tr')
print(table)
This yields the error message "AttributeError: 'NoneType' object has no attribute 'findAll'
Using 'soup.findAll' also does not work.
If I try the xpathselector route using Selenium ...
DRIVER_PATH = '/Users/myname/Downloads/capitol-trades/chromedriver'
driver = webdriver.Chrome(executable_path=DRIVER_PATH)
driver.implicitly_wait(1000)
driver.get('https://app.capitoltrades.com/politician/491')
table = driver.find_element_by_xpath("//*[#id='pr_id_2']/div/table").text
print(table)
Chrome continues to open up and nothing gets printed inside my Jupyter notebook (probably because there is no text directly inside the table element[?])
I'd prefer to be able to grab the table element using Beautifulsoup, but all answers are welcomed. I appreciate any help you can provide me with.
That site has a backend api that can be hit very easily:
import requests
import pandas as pd
url = 'https://api.capitoltrades.com/senators/trades/491/false?pageSize=20&pageNumber=1'
resp = requests.get(url).json()
df = pd.DataFrame(resp)
df.to_csv('naughty_nancy_trades.csv',index=False)
print('Saved to naughty_nancy_trades.csv ')
to see where all the data comes from open your browser's Developer Tools - Network - fetch/XHR and reload the page you'll see them fire. I've scraped one of those network calls, there are others for all the data on that page
As per your Selenium code: you are missing a wait.
This
driver.find_element_by_xpath("//*[#id='pr_id_2']/div/table")
command returns you the web element when it is just created i.e already existing, but still not fully rendered.
This should work better:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
DRIVER_PATH = '/Users/myname/Downloads/capitol-trades/chromedriver'
driver = webdriver.Chrome(executable_path=DRIVER_PATH)
wait = WebDriverWait(driver, 20)
driver.get('https://app.capitoltrades.com/politician/491')
table = wait.until(EC.visibility_of_element_located((By.XPATH, "//*[#id='pr_id_2']//tr[#class='p-selectable-row ng-star-inserted']"))).text
print(table)
As per you BS4 code, looks like you are using a wrong locator.
This:
table = soup.find("table", {"class": "p-datatable-table ng-star-inserted"})
looks to be better (you have extra spaces in your class name).
The line above returns 5 elements.
So this supposed to work:
table = soup.find("table", {"class": "p-datatable-table ng-star-inserted"}).findAll('tr')
Related
I am trying to scrape data from AGMARKNET website. The tables are split into 11 pages but all of the pages use the same url. I am very new to webscraping (or python in general), but AGMARKNET does not have a public API so scraping the page seems to be my only option. I am currently using BeautifulSoup to parse the HTML code and I am able to scrape the initial table, but that only contains the first 500 data points, but I want the entire 11 page data. I am stuck and frustrated. Link and my current code are below. Any direction would be helpful, thank you .
#αԋɱҽԃ αмєяιcαη
https://agmarknet.gov.in/SearchCmmMkt.aspx?Tx_Commodity=17&Tx_State=JK&Tx_District=0&Tx_Market=0&DateFrom=01-Oct-2004&DateTo=18-Oct-2022&Fr_Date=01-Oct-2004&To_Date=18-Oct-2022&Tx_Trend=2&Tx_CommodityHead=Apple&Tx_StateHead=Jammu+and+Kashmir&Tx_DistrictHead=--Select--&Tx_MarketHead=--Select--
import requests
import pandas as pd
url = 'https://agmarknet.gov.in/SearchCmmMkt.aspx?Tx_Commodity=17&Tx_State=JK&Tx_District=0&Tx_Market=0&DateFrom=01-Oct-2004&DateTo=18-Oct-2022&Fr_Date=01-Oct-2004&To_Date=18-Oct-2022&Tx_Trend=2&Tx_CommodityHead=Apple&Tx_StateHead=Jammu+and+Kashmir&Tx_DistrictHead=--Select--&Tx_MarketHead=--Select--'
response = requests.get(url)
# Use BeautifulSoup to parse the HTML code
soup = BeautifulSoup(response.content, 'html.parser')
# changes stat_table from ResultSet to a Tag
stat_table = stat_table[0]
# Convert html table to list
rows = []
for tr in stat_table.find_all('tr')[1:]:
cells = []
tds = tr.find_all('td')
if len(tds) == 0:
ths = tr.find_all('th')
for th in ths:
cells.append(th.text.strip())
else:
for td in tds:
cells.append(td.text.strip())
rows.append(cells)
# convert table to df
table = pd.DataFrame(rows)
The website you linked to seems to be using JavaScript to navigate to the next page. The requests and BeautifulSoup libraries are only for parsing static HTML pages, so they can't run JavaScript.
Instead of using them, you should try something like Selenium that actually simulates a full browser environment (including HTML, CSS, etc.). In fact, Selenium can even open a full browser window so you can see it in action as it navigates!
Here is a quick sample code:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.firefox.options import Options
# If you prefer Chrome to Firefox, there is a driver available
# for that as well
# Set the URL
url = 'https://agmarknet.gov.in/SearchCmmMkt.aspx?Tx_Commodity=17&Tx_State=JK&Tx_District=0&Tx_Market=0&DateFrom=01-Oct-2004&DateTo=18-Oct-2022&Fr_Date=01-Oct-2004&To_Date=18-Oct-2022&Tx_Trend=2&Tx_CommodityHead=Apple&Tx_StateHead=Jammu+and+Kashmir&Tx_DistrictHead=--Select--&Tx_MarketHead=--Select--'
# Start the browser
opts = Options()
driver = webdriver.Firefox(options=opts)
driver.get(url)
Now you can use functions like driver.find_element(...) and driver.find_elements(...) to extract the data you want from this page, the same way you did with BeautifulSoup.
For your given link, the page number navigators seem to be running a function of the form,
__doPostBack('ctl00$cphBody$GridViewBoth','Page$2')
...replacing Page$2 with Page$3, Page$4, etc. depending on which page you want. So you can use Selenium to run that JavaScript function when you're ready to navigate.
driver.execute_script("__doPostBack('ctl00$cphBody$GridViewBoth','Page$2')")
A more generic solution is to just select which button you want and then run that button's click() function. General example (not necessarily for the current website):
btn = driver.find_element('id', 'next-button')
btn.click()
A final note: after the button is clicked, you might want to time.sleep(...) for a little while to make sure the page is fully loaded before you start processing the next set of data.
I want to scrape the duration of tiktok videos for an upcoming project but my code isn't working
import requests; from bs4 import BeautifulSoup
content = requests.get('https://vm.tiktok.com/ZMFFKmx3K/').text
soup = BeautifulSoup(content, 'lxml')
data = soup.find('div', class_="tiktok-1g3unbt-DivSeekBarTimeContainer e123m2eu1")
print(data)
Using an example tiktok
I would think this would work could anyone help
If you turn off JavaScript then check out the element selection in chrome devtools then you will see that the value is like 00/000 but when you will turn JS and the video is on play mode then the duration is increasing uoto finishig.So the real duration value of that element depends on Js. So you have to use an automation tool something like selenium to grab that dynamic value. And How much duration will scrape that depend on time.sleep() if you are on selenium. If time.sleep is more than the video length then it will show None typEerror.
Example:
import time
from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.chrome.service import Service
webdriver_service = Service("./chromedriver") #Your chromedriver path
driver = webdriver.Chrome(service=webdriver_service)
url ='https://vm.tiktok.com/ZMFFKmx3K/'
driver.get(url)
driver.maximize_window()
time.sleep(25)
soup = BeautifulSoup(driver.page_source,"lxml")
data = soup.find('div', class_="tiktok-1g3unbt-DivSeekBarTimeContainer e123m2eu1")
print(data.text)
Output:
00:25/00:28
the ID associated is likely randomized. Try using regex to get element by class ending in 'TimeContainer' + some other id
import requests
from bs4 import BeautifulSoup
import re
content = requests.get('https://vm.tiktok.com/ZMFFKmx3K/').text
soup = BeautifulSoup(content, 'lxml')
data = soup.find('div', {'class': re.compile(r'TimeContainer.*$')})
print(data)
you next issue is that the page loads before the video, so you'll get 0/0 for the time. try selenium instead so you can add timer waits for loading
I am trying to webscrape form the website https://roll20.net/compendium/dnd5e/Monsters%20List#content and having some issues.
My first script I tried kept returning an empty list when finding by div and class name, which I believe is do to the site using Javascript? But a little uncertain if that is the case or not.
Here was my first attempt:
import requests
from bs4 import BeautifulSoup
import pandas as pd
page = requests.get('https://roll20.net/compendium/dnd5e/Monsters%20List#content')
soup = BeautifulSoup(page.text, 'html.parser')
card = soup.find_all("div", class_='card')
print(card)
This one returns an empty list so then I tired to use Selenium and scrape with that. Here is that script:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from selenium import webdriver
url='https://roll20.net/compendium/dnd5e/Monsters%20List#content'
driver = webdriver.Firefox(executable_path = 'C:\Windows\System32\geckodriver')
driver.get(url)
page = driver.page_source
page_soup = soup(page,'html.parser')
Starting the script with that I then tried all 3 of these different options (individually ran these, just listed them here together for simplicity sake):
for card in body.find('div', {"class":"card"}):
print(card.text)
print(card)
for card in body.find_all('div', {"class":"card"}):
print(card.text)
print(card)
card = body.find_all('div', {"class":"card"})
print(card)
All of them return the same error message:
AttributeError: ResultSet object has no attribute 'find'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?
Where am I going wrong here?
Edit:
Fazul thank you for your input on this I guess I should be more specific. I was more looking to get the contents of each card. For example, the card has a "body" class and within that body class there are many fields that is the data I am looking to extract. Maybe I am misunderstanding your script and what you stated. Here is a screen shot to maybe help specify my question a bit more to what content I am looking to extract.
So everything that would be under the body i.e. name, title, subtitle, etc.. Those were the texts I was trying to extract.
That page is being loaded by JavaScript. So beautifulsoup will not work in this case. You have to use Selenium.
And the element that you are looking for - <div> with class name as card show up only when you click on the drop-down arrow. Since you are not doing any click event in your code, you get an empty result.
Use selenium to click the <div> with class name as dropdown-toggle. That click event loads the <div class='card'>
Then you can scrape the data you need.
Since the page is loaded by JavaScript, you have to use Selenium.
My solution As follows:
Every card has a link as shown:
Use BeautifulSoup to get the link for each card, Then open each link using Selenium in headless mode because it is also loaded by JavaScript.
Then you get the data you need for each card using BeautifulSoup
Working code:
from selenium import webdriver
from webdriver_manager import driver
from webdriver_manager.chrome import ChromeDriver, ChromeDriverManager
from selenium.webdriver.chrome.options import Options
from time import sleep
from bs4 import BeautifulSoup
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get("https://roll20.net/compendium/dnd5e/Monsters%20List#content")
page = driver.page_source
# Wait until the page is loaded
sleep(13)
soup = BeautifulSoup(page, "html.parser")
# Get all card urls
urls = soup.find_all("div", {"class":"header inapp-hide single-hide"})
cards_list = []
for url in urls:
card = {}
card_url = url.a["href"]
card_link = f'https://roll20.net{card_url}#content'
# Open chrom in headless mode
chrome_options = Options()
chrome_options.add_argument("--headless")
driver = webdriver.Chrome(ChromeDriverManager().install(), options=chrome_options)
driver.get(card_link )
page = driver.page_source
soup = BeautifulSoup(page, "html.parser")
card["name"] = soup.select_one(".name").text.split("\n")[1]
card["subtitle"] = soup.select_one(".subtitle").text
# Do the same to get all card detealis
cards_list.append(card)
driver.quit()
Library you need to install
pip install webdriver_manager
This library open chrome driver without the need to geckodriver and get up to date driver for you.
https://roll20.net/compendium/compendium/getList?bookName=dnd5e&pageName=Monsters%20List&_=1630747796747
u can get the json data from this link instead just parse it and get the data you need
I have tried to parse the data on this table:
https://www.rad.cvm.gov.br/enetconsulta/frmGerenciaPaginaFRE.aspx?CodigoTipoInstituicao=1&NumeroSequencialDocumento=62733
you will notice that this is a dynamically generated table (apparently javascript). It seems somehow that when I open the url using selenium or beautiful soup, it it's not possible to recognize / parse the table, although the table is there (if you right click on the table you and check frame source / page source you will see that they do not seem to be related).
please let me know if you are able to parse the table in python.
You can do it using selenium or any other library, once you look at the source, you'll find out that the table is being loaded inside an iframe, and the frame url, which is being set from javascript, is :
urlFrame = "https://www.rad.cvm.gov.br/enetconsulta/frmDemonstracaoFinanceiraITR.aspx?Informacao=2&Demonstracao=4&Periodo=0&Grupo=DFs+Consolidadas&Quadro=Demonstra%C3%A7%C3%A3o+do+Resultado&NomeTipoDocumento=DFP&Titulo=Demonstra%C3%A7%C3%A3o%20do%20Resultado&Empresa=VALE%20S.A.&DataReferencia=31/12/2016&Versao=1&CodTipoDocumento=4&NumeroSequencialDocumento=62733&NumeroSequencialRegistroCvm=1789&CodigoTipoInstituicao=1"
but looks like this url needs some cookies which the browser automatically sends, so we'll first load our original url, then simply go to the frame url and extract the data from the table.
Solution using selenium :
from bs4 import BeautifulSoup
from selenium import webdriver
url = "https://www.rad.cvm.gov.br/enetconsulta/frmGerenciaPaginaFRE.aspx?CodigoTipoInstituicao=1&NumeroSequencialDocumento=62733"
urlFrame = "https://www.rad.cvm.gov.br/enetconsulta/frmDemonstracaoFinanceiraITR.aspx?Informacao=2&Demonstracao=4&Periodo=0&Grupo=DFs+Consolidadas&Quadro=Demonstra%C3%A7%C3%A3o+do+Resultado&NomeTipoDocumento=DFP&Titulo=Demonstra%C3%A7%C3%A3o%20do%20Resultado&Empresa=VALE%20S.A.&DataReferencia=31/12/2016&Versao=1&CodTipoDocumento=4&NumeroSequencialDocumento=62733&NumeroSequencialRegistroCvm=1789&CodigoTipoInstituicao=1"
driver = webdriver.Chrome()
driver.maximize_window()
driver.get(url)
driver.get(urlFrame)
print(driver.page_source)
soup = BeautifulSoup(driver.page_source, "html.parser")
table_data = soup.findAll("table", {"id": "ctl00_cphPopUp_tbDados"})
# do something with table_data/ parse it further
print(table_data)
so I'm trying to do web scraping for the first time using BeautifulSoup and Python. The page that I am trying to scrape is at: http://vesselregister.dnvgl.com/VesselRegister/vesseldetails.html?vesselid=34172
client = request('http://vesselregister.dnvgl.com/VesselRegister/vesseldetails.html?vesselid=34172')
page_html = client.read()
client.close()
page_soup = soup(page_html)
identification = page_soup.find('div', {'data-bind':'text: name'})
print(identification.text)
When I do this I simply get an empty string. If I print out simply the identification variable I get:
<div class="col-xs-7" data-bind="text: name"></div>
This is the line of html that I am trying to get the value of, as you can see there is a value A LEBLANC there in the tag
You can try this code :
from selenium import webdriver
driver=webdriver.Chrome()
browser=driver.get('http://vesselregister.dnvgl.com/VesselRegister/vesseldetails.html?vesselid=34172')
find=driver.find_element_by_xpath('//*[#id="identificationCollapse"]/div/div/div/div[1]/div[1]/div[2]')
print(find.text)
output:
A LEBLANC
There are several ways you can achieve the same goal. However, I've used selector in my script which is easy to understand and has got less chance to break unless the html structure of that website is heavily changed. Try this out as well.
from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Chrome()
driver.get('http://vesselregister.dnvgl.com/VesselRegister/vesseldetails.html?vesselid=34172')
soup = BeautifulSoup(driver.page_source,"lxml")
driver.quit()
item_name = soup.select("[data-bind$='name']")[0].text
print(item_name)
Result:
A LEBLANC
Btw, the way you started will also work:
from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Chrome()
driver.get('http://vesselregister.dnvgl.com/VesselRegister/vesseldetails.html?vesselid=34172')
soup = BeautifulSoup(driver.page_source,"lxml")
driver.quit()
item_name = soup.find('div', {'data-bind':'text: name'}).text
print(item_name)