I am trying to scrape data from AGMARKNET website. The tables are split into 11 pages but all of the pages use the same url. I am very new to webscraping (or python in general), but AGMARKNET does not have a public API so scraping the page seems to be my only option. I am currently using BeautifulSoup to parse the HTML code and I am able to scrape the initial table, but that only contains the first 500 data points, but I want the entire 11 page data. I am stuck and frustrated. Link and my current code are below. Any direction would be helpful, thank you .
#αԋɱҽԃ αмєяιcαη
https://agmarknet.gov.in/SearchCmmMkt.aspx?Tx_Commodity=17&Tx_State=JK&Tx_District=0&Tx_Market=0&DateFrom=01-Oct-2004&DateTo=18-Oct-2022&Fr_Date=01-Oct-2004&To_Date=18-Oct-2022&Tx_Trend=2&Tx_CommodityHead=Apple&Tx_StateHead=Jammu+and+Kashmir&Tx_DistrictHead=--Select--&Tx_MarketHead=--Select--
import requests
import pandas as pd
url = 'https://agmarknet.gov.in/SearchCmmMkt.aspx?Tx_Commodity=17&Tx_State=JK&Tx_District=0&Tx_Market=0&DateFrom=01-Oct-2004&DateTo=18-Oct-2022&Fr_Date=01-Oct-2004&To_Date=18-Oct-2022&Tx_Trend=2&Tx_CommodityHead=Apple&Tx_StateHead=Jammu+and+Kashmir&Tx_DistrictHead=--Select--&Tx_MarketHead=--Select--'
response = requests.get(url)
# Use BeautifulSoup to parse the HTML code
soup = BeautifulSoup(response.content, 'html.parser')
# changes stat_table from ResultSet to a Tag
stat_table = stat_table[0]
# Convert html table to list
rows = []
for tr in stat_table.find_all('tr')[1:]:
cells = []
tds = tr.find_all('td')
if len(tds) == 0:
ths = tr.find_all('th')
for th in ths:
cells.append(th.text.strip())
else:
for td in tds:
cells.append(td.text.strip())
rows.append(cells)
# convert table to df
table = pd.DataFrame(rows)
The website you linked to seems to be using JavaScript to navigate to the next page. The requests and BeautifulSoup libraries are only for parsing static HTML pages, so they can't run JavaScript.
Instead of using them, you should try something like Selenium that actually simulates a full browser environment (including HTML, CSS, etc.). In fact, Selenium can even open a full browser window so you can see it in action as it navigates!
Here is a quick sample code:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.firefox.options import Options
# If you prefer Chrome to Firefox, there is a driver available
# for that as well
# Set the URL
url = 'https://agmarknet.gov.in/SearchCmmMkt.aspx?Tx_Commodity=17&Tx_State=JK&Tx_District=0&Tx_Market=0&DateFrom=01-Oct-2004&DateTo=18-Oct-2022&Fr_Date=01-Oct-2004&To_Date=18-Oct-2022&Tx_Trend=2&Tx_CommodityHead=Apple&Tx_StateHead=Jammu+and+Kashmir&Tx_DistrictHead=--Select--&Tx_MarketHead=--Select--'
# Start the browser
opts = Options()
driver = webdriver.Firefox(options=opts)
driver.get(url)
Now you can use functions like driver.find_element(...) and driver.find_elements(...) to extract the data you want from this page, the same way you did with BeautifulSoup.
For your given link, the page number navigators seem to be running a function of the form,
__doPostBack('ctl00$cphBody$GridViewBoth','Page$2')
...replacing Page$2 with Page$3, Page$4, etc. depending on which page you want. So you can use Selenium to run that JavaScript function when you're ready to navigate.
driver.execute_script("__doPostBack('ctl00$cphBody$GridViewBoth','Page$2')")
A more generic solution is to just select which button you want and then run that button's click() function. General example (not necessarily for the current website):
btn = driver.find_element('id', 'next-button')
btn.click()
A final note: after the button is clicked, you might want to time.sleep(...) for a little while to make sure the page is fully loaded before you start processing the next set of data.
Related
I am working on scrapping numbers from the Powerball website with the code below.
However, numbers keeps coming back empty. Why is this?
import requests
from bs4 import BeautifulSoup
url = 'https://www.powerball.com/games/home'
page = requests.get(url).text
bsPage = BeautifulSoup(page)
numbers = bsPage.find_all("div", class_="field_numbers")
numbers
Can confirm #Teprr is absolutely correct. You'll need to download chrome and add chromedriver.exe to your system path for this to work but the following code gets what you are looking for. You can use other browsers too you just need their respective driver.
from bs4 import BeautifulSoup
from selenium import webdriver
import time
url = 'https://www.powerball.com/games/home'
options = webdriver.ChromeOptions()
options.add_argument('headless')
browser = webdriver.Chrome(options=options)
browser.get(url)
time.sleep(3) # wait three seconds for all the js to happen
html = browser.page_source
soup = BeautifulSoup(html, 'html.parser')
draws = soup.findAll("div", {"class":"number-card"})
print(draws)
for d in draws:
info = d.find("div",{"class":"field_draw_date"}).getText()
balls = d.find("div",{"class":"field_numbers"}).findAll("div",{"class":"numbers-ball"})
numbers = [ball.getText() for ball in balls]
print(info)
print(numbers)
If you download that file and inspect it locally, you can see that there is no <div> with that class. That means that it is likely generated dynamically using javascript by your browser, so you would need to use something like selenium to get the full, generated HTML content.
Anyway, in this specific case, this piece of HTML seems to be the container for the data you are looking for:
<div data-url="/api/v1/numbers/powerball/recent?_format=json" class="recent-winning-numbers"
data-numbers-powerball="Power Play" data-numbers="All Star Bonus">
Now, if you check that custom data-url, you can find the information you want in JSON format.
I'm trying to scrape automobile information from a dynamic webpage. However, after running Selenium chrome browser, inspection elements are not shown as they are in original source page. Instead of html code of the car details (Informative area near the product image), " ::after " element is appeared in html source code.
You can see my scraping code below;
import requests
from requests import get
from bs4 import BeautifulSoup
from selenium import webdriver
driver_path = ("C:\\Desktop\\chromedriver.exe")
driver = webdriver.Chrome(driver_path)
driver.get('https://www.arabam.com/ilan/galeriden-satilik-citroen-c-elysee-1-6-hdi-attraction/fiat-onkol-oto-dan-c-elysee-1-6-attraction-92-hp-beyaz/14046287')
soup = BeautifulSoup(driver.page_source, 'html.parser')
table = soup.table
table_rows = table.find_all('li')
print(table_rows)
When i used given code to get relative information from the webpage, i could not see any html attributes which is necessary for further scraping loops.
What can be the reason of that problem and how can i solve that?
Thanks,
Edit;
HTML element content in selenium browser,
Normal Google Chrome HTML element content that i try to reach,
There is no table in the HTML page you provided, try using a different selector. You could try selecting by
driver.find_elements_by_class_name("w100 semi-bold lh18")
This should give you an ordered list of the span elements
I'm trying to scrape the get CNSs drop down menu from the following page
Just to walk you through, I start of with a main link that links to all the sequences(link from a above is an url to a sequence).
I go to that link and try to grab each item from the drop down menu
that takes you to a different page(This is the main issue, that I'm
trying to solve).
Once on the page that the drop down menu takes you, I want to grab the link that directs you to get all CNSs alignments and scrape the information that the link provides you. I have to do this for 10000 alignments.
I'm currently struggling with the drop down menu everything else I should be able to figure out.
I've tried implementing Selenium and BeautifulSoup as you can tell from the code I've written so far. I'm open to suggestions and modification.
This is python2.7
Thank you
#importing libraries
import urllib
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.ui import Select
#parsing the html
url = ("http://pipeline.lbl.gov/cgi-bin/textBrowser2?act=mvista&run=u233-9GR6Sl35")
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html,'html.parser')
#saving the links to a list so I can access those links and scrape them
sequenceurl=[]
for link in soup.find_all('a', string ="VISTA-Point"):
sequenceurl.append(link.get('href'))
for item in sequenceurl:
print item
print
#open the webpage and go to the web browser
driver = webdriver.Firefox()
driver.get(sequenceurl[0])
driver.maximize_window()
Select(driver.find_element_by_xpath('//*[#id="x-auto-131"]/tbody/tr/td[2]/select')).select_by_index(1).click()
Edit: The main link is the link inside the code that says url =. Here it is again for reference http://pipeline.lbl.gov/cgi-bin/textBrowser2?act=mvista&run=u233-9GR6Sl35
I have tried to parse the data on this table:
https://www.rad.cvm.gov.br/enetconsulta/frmGerenciaPaginaFRE.aspx?CodigoTipoInstituicao=1&NumeroSequencialDocumento=62733
you will notice that this is a dynamically generated table (apparently javascript). It seems somehow that when I open the url using selenium or beautiful soup, it it's not possible to recognize / parse the table, although the table is there (if you right click on the table you and check frame source / page source you will see that they do not seem to be related).
please let me know if you are able to parse the table in python.
You can do it using selenium or any other library, once you look at the source, you'll find out that the table is being loaded inside an iframe, and the frame url, which is being set from javascript, is :
urlFrame = "https://www.rad.cvm.gov.br/enetconsulta/frmDemonstracaoFinanceiraITR.aspx?Informacao=2&Demonstracao=4&Periodo=0&Grupo=DFs+Consolidadas&Quadro=Demonstra%C3%A7%C3%A3o+do+Resultado&NomeTipoDocumento=DFP&Titulo=Demonstra%C3%A7%C3%A3o%20do%20Resultado&Empresa=VALE%20S.A.&DataReferencia=31/12/2016&Versao=1&CodTipoDocumento=4&NumeroSequencialDocumento=62733&NumeroSequencialRegistroCvm=1789&CodigoTipoInstituicao=1"
but looks like this url needs some cookies which the browser automatically sends, so we'll first load our original url, then simply go to the frame url and extract the data from the table.
Solution using selenium :
from bs4 import BeautifulSoup
from selenium import webdriver
url = "https://www.rad.cvm.gov.br/enetconsulta/frmGerenciaPaginaFRE.aspx?CodigoTipoInstituicao=1&NumeroSequencialDocumento=62733"
urlFrame = "https://www.rad.cvm.gov.br/enetconsulta/frmDemonstracaoFinanceiraITR.aspx?Informacao=2&Demonstracao=4&Periodo=0&Grupo=DFs+Consolidadas&Quadro=Demonstra%C3%A7%C3%A3o+do+Resultado&NomeTipoDocumento=DFP&Titulo=Demonstra%C3%A7%C3%A3o%20do%20Resultado&Empresa=VALE%20S.A.&DataReferencia=31/12/2016&Versao=1&CodTipoDocumento=4&NumeroSequencialDocumento=62733&NumeroSequencialRegistroCvm=1789&CodigoTipoInstituicao=1"
driver = webdriver.Chrome()
driver.maximize_window()
driver.get(url)
driver.get(urlFrame)
print(driver.page_source)
soup = BeautifulSoup(driver.page_source, "html.parser")
table_data = soup.findAll("table", {"id": "ctl00_cphPopUp_tbDados"})
# do something with table_data/ parse it further
print(table_data)
usually I'm able to write a script that works for scraping, but I've been having some difficulty scraping this site for the table enlisted for this research project I'm working on. I'm planning to verify the script working on one State before entering the URL of my targeted states.
import requests
import bs4 as bs
url = ("http://programs.dsireusa.org/system/program/detail/284")
dsire_get = requests.get(url)
soup = bs.BeautifulSoup(dsire_get.text,'lxml')
table = soup.findAll('div', {'data-ng-controller': 'DetailsPageCtrl'})
print(table)
#I'm printing "Table" just to ensure that the table information I'm looking for is within this sections
I'm not sure if the site is attempting to block people from scraping, but all the info that I'm looking to grab is within """if you look what Table outputs.
The text is rendered with JavaScript.
First render the page with dryscrape
(If you don't want to use dryscrape see Web-scraping JavaScript page with Python )
Then the text can be extracted, after it has been rendered, from a different position on the page i.e the place it has been rendered to.
As an example this code will extract HTML from the summary.
import bs4 as bs
import dryscrape
url = ("http://programs.dsireusa.org/system/program/detail/284")
session = dryscrape.Session()
session.visit(url)
dsire_get = session.body()
soup = bs.BeautifulSoup(dsire_get,'html.parser')
table = soup.findAll('div', {'class': 'programSummary ng-binding'})
print(table[0])
Outputs:
<div class="programSummary ng-binding" data-ng-bind-html="program.summary"><p>
<strong>Eligibility and Availability</strong></p>
<p>
Net metering is available to all "qualifying facilities" (QFs), as defined by the federal <i>Public Utility Regulatory Policies Act of 1978</i> (PURPA), which pertains to renewable energy systems and combined heat and power systems up to 80 megawatts (MW) in capacity. There is no statewide cap on the aggregate capacity of net-metered systems.</p>
<p>
All utilities subject to Public ...
So I finally managed to solve the issue, and successfuly grab the data from the Javascript page the code as follows worked for me if anyone encounters a same issue when trying to use python to scrape a javascript webpage using windows (dryscrape incompatible).
import bs4 as bs
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.keys import Keys
browser = webdriver.Chrome()
url = ("http://programs.dsireusa.org/system/program/detail/284")
browser.get(url)
html_source = browser.page_source
browser.quit()
soup = bs.BeautifulSoup(html_source, "html.parser")
table = soup.find('div', {'class': 'programOverview'})
data = []
for n in table.findAll("div", {"class": "ng-binding"}):
trip = str(n.text)
data.append(trip)