web scraping a drop down menu in python - python

I'm trying to scrape the get CNSs drop down menu from the following page
Just to walk you through, I start of with a main link that links to all the sequences(link from a above is an url to a sequence).
I go to that link and try to grab each item from the drop down menu
that takes you to a different page(This is the main issue, that I'm
trying to solve).
Once on the page that the drop down menu takes you, I want to grab the link that directs you to get all CNSs alignments and scrape the information that the link provides you. I have to do this for 10000 alignments.
I'm currently struggling with the drop down menu everything else I should be able to figure out.
I've tried implementing Selenium and BeautifulSoup as you can tell from the code I've written so far. I'm open to suggestions and modification.
This is python2.7
Thank you
#importing libraries
import urllib
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.ui import Select
#parsing the html
url = ("http://pipeline.lbl.gov/cgi-bin/textBrowser2?act=mvista&run=u233-9GR6Sl35")
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html,'html.parser')
#saving the links to a list so I can access those links and scrape them
sequenceurl=[]
for link in soup.find_all('a', string ="VISTA-Point"):
sequenceurl.append(link.get('href'))
for item in sequenceurl:
print item
print
#open the webpage and go to the web browser
driver = webdriver.Firefox()
driver.get(sequenceurl[0])
driver.maximize_window()
Select(driver.find_element_by_xpath('//*[#id="x-auto-131"]/tbody/tr/td[2]/select')).select_by_index(1).click()
Edit: The main link is the link inside the code that says url =. Here it is again for reference http://pipeline.lbl.gov/cgi-bin/textBrowser2?act=mvista&run=u233-9GR6Sl35

Related

How can I fetch the source code from a website that is blocking the requests when we use python bs4, selenium?

I want to scrape the data present in this website "https://www.findhelp.org/care/support-network--san-francisco-ca?postal=94105". I tried using beautiful soup and selenium,
The first approach:
import requests as requests
from bs4 import BeautifulSoup
url="https://www.findhelp.org/care/support-network--san-francisco-ca?postal=94105"
reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, 'lxml')
print(soup)
This was not giving the output i was expecting, the fetched page contains something like this "Sorry, something about your browser or browsing activity made us think you were a robot."
The second approach:
import selenium
from selenium import webdriver
import webbrowser
url="https://www.findhelp.org/care/support-network--san-francisco-ca?postal=94105"
PATH = r"C:\Users\Vinay Edula\Desktop\xxxxxxxx\chromedriver.exe"
driver= webdriver.Chrome(PATH)
driver.get(url)
This approach works fine for one or two pages in that site but then after the problem is this website is blocking the requests.
Approach 3:
import webbrowser
url="https://www.findhelp.org/care/support-network--san-francisco-ca?postal=94105"
chrome_path = 'C:/Program Files (x86)/Google/Chrome/Application/chrome.exe %s'
for i in range(10):
url="https://www.findhelp.org/care/support-network--san-francisco-ca?postal=94105&cursor="+str(i*10)+"&limit=10"
webbrowser.get(chrome_path).open(url)
time.sleep(10)
This code working fine, it was opening the site in chrome without any error or blockibg but i dont know how to fetch the source code.
When the python code tries to fetch the code or by accessing from the guest browser as selenium did i am getting error. When I manually opens this webpage or using webbrowser module in python I can able to see the contents. So how can i solve this problem , my final aim is to fetch the contents present from this paginated site https://www.findhelp.org/care/support-network--san-francisco-ca?postal=94105? .
Any solution for this problem will be will be highly appreciated.
You can use the page_source property of Selenium as follows:
from selenium import webdriver
browser = webdriver.Firefox()
browser.get("http://example.com")

Scraping Data from Table with Multiple Pages

I am trying to scrape data from AGMARKNET website. The tables are split into 11 pages but all of the pages use the same url. I am very new to webscraping (or python in general), but AGMARKNET does not have a public API so scraping the page seems to be my only option. I am currently using BeautifulSoup to parse the HTML code and I am able to scrape the initial table, but that only contains the first 500 data points, but I want the entire 11 page data. I am stuck and frustrated. Link and my current code are below. Any direction would be helpful, thank you .
#αԋɱҽԃ αмєяιcαη
https://agmarknet.gov.in/SearchCmmMkt.aspx?Tx_Commodity=17&Tx_State=JK&Tx_District=0&Tx_Market=0&DateFrom=01-Oct-2004&DateTo=18-Oct-2022&Fr_Date=01-Oct-2004&To_Date=18-Oct-2022&Tx_Trend=2&Tx_CommodityHead=Apple&Tx_StateHead=Jammu+and+Kashmir&Tx_DistrictHead=--Select--&Tx_MarketHead=--Select--
import requests
import pandas as pd
url = 'https://agmarknet.gov.in/SearchCmmMkt.aspx?Tx_Commodity=17&Tx_State=JK&Tx_District=0&Tx_Market=0&DateFrom=01-Oct-2004&DateTo=18-Oct-2022&Fr_Date=01-Oct-2004&To_Date=18-Oct-2022&Tx_Trend=2&Tx_CommodityHead=Apple&Tx_StateHead=Jammu+and+Kashmir&Tx_DistrictHead=--Select--&Tx_MarketHead=--Select--'
response = requests.get(url)
# Use BeautifulSoup to parse the HTML code
soup = BeautifulSoup(response.content, 'html.parser')
# changes stat_table from ResultSet to a Tag
stat_table = stat_table[0]
# Convert html table to list
rows = []
for tr in stat_table.find_all('tr')[1:]:
cells = []
tds = tr.find_all('td')
if len(tds) == 0:
ths = tr.find_all('th')
for th in ths:
cells.append(th.text.strip())
else:
for td in tds:
cells.append(td.text.strip())
rows.append(cells)
# convert table to df
table = pd.DataFrame(rows)
The website you linked to seems to be using JavaScript to navigate to the next page. The requests and BeautifulSoup libraries are only for parsing static HTML pages, so they can't run JavaScript.
Instead of using them, you should try something like Selenium that actually simulates a full browser environment (including HTML, CSS, etc.). In fact, Selenium can even open a full browser window so you can see it in action as it navigates!
Here is a quick sample code:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.firefox.options import Options
# If you prefer Chrome to Firefox, there is a driver available
# for that as well
# Set the URL
url = 'https://agmarknet.gov.in/SearchCmmMkt.aspx?Tx_Commodity=17&Tx_State=JK&Tx_District=0&Tx_Market=0&DateFrom=01-Oct-2004&DateTo=18-Oct-2022&Fr_Date=01-Oct-2004&To_Date=18-Oct-2022&Tx_Trend=2&Tx_CommodityHead=Apple&Tx_StateHead=Jammu+and+Kashmir&Tx_DistrictHead=--Select--&Tx_MarketHead=--Select--'
# Start the browser
opts = Options()
driver = webdriver.Firefox(options=opts)
driver.get(url)
Now you can use functions like driver.find_element(...) and driver.find_elements(...) to extract the data you want from this page, the same way you did with BeautifulSoup.
For your given link, the page number navigators seem to be running a function of the form,
__doPostBack('ctl00$cphBody$GridViewBoth','Page$2')
...replacing Page$2 with Page$3, Page$4, etc. depending on which page you want. So you can use Selenium to run that JavaScript function when you're ready to navigate.
driver.execute_script("__doPostBack('ctl00$cphBody$GridViewBoth','Page$2')")
A more generic solution is to just select which button you want and then run that button's click() function. General example (not necessarily for the current website):
btn = driver.find_element('id', 'next-button')
btn.click()
A final note: after the button is clicked, you might want to time.sleep(...) for a little while to make sure the page is fully loaded before you start processing the next set of data.

Web scraping using Beautifulsoup/Selenium unable to pull div class using find or find_all

I am trying to webscrape form the website https://roll20.net/compendium/dnd5e/Monsters%20List#content and having some issues.
My first script I tried kept returning an empty list when finding by div and class name, which I believe is do to the site using Javascript? But a little uncertain if that is the case or not.
Here was my first attempt:
import requests
from bs4 import BeautifulSoup
import pandas as pd
page = requests.get('https://roll20.net/compendium/dnd5e/Monsters%20List#content')
soup = BeautifulSoup(page.text, 'html.parser')
card = soup.find_all("div", class_='card')
print(card)
This one returns an empty list so then I tired to use Selenium and scrape with that. Here is that script:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from selenium import webdriver
url='https://roll20.net/compendium/dnd5e/Monsters%20List#content'
driver = webdriver.Firefox(executable_path = 'C:\Windows\System32\geckodriver')
driver.get(url)
page = driver.page_source
page_soup = soup(page,'html.parser')
Starting the script with that I then tried all 3 of these different options (individually ran these, just listed them here together for simplicity sake):
for card in body.find('div', {"class":"card"}):
print(card.text)
print(card)
for card in body.find_all('div', {"class":"card"}):
print(card.text)
print(card)
card = body.find_all('div', {"class":"card"})
print(card)
All of them return the same error message:
AttributeError: ResultSet object has no attribute 'find'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?
Where am I going wrong here?
Edit:
Fazul thank you for your input on this I guess I should be more specific. I was more looking to get the contents of each card. For example, the card has a "body" class and within that body class there are many fields that is the data I am looking to extract. Maybe I am misunderstanding your script and what you stated. Here is a screen shot to maybe help specify my question a bit more to what content I am looking to extract.
So everything that would be under the body i.e. name, title, subtitle, etc.. Those were the texts I was trying to extract.
That page is being loaded by JavaScript. So beautifulsoup will not work in this case. You have to use Selenium.
And the element that you are looking for - <div> with class name as card show up only when you click on the drop-down arrow. Since you are not doing any click event in your code, you get an empty result.
Use selenium to click the <div> with class name as dropdown-toggle. That click event loads the <div class='card'>
Then you can scrape the data you need.
Since the page is loaded by JavaScript, you have to use Selenium.
My solution As follows:
Every card has a link as shown:
Use BeautifulSoup to get the link for each card, Then open each link using Selenium in headless mode because it is also loaded by JavaScript.
Then you get the data you need for each card using BeautifulSoup
Working code:
from selenium import webdriver
from webdriver_manager import driver
from webdriver_manager.chrome import ChromeDriver, ChromeDriverManager
from selenium.webdriver.chrome.options import Options
from time import sleep
from bs4 import BeautifulSoup
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get("https://roll20.net/compendium/dnd5e/Monsters%20List#content")
page = driver.page_source
# Wait until the page is loaded
sleep(13)
soup = BeautifulSoup(page, "html.parser")
# Get all card urls
urls = soup.find_all("div", {"class":"header inapp-hide single-hide"})
cards_list = []
for url in urls:
card = {}
card_url = url.a["href"]
card_link = f'https://roll20.net{card_url}#content'
# Open chrom in headless mode
chrome_options = Options()
chrome_options.add_argument("--headless")
driver = webdriver.Chrome(ChromeDriverManager().install(), options=chrome_options)
driver.get(card_link )
page = driver.page_source
soup = BeautifulSoup(page, "html.parser")
card["name"] = soup.select_one(".name").text.split("\n")[1]
card["subtitle"] = soup.select_one(".subtitle").text
# Do the same to get all card detealis
cards_list.append(card)
driver.quit()
Library you need to install
pip install webdriver_manager
This library open chrome driver without the need to geckodriver and get up to date driver for you.
https://roll20.net/compendium/compendium/getList?bookName=dnd5e&pageName=Monsters%20List&_=1630747796747
u can get the json data from this link instead just parse it and get the data you need

Web scraping with Python and beautifulsoup: What is saved by the BeautifulSoup function?

This question follows this previous question. I want to scrape data from a betting site using Python. I first tried to follow this tutorial, but the problem is that the site tipico is not available from Switzerland. I thus chose another betting site: Winamax. In the tutorial, the webpage tipico is first inspected, in order to find where the betting rates are located in the html file. In the tipico webpage, they were stored in buttons of class “c_but_base c_but". By writing the following lines, the rates could therefore be saved and printed using the Beautiful soup module:
from bs4 import BeautifulSoup
import urllib.request
import re
url = "https://www.tipico.de/de/live-wetten/"
try:
page = urllib.request.urlopen(url)
except:
print(“An error occured.”)
soup = BeautifulSoup(page, ‘html.parser’)
regex = re.compile(‘c_but_base c_but’)
content_lis = soup.find_all(‘button’, attrs={‘class’: regex})
print(content_lis)
I thus tried to do the same with the webpage Winamax. I inspected the page and found that the betting rates were stored in buttons of class "ui-touchlink-needsclick price odd-price". See the code below:
from bs4 import BeautifulSoup
import urllib.request
import re
url = "https://www.winamax.fr/paris-sportifs/sports/1/7/4"
try:
page = urllib.request.urlopen(url)
except Exception as e:
print(f"An error occurred: {e}")
soup = BeautifulSoup(page, 'html.parser')
regex = re.compile('ui-touchlink-needsclick price odd-price')
content_lis = soup.find_all('button', attrs={'class': regex})
print(content_lis)
The problem is that it prints nothing: Python does not find elements of such class (right?). I thus tried to print the soup object in order to see what the BeautifulSoup function was exactly doing. I added this line
print(soup)
When printing it (I do not show it the print of soup because it is too long), I notice that this is not the same text as what appears when I do a right click "inspect" of the Winamax webpage. So what is the BeautifulSoup function exactly doing? How can I store the betting rates from the Winamax website using BeautifulSoup?
EDIT: I have never coded in html and I'm a beginner in Python, so some terminology might be wrong, that's why some parts are in italics.
That's because the website is using JavaScript to display these details and BeautifulSoup does not interact with JS on it's own.
First try to find out if the element you want to scrape is present in the page source, if so you can scrape, pretty much everything! In your case the button/span tag's were not in the page source(meaning hidden or it's pulled through a script)
No <button> tag in the page source :
So I suggest using Selenium as the solution, and I tried a basic scrape of the website.
Here is the code I used :
from selenium import webdriver
option = webdriver.ChromeOptions()
option.add_argument('--headless')
option.binary_location = r'Your chrome.exe file path'
browser = webdriver.Chrome(executable_path=r'Your chromedriver.exe file path', options=option)
browser.get(r"https://www.winamax.fr/paris-sportifs/sports/1/7/4")
span_tags = browser.find_elements_by_tag_name('span')
for span_tag in span_tags:
print(span_tag.text)
browser.quit()
This is the output:
There are some junk data present in this output, but that's for you to figure out what you need and what you don't!

Python, Selenium "::after" problem while scraping

I'm trying to scrape automobile information from a dynamic webpage. However, after running Selenium chrome browser, inspection elements are not shown as they are in original source page. Instead of html code of the car details (Informative area near the product image), " ::after " element is appeared in html source code.
You can see my scraping code below;
import requests
from requests import get
from bs4 import BeautifulSoup
from selenium import webdriver
driver_path = ("C:\\Desktop\\chromedriver.exe")
driver = webdriver.Chrome(driver_path)
driver.get('https://www.arabam.com/ilan/galeriden-satilik-citroen-c-elysee-1-6-hdi-attraction/fiat-onkol-oto-dan-c-elysee-1-6-attraction-92-hp-beyaz/14046287')
soup = BeautifulSoup(driver.page_source, 'html.parser')
table = soup.table
table_rows = table.find_all('li')
print(table_rows)
When i used given code to get relative information from the webpage, i could not see any html attributes which is necessary for further scraping loops.
What can be the reason of that problem and how can i solve that?
Thanks,
Edit;
HTML element content in selenium browser,
Normal Google Chrome HTML element content that i try to reach,
There is no table in the HTML page you provided, try using a different selector. You could try selecting by
driver.find_elements_by_class_name("w100 semi-bold lh18")
This should give you an ordered list of the span elements

Categories

Resources