unable to Webscrape dropdown item [Python][beautifulsoup]

unable to Webscrape dropdown item [Python][beautifulsoup] - python

i am new to webscraping, i am scraping a website - https://www.valueresearchonline.com/funds/22/uti-mastershare-fund-regular-plan/
In this,i want to scrape this text - Regular Plan
But the thing is, when i do it using inspect element,
code -
import requests
from bs4 import BeautifulSoup
import csv
import sys
url = 'https://www.valueresearchonline.com/funds/newsnapshot.asp?schemecode=22'
res = requests.get(url)
soup = BeautifulSoup(res.text, "html.parser")
regular_direct = soup.find('span',class_="filter-option pull-left").text
print(regular_direct)
i get none in printing, and i don't know why, the code in inspect element and view page source is also different, because in view page source, this span and class is not there.
why i am getting none?? can anyone please tell me and how can i get that text and why inspect element code and view page source code are different?

You need to change the selector because the html source that gets downloaded is different.
import requests
from bs4 import BeautifulSoup
import csv
import sys
url = 'https://www.valueresearchonline.com/funds/newsnapshot.asp?schemecode=22'
res = requests.get(url)
soup = BeautifulSoup(res.text, "html.parser")
regular_direct = soup.find("select", {"id":"select-plan"}).find("option",{"selected":"selected"}).get_text(strip=True)
print(regular_direct)
Output:
Regular plan

Related

Can't scrape <h3> tag from page

Seems like i can scrape any tag and class, except h3 on this page. It keeps returning None or an empty list. I'm trying to get this h3 tag:
...on the following webpage:
https://www.empireonline.com/movies/features/best-movies-2/
And this is the code I use:
from bs4 import BeautifulSoup
import requests
from pprint import pprint
from bs4 import BeautifulSoup
URL = "https://www.empireonline.com/movies/features/best-movies-2/"
response = requests.get(URL)
web_html = response.text
soup = BeautifulSoup(web_html, "html.parser")
movies = soup.findAll(name = "h3" , class_ = "jsx-4245974604")
movies_text=[]
for item in movies:
result = item.getText()
movies_text.append(result)
print(movies_text)
Can you please help with the solution for this problem?

As other people mentioned this is dynamic content, which needs to be generated first when opening/running the webpage. Therefore you can't find the class "jsx-4245974604" with BS4.
If you print out your "soup" variable you actually can see that you won't find it. But if simply you want to get the names of the movies you can just use another part of the html in this case.
The movie name is in the alt tag of the picture (and actually also in many other parts of the html).
import requests
from pprint import pprint
from bs4 import BeautifulSoup
URL = "https://www.empireonline.com/movies/features/best-movies-2/"
response = requests.get(URL)
web_html = response.text
soup = BeautifulSoup(web_html, "html.parser")
movies = soup.findAll("img", class_="jsx-952983560")
movies_text=[]
for item in movies:
result = item.get('alt')
movies_text.append(result)
print(movies_text)
If you run into this issue in the future, remember to just print out the initial html you can get with soup and just check by eye if the information you need can be found.

How do I get the inspect element code instead of the page source when they are both different?

I was trying to get all the links from the inspect element code of this website with the following code.
import requests
from bs4 import BeautifulSoup
url = 'https://chromedriver.storage.googleapis.com/index.html?path=97.0.4692.71/'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
for link in soup.find_all('a'):
print(link)
However, I got no links. Then, I checked what soup was by printing it, and I compared it to the code I got after inspecting element and viewing page source on the actual website. The code returned by print(source) matched that which showed up when I clicked view page source, but it did not match the code that showed up when I clicked inspect element. Firstly, how do I get the inspect element code instead of the page source code? Secondly, why are the two different?

Just use the other URL mentioned in the comments and parse the XML with BeautifulSoup.
For example:
import requests
from bs4 import BeautifulSoup
url = "https://chromedriver.storage.googleapis.com/?delimiter=/&prefix=97.0.4692.71/"
soup = BeautifulSoup(requests.get(url).text, features="xml").find_all("Key")
keys = [f"https://chromedriver.storage.googleapis.com/{k.getText()}" for k in soup]
print("\n".join(keys))
Output:
https://chromedriver.storage.googleapis.com/97.0.4692.71/chromedriver_linux64.zip
https://chromedriver.storage.googleapis.com/97.0.4692.71/chromedriver_mac64.zip
https://chromedriver.storage.googleapis.com/97.0.4692.71/chromedriver_mac64_m1.zip
https://chromedriver.storage.googleapis.com/97.0.4692.71/chromedriver_win32.zip
https://chromedriver.storage.googleapis.com/97.0.4692.71/notes.txt

Trying to scrape Aliexpress

So I am trying to scrape the price of a product on Aliexpress. I tried inspecting the element which looks like
<span class="product-price-value" itemprop="price" data-spm-anchor-id="a2g0o.detail.1000016.i3.fe3c2b54yAsLRn">US $14.43</span>
I'm trying to run the following code
'''
import pandas as pd
from bs4 import BeautifulSoup
from urllib.request import urlopen
import re
url = 'https://www.aliexpress.com/item/32981494236.html?spm=a2g0o.productlist.0.0.44ba26f6M32wxY&algo_pvid=520e41c9-ba26-4aa6-b382-4aa63d014b4b&algo_expid=520e41c9-ba26-4aa6-b382-4aa63d014b4b-22&btsid=0bb0623b16170222520893504e9ae8&ws_ab_test=searchweb0_0,searchweb201602_,searchweb201603_'
source = urlopen(url).read()
soup = BeautifulSoup(source, 'lxml')
soup.find('span', class_='product-price-value')
'''
but I keep getting a blank output. I must be doing something wrong but these methods seem to work in the tutorials I've seen.

So, what i got. As i understood right, the page what you gave, was recived by scripts, but in origin, it doesn't contain it, just script tags, so i just used split to get it. Here is my code:
from bs4 import BeautifulSoup
import requests
url = 'https://aliexpress.ru/item/1005002281350811.html?spm=a2g0o.productlist.0.0.42d53b59T5ddTM&algo_pvid=f3c72fef-c5ab-44b6-902c-d7d362bcf5a5&algo_expid=f3c72fef-c5ab-44b6-902c-d7d362bcf5a5-1&btsid=0b8b035c16170960366785062e33c0&ws_ab_test=searchweb0_0,searchweb201602_,searchweb201603_&sku_id=12000019900010138'
data = requests.get(url)
soup = BeautifulSoup(data.content, features="lxml")
res = soup.findAll("script")
total_value = str(res[-3]).split("totalValue:")[1].split("}")[0].replace("\"", "").replace(".", "").strip()
print(total_value)
It works fine, i tried on few pages from Ali.

Python BeautifulSoup cannot find table ID

I am running into some trouble scraping a table using BeautifulSoup. Here is my code
from urllib.request import urlopen
from bs4 import BeautifulSoup
site = "http://www.sports-reference.com/cbb/schools/clemson/2014.html"
page = urlopen(site)
soup = BeautifulSoup(page,"html.parser")
stats = soup.find('table', id = 'totals')
In [78]: print(stats)
None
When I right click on the table to inspect the element the HTML looks as I'd expect, however when I view the source the only element with id = 'totals' is commented out. Is there a way to scrape a table from the commented source code?
I have referenced this post but can't seem to replicate their solution.
Here is a link to the webpage I am interested in. I'd like to scrape the table labeled "Totals" and store it as a data frame.
I am relatively new to Python, HTML, and web scraping. Any help would be greatly appreciated.
Thanks in advance.
Michael

Comments are string instances in BeautifulSoup. You can use BeautifulSoup's find method with a regular expression to find the particular string that you're after. Once you have the string, have BeautifulSoup parse that and there you go.
In other words,
import re
from urllib.request import urlopen
from bs4 import BeautifulSoup
site = "http://www.sports-reference.com/cbb/schools/clemson/2014.html"
page = urlopen(site)
soup = BeautifulSoup(page,"html.parser")
stats_html = soup.find(string=re.compile('id="totals"'))
stats_soup = BeautifulSoup(stats_html, "html.parser")
print(stats_soup.table.caption.text)

You can do this:
from urllib2 import *
from bs4 import BeautifulSoup
site = "http://www.sports-reference.com/cbb/schools/clemson/2014.html"
page = urlopen(site)
soup = BeautifulSoup(page,"lxml")
stats = soup.findAll('div', id = 'all_totals')
print stats
Please inform me if I helped!

Scraping using Inspect element

I am trying to get some information from Instagram by scraping it. I have tried this code on twitter and it was working fine but it shows no result on Instagram both of the code are available here.
Twitter code:
from bs4 import BeautifulSoup
from urllib2 import urlopen
theurl = "https://twitter.com/realmadrid"
thepage = urlopen(theurl)
soup = BeautifulSoup(thepage,"html.parser")
print(soup.find('div',{"class":"ProfileHeaderCard"}))
Result: Perfectly given.
Instagram Code:
from bs4 import BeautifulSoup
from urllib2 import urlopen
theurl = "https://www.instagram.com/barackobama/"
thepage = urlopen(theurl)
soup = BeautifulSoup(thepage,"html.parser")
print(soup.find('div',{"class":"_bugdy"}))
Result: None

If you look at the source, you will see the content is dynamically loaded so there is no div._bugdy in what is returned by your request, depending on what it is you want you may be able to pull it from the script json:
import requests
import re
import json
r = requests.get("https://www.instagram.com/barackobama/")
soup = BeautifulSoup(r.content)
js = soup.find("script",text=re.compile("window._sharedData")).text
_json = json.loads((js[js.find("{"):js.rfind("}")+1]))
from pprint import pprint as pp
pp(_json)
That gives you everything you see in the <script type="text/javascript">window._sharedData = ..... in the source returned.
If you want to ge the followers then you will need to use something like selenium, the site is pretty much all dynamically loaded content, to get the followers you need to click the link which is only visible if you are logged in, this will get you closer to what you want:
from selenium import webdriver
import time
login = "https://www.instagram.com"
dr = webdriver.Chrome()
dr.get(login)
dr.find_element_by_xpath("//a[#class='_k6cv7']").click()
dr.find_element_by_xpath("//input[#name='username']").send_keys(youruname")
dr.find_element_by_xpath("//input[#name='password']").send_keys("yourpass")
dr.find_element_by_css_selector("button._aj7mu._taytv._ki5uo._o0442").click()
time.sleep(5)
dr.get("https://www.instagram.com/barackobama")
dr.find_element_by_css_selector('a[href="/barackobama/followers/"]').click()
time.sleep(3)
for li in dr.find_element_by_css_selector("div._n3cp9._qjr85").find_elements_by_xpath("//ul/li"):
print(li.text)
That pulls some text from the li tags that appear in the popup after you click the link, you can pull whatever you want from the unordered list:

First of all there seems to be a typo in the address on line 3.
from bs4 import BeautifulSoup
from urllib2 import urlopen
theurl = "https://www.instagram.com/barackobama/"
thepage = urlopen(theurl)
soup = BeautifulSoup(thepage,"html.parser")
print(soup.find('div',{"class":"_bugdy"}))
Secondly, since you are working with dynamically loaded content, Python might not be able to see all the content you see when browsing the page in your browser.
In order to solve that there are different webdrivers, such as Selenium webdriver (http://www.seleniumhq.org/projects/webdriver/) and PhantomJS (http://phantomjs.org/) which emulate the browser and can wait for Javascript to generate/display data before looking it up.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

unable to Webscrape dropdown item [Python][beautifulsoup] - python

Related

Can't scrape <h3> tag from page

How do I get the inspect element code instead of the page source when they are both different?

Trying to scrape Aliexpress

Python BeautifulSoup cannot find table ID

Scraping using Inspect element

Categories

Resources