Using urllib2 and BeautifulSoup not receiving the data I view in browser - python

I am trying to scrape a website:
http://www.gabar.org/membersearchresults.cfm?start=26&id=E640EC74-9C8E-9913-79DB5D9C376528C0
I know the link above will show that there are no search results, but when I do the search manually there are results.
The problem I am having is when I open this link in my browser I am able to see a page as expected however when I open it in beautiful soup the output I get something along the lines that this search is not available.
I am new to this so not quite sure how this works, do websites have things built in that make things like this (urllib2/beautifulsoup) not work?
File = urllib2.urlopen("http://www.gabar.org/membersearchresults.cfm?start=26&id=E640EC74-9C8E-9913-79DB5D9C376528C0")
Html = File.read()
File.close()
soup = BeautifulSoup(Html)
AllLinks = soup.find_all("a")
lawyerlinks = []
for link in soup.find_all("a"):
lawyerlinks.append(link.get('href'))
lawyerlinks = lawyerlinks[76:100]
print lawyerlinks

That's fascinating. Going to the first page of results works, and then clicking "Next" works, and all it does is take you to the URL you posted. But if I visit that URL directly, I get no results.
Note that urllib2.urlopen is indeed behaving exactly like a browser here. If you open a browser directly to that page, you get no results - which is exactly what you get with urlopen.
What you want to do is mimic a browser, visit the first page of results, and then mimic clicking 'next' just like a browser would. The best library I know of for this is mechanize.
import mechanize
br = mechanize.Browser()
br.open("http://www.gabar.org/membersearchresults.cfm?id=ED162783-9C8E-9913-79DBE86CBE9FB115")
response1 = br.follow_link(text_regex=r"Next", nr=0)
Html = response1.read()
#rest is the same

Related

How do I fix fix getting "None" as a response when web scraping?

So I am trying to create a small code that gets the views from a youtube video and prints them. However using this code when printing the text var I just get the response "None". Is there a way to get a response of the actual view count using these libraries?
import requests
from bs4 import BeautifulSoup
url = requests.get("https://www.youtube.com/watch?v=dQw4w9WgXcQ")
soup = BeautifulSoup(url.text, 'html.parser')
text = soup.find('span', {'class': "view-count style-scopeytd-video-view-count-renderer"})
print(text)
To see why, you should use wget or curl to fetch a copy of that page and look at it, or use "view source" from your browser. That's what requests sees. None of those classes appear in the HTML you get back. That's why you get None -- because there ARE none.
YouTube builds all of its pages dynamically, through Javascript. requests doesn't interpret Javascript. If you need to do this, you'll need to use something like Selenium to run a real browser with a Javascript interpreter built in.

python crawling text from <em></em>

Hi, I want to get the text(number 18) from em tag as shown in the picture above.
When I ran my code, it did not work and gave me only empty list. Can anyone help me? Thank you~
here is my code.
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = 'https://blog.naver.com/kwoohyun761/221945923725'
html = urlopen(url)
soup = BeautifulSoup(html, 'lxml')
likes = soup.find_all('em', class_='u_cnt _count')
print(likes)
When you disable javascript you'll see that the like count is loaded dynamically, so you have to use a service that renders the website and then you can parse the content.
You can use an API: https://www.scraperapi.com/
Or run your own for example: https://github.com/scrapinghub/splash
EDIT:
First of all, I missed that you were using urlopen incorrectly the correct way is described here: https://docs.python.org/3/howto/urllib2.html . Assuming you are using python3, which seems to be the case judging by the print statement.
Furthermore: looking at the issue again it is a bit more complicated. When you look at the source code of the page it actually loads an iframe and in that iframe you have the actual content: Hit ctrl + u to see the source code of the original url, since the side seems to block the browser context menu.
So in order to achieve your crawling objective you have to first grab the initial page and then grab the page you are interested in:
from urllib.request import urlopen
from bs4 import BeautifulSoup
# original url
url = "https://blog.naver.com/kwoohyun761/221945923725"
with urlopen(url) as response:
html = response.read()
soup = BeautifulSoup(html, 'lxml')
iframe = soup.find('iframe')
# iframe grabbed, construct real url
print(iframe['src'])
real_url = "https://blog.naver.com" + iframe['src']
# do your crawling
with urlopen(real_url) as response:
html = response.read()
soup = BeautifulSoup(html, 'lxml')
likes = soup.find_all('em', class_='u_cnt _count')
print(likes)
You might be able to avoid one round trip by analyzing the original url and the URL in the iframe. At first glance it looked like the iframe url can be constructed from the original url.
You'll still need a rendered version of the iframe url to grab your desired value.
I don't know what this site is about, but it seems they do not want to get crawled maybe you respect that.

Why can't I scrape using beautiful soup?

I need to scrape the only table from this website: https://core.ap.gov.in/CMDashBoard/UserInterface/eAgriculture/eAgricultureVillagewise.aspx?mandal=Agali&district=Anantapuramu
I used beautiful soup and requests but not successful. Can you guys suggest me where I am going wrong?
mandal_url = "https://core.ap.gov.in/CMDashBoard/UserInterface/eAgriculture/eAgricultureVillagewise.aspx?mandal=Agali&district=Anantapuramu"
r = requests.get(mandal_url, verify=False).content
soup = bs4.BeautifulSoup(r, 'lxml')
df = pd.read_html(str(soup.find('table',{"id":"gvAgricultureVillage"})))
I am geeting 'Page not found' in the data frame. I don't know where I am going wrong!
The page likely requires some sort of sign-in. Viewing it myself by clicking on the link, I get .
You would need to add the cookies / some other headers to the request to appear "signed in".
Try to click the link you're trying to scrape from an invalid link. When I click the link you provide, or the link you store in mandal_url, both return an 'Page not found' page. So you are scraping in the correct way, but the url you provide to the scraper is invalid/not up anymore.
I couldn't get access to the website. But you can read form(s) on a webpage directly by using:
dfs = pd.read_html(your_url, header=0)
In the case that url requires authentication, you can get the form by:
r = requests.get(url_need_authentivation, auth=('myuser', 'mypasswd'))
pd.read_html(r.text, header=0)[1]
This will simplify your code. Hope it helps!

request.get url of second page of a search result

I am trying to use request.get(url) to get the response of a url from a server.
The following code works for the url of the first page of a search result:
r = requests.get("https://www.epocacosmeticos.com.br/perfumes")
soup = BeautifulSoup(r.text)
However, when I try to use the same code for the url of the second page of the the search result, which is "https://www.epocacosmeticos.com.br/perfumes#2",
r = requests.get("https://www.epocacosmeticos.com.br/perfumes#2")
soup = BeautifulSoup(r.text)
it returns the response of the first page. It ignores the '#2' at the end of the URL. How can I get the response of the second page of a search result?
You can use a web proxy like BurpSuite to view the requests made by the page. When you click on the "Page 2" button, this is what is being sent in the background:
GET /buscapagina?fq=C%3a%2f1000001%2f&PS=16&sl=f804bbc5-5fa8-4b8b-b93a-641c059b35b3&cc=4&sm=0&PageNumber=2 HTTP/1.1
Therefore, this is the url you will need to query if you want to properly scrape the website.
BurpSuite also allows you to play with the requests, so you can try changing the request (like changing the 2 for a 3) and see if you get the expected result.
It seems this website uses dynamic html. Because of this, the second results page is not a "new page", but the same page with the search content reloaded.
You probably won't be able to scrap only using requests. This probably requires a browser. Selenium with PhantomJS or Headless-Chrome are good choices for this job, and after that you can use beautifulSoup to parse.

Scrape multiple pages with BeautifulSoup and Python

My code successfully scrapes the tr align=center tags from [ http://my.gwu.edu/mod/pws/courses.cfm?campId=1&termId=201501&subjId=ACCY ] and writes the td elements to a text file.
However, there are multiple pages available at the site above in which I would like to be able to scrape.
For example, with the url above, when I click the link to "page 2" the overall url does NOT change. I looked at the page source and saw a javascript code to advance to the next page.
How can my code be changed to scrape data from all the available listed pages?
My code that works for page 1 only:
import bs4
import requests
response = requests.get('http://my.gwu.edu/mod/pws/courses.cfm?campId=1&termId=201501&subjId=ACCY')
soup = bs4.BeautifulSoup(response.text)
soup.prettify()
acct = open("/Users/it/Desktop/accounting.txt", "w")
for tr in soup.find_all('tr', align='center'):
stack = []
for td in tr.findAll('td'):
stack.append(td.text.replace('\n', '').replace('\t', '').strip())
acct.write(", ".join(stack) + '\n')
The trick here is to check the requests that are coming in and out of the page-change action when you click on the link to view the other pages. The way to check this is to use Chrome's inspection tool (via pressing F12) or installing the Firebug extension in Firefox. I will be using Chrome's inspection tool in this answer. See below for my settings.
Now, what we want to see is either a GET request to another page or a POST request that changes the page. While the tool is open, click on a page number. For a really brief moment, there will only be one request that will appear, and it's a POST method. All the other elements will quickly follow and fill the page. See below for what we're looking for.
Click on the above POST method. It should bring up a sub-window of sorts that has tabs. Click on the Headers tab. This page lists the request headers, pretty much the identification stuff that the other side (the site, for example) needs from you to be able to connect (someone else can explain this muuuch better than I do).
Whenever the URL has variables like page numbers, location markers, or categories, more often that not, the site uses query-strings. Long story made short, it's similar to an SQL query (actually, it is an SQL query, sometimes) that allows the site to pull the information you need. If this is the case, you can check the request headers for query string parameters. Scroll down a bit and you should find it.
As you can see, the query string parameters match the variables in our URL. A little bit below, you can see Form Data with pageNum: 2 beneath it. This is the key.
POST requests are more commonly known as form requests because these are the kind of requests made when you submit forms, log in to websites, etc. Basically, pretty much anything where you have to submit information. What most people don't see is that POST requests have a URL that they follow. A good example of this is when you log-in to a website and, very briefly, see your address bar morph into some sort of gibberish URL before settling on /index.html or somesuch.
What the above paragraph basically means is that you can (but not always) append the form data to your URL and it will carry out the POST request for you on execution. To know the exact string you have to append, click on view source.
Test if it works by adding it to the URL.
Et voila, it works. Now, the real challenge: getting the last page automatically and scraping all of the pages. Your code is pretty much there. The only things remaining to be done are getting the number of pages, constructing a list of URLs to scrape, and iterating over them.
Modified code is below:
from bs4 import BeautifulSoup as bsoup
import requests as rq
import re
base_url = 'http://my.gwu.edu/mod/pws/courses.cfm?campId=1&termId=201501&subjId=ACCY'
r = rq.get(base_url)
soup = bsoup(r.text)
# Use regex to isolate only the links of the page numbers, the one you click on.
page_count_links = soup.find_all("a",href=re.compile(r".*javascript:goToPage.*"))
try: # Make sure there are more than one page, otherwise, set to 1.
num_pages = int(page_count_links[-1].get_text())
except IndexError:
num_pages = 1
# Add 1 because Python range.
url_list = ["{}&pageNum={}".format(base_url, str(page)) for page in range(1, num_pages + 1)]
# Open the text file. Use with to save self from grief.
with open("results.txt","wb") as acct:
for url_ in url_list:
print "Processing {}...".format(url_)
r_new = rq.get(url_)
soup_new = bsoup(r_new.text)
for tr in soup_new.find_all('tr', align='center'):
stack = []
for td in tr.findAll('td'):
stack.append(td.text.replace('\n', '').replace('\t', '').strip())
acct.write(", ".join(stack) + '\n')
We use regular expressions to get the proper links. Then using list comprehension, we built a list of URL strings. Finally, we iterate over them.
Results:
Processing http://my.gwu.edu/mod/pws/courses.cfm?campId=1&termId=201501&subjId=ACCY&pageNum=1...
Processing http://my.gwu.edu/mod/pws/courses.cfm?campId=1&termId=201501&subjId=ACCY&pageNum=2...
Processing http://my.gwu.edu/mod/pws/courses.cfm?campId=1&termId=201501&subjId=ACCY&pageNum=3...
[Finished in 6.8s]
Hope that helps.
EDIT:
Out of sheer boredom, I think I just created a scraper for the entire class directory. Also, I update both the above and below codes to not error out when there is only a single page available.
from bs4 import BeautifulSoup as bsoup
import requests as rq
import re
spring_2015 = "http://my.gwu.edu/mod/pws/subjects.cfm?campId=1&termId=201501"
r = rq.get(spring_2015)
soup = bsoup(r.text)
classes_url_list = [c["href"] for c in soup.find_all("a", href=re.compile(r".*courses.cfm\?campId=1&termId=201501&subjId=.*"))]
print classes_url_list
with open("results.txt","wb") as acct:
for class_url in classes_url_list:
base_url = "http://my.gwu.edu/mod/pws/{}".format(class_url)
r = rq.get(base_url)
soup = bsoup(r.text)
# Use regex to isolate only the links of the page numbers, the one you click on.
page_count_links = soup.find_all("a",href=re.compile(r".*javascript:goToPage.*"))
try:
num_pages = int(page_count_links[-1].get_text())
except IndexError:
num_pages = 1
# Add 1 because Python range.
url_list = ["{}&pageNum={}".format(base_url, str(page)) for page in range(1, num_pages + 1)]
# Open the text file. Use with to save self from grief.
for url_ in url_list:
print "Processing {}...".format(url_)
r_new = rq.get(url_)
soup_new = bsoup(r_new.text)
for tr in soup_new.find_all('tr', align='center'):
stack = []
for td in tr.findAll('td'):
stack.append(td.text.replace('\n', '').replace('\t', '').strip())
acct.write(", ".join(stack) + '\n')

Categories

Resources