Following links with Mechanize

Following links with Mechanize - python

I would like to use the Mechanize python library to follow certain links in a website, but the only links I'm interested in are the ones in a <div> tag. This question is related, but they achieve it using the lxml parser, which I'm not familiar with, I'm more comfortable using BeautifulSoup.
I've located the relevant links using BeautifulSoup already, but I don't know how to use Mechanize (or something else) to follow these links. Is there a way to pass a string to Mechanize so that it will follow it?

the simple open() should be enough:
br.open('http://google.com')

import mechanize
response = mechanize.urlopen("http://example.com/")
content = response.read() #The content is the code of the page (html)
Or, if you want to add things like headers:
import mechanize
request = mechanize.Request("http://example.com/")
response = mechanize.urlopen(request)
content = response.read() #The content is the code of the page (html)

Related

How to take table data from a website using bs4

I'm trying to scrape a website that has a table in it using bs4, but the element of the content I'm getting is not as complete compared to the one I get from inspect. I cannot find the tag <tr> and <td> in it. How can I get the full content of that site especially the tags for the table?
Here's my code:
from bs4 import BeautifulSoup
import requests
link = requests.get("https://pemilu2019.kpu.go.id/#/ppwp/hitung-suara/", verify = False)
src = link.content
soup = BeautifulSoup(src, "html.parser")
print(soup)
I expect the content to have the tag <tr> and <td> in it because they do exist when I inspect it,but I found none from the output.
Here's the image of the page where there is the tag <tr> and <td>

You should dump the contents of the text you're trying to parse to a file and look at it. This will tell you for sure what is and isn't there. Like this:
from bs4 import BeautifulSoup
import requests
link = requests.get("https://pemilu2019.kpu.go.id/#/ppwp/hitung-suara/", verify = False)
src = link.content
with open("/tmp/content.html", "w") as f:
f.write(src)
soup = BeautifulSoup(src, "html.parser")
print(soup)
Run this code, and then look at the file "/tmp/content.html" (use a different path, obviously, if you're on Windows), and look at what is actually in the file. You could probably do this with your browser, but this this is the way to be the most sure you know what you are getting. You could, of course, also just add print(src), but if it were me, I'd dump it to a file
If the HTML you're looking for is not in the initial HTML that you're getting back, then that HTML is coming from somewhere else. The table could be being built dynamically by JavaScript, or coming from another URL reference, possibly one that calls an HTTP API to grab the table's HTML via parameters passed to the API endpoint.
You will have to reverse engineer the site's design to find where that HTML comes from. If it comes from JavaScript, you may be stuck short of scripting the execution of a browser so you can gain access programmatically to the DOM in the browser's memory.
I would recommend running a debugging proxy that will show you each HTTP request being made by your browser. You'll be able to see the contents of each request and response. If you can do this, you can find the URL that actually returns the content you're looking for, if such a URL exists. You'll have to deal with SSL certificates and such because this is a https endpoint. Debugging proxies usually make that pretty easy. We use Charles. The standard browser toolboxes might do this too...allow you to see each request and response that is generated by a particular page load.
If you can discover the URL that actually returns the table HTML, then you can use that URL to grab it and parse it with BS.

Extract the source code of this page using Python (https://mobile.twitter.com/i/bookmarks)

how Extract the source code of this page using Python (https://mobile.twitter.com/i/bookmarks) !
The problem is that the actual page code does not appear
import mechanicalsoup as ms
Browser = ms.StatefulBrowser()
Browser.open("https://mobile.twitter.com/login")
Browser.select_form('form[action="/sessions"]')
Browser["session[username_or_email]"] = 'email'
Browser["session[password]"] = 'password'
Browser.submit_selected()
Browser.open("https://mobile.twitter.com/i/bookmarks")
html = Browser.get_current_page()
print html

Use BeautifulSoup.
from urllib import request
from bs4 import BeautifulSoup
url_1 = "http://www.google.com"
page = request.urlopen(url_1)
soup = BeautifulSoup(page)
print(soup.prettify())
From this answer:
https://stackoverflow.com/a/43290890/11034096

Edit:
It looks like the issue is that Twitter is trying to use a JS redirect to load the next page. JS isn't supported by mechanicalsoup, so you'll need to try something like selenium.
The html variable that you are returning is actually a BeautifulSoup object and not the text HTML. I would try using:
print(html.text())
to see if that will print the HTML directly.
Alternatively, from the BeautifulSoup documentation you should be able to use the non-pretty printing of:
str(html)
or
unicode(html.a)

Beautiful Soup Can't Find Tags

I am currently trying to practice with the requests and BeautifulSoup Modules in Python 3.6 and have run into an issue that I can't seem to find any info on in other questions and answers.
It seems that at some point in the page, Beuatiful Soup stops recognizing tags and Ids. I am trying to pull Play-by-play data from a page like this:
http://www.pro-football-reference.com/boxscores/201609080den.htm
import requests, bs4
source_url = 'http://www.pro-football-reference.com/boxscores/201609080den.htm'
res = requests.get(source_url)
if '404' in res.url:
raise Exception('No data found for this link: '+source_url)
soup = bs4.BeautifulSoup(res.text,'html.parser')
#this works
all_pbp = soup.findAll('div', {'id' : 'all_pbp'})
print(len(all_pbp))
#this doesn't
table = soup.findAll('table', {'id' : 'pbp'})
print(len(table))
Using the inspector in Chrome, I can see that the table definitely exists. I have also tried to use it on 'div's and 'tr's in the later half of the HTML and it doesn't seem to work. I have tried the standard 'html.parser' as well as lxml and html5lib, but nothing seems to work.
Am I doing something wrong here, or is there something in the HTML or its formatting that prevents BeautifulSoup from correctly finding the later tags? I have run into issues with similar pages run by this company (hockey-reference.com, basketball-reference.com), but have been able to use these tools properly on other sites.
If it is something with the HTML, is there any better tool/library for helping to extract this info out there?
Thank you for your help,
BF

BS4 won't be able to execute the javascript of a web page after doing the GET request for a URL. I think that the table of concern is loaded async from client-side javascript.
As a result, the client-side javascript will need to run first before scraping the HTML. This post describes how to do so!

Ok, I got what was the problem. You're trying to parse comment, not an ordinary html element.
For such cases you should use Comment from BeautifulSoup, like this:
import requests
from bs4 import BeautifulSoup,Comment
source_url = 'http://www.pro-football-reference.com/boxscores/201609080den.htm'
res = requests.get(source_url)
if '404' in res.url:
raise Exception('No data found for this link: '+source_url)
soup = BeautifulSoup(res.content,'html.parser')
comments=soup.find_all(string=lambda text:isinstance(text,Comment))
for comment in comments:
comment=BeautifulSoup(str(comment), 'html.parser')
search_play = comment.find('table', {'id':'pbp'})
if search_play:
play_to_play=search_play

Trying to crawl a website but dont get <body>

I am playing around and want to send myself an email when a new post in on forum thread appears, but when I open the url with urllib.urlopen I get back the webpage but without a page body. Can someone please tell me why this is the case? And how I can get the body?
def loadUrl(adress):
adress = urllib.unquote(adress)
print("Loading " + adress)
socket =urllib.urlopen(adress)
html = socket.read()
socket.close()
soup = BeautifulSoup(html)
return soup
soup = loadUrl("http://de.pokerstrategy.com/forum/thread.php?threadid=498111")

In addition, i would recommend to use Pyquery.
from pyquery import PyQuery
d = PyQuery("http://de.pokerstrategy.com/forum/thread.php?threadid=498111")
print d("body").html()

EDIT sorry, I didn't realise that you had posted the url you were trying to retrieve. I get the same response as you, and aren't sure why. I can't see anything in the javascript, as i had suggested below.
I tested your code and it seems to work fine. Perhaps the page you are trying to retrieve generates the body element via javascript or something similar. In this case I believe you can use something like selenium to emulate the browser.

I've had success using BeautifulSoup with urllib2, for example:
from urllib2 import urlopen
...
html = urlopen(...)
soup = BeautifulSoup(html)

BeautifulSoup cannot find any <a> tag

I am trying to scrape the website here: ftp://ftp.sec.gov/edgar/daily-index/. Using the code as shown below:
from bs4 import BeautifulSoup
import urllib.request
html = urllib.request.urlopen("ftp://ftp.sec.gov/edgar/daily-index/")
soup = BeautifulSoup(line, "lxml")
soup.a # or soup.find_all('a') neither of them works
#return None.
Please help, I am really frustrated by this. My suspicion is that the tag is causing the problem. The site's Html looks well formated (matched tags), so I am lost as to why BeautifulSoup doesn't find anything. Thanks

The ftp://ftp.sec.gov/edgar/daily-index/ URL leads to a FTP directory, not an HTML page.
Your browser could generate HTML based on the FTP directory contents, but the server does not send you HTML when you load that resource with urllib.request.
You probably want to use the ftplib module directly instead to read the directory listing, or inspect the return value of urlopen(...).read() first.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.