I am doing a project on web crawling for which I need to find all links within a given web page. Till now I was using urljoin in urllib.parse. But now I found that some links are not properly joined using the urljoin function.
For e.g. the <a> tag might be something like A. The complete address however might be http://www.example.org/main/test/a.xml?value=basketball, but the urljoin function will give wrong results ( something like http://www.example.com/a.xml?value=basketball).
Code which I am using:
parentUrl = urlQueue.get()
html = get_page_source(parentUrl)
bSoup = BeautifulSoup(html, 'html.parser')
aTags = bSoup.find_all('a', href=True)
for aTag in aTags:
childUrl = aTag.get('href')
# just to check if the url is complete or not(for .com only)
if '.com' not in childUrl:
# this urljoin is giving invalid resultsas mentioned above
childUrl = urljoin(parentUrl, childUrl)
Is there any way through which I can correctly join two URLs, including these cases ?
Just some tweaks to get this working. In your case pass base URI with trailing slash. Everything you will need to accomplish this is written in the docs of urlparse
>>> import urlparse
>>> urlparse.urljoin('http://www.example.org/main/test','a.xml?value=basketball')
'http://www.example.org/main/a.xml?value=basketball'
>>> urlparse.urljoin('http://www.example.org/main/test/','a.xml?value=basketball')
'http://www.example.org/main/test/a.xml?value=basketball'
BTW: this is a perfect use case to factor out the code for building URLs into a separate function. Then write some unit tests to verify its working as expected and even works with your edge cases. Afterwards use it in your web crawler code.
Related
I am currently learning web scraping with python. I'm reading Web scraping with Python by Ryan Mitchell.
I am stuck at Crawling Sites Through Search. For example, reuters search given in the book works perfectly but when I try to find it by myself, as I will do in the future, I get this link.
Whilst in the second link it is working for a human, I cannot figure out how to scrape it due to weird class names like this class="media-story-card__body__3tRWy"
The first link gives me simple names, like this class="search-result-content" that I can scrape.
I've encountered the same problem on other sites too. How would I go about scraping it or finding a link with normal names in the future?
Here's my code example:
from bs4 import BeautifulSoup
import requests
from rich.pretty import pprint
text = "hello"
url = f"https://www.reuters.com/site-search/?query={text}"
response = requests.get(url)
soup = BeautifulSoup(response.text, "lxml")
results = soup.select("div.media-story-card__body__3tRWy")
for result in results:
pprint(result)
pprint("###############")
You might resort to a prefix attribute value selector, like
div[class^="media-story-card__body__"]
This assumes that the class is the only one ( or at least notationally the first ). However, the idea can be extended to checking for a substring.
I'm working on parsing this web page.
I've got table = soup.find("div",{"class","accordions"}) to get just the fixtures list (and nothing else) however now I'm trying to loop through each match one at a time. It looks like each match starts with an article element tag <article role="article" about="/fixture/arsenal/2018-apr-01/stoke-city">
However for some reason when I try to use matches = table.findAll("article",{"role","article"})
and then print the length of matches, I get 0.
I've also tried to say matches = table.findAll("article",{"about","/fixture/arsenal"}) but get the same issue.
Is BeautifulSoup unable to parse tags, or am I just using it wrong?
Try this:
matches = table.findAll('article', attrs={'role': 'article'})
the reason is that findAll is searching for tag name. refer to bs4 docs
You need to pass the attributes as a dictionary. There are three ways in which you can get the data you want.
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.arsenal.com/fixtures')
soup = BeautifulSoup(r.text, 'lxml')
matches = soup.find_all('article', {'role': 'article'})
print(len(matches))
# 16
Or, this is also the same:
matches = soup.find_all('article', role='article')
But, both these methods give some extra article tags that don't have the Arsernal fixtures. So, if you want to find them using /fixture/arsenal you can use CSS selectors. (Using find_all won't work, as you need a partial match)
matches = soup.select('article[about^=/fixture/arsenal]')
print(len(matches))
# 13
Also, have a look at the keyword arguments. It'll help you get what you want.
I have the following code with a purpose to parse specific information from each of multiple pages. The http of each of the multiple pages is structured and therefore I use this structure to collect all links at the same time for further parsing.
import urllib
import urlparse
import re
from bs4 import BeautifulSoup
Links = ["http://www.newyorksocialdiary.com/party-pictures?page=" + str(i) for i in range(2,27)]
This command gives me a list of http links. I go further to read in and make soups.
Rs = [urllib.urlopen(Link).read() for Link in Links]
soups = [BeautifulSoup(R) for R in Rs]
As these make the soups that I desire, I cannot achieve the final goal - parsing structure . For instance,
Something for Everyone
I am specifically interested in obtaining things like this: '/party-pictures/2007/something-for-everyone'. However, the code below cannot serve this purpose.
As = [soup.find_all('a', attr = {"href"}) for soup in soups]
Could someone tell me where went wrong? I highly appreciate your assistance. Thank you.
I am specifically interested in obtaining things like this: '/party-pictures/2007/something-for-everyone'.
The next would be going for regular expression!!
You don't necessarily need to use regular expressions, and, from what I understand, you can filter out the desired links with BeautifulSoup:
[[a["href"] for a in soup.select('a[href*=party-pictures]')]
for soup in soups]
This, for example, would give you the list of links having party-pictures inside the href. *= means "contains", select() is a CSS selector search.
You can also use find_all() and apply the regular expression filter, for example:
pattern = re.compile(r"/party-pictures/2007/")
[[a["href"] for a in soup.find_all('a', href=pattern)]
for soup in soups]
This should work :
As = [soup.find_all(href=True) for soup in soups]
This should give you all href tags
If you only want hrefs with name 'a', then the following would work :
As = [soup.find_all('a',href=True) for soup in soups]
Hello there stack community!
I'm having an issue that I can't seem to resolve since it looks like most of the help out there is for Python 2.7.
I want to pull a table from a webpage and then just get the linktext and not the whole anchor.
Here is the code:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
url = 'http://www.craftcount.com/category.php?cat=5'
html = urlopen(url).read()
soup = BeautifulSoup(html)
alltables = soup.findAll("table")
## This bit captures the input from the previous sequence
results=[]
for link in alltables:
rows = link.findAll('a')
## Find just the names
top100 = re.findall(r">(.*?)<\/a>",rows)
print(top100)
When I run it, I get: "TypeError: expected string or buffer"
Up to the second to the last line, it does everything correctly (when I swap out 'print(top100)' for 'print(rows)').
As an example of the response I get:
michellechangjewelry
And I just need to get:
michellechangjewelry
According to pythex.org, my (ir)regular expression should work, so I wanted to see if anyone out there knew how to do that. As an additional issue, it looks like most people like to go the other way, that is, from having the full text and only wanting the URL part.
Finally, I'm using BeautifulSoup out of "convenience", but I'm not beholden to it if you can suggest a better package to narrow down the parsing to the linktext.
Many thanks in advance!!
BeautifulSoup results are not strings; they are Tag objects, mostly.
Look for the text of the <a> tags, use the .string attribute:
for table in alltables:
link = table.find('a')
top100 = link.string
print(top100)
This finds the first <a> link in a table. To find all text of all links:
for table in alltables:
links = table.find_all('a')
top100 = [link.string for link in links]
print(top100)
I'm working on something that requires me to get all the URLs on a page. It seems to work on most websites I've tested, for example microsoft.com, but it only returns three from google.com. Here is the relevant source code:
import urllib
import time
import re
fwcURL = "http://www.microsoft.com" #URL to read
mylines = urllib.urlopen(fwcURL).readlines()
print "Found URLs:"
time.sleep(1) #Pause execution for a bit
for item in mylines:
if "http://" in item.lower(): #For http
print item[item.index("http://"):].split("'")[0].split('"')[0] # Remove ' and " from the end, for example in href=
if "https://" in item.lower(): #For https
print item[item.index("https://"):].split("'")[0].split('"')[0] # Ditto
If my code can be improved, or if there is a better way to do this, please respond. Thanks in advance!
Try Using Mechanize or BeautifulSoup or lxml.
By using BeautifulSoup, you can easily get all the html/xml content very easily.
import urllib2
from BeautifulSoup import BeautifulSoup
page = urllib2.urlopen("some_url")
soup = BeautifulSoup(page.read())
links = soup.findAll("a")
for link in links:
print link["href"]
BeautifulSoup is very easy to learn and understand.
First off, HTML is not a regular language, and no amount of simple string manipulation like that is going to work on all pages. You need a real HTML parser. I'd recommend Lxml. Then it's just a matter of recursing through the tree and finding the elements you want.
Second, some pages may be dynamic, so you won't find all of the contents in the html source. Google makes heavy use of javascript and AJAX (notice how it displays results without reloading the page).
I would use lxml and do:
import lxml.html
page = lxml.html.parse('http://www.microsoft.com').getroot()
anchors = page.findall('a')
It's worth noting that if links are dynamically generated (via JS or similar), then you won't get those short of automating a browser in some fashion.