I have a problem in screen scraping using bs4. Following is my code.
from bs4 import BeautifulSoup
import urllib2
url="http://www.99acres.com/property-in-velachery-chennai-south-ffid?"
page=urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
properties=soup.findAll('a',{'title':'Bedroom'})
for eachproperty in properties:
print eachproperty['href']+",", eachproperty.string
When I analyzed the website, the actual title structure looks like this
1 Bedroom, Residential Apartment in Velachery for all the anchor links. But I do not get any output for this and no error either. So how do I tell the program to scrape all the data that has the title containing the word "Bedroom"?
Hope I made it clear.
You'll need to use a regular expression here, as you only want to match those anchor links that have Bedroom in the title, not as the whole title:
import re
properties = soup.find_all('a', title=re.compile('Bedroom'))
This gives 47 matches for the URL you've given.
Related
I am trying to scrape data to get the text I need. I want to find the line that says aberdeen and all lines after it which contain the airport info. Here is a pic of the html hierarchy:
I am trying to locate the text elements inside the class "i1" with this code:
import requests
from bs4 import BeautifulSoup
page = requests.get('http://www.airportcodes.org/')
soup = BeautifulSoup(page.text, 'html.parser')
table = soup.find('div',attrs={"class":"i1"})
print(table.text)
But I am not getting the values I expect at all. Here is a link to the data if curious. I am new to scraping obviously.
The problem is your BeautifulSoup parser:
import requests
from bs4 import BeautifulSoup
page = requests.get('http://www.airportcodes.org/')
soup = BeautifulSoup(page.text, 'lxml')
table = soup.find('div',attrs={"class":"i1"})
print(table.text)
If what you want is the text elements, you can use:
soup.get_text()
Note: this will give you all the text elements.
why are people suggesting selenium? this doesnt dynamically load the data ... requests + re is all you need, you dont even need beautiful soup
data = requests.get('http://www.airportcodes.org/').content
cities_and_codes =re.findall("([A-Za-z, ]+)\(([A-Z]{3})\)",data)
just look for any alphanumeric characters (including also comma and space)
followed by exactly 3 uppercase letters in parenthesis
I would know how to get data from a website
I find a tutorial and finished with this
import os
import csv
import requests
from bs4 import BeautifulSoup
requete = requests.get("https://www.palabrasaleatorias.com/mots-aleatoires.php")
page = requete.content
soup = BeautifulSoup(page)
The tutorial say me that I should use something like this to get the string of a tag
h1 = soup.find("h1", {"class": "ico-after ico-tutorials"})
print(h1.string)
But I got a problem : the tag where I want to get text content haven't class... how should I do ?
I tried to put {} but not working
this too {"class": ""}
In fact, it's return me a None
I want to get the text content of this part of the website :
<div style="font-size:3em; color:#6200C5;">
Orchard</div>
Where Orchard is the random word
Thank for any type of help
Unfortunately, there aren't many pointers featured in BeautifulSoup, and the page you are trying to get is terribly ill-suited for your task (no IDs, classes, or other useful html features to point at).
Hence, you should change the way you use to point at the html element, and use the Xpath - and you can't do it with BeautifulSoup. In order to do that, just use html from package lxml to parse the page. Below a code snippet (based on the answers to this question) which extracts the random word in your example.
import requests
from lxml import html
requete = requests.get("https://www.palabrasaleatorias.com/mots-aleatoires.php")
tree = html.fromstring(requete.content)
rand_w = tree.xpath('/html/body/center/center/table[1]/tr/td/div/text()')
print(rand_w)
I wrote this test code which uses BeautifulSoup.
url = "http://www.dailymail.co.uk/news/article-3795511/Harry-Potter-sale-half-million-pound-house-Iconic-Privet-Drive-market-suburban-Berkshire-complete-cupboard-stairs-one-magical-boy.html"
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html,"lxml")
for n in soup.find_all('p'):
print(n.get_text())
It works fine but it also retrieves text that is not part of the news article, such as the time it was posted, number of comments, copyrights ect.
I would wish for it to only retrieve text from the news article itself, how would one go about this?
You might have much better luck with newspaper library which is focused on scraping articles.
If we talk about BeautifulSoup only, one option to get closer to the desired result and have more relevant paragraphs is to find them in the context of div element with itemprop="articleBody" attribute:
article_body = soup.find(itemprop="articleBody")
for p in article_body.find_all("p"):
print(p.get_text())
You'll need to target more specifically than just the p tag. Try looking for a div class="article" or something similar, then only grab paragraphs from there
Be more specific, you need to catch the div with class articleBody, so :
import urllib.request
from bs4 import BeautifulSoup
url = "http://www.dailymail.co.uk/news/article-3795511/Harry-Potter-sale-half-million-pound-house-Iconic-Privet-Drive-market-suburban-Berkshire-complete-cupboard-stairs-one-magical-boy.html"
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html,"lxml")
for n in soup.find_all('div', attrs={'itemprop':"articleBody"}):
print(n.get_text())
Responses on SO is not just for you, but also for people coming from google searches and such. So as you can see, attrs is a dict, it is then possible to pass more attributes/values if needed.
I am learning to use both the re module and the urllib module in python and attempting to write a simple web scraper. Here's the code I've written to scrape just the title of websites:
#!/usr/bin/python
import urllib
import re
urls=["http://google.com","https://facebook.com","http://reddit.com"]
i=0
these_regex="<title>(.+?)</title>"
pattern=re.compile(these_regex)
while(i<len(urls)):
htmlfile=urllib.urlopen(urls[i])
htmltext=htmlfile.read()
titles=re.findall(pattern,htmltext)
print titles
i+=1
This gives the correct output for Google and Reddit but not for Facebook - like so:
['Google']
[]
['reddit: the front page of the internet']
This is because, I found that on Facebook's page the title tag is as follows: <title id="pageTitle">. To accomodate for the additional id=, I modified the these_regex variable as follows: these_regex="<title.+?>(.+?)</title>". But this gives the following output:
[]
['Welcome to Facebook \xe2\x80\x94 Log in, sign up or learn more']
[]
How would I combine both so that I can take into account any additional parameters passed within the title tag?
It is recommended that you use Beautiful Soup or any other parser to parse HTML, but if you badly want regex the following piece of code would do the job.
The regex code:
<title.*?>(.+?)</title>
How it works:
Produces:
['Google']
['Welcome to Facebook - Log In, Sign Up or Learn More']
['reddit: the front page of the internet']
You are using a regular expression, and matching HTML with such expressions get too complicated, too fast.
Use a HTML parser instead, Python has several to choose from. I recommend you use BeautifulSoup, a popular 3rd party library.
BeautifulSoup example:
from bs4 import BeautifulSoup
response = urllib2.urlopen(url)
soup = BeautifulSoup(response.read(), from_encoding=response.info().getparam('charset'))
title = soup.find('title').text
Since a title tag itself doesn't contain other tags, you can get away with a regular expression here, but as soon as you try to parse nested tags, you will run into hugely complex issues.
Your specific problem can be solved by matching additional characters within the title tag, optionally:
r'<title[^>]*>([^<]+)</title>'
This matches 0 or more characters that are not the closing > bracket. The '0 or more' here lets you match both extra attributes and the plain <title> tag.
If you wish to identify all the htlm tags, you can use this
batRegex = re.compile(r'(<[a-z]*>)')
m1=batRegex.search(html)
print batRegex.findall(yourstring)
You could scrape a bunch of titles with a couple lines of gazpacho:
from gazpacho import Soup
urls = ["http://google.com", "https://facebook.com", "http://reddit.com"]
titles = []
for url in urls:
soup = Soup.get(url)
title = soup.find("title", mode="first").text
titles.append(title)
This will output:
titles
['Google',
'Facebook - Log In or Sign Up',
'reddit: the front page of the internet']
I have the following html pattern I want to scrap using BeautifulSoup. The html pattern is:
TITLE
I want to grab TITLE and the information that is displayed in the link. That is, if you clicked the link there is a a description of the TITLE. I want that description.
I started with just trying to grab the title with the following code:
import urllib
from bs4 import BeautifulSoup
import re
webpage = urrlib.urlopen("http://urlofinterest")
title = re.compile('<a>(.*)</a>')
findTitle = re.findall(title,webpage)
print findTile
My output is:
% python beta2.py
[]
So this is obviously not even finding the title. I even tried <a href>(.*)</a> and that didn't work. Based on my reading of the documentation and I thought BeautifulSoup will grab whatever text is between the symbols I give it. In this case, , so what am I doing wrong?
How come you're importing beautifulsoup and then not using it at all?
webpage = urrlib.urlopen("http://urlofinterest")
You'll want to read the data from this, so that:
webpage = urrlib.urlopen("http://urlofinterest").read()
Something like (should get you to a point to go further):
>>> blah = 'TITLE'
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(blah) # change to webpage later
>>> for tag in soup('a', href=True):
print tag['href'], tag.string
link TITLE