Scraping data with python - python

I'm working on a program to scrape address data from companies from a website. For this, I already have built a list with the links where this data can be found. On each of those links, the page source of the data I need looks like this:
http://imgur.com/a/V0kBK
For some reason though, I cannot seem to fetch the first line of the address, in this case "'t walletje 104 101"
All the other information comes through fine, as you can see here:
http://imgur.com/a/aUmSI
This is my code:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests
import string
import urllib
urllist=[]
url1="http://www.bisy.be/"
fname = "test2.txt"
fh = open(fname)
for line in fh:
line = line.rstrip()
if line.startswith(" <tr class="):
words= line.split()
url2 = words[6]
url3 = url1 + url2[1:48]
urllist.append(url3)
for link in urllist:
document = urllib.request.urlopen(link)
html = document.read()
soup = BeautifulSoup(html,"html.parser")
name = soup.find("br")
name2 = name.text.strip()
print (name2)
This code is doing the basics so far. Once everything works I will clean it and finetune it a bit. Is there anyone who can help me?
An example link for the people who want to check out the page source: http://www.bisy.be/?q=nl/bedrijf_fiche&id=KNO01C2015&nr=2027420160
Is there anyone who can point me in the right direction?

this is a workaround by finding p tag with specific font size:
elements = soup.find_all("p")
for tag in elements:
try:
if "9pt" in tag["style"]:
details = tag.text
print(details)
except:
pass

your can't select text in br it new line not like p, span div or others. you can try using BeautifulSoup above or use regex below which is faster
import re
for link in urllist:
document = urllib.request.urlopen(link)
html = document.read().decode('utf-8')
name = re.compile(r"9pt; font-family: Arial;'>(.*?)<a", re.DOTALL).findall(html)
# clean html tag
name = re.sub('<[^>]*>', '', name[0])
print(name)

Related

Python Web Scraping Error without Any Warning

I am trying to scrape some text from a webpage and saving them in a text file using following code (I am opening links from a text file called links.txt):
import requests
import csv
import random
import string
import re
from bs4 import BeautifulSoup
#Create random string of specific length
def randStr(chars = string.ascii_uppercase + string.digits, N=10):
return ''.join(random.choice(chars) for _ in range(N))
with open("links.txt", "r") as a_file:
for line in a_file:
stripped_line = line.strip()
endpoint = stripped_line
response = requests.get(endpoint)
data = response.text
soup = BeautifulSoup(data, "html.parser")
for pictags in soup.find_all('col-md-2'):
lastfilename = randStr()
file = open(lastfilename + ".txt", "w")
file.write(pictags.txt)
file.close()
print(stripped_line)
the webpage has following attribute:
<div class="col-md-2">
The problem is after running the code noting is happening and I am not receiving any error.
To get all keyword text from the page into a file, you can do:
import requests
from bs4 import BeautifulSoup
url = "http://www.mykeyworder.com/keywords?tags=dog&exclude=&language=en"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
with open("data.txt", "w") as f_out:
for inp in soup.select('input[type="checkbox"]'):
print(inp["value"], file=f_out)
This creates data.txt with content:
dog
animal
canine
pet
cute
puppy
happy
young
adorable
...and so on.
From the documentation of BeautifulSoup here, you can see your line for pictags in soup.find_all('col-md-2') will search for any element with tag name 'col-md-2' not element with that class name. In other word, your code will search element like so <col-md-2></col-md-2>.
You fix your code and try again or pictags in soup.find_all(class_='col-md-2')
you can match the elements with relevant attributes.
pass a dictionary to the attrs parameter of find_all with the
desired attributes of the elements you’re looking for.
pictags = soup.find_all(attrs={'class':'col-md-2'})
this will find all elements with class 'col-md-2'

Web Scraping - Extract list of text from multiple pages

I want to extract a list of names from multiple pages of a website.
The website has over 200 pages and i want to save all the names to a text file. I have wrote some code but it's giving me index error.
CODE:
import requests
from bs4 import BeautifulSoup as bs
URL = 'https://hamariweb.com/names/muslim/boy/page-'
#for page in range(1, 203):
page = 1
req = requests.get(URL + str(page))
soup = bs(req.text, 'html.parser')
row = soup.find('div', attrs={'class', 'row'})
books = row.find_all('a')
for book in books:
data = book.find_all('b')[0].get_text()
print(data)
OUTPUT:
Aabbaz
Aabid
Aabideen
Aabinus
Aadam
Aadeel
Aadil
Aadroop
Aafandi
Aafaq
Aaki
Aakif
Aalah
Aalam
Aalamgeer
Aalif
Traceback (most recent call last):
File "C:\Users\Mujtaba\Documents\names.py", line 15, in <module>
data = book.find_all('b')[0].get_text()
IndexError: list index out of range
>>>
The reason for getting the error is since it can't find a <b> tag.
Try this code to request each page and save the data to a file:
import requests
from bs4 import BeautifulSoup as bs
MAIN_URL = "https://hamariweb.com/names/muslim/boy/"
URL = "https://hamariweb.com/names/muslim/boy/page-{}"
with open("output.txt", "a", encoding="utf-8") as f:
for page in range(203):
if page == 0:
req = requests.get(MAIN_URL.format(page))
else:
req = requests.get(URL.format(page))
soup = bs(req.text, "html.parser")
print(f"page # {page}, Getting: {req.url}")
book_name = (
tag.get_text(strip=True)
for tag in soup.select(
"tr.bottom-divider:nth-of-type(n+2) td:nth-of-type(1)"
)
)
f.seek(0)
f.write("\n".join(book_name) + "\n")
I suggest to change your parser to html5lib #pip install html5lib. I just think it's better. Second It's better NOT to do a .find() from your soup object DIRECTLY since it might cause some problems where the tags and classes might have duplicates. SO you might be finding data on a html tag where your data isn't even there. So it's better to check everything and inspect element the the tags you want to get and see on what block of code they might be in cause it is easier that way to scrape, also to avoid more errors.
What I did there is I inspected the elements first and FIND the BLOCK of code where you want to get your data and I found that it is on a div and its class is mb-40 content-box that is where all the names you are trying to get are. Luckily the class is UNIQUE and there are no other elements with the same tag and class so we can just directly .find() it.
Then the value of trs are simply the tr tags inside of that block
(Take note also that those <tr> tags are inside of a <table> tag but the good thing is those are the only <tr> tags that exist so there wouldn't be much of a problem like if there would be another <table> tag with the same class value)
which the <tr> tags contains the names you want to get. You may ask why is there [1:] it's because to start at index 1 to NOT include the Header from the table on the website.
Then just loop through those tr tags and get the text. With regards to your error on why is it happening it is simply because of index out of range you are trying to access a .find_all() result list item where it is out of bounds and this might happen if cases that there are no such data that is being found and that also might happen if you DIRECTLY do a .find() function on your soup variable, because there would be times that there are tags and their respective class values are the same BUT! WITH DIFFERENT CONTENT WITHIN IT. So what happens is you're expecting to scrape that particular part of the website but what actually happening is you're scraping a different part, that's why you might not get any data and wonder why it is happening.
import requests
from bs4 import BeautifulSoup as bs
URL = 'https://hamariweb.com/names/muslim/boy/page-'
#for page in range(1, 203):
page = 1
req = requests.get(URL + str(page))
soup = bs(req.content, 'html5lib')
div_container = soup.find('div', class_='mb-40 content-box')
trs = div_container.find_all("tr",class_="bottom-divider")[1:]
for tr in trs:
text = tr.find("td").find("a").text
print(text)
The issue you're having with the IndexError means that in this case the b-tag you found doesn't contains the information that you are looking for.
You can simply wrap that piece of code in a try-except clause.
for book in books:
try:
data = book.find_all('b')[0].get_text()
print(data)
# Add data to the all_titles list
all_titles.append(data)
except IndexError:
pass # There was no element available
This will catch you error and move on. But not break the code.
Below I have also added some extra lines to save your title to a text-file.
Take a look at the inline comments.
import requests
from bs4 import BeautifulSoup as bs
URL = 'https://hamariweb.com/names/muslim/boy/page-'
# Theres is where your titles will be saved. Changes as needed
PATH = '/tmp/title_file.txt'
page = 1
req = requests.get(URL + str(page))
soup = bs(req.text, 'html.parser')
row = soup.find('div', attrs={'class', 'row'})
books = row.find_all('a')
# Here your title will be stored before writing to file
all_titles = []
for book in books:
try:
# Add strip() to cleanup the input
data = book.find_all('b')[0].get_text().strip()
print(data)
# Add data to the all_titles list
all_titles.append(data)
except IndexError:
pass # There was no element available
# Open path to write
with open(PATH, 'w') as f:
# Write all titles on a new line
f.write('\n'.join(all_titles))

Python moving average on web data

Relatively new to python so apologies if I ask a stupid question.
I just want to check if this is possible and if it is how complex.
I would like to calculate the moving average from share data on this web page
https://uk.finance.yahoo.com/q/hp?a=&b=&c=&d=11&e=16&f=2015&g=d&s=LLOY.L%2C+&ql=1
You can use this sample code.
import urllib
from BeautifulSoup import *
url = raw_input('Enter - ')
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
# Retrieve all of the anchor tags
tags = soup('span')
for tag in tags:
# Look at the parts of a tag
#calculate whatever you wanto
Try it:
from urllib import request
from bs4 import BeautifulSoup
url = "https://uk.finance.yahoo.com/q/hp?a=&b=&c=&d=11&e=16&f=2015&g=d&s=LLOY.L%2C+&ql=1"
html_contents = request.urlopen(url).read()
page = BeautifulSoup(html_contents, "html.parser")
el_list = page.find_all("span", {"id": "yfs_p43_lloy.l"})
print(el_list[0])

How to collect a continuous set of webpages using python?

https://example.net/users/x
Here, x is a number that ranges from 1 to 200000. I want to run a loop to get all the URLs and extract contents from every URL using beautiful soup.
from bs4 import BeautifulSoup
from urllib.request import urlopen
import re
content = urlopen(re.compile(r"https://example.net/users/[0-9]//"))
soup = BeautifulSoup(content)
Is this the right approach? I have to perform two things.
Get a continuous set of URLs
Extract & store retrieved contents from every page/URL.
UPDATE:
I've to get only one particular value from each of the webpages.
soup = BeautifulSoup(content)
divTag = soup.find_all("div", {"class":"classname"})
for tag in divTag:
ulTags = tag.find_all("ul", {"class":"classname"})
for tag in ulTags:
aTags = tag.find_all("a",{"class":"classname"})
for tag in aTags:
name = tag.find('img')['alt']
print(name)
You could try this:
import urllib2
import shutil
urls = []
for i in range(10):
urls.append(str('https://www.example.org/users/' + i))
def getUrl(urls):
for url in urls:
# Only a file_name based on url string
file_name = url.replace('https://', '').replace('.', '_').replace('/', '_')
response = urllib2.urlopen(url)
with open(file_name, 'wb') as out_file:
shutil.copyfileobj(response, out_file)
getUrl(urls)
If you just need the contents of a web page, you could probably use lxml, from which you could parse the content. Something like:
from lxml import etree
r = requests.get('https://example.net/users/x')
dom = etree.fromstring(r.text)
# parse seomthing
title = dom.xpath('//h1[#class="title"]')[0].text
Additionally, if you are scraping 10s or 100s of thousands of pages, you might want to look into something like grequests where you can do multiple asynchronous HTTP requests.

Web Crawler To get Links From New Website

I am trying to get the links from a news website page(from one of its archives). I wrote the following lines of code in Python:
main.py contains :
import mechanize
from bs4 import BeautifulSoup
url = "http://www.thehindu.com/archive/web/2010/06/19/"
br = mechanize.Browser()
htmltext = br.open(url).read()
articletext = ""
soup = BeautifulSoup(htmltext)
for tag in soup.findAll('li', attrs={"data-section":"Business"}):
articletext += tag.contents[0]
print articletext
An example of the object in tag.contents[0] :
ITC to issue 1:1 bonus
But on running it I am getting the following error :
File "C:\Python27\crawler\main.py", line 4, in <module>
text = articletext.getArticle(url)
File "C:\Python27\crawler\articletext.py", line 23, in getArticle
return getArticleText(htmltext)
File "C:\Python27\crawler\articletext.py", line 18, in getArticleText
articletext += tag.contents[0]
TypeError: cannot concatenate 'str' and 'Tag' objects
Can someone help me to sort it out ?? I am new to Python Programming. thanks and regards.
you are using link_dictionary vaguely. If you are not using it for reading purpose then try the following code :
br = mechanize.Browser()
htmltext = br.open(url).read()
articletext = ""
for tag_li in soup.findAll('li', attrs={"data-section":"Op-Ed"}):
for link in tag_li.findAll('a'):
urlnew = urlnew = link.get('href')
brnew = mechanize.Browser()
htmltextnew = brnew.open(urlnew).read()
articletext = ""
soupnew = BeautifulSoup(htmltextnew)
for tag in soupnew.findAll('p'):
articletext += tag.text
print re.sub('\s+', ' ', articletext, flags=re.M)
Note : re is for regulare expression. for this you import the module of re.
I believe you may want to try accessing the text inside the list item like so:
for tag in soup.findAll('li', attrs={"data-section":"Business"}):
articletext += tag.string
Edited: General Comments on getting links from a page
Probably the easiest data type to use to gather a bunch of links and retrieve them later is a dictionary.
To get links from a page using BeautifulSoup, you could do something like the following:
link_dictionary = {}
with urlopen(url_source) as f:
soup = BeautifulSoup(f)
for link in soup.findAll('a'):
link_dictionary[link.string] = link.get('href')
This will provide you with a dictionary named link_dictionary, where every key in the dictionary is a string that is simply the text contents between the <a> </a> tags and every value is the the value of the href attribute.
How to combine this what your previous attempt
Now, if we combine this with the problem you were having before, we could try something like the following:
link_dictionary = {}
for tag in soup.findAll('li', attrs={"data-section":"Business"}):
for link in tag.findAll('a'):
link_dictionary[link.string] = link.get('href')
If this doesn't make sense, or you have a lot more questions, you will need to experiment first and try to come up with a solution before asking another new, clearer question.
You might want to use the powerful XPath query language with the faster lxml module. As simple as that:
import urllib2
from lxml import etree
url = 'http://www.thehindu.com/archive/web/2010/06/19/'
html = etree.HTML(urllib2.urlopen(url).read())
for link in html.xpath("//li[#data-section='Business']/a"):
print '{} ({})'.format(link.text, link.attrib['href'])
Update for #data-section='Chennai'
#!/usr/bin/python
import urllib2
from lxml import etree
url = 'http://www.thehindu.com/template/1-0-1/widget/archive/archiveWebDayRest.jsp?d=2010-06-19'
html = etree.HTML(urllib2.urlopen(url).read())
for link in html.xpath("//li[#data-section='Chennai']/a"):
print '{} => {}'.format(link.text, link.attrib['href'])

Categories

Resources