Python scraping webpages - python

I am trying to pull only the links and their text from a webpage line by line and insert text and link into a dictionary. Without using beautiful soup or a regex.
i keep getting this error:
error:
Traceback (most recent call last):
File "F:/Homework7-2.py", line 13, in <module>
link2 = link1.split("href=")[1]
IndexError: list index out of range
code:
import urllib.request
url = "http://www.facebook.com"
page = urllib.request.urlopen(url)
mylinks = {}
links = page.readline().decode('utf-8')
for items in links:
links = page.readline().decode('utf-8')
if "a href=" in links:
links = page.readline().decode('utf-8')
link1 = links.split(">")[0]
link2 = link1.split("href=")[1]
mylinks = link2
print(mylinks)

import requests
from bs4 import BeautifulSoup
r = requests.get("http://stackoverflow.com/questions/29336915/python-scraping-webpages")
# find all a tags with href attributes
for a in BeautifulSoup(r.content).find_all("a",href=True):
# print each href
print(a["href"])
Obviously that is a very broad example but will get you started, if you want specific urls you can narrow your search to certain elements but it will be different for all webpages. You won't find much easier tools to use for parsing than requests and BeautifulSoup

Related

Error while trying to scrape two pages at the same time - Python, bs4

I am trying to scrape the links of some movies from a (main) website and after that, to scrape the contents from those links.
In the code below, I have tried to do it with only one link, but eventually, I will use a loop for all of them.
from urllib.request import urlopen
from bs4 import BeautifulSoup as soup
import csv
def make_soup(url):
# opening up connection, grabbing the page
source = urlopen(url).read()
# opening up connection, grabbing the page
page_soup = soup(source, "lxml")
return page_soup
soup = make_soup('https://letterboxd.com/top10ner/list/2020-edition-top10ners-1001-greatest-movies/')
#### code for grabbing the links
#### link = first_link
my_url = str(link)
new_soup = make_soup(my_url)
new_cont = new_soup.find('div', {'id':'content'})
And I get an error:
Traceback (most recent call last):
File "/Users/calinap/PycharmProjects/WebScraping/letterboxd_scrape.py", line 34, in
new_cont = new_soup.find('div', {'id':'content'})
File "/Users/calinap/PycharmProjects/WebScraping/venv/lib/python3.8/site-packages/bs4/element.py", line 2127, in getattr
raise AttributeError(
AttributeError: ResultSet object has no attribute 'find'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?
this probably isn't necessary
my_url = "'" + str(link) + "'"
just do my_url = str(link)
Ok. I think I know what's going on - you are trying to open a page which is rendered with JS. Some pages, such as facebook - extensively use this method. Instead of sending you a result page, which you can parse and extract data from, they are sending you JavaScript code, which needs to be executed to generate page with the results. If you definitely need to to have this logic - you would need to use headless browser, such as chromium.
you would need to replace this:
source = urlopen(url).read()
with something linke this:
from selenium import webdriver
driver = webdriver.Chrome("./chromedriver")
source = driver.get(url)
have a look here:
https://selenium-python.readthedocs.io/getting-started.html

Trying to extract some data from a webpage (scraping beginner)

I'm trying to extract some data from a webpage using Requests and then Beautifulsoup. I started by getting the html code with Requests and then "putting it" in Beautifulsoup:
from bs4 import BeautifulSoup
import requests
result = requests.get("https://XXXXX")
#print(result.status_code)
#print(result.headers)
src = result.content
soup = BeautifulSoup(src, 'lxml')
Then I singled out some pieces of code:
tags = soup.findAll('ol',{'class':'activity-popup-users'})
print(tags)
Here is a part of what I got:
<div class="account js-actionable-user js-profile-popup-actionable " data-emojified-name="" data-feedback-token="" data-impression-id="" data-name="The UN Times" data-screen-name="TheUNTimes" data-user-id="3787869561">
What I want now is to extract the data after data-user-id=which consists of numbers between "". Then I would like that data to be entered into some kind of calc sheet.
I am an absolute beginner and I'm postly pasting code I found elsewhere on tutorials or documentation.
Thanks a lot for your time...
EDIT:
So here's what I tried:
from bs4 import BeautifulSoup
import requests
result = requests.get("https://XXXX")
src = result.content
soup = BeautifulSoup(src, 'html.parser')
tags = soup.findAll('ol',{'class':'activity-popup-users'})
print(tags['data-user-id'])
And here's what I got:
TypeError: list indices must be integers or slices, not str
So I tried that:
from bs4 import BeautifulSoup
import requests
result = requests.get("https://XXXX")
src = result.content soup = BeautifulSoup(src, 'html.parser')
#tags = soup.findAll('a',{'class':'account-group js-user-profile-link'})
tags = soup.findAll('ol',{'class':'activity-popup-users'})
tags.attrs
#print(tags['data-user-id'])
And got:
File "C:\Users\XXXX\element.py", line 1884, in __getattr__
"ResultSet object has no attribute '%s'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?" % key
AttributeError: ResultSet object has no attribute 'attrs'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?
You can get any attribute value of a tag by treating the tag like an attribute-value dictionary.
Read the BeautifulSoup documentation on attributes.
tag['data-user-id']
For example
html="""
<div class="account js-actionable-user js-profile-popup-actionable " data-emojified-name="" data-feedback-token="" data-impression-id="" data-name="The UN Times" data-screen-name="TheUNTimes" data-user-id="3787869561">
"""
from bs4 import BeautifulSoup
soup=BeautifulSoup(html,'html.parser')
tag=soup.find('div')
print(tag['data-user-id'])
Output
3787869561
Edit to include OP's question change:
from bs4 import BeautifulSoup
import requests
result = requests.get("http://twitter.com/RussiaUN/media")
src = result.content
soup = BeautifulSoup(src, 'html.parser')
divs = soup.find_all('div',class_='account')
#just print
for div in divs:
print(div['data-user-id'])
#write to a file
with open('file.txt','w') as f:
for div in divs:
f.write(div['data-user-id']+'\n')
Output:
255471924
2154112404
408696260
1267887043
475954041
3787869561
796979978
261711504
398068796
1174451010
...

TypeError: 'ResultSet' object is not callable - Python with BeautifulSoup

New to python here and keep running into an error when trying to set up some code to scrape data off a list of web pages.
The link to one of those pages is - https://rspo.org/members/2.htm
and I am trying to grab the information on there like 'Membership Number', 'Category', 'Sector', 'Country', etc and export it all into a spreadsheet.
Code:
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen
import requests
pages = []
for i in range(1, 10):
url = 'https://rspo.org/members/' + str(i)
pages.append(url)
for item in pages:
page = requests.get(item)
soup = soup(page.text, 'html.parser')
member = soup.find_all("span", {"class":"current"})
And I get the following error:
Traceback (most recent call last):
File "", line 3, in
soup = soup(page.text, 'html.parser')
TypeError: 'ResultSet' object is not callable
Not sure why I am getting this error. I tried looking at other pages on Stack Overflow but nothing seemed to have a similar error to the one I get above.
The problem is that you have a name conflict because you are using the same name in multiple ways. Thus, your soup sets to a BeautifulSoup soup object, but is then reused as this same object.
Try this instead:
from bs4 import BeautifulSoup
from urllib.request import urlopen
import requests
pages = []
for i in range(1, 10):
url = 'https://rspo.org/members/' + str(i)
pages.append(url)
for item in pages:
page = requests.get(item)
soup = BeautifulSoup(page.text, 'html.parser')
member = soup.find_all("span", {"class":"current"})
Note that I just removed the alias from BeautifulSoup. The reason why I took this approach is simple. The standard convention in Python is that classes should be proper case. I.e ClassOne and BeautifulSoup. Instances of classes should be lower-case i.e class and soup. This helps avoid name conflicts, but it also makes your code more intuitive. Once you learn this, it becomes much easier to read code, and to write clean code.

Getting href using beautiful soup with different methods

I'm trying to scrape a website. I learned to scrape from two resources: one used tag.get('href') to get the href from an a tag, and one used tag['href'] to get the same. As far as I understand it, they both do the same thing. But when I tried this code:
link_list = [l.get('href') for l in soup.find_all('a')]
it worked with the .get method, but not with the dictionary access way.
link_list = [l['href'] for l in soup.find_all('a')]
This throws a KeyError. I'm very new to scraping, so please pardon if this is a silly one.
Edit - Both of the methods worked for the find method instead of find_all.
You may let BeautifulSoup find the links with existing href attributes only.
test
You can do it in two common ways, via find_all():
link_list = [a['href'] for a in soup.find_all('a', href=True)]
Or, with a CSS selector:
link_list = [a['href'] for a in soup.select('a[href]')]
Maybe HTML-string does not have a "href"?
For example:
from bs4 import BeautifulSoup
doc_html = """<a class="vote-up-off" title="This question shows research effort; it is useful and clear">up vote</a>"""
soup = BeautifulSoup(doc_html, 'html.parser')
ahref = soup.find('a')
ahref.get('href')
Nothing will happen, but
ahref['href']
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/sergey/.virtualenvs/soup_example/lib/python3.5/site-
packages/bs4/element.py", line 1011, in __getitem__
return self.attrs[key]
KeyError: 'href'
'href'

how to scrape deeply embeded links with python beautifulSoup

I'm trying to build a spider/web crawler for academic purposes to grab text from academic publications and append related links to a URL stack. I'm trying to crawl 1 website called 'PubMed'. I can't seem to grab the links I need though. Here is my code with an example page, this page should be representative of others in their database:
website = 'http://www.ncbi.nlm.nih.gov/pubmed/?term=mtap+prmt'
from bs4 import BeautifulSoup
import requests
r = requests.get(website)
soup = BeautifulSoup(r.content)
I have broken the html tree down into several variables just for readability so that it can all fit on 1 screen width.
key_text = soup.find('div', {'class':'grid'}).find('div',{'class':'col twelve_col nomargin shadow'}).find('form',{'id':'EntrezForm'})
side_column = key_text.find('div', {'xmlns:xi':'http://www.w3.org/2001/XInclude'}).find('div', {'class':'supplemental col three_col last'})
side_links = side_column.find('div').findAll('div')[1].find('div', {'id':'disc_col'}).findAll('div')[1]
for link in side_links:
print link
if you look at the html source code using chrome inspect element there should be several other nested divs with links within 'side_links'. However the above code produces the following error:
Traceback (most recent call last):
File "C:/Users/ballbag/Copy/web_scraping/google_search.py", line 22, in <module>
side_links = side_column.find('div').findAll('div')[1].find('div', {'id':'disc_col'}).findAll('div')[1]
IndexError: list index out of range
if you go to the url there is a column on the right called 'related links' containing the urls that I wish to scrape. But I can't seem to get to them. There is a statement saying under the div i am trying to get into and I suspect this has something to do with it. Can anyone help grab these links? I'd really appreciate any pointers
The problem is that the side bar is loaded with an additional asynchronous request.
The idea here would be to:
maintain a web-scraping session using requests.Session
parse the url that is used for getting the side bar
follow that link and get the links from the div with class="portlet_content"
Code:
from urlparse import urljoin
from bs4 import BeautifulSoup
import requests
base_url = 'http://www.ncbi.nlm.nih.gov'
website = 'http://www.ncbi.nlm.nih.gov/pubmed/?term=mtap+prmt'
# parse the main page and grab the link to the side bar
session = requests.Session()
soup = BeautifulSoup(session.get(website).content)
url = urljoin(base_url, soup.select('div#disc_col a.disc_col_ph')[0]['href'])
# parsing the side bar
soup = BeautifulSoup(session.get(url).content)
for a in soup.select('div.portlet_content ul li.brieflinkpopper a'):
print a.text, urljoin(base_url, a.get('href'))
Prints:
The metabolite 5'-methylthioadenosine signals through the adenosine receptor A2B in melanoma. http://www.ncbi.nlm.nih.gov/pubmed/25087184
Down-regulation of methylthioadenosine phosphorylase (MTAP) induces progression of hepatocellular carcinoma via accumulation of 5'-deoxy-5'-methylthioadenosine (MTA). http://www.ncbi.nlm.nih.gov/pubmed/21356366
Quantitative analysis of 5'-deoxy-5'-methylthioadenosine in melanoma cells by liquid chromatography-stable isotope ratio tandem mass spectrometry. http://www.ncbi.nlm.nih.gov/pubmed/18996776
...
Cited in PMC http://www.ncbi.nlm.nih.gov/pmc/articles/pmid/23265702/citedby/?tool=pubmed

Categories

Resources