web scraping. Not getting back what I want

web scraping. Not getting back what I want - python

Trying to do some webscraping. Trying to make a function that will spit out population for each country. I am trying to web-scrape from US census bureau, but I cant get back the right information.
https://www.census.gov/popclock/world/af
<div id ="basic-facts" class = "data-cell">
<div class = "data-contianer">
<div class="data-cell" style = "background-image: url.....">
<p>population</p>
<h2 data-population="">35.8M</h2>"
this is basically what the code looks like that im trying to scrape. What I want is that "35.8M"
I have tried a few methods and all I can get back is the heading itself "data-population", none of the data.
Someone mentioned to me that maybe the website has it in some format so that it cant be scraped. In my experience, when it is blocked, the formatting looks different, it is in a image or dynamic item or something that makes it more difficult to scrape. Does anyone have any thoughts on this?
# -*- coding: utf-8 -*-
# Tells python what encoding the string is stored in
# Import required libraries
import requests
from bs4 import BeautifulSoup
### country naming issues: In the URLS on the websites the countries are coded with
### a two digit code # "au" = australia, "in" = india. If we were to search for a
### country name or something like that we would need to have something to relate
### the country name to the two letter code so it can search for it.
country = 'albania'
countrycode = [al: 'albania', af: 'afghanistan',]
### this would take long to write
### it all out, maybe its possible to scrape these names?
# Create url for the requested location through string concatenation
url = 'https://www.census.gov/popclock/world/'+countrycode
# Send request to retrieve the web-page using the
# get() function from the requests library
# The page variable stores the response from the web-page
page = requests.get(url)
# Create a BeautifulSoup object with the response from the URL
# Access contents of the web-page using .content
# html_parser is used since our page is in HTML format
soup=BeautifulSoup(page.content,"html.parser")
################################################################## Start what im not sure about
# Locate element on page to be scraped
# find() locates the element in the BeautifulSoup object
1. First method
population = soup.find(id="basic-facts", class="data-cell")
#I tried some methods like this. got only errors
2. Second method
populaiton = soup.findAll("h2", {"data-population": ""})
for i in population:
print i
# this returns the headings for the table but no data
### here we need to take out the population data
### it is listed as "<h2 data-population = "" >35.8</h2>"
################################################################## end
# Extract text from the selected BeautifulSoup object using .text
population = population.text
#Final Output
#Return Scraped info
print 'The Population of'+country+'is'+population
I outlined the code with #######. I tried a few methods. I listed two
I am pretty new to coding in general, so excuse me if I didnt describe this all right, thanks for any advice anyone can give.

It is dynamically retrieved from an API call you can find in the network tab. As you are not using a browser, where this call would have been made for you, you will need to make the request direct yourself.
import requests
r = requests.get('https://www.census.gov/popclock/apiData_pop.php?get=POP,MPOP0_4,MPOP5_9,MPOP10_14,MPOP15_19,MPOP20_24,MPOP25_29,MPOP30_34,MPOP35_39,MPOP40_44,MPOP45_49,MPOP50_54,MPOP55_59,MPOP60_64,MPOP65_69,MPOP70_74,MPOP75_79,MPOP80_84,MPOP85_89,MPOP90_94,MPOP95_99,MPOP100_,FPOP0_4,FPOP5_9,FPOP10_14,FPOP15_19,FPOP20_24,FPOP25_29,FPOP30_34,FPOP35_39,FPOP40_44,FPOP45_49,FPOP50_54,FPOP55_59,FPOP60_64,FPOP65_69,FPOP70_74,FPOP75_79,FPOP80_84,FPOP85_89,FPOP90_94,FPOP95_99,FPOP100_&key=&YR=2019&FIPS=af').json()
data = list(zip(r[0],r[1]))
print(round(int(data[0][1])/100_0000,1))

Related

Python Printing Multiple Items- Web Scraping with XPath

I am very novice to Python and programming in general so please forgive my lack of insight. I have managed to web-scrape some data with Xpath.
#Dependencies
from lxml import html
import requests
#URL
url = 'https://web.archive.org/web/20171004082203/https://www.yellowpages.com/houston-tx/air-conditioning-service-repair'
#Use Requests to retrieve html
resp = requests.get(url)
#Create Tree from Request Response
tree = html.fromstring(resp.content)
#Create Tree element
elements = tree.xpath('//*[starts-with(#id,"lid-")]/div/div/div[2]/div[2]/div[2]/a[1]')
# Scrape for URL and split for just business url
websites= (elements[0].attrib['href'].split("http://")[1])
The output of this code returns a single website url. However, I would like to print all the business urls to eventually put to a Pandas data frame.
How can I retrieve elements[0],elements[1],elements[2]... in one variable or expression?
I am sure there is an iterative function or list comprehension for this but I cannot wrap my brain around it. I'm thinking something like this:
Can I create a function to iterate through the 'elements[0]' and return all my values?
Any help is greatly appreciated, Thanks!

Here is a quick fix that will get the websites from this particular site working from your code; it stores them all in the 'websites' list. That said if you're working on a webscraper you'd probably be better served working with Beautiful Soup
#Create Tree element
elements = tree.xpath('//*[starts-with(#id,"lid-")]/div/div/div[2]/div[2]/div[2]/a[1]')
websites = []
for element in elements:
try:
websites.append("http"+element.attrib['href'].split("http")[2])
except:
continue
for website in websites:
print(website, '\n')

Replace and Update Input tags with beautiful soup

Hello I am trying to collect event data from a my school event web page and save to a list based on locations. However there is no current event information on the page since school is closed. So I tried to get the tags that take the user specified dates and change those values. So that I can make url request with the new tags and get event data from the year before. However doing this does not give me any new information. How do I replace an old input tag with a new one, and how do I update the html page with these new tags. Attached below is example code of what I am doing.
response = requests.get(url)
#start and end dates I want to use
st_date = "04/01/2019"
ed_date = "04/14/2019"
soup = bs(response.text, 'html.parser')
input_list = soup.findAll('input')
#the first and second values in the list are the input tags
start_date = input_list[0]
end_date = input_list[1]
#replace the value attribute with date strings
start_date['value'] = st_date
end_date['value'] = en_date
#insert the new tags
soup.insert(1,start_date)

You're manipulating a static copy of the web page. Instead you should choose one of two options to achieve your goal:
Use Selenium or something equivalent that will execute any JavaScript code that observes form field changes, or will handle injected submit/update button clicks.
Open the Developer Tools in your browser (usually F12) and watch the Network tab for outgoing requests when you change the dates. Then call these endpoints with your desired dates and get the events data from the response.
Good luck.

If you really wanted to change the content of html/xml using BeautifulSoup. You can also do that.
from bs4 import BeautifulSoup
html = BeautifulSoup(some_html, 'html.parser')
html.find('tag').string = 'some new value'
Last line will change the content of html page.

Scraping specific 'dd' tags with BeautifulSoup and Python

Im learning beautifulsoup and I came a cross one problem. Thats scraping dd tags in html. Check out the picture below, I want to get the parameters that are in the red color zone. The problem is I do not know how to access them. I have tried this:
kvadratura = float(nek_html.find('span', class_='d-inline-block mt-auto').text.split(' ')[0])
jedinica_mere = nek_html.find('span', class_='d-inline-block mt-auto').text.split(' ')[1].strip()
...
But the problem is that sometimes different pages have different parameters, or different order of parameters so I cant access with index. Check out the links:
https://www.nekretnine.rs/stambeni-objekti/stanovi/centar-zmaj-jovina-salonac-id1003/NkmUEzjEFo0/
https://www.nekretnine.rs/stambeni-objekti/stanovi/prodajemo-stan-milica-od-macve-mirijevo-46m2-nov/NkNruPymNHy/
How can I sure that I will always scrape the parameter that I want?
Each parameter goes into the list afterwards so If some parameter does now exist, it should add '' to the list

In such cases, this is something you might wanna do instead of using index as the latter may lead you to the wrong dd. When you go for the following approach, all you need to do is replace the text within :contains('') to get their dd, as in Transakcija,Vrsta stana and so on..
import requests
from bs4 import BeautifulSoup
url = "https://www.nekretnine.rs/stambeni-objekti/stanovi/zemun-krajiska-41m-bela-fasadna-cila-odlican/NkiRX4sq4Cy/"
res = requests.get(url)
soup = BeautifulSoup(res.text,"lxml")
Kategorija = soup.select_one(".base-inf .dl-horozontal:has(:contains('Kategorija:')) > dd")
Kategorija = Kategorija.get_text(strip=True) if Kategorija else ""
print(Kategorija)

Problems splitting scraped data with python

I am trying to scrape the data on some pages with BeauitfulSoup but I cannot seem to get the data that I want. I am having trouble splitting the data. I'll post my code below but what I am trying to do is grab each address and split it. For instance, if you try the code below, I can get the data that I want but I can't seem to figure out how to split it on the tag. My output I am attempting is, address = ['2 Warriston's Close','High Street, Edinburgh EH1 1PG','United Kingdom']
from bs4 import BeautifulSoup as bs
import requests
url = 'https://www.hauntedplaces.org/item/mary-kings-close/'
page = requests.get(url)
soup = bs(page.text, 'lxml')
region = soup.select('dd.data')[0]
# Need something here to split the region variable so I can separate for csv file.
# Trying to use soup.select('dd.data')[0].split() but no avail.
print(region)

So, instead of the HTML, you want to get the text inside of the tags. BeautifulSoup has a text attribute. So, in this case to get what you want you can just add the line:
print(region.text.split('\n')[:3])

Python: Parsing a class prints nothing?

Trying to parse a weather page and select the weekly forecasted highs.
Normally I would search with tags = soup.find_all("span", id="hi") but this tag doesn't use an id it uses a class.
Full code:
import mechanize
from bs4 import BeautifulSoup
my_browser = mechanize.Browser()
html_page = my_browser.open("http://www.wunderground.com/weather-forecast/45056")
html_text = html_page.get_data()
my_soup = BeautifulSoup(html_text)
tags = my_soup.find_all("span", class_="hi")
temp = tags[0].string
print temp
When I run this, nothing prints
The piece of HTML is buried inside a bunch of other tags, however the specific tag for today's high is as follows:
<span class="hi">63</span>

Just use class_ as the parameter name. See the docs.
The problem arises because class is a Python keyword, so you can't use it directly.

As an alternative to scraping the web page, you could always check out Weather Underground's API. It's free for developers (limited number of calls per day, etc.), but if you're going to be doing a number of lookups, this might be easier in the end.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

web scraping. Not getting back what I want - python

Related

Python Printing Multiple Items- Web Scraping with XPath

Replace and Update Input tags with beautiful soup

Scraping specific 'dd' tags with BeautifulSoup and Python

Problems splitting scraped data with python

Python: Parsing a class prints nothing?

Categories

Resources