Python dataframe name error: name is not defined - python

I scraped the link and address of each property on page 1 of a real estate website into a list. I then convert this list of lists listing_details into pandas dataframe by appending info of each property as a row (20 rows in total). My code is as follows:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
headers = {'user-agent': 'Mozilla/5.0'}
url = "https://www.realestate.co.nz/residential/sale/auckland?by=latest&oad=true&pm=1&scat=1%2C7%2C2%2C4%2C6%2C5%2C3%2C50%2C53"
response = requests.get(url, headers=headers)
listings = BeautifulSoup(response.content, "lxml")
listing_details = []
details = listings.findAll('div', attrs={"data-test":"tile"})
for detail in details:
# get property links
links = detail.findAll('a', href=True)
for link in links:
link="https://www.realestate.co.nz" + link['href']
# get addresses
addresses = detail.findAll('h3')
for address in addresses:
address=address.text.strip()
df = df.append(pd.DataFrame(listing_details, columns=['Link','Location']), ignore_index=True)
print(df)
However, I got the following error: NameError: name 'df' is not defined
I changed the last two lines into print(listing_details) to see if there's something wrong with the list but found that I got 20 empty lists.
But when I write print(link) and print(address) I can see that I did scrape the correct information as follows:
https://www.realestate.co.nz/4016546/residential/sale/3436-westgate-drive-westgate
34-36 Westgate Drive, Westgate
https://www.realestate.co.nz/4016545/residential/sale/3436-westgate-drive-westgate
34-36 Westgate Drive, Westgate
https://www.realestate.co.nz/4016519/residential/sale/7-ckaitiaki-drive-clarks-beach
7C Kaitiaki Drive, Clarks Beach
https://www.realestate.co.nz/4016178/residential/sale/6423427-beach-road-mairangi-bay
6/423-427 Beach Road, Mairangi Bay
https://www.realestate.co.nz/4016177/residential/sale/4423427-beach-road-mairangi-bay
4/423-427 Beach Road, Mairangi Bay
https://www.realestate.co.nz/4016176/residential/sale/2423427-beach-road-mairangi-bay
2/423-427 Beach Road, Mairangi Bay
https://www.realestate.co.nz/4016163/residential/sale/303428-dominion-road-mount-eden
303/428 Dominion Road, Mount Eden
https://www.realestate.co.nz/4016162/residential/sale/316428-dominion-road-mount-eden
316/428 Dominion Road, Mount Eden
https://www.realestate.co.nz/4016127/residential/sale/50910-kingdon-street-newmarket
509/10 Kingdon Street, Newmarket
https://www.realestate.co.nz/4016057/residential/sale/3-s99-customs-street-west-auckland-central
3S/99 Customs Street West, Auckland Central
https://www.realestate.co.nz/4016005/residential/sale/80270-daldy-street-wynyard-quarter
802/70 Daldy Street, Wynyard Quarter
https://www.realestate.co.nz/4015970/residential/sale/20-crown-lynn-place-new-lynn
20 Crown Lynn Place, New Lynn
https://www.realestate.co.nz/4015916/residential/sale/3-s15-nelson-street-auckland-central
3S/15 Nelson Street, Auckland Central
https://www.realestate.co.nz/4015773/residential/sale/lot7280-fred-taylor-drive-westgate
Lot 72, 80 Fred Taylor Drive, Westgate
https://www.realestate.co.nz/4015774/residential/sale/lot4280-fred-taylor-drive-westgate
Lot 42, 80 Fred Taylor Drive, Westgate
https://www.realestate.co.nz/4015772/residential/sale/lot4580-fred-taylor-drive-massey
Lot 45, 80 Fred Taylor Drive, Massey
https://www.realestate.co.nz/4015771/residential/sale/lot6680-fred-taylor-drive-massey
Lot 66, 80 Fred Taylor Drive, Massey
https://www.realestate.co.nz/4015759/residential/sale/lot7280-fred-taylor-drive-massey
Lot 72, 80 Fred Taylor Drive, Massey
https://www.realestate.co.nz/4015757/residential/sale/lot4780-fred-taylor-drive-westgate
Lot 47, 80 Fred Taylor Drive, Westgate
https://www.realestate.co.nz/4015758/residential/sale/lot4580-fred-taylor-drive-westgate
Lot 45, 80 Fred Taylor Drive, Westgate
Any ideas on where I did wrong? Thanks a lot!

Currently, you are not appending anything to listing_details. Your for loop should look something like this:
for detail in details:
# get property links
links = detail.findAll('a', href=True)
for link in links:
link="https://www.realestate.co.nz" + link['href']
# get addresses
addresses = detail.findAll('h3')
for address in addresses:
address=address.text.strip()
listing_details.append(address, link) # You can decide the order or address and link.

Related

Appling a custom function to each row in a column in a dataframe

I have a bit of code which pulls the latitude and longitude for a location. It is here:
address = 'New York University'
url = 'https://nominatim.openstreetmap.org/search/' + urllib.parse.quote(address) +'?format=json'
response = requests.get(url).json()
print(response[0]["lat"])
print(response[0]["lon"])
I'm wanting to apply this as a function to a long column of "address".
I've seen loads of questions about 'apply' and 'map', but they're almost all simple math examples.
Here is what I tried last night:
def locate (address):
response = requests.get(url).json()
print(response[0]["lat"])
print(response[0]["lon"])
return
df['lat'] = df['lat'].map(locate)
df['lon'] = df['lon'].map(locate)
This ended up just applying the first row lat / lon to the entire csv.
What is the best method to turn the code into a custom function and apply it to each row?
Thanks in advance.
EDIT: Thank you #PacketLoss for your assistance. I'm getting an indexerror:list index out of range, but it does work on his sample dataframe.
Here is the read_csv I used to pull in the data:
df = pd.read_csv('C:\\Users\\CIHAnalyst1\\Desktop\\InstitutionLocations.csv', sep=',', error_bad_lines=False, index_col=False, dtype='unicode', encoding = "utf-8", warn_bad_lines=False)
Here is a text copy of the rows from the dataframe:
address
0 GRAND CANYON UNIVERSITY
1 SOUTHERN NEW HAMPSHIRE UNIVERSITY
2 WESTERN GOVERNORS UNIVERSITY
3 FLORIDA INTERNATIONAL UNIVERSITY - UNIVERSITY ...
4 PENN STATE UNIVERSITY UNIVERSITY PARK
... ...
4292 THE ART INSTITUTES INTERNATIONAL LLC
4293 INTERCOAST - ONLINE
4294 CAROLINAS COLLEGE OF HEALTH SCIENCES
4295 DYERSBURG STATE COMMUNITY COLLEGE COVINGTON
4296 ULTIMATE MEDICAL ACADEMY - NY
You need to return your values from your function, or nothing will happen.
We can use apply here and pass the address from the df as well.
data = {'address': ['New York University', 'Sydney Opera House', 'Paris', 'SupeRduperFakeAddress']}
df = pd.DataFrame(data)
def locate(row):
url = 'https://nominatim.openstreetmap.org/search/' + urllib.parse.quote(row['address']) +'?format=json'
response = requests.get(url).json()
if response:
row['lat'] = response[0]['lat']
row['lon'] = response[0]['lon']
return row
df = df.apply(locate, axis=1)
Outputs
address lat lon
0 New York University 40.72925325 -73.99625393609625
1 Sydney Opera House -33.85719805 151.21512338473752
2 Paris 48.8566969 2.3514616
3 SupeRduperFakeAddress NaN NaN

Getting all links from page, receiving javascript.void() error?

I am trying to get all the links from this page to the incident reports, in a csv format. However, as they don't seem to be "real links" (if you open in new tab then you receive an "about:blank" error). They do have their own links - visible in inspect element. I'm pretty confused. I did find some code online to do this, but just got "Javascript.void()" as every link.
Surely there must be a way to do this?
https://www.avalanche.state.co.us/accidents/us/
To load all the links into a DataFrame and save it to CSV, you can use this example:
import re
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://www.avalanche.state.co.us/caic/acc/acc_us.php?'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
r = re.compile(r"window\.open\('(.*?)'")
data = []
for link in soup.select('a[onclick*="window"]'):
data.append({'text':link.get_text(strip=True),
'link':r.search(link['onclick']).group(1)})
df = pd.DataFrame(data)
print(df)
df.to_csv('data.csv', index=False)
Prints:
text link
0 Mount Emmons, west of Crested Butte https://www.avalanche.state.co.us/caic/acc/acc...
1 Point 12885 near Red Peak, west of Silverthorne https://www.avalanche.state.co.us/caic/acc/acc...
2 Austin Canyon, Snake River Range https://www.avalanche.state.co.us/caic/acc/acc...
3 Taylor Mountain, northwest of Teton Pass https://www.avalanche.state.co.us/caic/acc/acc...
4 North of Skyline Peak https://www.avalanche.state.co.us/caic/acc/acc...
.. ... ...
238 Battle Mountain - outside Vail Mountain ski area https://www.avalanche.state.co.us/caic/acc/acc...
239 Scotch Bonnet Mountain, near Lulu Pass https://www.avalanche.state.co.us/caic/acc/acc...
240 Near Paulina Peak https://www.avalanche.state.co.us/caic/acc/acc...
241 Rock Lake, Cascade, Idaho https://www.avalanche.state.co.us/caic/acc/acc...
242 Hyalite Drainage, northern Gallatins, Bozeman https://www.avalanche.state.co.us/caic/acc/acc...
[243 rows x 2 columns]
And saves data.csv (screenshot from LibreOffice):
Look at the onclick property of this link and get "real" address from them.

scraping data from wikipedia table

I'm just trying to scrape data from a wikipedia table into a panda dataframe.
I need to reproduce the three columns: "Postcode, Borough, Neighbourhood".
import requests
website_url = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
from bs4 import BeautifulSoup
soup = BeautifulSoup(website_url,'xml')
print(soup.prettify())
My_table = soup.find('table',{'class':'wikitable sortable'})
My_table
links = My_table.findAll('a')
links
Neighbourhood = []
for link in links:
Neighbourhood.append(link.get('title'))
print (Neighbourhood)
import pandas as pd
df = pd.DataFrame([])
df['PostalCode', 'Borough', 'Neighbourhood'] = pd.Series(Neighbourhood)
df
And it returns only the borough...
Thanks
You may be overthinking the problem, if you only want the script to pull one table from the page. One import, one line, no loops:
import pandas as pd
url='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
df=pd.read_html(url, header=0)[0]
df.head()
Postcode Borough Neighbourhood
0 M1A Not assigned Not assigned
1 M2A Not assigned Not assigned
2 M3A North York Parkwoods
3 M4A North York Victoria Village
4 M5A Downtown Toronto Harbourfront
You need to iterate over each row in the table and store the data row by row, not just in one giant list. Try something like this:
import pandas
import requests
from bs4 import BeautifulSoup
website_text = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(website_text,'xml')
table = soup.find('table',{'class':'wikitable sortable'})
table_rows = table.find_all('tr')
data = []
for row in table_rows:
data.append([t.text.strip() for t in row.find_all('td')])
df = pandas.DataFrame(data, columns=['PostalCode', 'Borough', 'Neighbourhood'])
df = df[~df['PostalCode'].isnull()] # to filter out bad rows
then
>>> df.head()
PostalCode Borough Neighbourhood
1 M1A Not assigned Not assigned
2 M2A Not assigned Not assigned
3 M3A North York Parkwoods
4 M4A North York Victoria Village
5 M5A Downtown Toronto Harbourfront
Basedig provides a platform to download Wikipedia tables as Excel, CSV or JSON files directly. Here is a link to the Wikipedia source: https://www.basedig.com/wikipedia/
If you do not find the dataset you are looking for on Basedig, send them the link to your article and they'll parse it for you.
Hope this helps

Finding the Metropolitan Area or County for City, State

I have following data in a dataframe:
... location amount
... Jacksonville, FL 40
... Provo, UT 20
... Montgomery, AL 22
... Los Angeles, CA 34
My dataset only contains U.S. cities in the form of [city name, state code] and I have no ZIP codes.
I want to determine either county of a city, in order to visualize my data with ggcounty (like here).
I looked on the website of the U.S. Census Bureau but couldn't really find a table of city,state,county, or similar.
Assuming that I would prefer solving the problem in R only, who has an idea how to solve this?
You can get a ZIP code and more detailed info doing this:
library(ggmap)
revgeocode(as.numeric(geocode('Jacksonville, FL ')))
Hope it helps

writing and saving CSV file from scraping data using python and Beautifulsoup4

I am trying to scrape data from the PGA.com website to get a table of all of the golf courses in the United States. In my CSV table I want to include the Name of the golf course ,Address ,Ownership ,Website , Phone number. With this data I would like to geocode it and place into a map and have a local copy on my computer
I utilized Python and Beautiful Soup4 to extract my data. I have reached as far to extract the data from the website but I am having difficulty on writing the script to export the data into a CSV file displaying the parameters I need.
Attached below is my script. I need help on creating code that will transfer my extracted code into a CSV file and how to save it into my desktop.
Here is my script below:
import csv
import requests
from bs4 import BeautifulSoup
url = "http://www.pga.com/golf-courses/search?searchbox=Course+Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0"
r = requests.get(url)
soup = BeautifulSoup(r.content)
g_data1=soup.find_all("div",{"class":"views-field-nothing-1"})
g_data2=soup.find_all("div",{"class":"views-field-nothing"})
for item in g_data1:
try:
print item.contents[1].find_all("div",{"class":"views-field-counter"})[0].text
except:
pass
try:
print item.contents[1].find_all("div",{"class":"views-field-course-type"})[0].text
except:
pass
for item in g_data2:
try:
print item.contents[1].find_all("div",{"class":"views-field-title"})[0].text
except:
pass
try:
print item.contents[1].find_all("div",{"class":"views-field-address"})[0].text
except:
pass
try:
print item.contents[1].find_all("div",{"class":"views-field-city-state-zip"})[0].text
except:
pass
This is what I currently get when I run the script. I want to take this data and make into a CSV table for geocoding later.
1801 Merrimac Trl
Williamsburg, Virginia 23185-5905
12551 Glades Rd
Boca Raton, Florida 33498-6830
Preserve Golf Club
13601 SW 115th Ave
Dunnellon, Florida 34432-5621
1000 Acres Ranch Resort
465 Warrensburg Rd
Stony Creek, New York 12878-1613
1757 Golf Club
45120 Waxpool Rd
Dulles, Virginia 20166-6923
27 Pines Golf Course
5611 Silverdale Rd
Sturgeon Bay, Wisconsin 54235-8308
3 Creek Ranch Golf Club
2625 S Park Loop Rd
Jackson, Wyoming 83001-9473
3 Lakes Golf Course
6700 Saltsburg Rd
Pittsburgh, Pennsylvania 15235-2130
3 Par At Four Points
8110 Aero Dr
San Diego, California 92123-1715
3 Parks Fairways
3841 N Florence Blvd
Florence, Arizona 85132
3-30 Golf & Country Club
101 Country Club Lane
Lowden, Iowa 52255
401 Par Golf
5715 Fayetteville Rd
Raleigh, North Carolina 27603-4525
93 Golf Ranch
406 E 200 S
Jerome, Idaho 83338-6731
A 1 Golf Center
1805 East Highway 30
Rockwall, Texas 75087
A H Blank Municipal Course
808 County Line Rd
Des Moines, Iowa 50320-6706
A-Bar-A Ranch Golf Course
Highway 230
Encampment, Wyoming 82325
A-Ga-Ming Golf Resort, Sundance
627 Ag A Ming Dr
Kewadin, Michigan 49648-9397
A-Ga-Ming Golf Resort, Torch
627 Ag A Ming Dr
Kewadin, Michigan 49648-9397
A. C. Read Golf Club, Bayou
Bldg 3495, Nas Pensacola
Pensacola, Florida 32508
A. C. Read Golf Club, Bayview
Bldg 3495, Nas Pensacola
Pensacola, Florida 32508
All you really need to do here is put your output in a list and then use the CSV library to export it. I'm not entirely clear on what you are getting out views-field-nothing-1 but to just focus on view-fields-nothing, you could do something like:
courses_list=[]
for item in g_data2:
try:
name=item.contents[1].find_all("div",{"class":"views-field-title"})[0].text
except:
name=''
try:
address1=item.contents[1].find_all("div",{"class":"views-field-address"})[0].text
except:
address1=''
try:
address2=item.contents[1].find_all("div",{"class":"views-field-city-state-zip"})[0].text
except:
address2=''
course=[name,address1,address2]
courses_list.append(course)
This will put the courses in a list, next you can write them to a cvs like so:
import csv
with open ('filename.cv','wb') as file:
writer=csv.writer(file)
for row in course_list:
writer.writerow(row)
First of all you want to put all of your items in a list and then write to a file later in case there is an error while you are scrapping. Instead of printing just append to a list.
Then you can write to a csv file
f= open('filename', 'wb')
csv_writer = csv.writer(f)
for i in main_list:
csv_writer.writerow(i)
f.close()

Categories

Resources