writing and saving CSV file from scraping data using python and Beautifulsoup4

writing and saving CSV file from scraping data using python and Beautifulsoup4 - python

I am trying to scrape data from the PGA.com website to get a table of all of the golf courses in the United States. In my CSV table I want to include the Name of the golf course ,Address ,Ownership ,Website , Phone number. With this data I would like to geocode it and place into a map and have a local copy on my computer
I utilized Python and Beautiful Soup4 to extract my data. I have reached as far to extract the data from the website but I am having difficulty on writing the script to export the data into a CSV file displaying the parameters I need.
Attached below is my script. I need help on creating code that will transfer my extracted code into a CSV file and how to save it into my desktop.
Here is my script below:
import csv
import requests
from bs4 import BeautifulSoup
url = "http://www.pga.com/golf-courses/search?searchbox=Course+Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0"
r = requests.get(url)
soup = BeautifulSoup(r.content)
g_data1=soup.find_all("div",{"class":"views-field-nothing-1"})
g_data2=soup.find_all("div",{"class":"views-field-nothing"})
for item in g_data1:
try:
print item.contents[1].find_all("div",{"class":"views-field-counter"})[0].text
except:
pass
try:
print item.contents[1].find_all("div",{"class":"views-field-course-type"})[0].text
except:
pass
for item in g_data2:
try:
print item.contents[1].find_all("div",{"class":"views-field-title"})[0].text
except:
pass
try:
print item.contents[1].find_all("div",{"class":"views-field-address"})[0].text
except:
pass
try:
print item.contents[1].find_all("div",{"class":"views-field-city-state-zip"})[0].text
except:
pass
This is what I currently get when I run the script. I want to take this data and make into a CSV table for geocoding later.
1801 Merrimac Trl
Williamsburg, Virginia 23185-5905
12551 Glades Rd
Boca Raton, Florida 33498-6830
Preserve Golf Club
13601 SW 115th Ave
Dunnellon, Florida 34432-5621
1000 Acres Ranch Resort
465 Warrensburg Rd
Stony Creek, New York 12878-1613
1757 Golf Club
45120 Waxpool Rd
Dulles, Virginia 20166-6923
27 Pines Golf Course
5611 Silverdale Rd
Sturgeon Bay, Wisconsin 54235-8308
3 Creek Ranch Golf Club
2625 S Park Loop Rd
Jackson, Wyoming 83001-9473
3 Lakes Golf Course
6700 Saltsburg Rd
Pittsburgh, Pennsylvania 15235-2130
3 Par At Four Points
8110 Aero Dr
San Diego, California 92123-1715
3 Parks Fairways
3841 N Florence Blvd
Florence, Arizona 85132
3-30 Golf & Country Club
101 Country Club Lane
Lowden, Iowa 52255
401 Par Golf
5715 Fayetteville Rd
Raleigh, North Carolina 27603-4525
93 Golf Ranch
406 E 200 S
Jerome, Idaho 83338-6731
A 1 Golf Center
1805 East Highway 30
Rockwall, Texas 75087
A H Blank Municipal Course
808 County Line Rd
Des Moines, Iowa 50320-6706
A-Bar-A Ranch Golf Course
Highway 230
Encampment, Wyoming 82325
A-Ga-Ming Golf Resort, Sundance
627 Ag A Ming Dr
Kewadin, Michigan 49648-9397
A-Ga-Ming Golf Resort, Torch
627 Ag A Ming Dr
Kewadin, Michigan 49648-9397
A. C. Read Golf Club, Bayou
Bldg 3495, Nas Pensacola
Pensacola, Florida 32508
A. C. Read Golf Club, Bayview
Bldg 3495, Nas Pensacola
Pensacola, Florida 32508

All you really need to do here is put your output in a list and then use the CSV library to export it. I'm not entirely clear on what you are getting out views-field-nothing-1 but to just focus on view-fields-nothing, you could do something like:
courses_list=[]
for item in g_data2:
try:
name=item.contents[1].find_all("div",{"class":"views-field-title"})[0].text
except:
name=''
try:
address1=item.contents[1].find_all("div",{"class":"views-field-address"})[0].text
except:
address1=''
try:
address2=item.contents[1].find_all("div",{"class":"views-field-city-state-zip"})[0].text
except:
address2=''
course=[name,address1,address2]
courses_list.append(course)
This will put the courses in a list, next you can write them to a cvs like so:
import csv
with open ('filename.cv','wb') as file:
writer=csv.writer(file)
for row in course_list:
writer.writerow(row)

First of all you want to put all of your items in a list and then write to a file later in case there is an error while you are scrapping. Instead of printing just append to a list.
Then you can write to a csv file
f= open('filename', 'wb')
csv_writer = csv.writer(f)
for i in main_list:
csv_writer.writerow(i)
f.close()

Related

Getting all links from page, receiving javascript.void() error?

I am trying to get all the links from this page to the incident reports, in a csv format. However, as they don't seem to be "real links" (if you open in new tab then you receive an "about:blank" error). They do have their own links - visible in inspect element. I'm pretty confused. I did find some code online to do this, but just got "Javascript.void()" as every link.
Surely there must be a way to do this?
https://www.avalanche.state.co.us/accidents/us/

To load all the links into a DataFrame and save it to CSV, you can use this example:
import re
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://www.avalanche.state.co.us/caic/acc/acc_us.php?'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
r = re.compile(r"window\.open\('(.*?)'")
data = []
for link in soup.select('a[onclick*="window"]'):
data.append({'text':link.get_text(strip=True),
'link':r.search(link['onclick']).group(1)})
df = pd.DataFrame(data)
print(df)
df.to_csv('data.csv', index=False)
Prints:
text link
0 Mount Emmons, west of Crested Butte https://www.avalanche.state.co.us/caic/acc/acc...
1 Point 12885 near Red Peak, west of Silverthorne https://www.avalanche.state.co.us/caic/acc/acc...
2 Austin Canyon, Snake River Range https://www.avalanche.state.co.us/caic/acc/acc...
3 Taylor Mountain, northwest of Teton Pass https://www.avalanche.state.co.us/caic/acc/acc...
4 North of Skyline Peak https://www.avalanche.state.co.us/caic/acc/acc...
.. ... ...
238 Battle Mountain - outside Vail Mountain ski area https://www.avalanche.state.co.us/caic/acc/acc...
239 Scotch Bonnet Mountain, near Lulu Pass https://www.avalanche.state.co.us/caic/acc/acc...
240 Near Paulina Peak https://www.avalanche.state.co.us/caic/acc/acc...
241 Rock Lake, Cascade, Idaho https://www.avalanche.state.co.us/caic/acc/acc...
242 Hyalite Drainage, northern Gallatins, Bozeman https://www.avalanche.state.co.us/caic/acc/acc...
[243 rows x 2 columns]
And saves data.csv (screenshot from LibreOffice):

Look at the onclick property of this link and get "real" address from them.

How to parse string as a pandas dataframe

I'm trying to build a self-contained Jupyter notebook that parses a long address string into a pandas dataframe for demonstration purposes. Currently I'm having to highlight the entire string and use pd.read_clipboard:
data = pd.read_clipboard(f,
comment='#',
header=None,
names=['address']).values.reshape(-1, 2)
matched_address = pd.DataFrame(data, columns=['addr_zagat', 'addr_fodor'])
I'm wondering if there is an easier way to read the string in directly instead of relying on having something copied to the clipboard. Here are the first few lines of the string for reference:
f = """###################################################################################################
#
# There are 112 matches between the tuples. The Zagat tuple is listed first,
# and then its Fodors pair.
#
###################################################################################################
Arnie Morton's of Chicago 435 S. La Cienega Blvd. Los Angeles 90048 310-246-1501 Steakhouses
Arnie Morton's of Chicago 435 S. La Cienega Blvd. Los Angeles 90048 310/246-1501 American
########################
Art's Deli 12224 Ventura Blvd. Studio City 91604 818-762-1221 Delis
Art's Delicatessen 12224 Ventura Blvd. Studio City 91604 818/762-1221 American
########################
Bel-Air Hotel 701 Stone Canyon Rd. Bel Air 90077 310-472-1211 Californian
Hotel Bel-Air 701 Stone Canyon Rd. Bel Air 90077 310/472-1211 Californian
########################
Cafe Bizou 14016 Ventura Blvd. Sherman Oaks 91423 818-788-3536 French Bistro
Cafe Bizou 14016 Ventura Blvd. Sherman Oaks 91423 818/788-3536 French
########################
h Bistro
Cafe Bizou 14016 Ventura Blvd. Sherman Oaks 91423 818/788-3536 French
########################"""
Does anybody have any tips as to how to parse this string directly into a pandas dataframe?
I realise there is another question that addresses this here: Create Pandas DataFrame from a string but the string is delimited by a semi colon and totally different to the format used in my example.

You should add an example of what your output should look like but generally, I would suggest something like this:
import pandas as pd
import numpy as np
# read file, split into lines
f = open("./your_file.txt", "r").read().split('\n')
accumulator = []
# loop through lines
for line in f:
# define criteria for selecting lines
if len(line) > 1 and line[0].isupper():
# define criteria for splitting the line
# get name
first_num_char = [c for c in line if c.isdigit()][0]
name = line.split(first_num_char, 1)[0]
line = line.replace(name, '')
# get restaurant type
rest_type = line.split()[-1]
line = line.replace(rest_type, '')
# get phone number
number = line.split()[-1]
line = line.replace(number, '')
# remainder should be the address
address = line
accumulator.append([name, rest_type, number, address])
# turn accumulator into numpy array, pass with column index to DataFrame constructor
df = pd.DataFrame(np.asarray(accumulator), columns=['name', 'restaurant_type', 'phone_number', 'address'])

Pandas : Create a new dataframe from 2 different dataframes using fuzzy matching [duplicate]

I have two data frames with each having a different number of rows. Below is a couple rows from each data set
df1 =
Company City State ZIP
FREDDIE LEES AMERICAN GOURMET SAUCE St. Louis MO 63101
CITYARCHRIVER 2015 FOUNDATION St. Louis MO 63102
GLAXOSMITHKLINE CONSUMER HEALTHCARE St. Louis MO 63102
LACKEY SHEET METAL St. Louis MO 63102
and
df2 =
FDA Company FDA City FDA State FDA ZIP
LACKEY SHEET METAL St. Louis MO 63102
PRIMUS STERILIZER COMPANY LLC Great Bend KS 67530
HELGET GAS PRODUCTS INC Omaha NE 68127
ORTHOQUEST LLC La Vista NE 68128
I joined them side by side using combined_data = pandas.concat([df1, df2], axis = 1). My next goal is to compare each string under df1['Company'] to each string under in df2['FDA Company'] using several different matching commands from the fuzzy wuzzy module and return the value of the best match and its name. I want to store that in a new column. For example if I did the fuzz.ratio and fuzz.token_sort_ratio on LACKY SHEET METAL in df1['Company'] to df2['FDA Company'] it would return that the best match was LACKY SHEET METAL with a score of 100 and this would then be saved under a new column in combined data. It results would look like
combined_data =
Company City State ZIP FDA Company FDA City FDA State FDA ZIP fuzzy.token_sort_ratio match fuzzy.ratio match
FREDDIE LEES AMERICAN GOURMET SAUCE St. Louis MO 63101 LACKEY SHEET METAL St. Louis MO 63102 LACKEY SHEET METAL 100 LACKEY SHEET METAL 100
CITYARCHRIVER 2015 FOUNDATION St. Louis MO 63102 PRIMUS STERILIZER COMPANY LLC Great Bend KS 67530
GLAXOSMITHKLINE CONSUMER HEALTHCARE St. Louis MO 63102 HELGET GAS PRODUCTS INC Omaha NE 68127
LACKEY SHEET METAL St. Louis MO 63102 ORTHOQUEST LLC La Vista NE 68128
I tried doing
combined_data['name_ratio'] = combined_data.apply(lambda x: fuzz.ratio(x['Company'], x['FDA Company']), axis = 1)
But got an error because the lengths of the columns are different.
I am stumped. How I can accomplish this?

I couldn't tell what you were doing. This is how I would do it.
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
Create a series of tuples to compare:
compare = pd.MultiIndex.from_product([df1['Company'],
df2['FDA Company']]).to_series()
Create a special function to calculate fuzzy metrics and return a series.
def metrics(tup):
return pd.Series([fuzz.ratio(*tup),
fuzz.token_sort_ratio(*tup)],
['ratio', 'token'])
Apply metrics to the compare series
compare.apply(metrics)
There are bunch of ways to do this next part:
Get closest matches to each row of df1
compare.apply(metrics).unstack().idxmax().unstack(0)
Get closest matches to each row of df2
compare.apply(metrics).unstack(0).idxmax().unstack(0)

Finding the Metropolitan Area or County for City, State

I have following data in a dataframe:
... location amount
... Jacksonville, FL 40
... Provo, UT 20
... Montgomery, AL 22
... Los Angeles, CA 34
My dataset only contains U.S. cities in the form of [city name, state code] and I have no ZIP codes.
I want to determine either county of a city, in order to visualize my data with ggcounty (like here).
I looked on the website of the U.S. Census Bureau but couldn't really find a table of city,state,county, or similar.
Assuming that I would prefer solving the problem in R only, who has an idea how to solve this?

You can get a ZIP code and more detailed info doing this:
library(ggmap)
revgeocode(as.numeric(geocode('Jacksonville, FL ')))
Hope it helps

Python: Read Content of Hidden HTML Table

On this webpage there is a "Show Study location" tab, when I click the tab it shows the entire location list and changes the web-address which I included in this program. and when I run the program to print out the entire location list, I get this result:
soup = BeautifulSoup(urllib2.urlopen('https://clinicaltrials.gov/ct2/show/study/NCT01718158?term=NCT01718158&rank=1&show_locs=Y#locn').read())
for row in soup('table')[5].findAll('tr'):
tds = row('td')
if len(tds)<2:
continue
print tds[0].string, tds[1].string #, '\n'.join(filter(unicode.strip, tds[1].strings))
Local Institution None
Local Institution None
Local Institution None
Local Institution None
Local Institution None
and so on..... leaving the rest of the information out. I feel I am missing something here. my result should be:
United States, California
Va Long Beach Healthcare System
Long Beach, California, United States, 90822
United States, Georgia
Gastrointestinal Specialists Of Georgia Pc
Marietta, Georgia, United States, 30060
United States, New York
Weill Cornell Medical College
and so forth. I want to print out the entire location list.

The local institutes are in rows with just one table cell, but you are skipping those.
Perhaps you need to extract the data from all cells and only skip rows without <td> cells here:
for row in soup('table')[5].findAll('tr'):
tds = row('td')
if not tds:
continue
print u' '.join([cell.string for cell in tds if cell.string])
This produces
United States, California
Va Long Beach Healthcare System
Long Beach, California, United States, 90822
United States, Georgia
Gastrointestinal Specialists Of Georgia Pc
Marietta, Georgia, United States, 30060
# ....
Local Institution
Taipei, Taiwan, 100
Local Institution
Taoyuan, Taiwan, 333
United Kingdom
Local Institution
London, Greater London, United Kingdom, SE5 9RS

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.