Working with Pyspark using the withColumn() command in order to do some basic transformation on the dataframe, namely, to update the value of a column. Looking for some debug assistance while I also strudy the problem.
Pyspark is issuing an AnalysisException & Py4JJavaError on the usage of the pyspark.withColumn command.
_c49='EVENT_NARRATIVE' is the withColumn('EVENT_NARRATIVE')... reference data elements inside the spark df (dataframe).
from pyspark.sql.functions import *
from pyspark.sql.types import *
df = df.withColumn('EVENT_NARRATIVE', lower(col('EVENT_NARRATIVE')))
Py4JJavaError: An error occurred while calling o100.withColumn.
: org.apache.spark.sql.AnalysisException: cannot resolve '`EVENT_NARRATIVE`' given input columns: [_c3, _c17, _c40, _c21, _c48, _c12, _c39, _c18, _c31, _c10, _c45, _c26, _c5, _c43, _c24, _c33, _c9, _c14, _c1, _c16, _c47, _c20, _c46, _c32, _c22, _c7, _c2, _c42, _c37, _c36, _c30, _c8, _c38, _c23, _c25, _c13, _c29, _c41, _c19, _c44, _c11, _c28, _c6, _c50, _c49, _c0, _c15, _c4, _c34, _c27, _c35];;
'Project [_c0#604, _c1#605, _c2#606, _c3#607, _c4#608, _c5#609, _c6#610, _c7#611, _c8#612, _c9#613, _c10#614, _c11#615, _c12#616, _c13#617, _c14#618, _c15#619, _c16#620, _c17#621, _c18#622, _c19#623, _c20#624, _c21#625, _c22#626, _c23#627, ... 28 more fields]
+- Relation[_c0#604,_c1#605,_c2#606,_c3#607,_c4#608,_c5#609,_c6#610,_c7#611,_c8#612,_c9#613,_c10#614,_c11#615,_c12#616,_c13#617,_c14#618,_c15#619,_c16#620,_c17#621,_c18#622,_c19#623,_c20#624,_c21#625,_c22#626,_c23#627,... 27 more fields] csv
1 row of sample data from df.head():
[Row(_c0='BEGIN_YEARMONTH', _c1='BEGIN_DAY', _c2='BEGIN_TIME', _c3='END_YEARMONTH', _c4='END_DAY', _c5='END_TIME', _c6='EPISODE_ID', _c7='EVENT_ID', _c8='STATE', _c9='STATE_FIPS', _c10='YEAR', _c11='MONTH_NAME', _c12='EVENT_TYPE', _c13='CZ_TYPE', _c14='CZ_FIPS', _c15='CZ_NAME', _c16='WFO', _c17='BEGIN_DATE_TIME', _c18='CZ_TIMEZONE', _c19='END_DATE_TIME', _c20='INJURIES_DIRECT', _c21='INJURIES_INDIRECT', _c22='DEATHS_DIRECT', _c23='DEATHS_INDIRECT', _c24='DAMAGE_PROPERTY', _c25='DAMAGE_CROPS', _c26='SOURCE', _c27='MAGNITUDE', _c28='MAGNITUDE_TYPE', _c29='FLOOD_CAUSE', _c30='CATEGORY', _c31='TOR_F_SCALE', _c32='TOR_LENGTH', _c33='TOR_WIDTH', _c34='TOR_OTHER_WFO', _c35='TOR_OTHER_CZ_STATE', _c36='TOR_OTHER_CZ_FIPS', _c37='TOR_OTHER_CZ_NAME', _c38='BEGIN_RANGE', _c39='BEGIN_AZIMUTH', _c40='BEGIN_LOCATION', _c41='END_RANGE', _c42='END_AZIMUTH', _c43='END_LOCATION', _c44='BEGIN_LAT', _c45='BEGIN_LON', _c46='END_LAT', _c47='END_LON', _c48='EPISODE_NARRATIVE', _c49='EVENT_NARRATIVE', _c50='DATA_SOURCE'),
Row(_c0='201210', _c1='29', _c2='1600', _c3='201210', _c4='29', _c5='1922', _c6='68680', _c7='416744', _c8='NEW HAMPSHIRE', _c9='33', _c10='2012', _c11='October', _c12='High Wind', _c13='Z', _c14='12', _c15='EASTERN HILLSBOROUGH', _c16='BOX', _c17='29-OCT-12 16:00:00', _c18='EST-5', _c19='29-OCT-12 19:22:00', _c20='0', _c21='0', _c22='0', _c23='0', _c24='109.60K', _c25='0.00K', _c26='ASOS', _c27='55.00', _c28='MG', _c29=None, _c30=None, _c31=None, _c32=None, _c33=None, _c34=None, _c35=None, _c36=None, _c37=None, _c38=None, _c39=None, _c40=None, _c41=None, _c42=None, _c43=None, _c44=None, _c45=None, _c46=None, _c47=None, _c48='Sandy, a hybrid storm with both tropical and extra-tropical characteristics, brought high winds and coastal flooding to southern New England. Easterly winds gusted to 50 to 60 mph for interior southern New England; 55 to 65 mph along the eastern Massachusetts coast and along the I-95 corridor in southeast Massachusetts and Rhode Island; and 70 to 80 mph along the southeast Massachusetts and Rhode Island coasts. A few higher higher gusts occurred along the Rhode Island coast. A severe thunderstorm embedded in an outer band associated with Sandy produced wind gusts to 90 mph and concentrated damage in Wareham early Tuesday evening, |a day after the center of Sandy had moved into New Jersey. In general, moderate coastal flooding occurred along the Massachusetts coastline, and major coastal flooding impacted the Rhode Island coastline. The storm surge was generally 2.5 to 4.5 feet along the east coast of Massachusetts, but peaked late Monday afternoon in between high tide cycles. Seas built to between 20 and 25 feet Monday afternoon and evening just off the Massachusetts east coast. Along the south coast, the storm surge was 4 to 6 feet and seas from 30 to a little over 35 feet were observed in the outer coastal waters. The very large waves on top of the storm surge caused destructive coastal flooding along stretches of the Rhode Island exposed south coast. ||Sandy grew into a hurricane over the southwest Caribbean and then headed north across Jamaica, Cuba, and the Bahamas. As Sandy headed north of the Bahamas, the storm interacted with a vigorous weather system moving west to east across the United States and began to take on a hybrid structure. Strong high pressure over southeast Canada helped with the expansion of the strong winds well north of the center of Sandy. In essence, Sandy retained the structure of a hurricane near its center (until shortly before landfall) while taking on more of an extra-tropical cyclone configuration well away from the center. Sandy���s track was unusual. The storm headed northeast and then north across the western Atlantic and then sharply turned to the west to make landfall near Atlantic City, NJ during Monday evening. Sandy subsequently weakened and moved west across southern Pennsylvania on Tuesday before turning north and heading across western New York state into Quebec during Tuesday night and Wednesday.', _c49='The Automated Surface Observing System at Manchester-Boston Regional Airport (KMHT) recorded sustained wind speeds of 38 mph and gusts to 63 mph. In Manchester, a tree was downed on Harrison Street. In Hudson, a tree was downed on Lawrence Road, bringing down wires that sparked a fire that damaged a house. In Merrimack, a tree was downed, taking down wires and closing Amherst Road from Meetinghouse Road to Riverside Drive. In Nashua, a tree was downed onto a house on Broad Street, near the Hollils line. No structural damage was found. Numerous trees were downed, blocking roads.', _c50='CSV')
The column names are in the form of _c followed by a number, because presumbaly you did not specify header=True while reading the input file. You can do
df = spark.read.csv('filepath', header=True)
So that the column names will be BEGIN_YEARMONTH, BEGIN_DAY, ... etc, instead of _c0, _c1, ..., and then your withColumn code should work.
You can also consider adding inferSchema=True to ensure that the data types are suitable.
Of course, you can also stick with your current code, and do
df2 = df.withColumn('_c49', lower(col('_c49')))
But that's not a good long-term solution. Column names should be sensible, and you also don't want the header to be one of the rows in your dataframe.
I am trying to get all the links from this page to the incident reports, in a csv format. However, as they don't seem to be "real links" (if you open in new tab then you receive an "about:blank" error). They do have their own links - visible in inspect element. I'm pretty confused. I did find some code online to do this, but just got "Javascript.void()" as every link.
Surely there must be a way to do this?
https://www.avalanche.state.co.us/accidents/us/
To load all the links into a DataFrame and save it to CSV, you can use this example:
import re
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://www.avalanche.state.co.us/caic/acc/acc_us.php?'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
r = re.compile(r"window\.open\('(.*?)'")
data = []
for link in soup.select('a[onclick*="window"]'):
data.append({'text':link.get_text(strip=True),
'link':r.search(link['onclick']).group(1)})
df = pd.DataFrame(data)
print(df)
df.to_csv('data.csv', index=False)
Prints:
text link
0 Mount Emmons, west of Crested Butte https://www.avalanche.state.co.us/caic/acc/acc...
1 Point 12885 near Red Peak, west of Silverthorne https://www.avalanche.state.co.us/caic/acc/acc...
2 Austin Canyon, Snake River Range https://www.avalanche.state.co.us/caic/acc/acc...
3 Taylor Mountain, northwest of Teton Pass https://www.avalanche.state.co.us/caic/acc/acc...
4 North of Skyline Peak https://www.avalanche.state.co.us/caic/acc/acc...
.. ... ...
238 Battle Mountain - outside Vail Mountain ski area https://www.avalanche.state.co.us/caic/acc/acc...
239 Scotch Bonnet Mountain, near Lulu Pass https://www.avalanche.state.co.us/caic/acc/acc...
240 Near Paulina Peak https://www.avalanche.state.co.us/caic/acc/acc...
241 Rock Lake, Cascade, Idaho https://www.avalanche.state.co.us/caic/acc/acc...
242 Hyalite Drainage, northern Gallatins, Bozeman https://www.avalanche.state.co.us/caic/acc/acc...
[243 rows x 2 columns]
And saves data.csv (screenshot from LibreOffice):
Look at the onclick property of this link and get "real" address from them.
I am trying to scrape data from the PGA.com website to get a table of all of the golf courses in the United States. In my CSV table I want to include the Name of the golf course ,Address ,Ownership ,Website , Phone number. With this data I would like to geocode it and place into a map and have a local copy on my computer
I utilized Python and Beautiful Soup4 to extract my data. I have reached as far to extract the data from the website but I am having difficulty on writing the script to export the data into a CSV file displaying the parameters I need.
Attached below is my script. I need help on creating code that will transfer my extracted code into a CSV file and how to save it into my desktop.
Here is my script below:
import csv
import requests
from bs4 import BeautifulSoup
url = "http://www.pga.com/golf-courses/search?searchbox=Course+Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0"
r = requests.get(url)
soup = BeautifulSoup(r.content)
g_data1=soup.find_all("div",{"class":"views-field-nothing-1"})
g_data2=soup.find_all("div",{"class":"views-field-nothing"})
for item in g_data1:
try:
print item.contents[1].find_all("div",{"class":"views-field-counter"})[0].text
except:
pass
try:
print item.contents[1].find_all("div",{"class":"views-field-course-type"})[0].text
except:
pass
for item in g_data2:
try:
print item.contents[1].find_all("div",{"class":"views-field-title"})[0].text
except:
pass
try:
print item.contents[1].find_all("div",{"class":"views-field-address"})[0].text
except:
pass
try:
print item.contents[1].find_all("div",{"class":"views-field-city-state-zip"})[0].text
except:
pass
This is what I currently get when I run the script. I want to take this data and make into a CSV table for geocoding later.
1801 Merrimac Trl
Williamsburg, Virginia 23185-5905
12551 Glades Rd
Boca Raton, Florida 33498-6830
Preserve Golf Club
13601 SW 115th Ave
Dunnellon, Florida 34432-5621
1000 Acres Ranch Resort
465 Warrensburg Rd
Stony Creek, New York 12878-1613
1757 Golf Club
45120 Waxpool Rd
Dulles, Virginia 20166-6923
27 Pines Golf Course
5611 Silverdale Rd
Sturgeon Bay, Wisconsin 54235-8308
3 Creek Ranch Golf Club
2625 S Park Loop Rd
Jackson, Wyoming 83001-9473
3 Lakes Golf Course
6700 Saltsburg Rd
Pittsburgh, Pennsylvania 15235-2130
3 Par At Four Points
8110 Aero Dr
San Diego, California 92123-1715
3 Parks Fairways
3841 N Florence Blvd
Florence, Arizona 85132
3-30 Golf & Country Club
101 Country Club Lane
Lowden, Iowa 52255
401 Par Golf
5715 Fayetteville Rd
Raleigh, North Carolina 27603-4525
93 Golf Ranch
406 E 200 S
Jerome, Idaho 83338-6731
A 1 Golf Center
1805 East Highway 30
Rockwall, Texas 75087
A H Blank Municipal Course
808 County Line Rd
Des Moines, Iowa 50320-6706
A-Bar-A Ranch Golf Course
Highway 230
Encampment, Wyoming 82325
A-Ga-Ming Golf Resort, Sundance
627 Ag A Ming Dr
Kewadin, Michigan 49648-9397
A-Ga-Ming Golf Resort, Torch
627 Ag A Ming Dr
Kewadin, Michigan 49648-9397
A. C. Read Golf Club, Bayou
Bldg 3495, Nas Pensacola
Pensacola, Florida 32508
A. C. Read Golf Club, Bayview
Bldg 3495, Nas Pensacola
Pensacola, Florida 32508
All you really need to do here is put your output in a list and then use the CSV library to export it. I'm not entirely clear on what you are getting out views-field-nothing-1 but to just focus on view-fields-nothing, you could do something like:
courses_list=[]
for item in g_data2:
try:
name=item.contents[1].find_all("div",{"class":"views-field-title"})[0].text
except:
name=''
try:
address1=item.contents[1].find_all("div",{"class":"views-field-address"})[0].text
except:
address1=''
try:
address2=item.contents[1].find_all("div",{"class":"views-field-city-state-zip"})[0].text
except:
address2=''
course=[name,address1,address2]
courses_list.append(course)
This will put the courses in a list, next you can write them to a cvs like so:
import csv
with open ('filename.cv','wb') as file:
writer=csv.writer(file)
for row in course_list:
writer.writerow(row)
First of all you want to put all of your items in a list and then write to a file later in case there is an error while you are scrapping. Instead of printing just append to a list.
Then you can write to a csv file
f= open('filename', 'wb')
csv_writer = csv.writer(f)
for i in main_list:
csv_writer.writerow(i)
f.close()