Python - Extracts only one when using openAPI - python

I'm extracting movie data from open API.
I want to bring the first director and actor, but everyone is printed out.
This is my code.
url = "http://www.kobis.or.kr/kobisopenapi/webservice/rest/movie/searchMovieInfo.json?key='KeyValue'&movieCd=20177478"
res = requests.get(url)
test = res.text
d = json.loads(test)
movieinfo = d['movieInfoResult']['movieInfo']
moviename = movieinfo['movieNm']
print("movie_name = " + moviename)
moviedt = movieinfo['openDt']
print("movie_dt = " + moviedt)
for b in d["movieInfoResult"]["movieInfo"]["directors"]:
print("director_name = " + b["peopleNm"])
When I run this code result is like this.
movie_name = avengers
movie_dt = 20180425
director_name = Anthony Russo
director_name = Joe Russo
How can I bring only one person like this.
I need just the first person.
movie_name = avengers
movie_dt = 20180425
director_name = Anthony Russo
Open API site(korean) - https://www.kobis.or.kr/kobisopenapi/homepg/apiservice/searchServiceInfo.do

You can break for loop after printing or you can directly access the first value (if you are sure directors array is not empty)
url = "http://www.kobis.or.kr/kobisopenapi/webservice/rest/movie/searchMovieInfo.json?key='KeyValue'&movieCd=20177478"
res = requests.get(url)
test = res.text
d = json.loads(test)
movieinfo = d['movieInfoResult']['movieInfo']
moviename = movieinfo['movieNm']
print("movie_name = " + moviename)
moviedt = movieinfo['openDt']
print("movie_dt = " + moviedt)
for b in d["movieInfoResult"]["movieInfo"]["directors"]:
print("director_name = " + b["peopleNm"])
break
or
url = "http://www.kobis.or.kr/kobisopenapi/webservice/rest/movie/searchMovieInfo.json?key='KeyValue'&movieCd=20177478"
res = requests.get(url)
test = res.text
d = json.loads(test)
movieinfo = d['movieInfoResult']['movieInfo']
moviename = movieinfo['movieNm']
print("movie_name = " + moviename)
moviedt = movieinfo['openDt']
print("movie_dt = " + moviedt)
print("director_name = " + d["movieInfoResult"]["movieInfo"]["directors"][0]["peopleNm"])

Related

Extracting Nested List-Dictionaries to Pandas Series in a DataFrame

I have a pandas dataframe that I have extracted from a JSON file for breweries I'm interested in. most of these columns are as nested list of dictionaries. However two columns 'hours' and 'memberships' are being problematic.
I'd like to extract the 'hours' column into 7 columns "Mon_Hours","Tue_hours"...'Sun_Hours'.
I have tried and tried to figure this out but these two columns are proving challenging.
Here is a link to the initial data: https://www.coloradobrewerylist.com/wp-json/cbl_api/v1/locations/?location-type%5Bnin%5D=404,405&page_size=1000&page_token=1
and here is my code:
import requests
import re
import pandas as pd
import numpy as np
import csv
import json
from datetime import datetime
### get the data from the Colorado Brewery list
url = "https://www.coloradobrewerylist.com/wp-json/cbl_api/v1/locations/?location-type%5Bnin%5D=404,405&page_size=1000&page_token=1"
payload={}
headers = {}
response = requests.request("GET", url, headers=headers, data=payload)
data=response.json()
### convert results to table
pd.set_option('display.max_columns', None)
brewdf = pd.DataFrame.from_dict(data['results'])
#brewdf
############################################
#### CLEAN UP NESTED LIST-DICT COLUMNS #####
############################################
## cleanup dogs column
dogs = pd.json_normalize(brewdf['dogs'])
dogs2 = dogs.squeeze()
dogsdf = pd.json_normalize(dogs2)
dogsdf = dogsdf.drop(columns =['id','slug'])
dogsdf = dogsdf.rename(columns={'name':'dogs_allowed'})
#dogsdf
## cleanup parking column
parking = pd.json_normalize(brewdf['parking'])
parking = parking.rename(columns = {0:'Parking1',1:'Parking2',2:'Parking3'})
a = pd.json_normalize(parking['Parking1'])
b = pd.json_normalize(parking['Parking2'])
c = pd.json_normalize(parking['Parking3'])
parkcombo = pd.concat([a,b,c],ignore_index=True, axis=1)
parkcombo = parkcombo.rename(columns = {2:'P1',5:'P2',8:'P3'})
parkcombo['parking_type'] = parkcombo['P1'].map(str) + ',' + parkcombo['P2'].map(str) + ',' + parkcombo['P3'].map(str)
parkcombo['parking_type'] = parkcombo['parking_type'].str.replace(",nan",'')
parkdf = parkcombo['parking_type'].to_frame()
#parkdf
## cleanup food type column
food = pd.json_normalize(brewdf['food_type'])
food
food = food.rename(columns = {0:'Food1',1:'Food2',2:'Food3',3:'Food4',4:'Food5',5:'Food6'})
a = pd.json_normalize(food['Food1'])
b = pd.json_normalize(food['Food2'])
c = pd.json_normalize(food['Food3'])
d = pd.json_normalize(food['Food4'])
e = pd.json_normalize(food['Food5'])
f = pd.json_normalize(food['Food6'])
foodcombo = pd.concat([a,b,c,d,e,f],ignore_index=True, axis =1)
foodcombo
foodcombo = foodcombo.rename(columns = {2:'F1',5:'F2',8:'F3',11:'F4',14:'F5',17:'F6'})
foodcombo['food_type'] = foodcombo['F1'].map(str) + ',' + foodcombo['F2'].map(str) + ',' + foodcombo['F3'].map(str) + ',' + foodcombo['F4'].map(str)+ ',' + foodcombo['F5'].map(str) + ',' + foodcombo['F6'].map(str)
foodcombo['food_type'] = foodcombo['food_type'].str.replace(",nan",'')
fooddf = foodcombo['food_type'].to_frame()
#fooddf
## cleanup patio column
patio = pd.json_normalize(brewdf['patio'])
patio = patio.rename(columns = {0:'P1',1:'P2',2:'P3'})
a = pd.json_normalize(patio['P1'])
b = pd.json_normalize(patio['P2'])
c = pd.json_normalize(patio['P3'])
patiocombo = pd.concat([a,b,c],ignore_index=True, axis =1)
patiocombo
patiocombo = patiocombo.rename(columns = {2:'P1',5:'P2',8:'P3'})
patiocombo['patio_type'] = patiocombo['P1'].map(str) + ',' + patiocombo['P2'].map(str) + ',' + patiocombo['P3'].map(str)
patiocombo['patio_type'] = patiocombo['patio_type'].str.replace(",nan",'')
patiodf = patiocombo['patio_type'].to_frame()
#patiodf
## clean visitor type column
visitor = pd.json_normalize(brewdf['visitors'])
visitor
visitor = visitor.rename(columns = {0:'V1',1:'V2',2:'V3'})
a = pd.json_normalize(visitor['V1'])
b = pd.json_normalize(visitor['V2'])
c = pd.json_normalize(visitor['V3'])
visitorcombo = pd.concat([a,b,c],ignore_index=True, axis =1)
visitorcombo
visitorcombo = visitorcombo.rename(columns = {2:'V1',5:'V2',8:'V3'})
visitorcombo['visitor_type'] = visitorcombo['V1'].map(str) + ',' + visitorcombo['V2'].map(str) + ',' + visitorcombo['V3'].map(str)
visitorcombo['visitor_type'] = visitorcombo['visitor_type'].str.replace(",nan",'')
visitordf = visitorcombo['visitor_type'].to_frame()
#visitordf
## clean tour type column
tour = pd.json_normalize(brewdf['tour_type'])
tour
tour = tour.rename(columns = {0:'T1',1:'T2',2:'T3',3:'T4'})
a = pd.json_normalize(tour['T1'])
b = pd.json_normalize(tour['T2'])
c = pd.json_normalize(tour['T3'])
d = pd.json_normalize(tour['T4'])
tourcombo = pd.concat([a,b,c,d],ignore_index=True, axis =1)
tourcombo
tourcombo = tourcombo.rename(columns = {2:'T1',5:'T2',8:'T3',11:'T4'})
tourcombo['tour_type'] = tourcombo['T1'].map(str) + ',' + tourcombo['T2'].map(str) + ',' + tourcombo['T3'].map(str) + ','+ tourcombo['T4'].map(str)
tourcombo['tour_type'] = tourcombo['tour_type'].str.replace(",nan",'')
tourdf = tourcombo['tour_type'].to_frame()
#tourdf
## clean other drinks column
odrink = pd.json_normalize(brewdf['otherdrinks_type'])
odrink
odrink = odrink.rename(columns = {0:'O1',1:'O2',2:'O3',3:'O4',4:'O5',5:'O6',6:'O7',7:'O8',8:'O9'})
a = pd.json_normalize(odrink['O1'])
b = pd.json_normalize(odrink['O2'])
c = pd.json_normalize(odrink['O3'])
d = pd.json_normalize(odrink['O4'])
e = pd.json_normalize(odrink['O5'])
f = pd.json_normalize(odrink['O6'])
g = pd.json_normalize(odrink['O7'])
h = pd.json_normalize(odrink['O8'])
i = pd.json_normalize(odrink['O9'])
odrinkcombo = pd.concat([a,b,c,d,e,f,g,h,i],ignore_index=True, axis =1)
odrinkcombo
odrinkcombo = odrinkcombo.rename(columns = {2:'O1',5:'O2',8:'O3',11:'O4',14:'O5',17:'O6',20:'O7',23:'O8',26:'O9'})
odrinkcombo['odrink_type'] = odrinkcombo['O1'].map(str) + ',' + odrinkcombo['O2'].map(str) + ',' + odrinkcombo['O3'].map(str) + ','+ odrinkcombo['O4'].map(str) + ','+ odrinkcombo['O5'].map(str)+ ','+ odrinkcombo['O6'].map(str)+ ','+ odrinkcombo['O7'].map(str)+','+ odrinkcombo['O8'].map(str)+','+ odrinkcombo['O9'].map(str)
odrinkcombo['odrink_type'] = odrinkcombo['odrink_type'].str.replace(",nan",'')
odrinkdf = odrinkcombo['odrink_type'].to_frame()
#odrinkdf
## clean to-go column
togo = pd.json_normalize(brewdf['togo_type'])
togo
togo = togo.rename(columns = {0:'TG1',1:'TG2',2:'TG3',3:'TG4',4:'TG5'})
a = pd.json_normalize(togo['TG1'])
b = pd.json_normalize(togo['TG2'])
c = pd.json_normalize(togo['TG3'])
d = pd.json_normalize(togo['TG4'])
e = pd.json_normalize(togo['TG5'])
togocombo = pd.concat([a,b,c,d,e],ignore_index=True, axis =1)
togocombo
togocombo = togocombo.rename(columns = {2:'TG1',5:'TG2',8:'TG3',11:'TG4',14:'TG5'})
togocombo['togo_type'] = togocombo['TG1'].map(str) + ',' + togocombo['TG2'].map(str) + ',' + togocombo['TG3'].map(str) + ','+ togocombo['TG4'].map(str) + ','+ togocombo['TG5'].map(str)
togocombo['togo_type'] = togocombo['togo_type'].str.replace(",nan",'')
togodf = togocombo['togo_type'].to_frame()
#togodf
## clean merch column
merch = pd.json_normalize(brewdf['merch_type'])
merch
merch = merch.rename(columns = {0:'M1',1:'M2',2:'M3',3:'M4',4:'M5',5:'M6',6:'M7',7:'M8',8:'M9',9:'M10',10:'M11',11:'M12'})
a = pd.json_normalize(merch['M1'])
b = pd.json_normalize(merch['M2'])
c = pd.json_normalize(merch['M3'])
d = pd.json_normalize(merch['M4'])
e = pd.json_normalize(merch['M5'])
f = pd.json_normalize(merch['M6'])
g = pd.json_normalize(merch['M7'])
h = pd.json_normalize(merch['M8'])
i = pd.json_normalize(merch['M9'])
j = pd.json_normalize(merch['M10'])
k = pd.json_normalize(merch['M11'])
l = pd.json_normalize(merch['M12'])
merchcombo = pd.concat([a,b,c,d,e,f,g,h,i,j,k,l],ignore_index=True, axis =1)
merchcombo
merchcombo = merchcombo.rename(columns = {2:'M1',5:'M2',8:'M3',11:'M4',14:'M5',17:'M6',20:'M7',23:'M8',26:'M9',29:'M10',32:'M11',35:'M12'})
merchcombo['merch_type'] = (merchcombo['M1'].map(str) + ',' + merchcombo['M2'].map(str) + ',' + merchcombo['M3'].map(str) + ','+ merchcombo['M4'].map(str) + ','
+ merchcombo['M5'].map(str) + ',' + merchcombo['M6'].map(str)+ ',' + merchcombo['M7'].map(str) + ',' + merchcombo['M8'].map(str)
+ ',' + merchcombo['M9'].map(str)+ ',' + merchcombo['M10'].map(str)+ ',' + merchcombo['M11'].map(str)+ ',' + merchcombo['M12'].map(str))
merchcombo['merch_type'] = merchcombo['merch_type'].str.replace(",nan",'')
merchdf = merchcombo['merch_type'].to_frame()
#merchdf
### clean description column
brewdf['description'] = brewdf['description'].str.replace(r'<[^<>]*>', '', regex=True)
#brewdf
### replace nan with null
brewdf = brewdf.replace('nan',np.nan)
brewdf = brewdf.replace('None',np.nan)
brewdf
cleanedbrewdf = brewdf.drop(columns = {'food_type','tour_type','otherdrinks_type','articles','merch_type','togo_type','patio','visitors','parking','dogs'})
mergedbrewdf = pd.concat([cleanedbrewdf,dogsdf,parkdf,fooddf,patiodf,
visitordf,tourdf,odrinkdf,togodf,merchdf,],ignore_index=False,axis=1)
mergedbrewdf
### remove non-existing
finalbrewdf = mergedbrewdf.loc[(mergedbrewdf['lon'].notnull())].copy()
finalbrewdf['lon'] = finalbrewdf['lon'].astype(float)
finalbrewdf['lat'] = finalbrewdf['lat'].astype(float)
finalbrewdf
Can someone please point me in the right direction for the hours and memberships columns? Also, is there a more efficient way to look through these different columns? They have different nested list-dict lengths which I thought might prevent me from writing a function.

Python Scrapy Spider Not Following Correct Link

I am trying to scrape the data off of this post. I am having an issue with scraping the comments however. The pagination of the comments is determined by the "page=1" at the end of the url. I noticed that if "page=0" is used it loads all the comments on one page which is really nice. However, my scrapy script will only scrape the comments from the first page, no matter what. Even if I change the link to "page=2" it still will only scrape the comments from the first page. I can not figure out why this issue is occurring.
import scrapy
from scrapy.crawler import CrawlerProcess
class IdeaSpider(scrapy.Spider):
name = "IdeaSpider"
def start_requests(self):
yield scrapy.Request(
url="https://www.games2gether.com/amplitude-studios/endless-space-2/ideas/1850-force-infinite-actions-to"
"-the-bottom-of-the-queue?page=0", callback=self.parse_idea)
# parses title, post, status, author, date
def parse_idea(self, response):
post_author = response.xpath('//span[#class = "username-content"]/text()')
temp_list.append(post_author.extract_first())
post_categories = response.xpath('//a[#class = "list-tags-item ng-star-inserted"]/text()')
post_categories_ext = post_categories.extract()
if len(post_categories_ext) > 1:
post_categories_combined = ""
for category in post_categories_ext:
post_categories_combined = post_categories_combined + category + ", "
temp_list.append(post_categories_combined)
else:
temp_list.append(post_categories_ext[0])
post_date = response.xpath('//div[#class = "time-date"]/text()')
temp_list.append(post_date.extract_first())
post_title = response.xpath('//h1[#class = "title"]/text()')
temp_list.append(post_title.extract()[0])
post_body = response.xpath('//article[#class = "post-list-item clearfix ng-star-inserted"]//div[#class = '
'"post-list-item-message-content post-content ng-star-inserted"]//text()')
post_body_ext = post_body.extract()
if len(post_body_ext) > 1:
post_body_combined = ""
for text in post_body_ext:
post_body_combined = post_body_combined + " " + text
temp_list.append(post_body_combined)
else:
temp_list.append(post_body_ext[0])
post_status = response.xpath('//p[#class = "status-title"][1]/text()')
if len(post_status.extract()) != 0:
temp_list.append(post_status.extract()[0])
else:
temp_list.append("no status")
dev_name = response.xpath('//div[#class = "ideas-details-status-comment user-role u-bdcolor-2 dev"]//p[#class '
'= "username user-role-username"]/text()')
temp_list.append(dev_name.extract_first())
dev_comment = response.xpath('//div[#class = "message post-content ng-star-inserted"]/p/text()')
temp_list.append(dev_comment.extract_first())
c_author_index = 0
c_body_index = 0
c_author_path = response.xpath('//article[#class = "post-list-item clearfix two-columns '
'ng-star-inserted"]//span[#class = "username-content"]/text()')
while c_author_index < len(c_author_path):
comment_author = c_author_path[c_author_index]
temp_list.append(comment_author.extract())
c_author_index += 1
c_body_combined = ""
c_body_path = '//div[#class = "post-list-comments"]/g2g-comments-item[1]/article[#class = ' \
'"post-list-item clearfix two-columns ng-star-inserted"]/div/div//div[#class ' \
'="post-list-item-message-content post-content ng-star-inserted"]//text() '
c_body = response.xpath(c_body_path.replace("1", str(c_body_index + 1)))
c_body_list = c_body.extract()
if len(c_body_list) > 1:
for word in c_body_list:
c_body_combined = c_body_combined + " " + word
temp_list.append(c_body_combined)
c_body_index += 1
elif len(c_body_list) != 0:
temp_list.append(c_body_list[0])
c_body_index += 1
elif len(c_body_list) == 0:
c_body_index += 1
c_body = response.xpath(c_body_path.replace("1", str(c_body_index + 1)))
c_body_list = c_body.extract()
if len(c_body_list) > 1:
for word in c_body_list:
c_body_combined = c_body_combined + " " + word
temp_list.append(c_body_combined)
c_body_index += 1
temp_list = list()
all_post_data = list()
process = CrawlerProcess()
process.crawl(IdeaSpider)
process.start()
print(temp_list)
This is because the comment pages are loaded using JavaScript and Scrapy is not rendering JavaScript. You could use Splash.

Key Error 'Main' when trying an openweathermap python api tutorial

I'm currently trying to run through a tutorial of how to set up openweathermap via Python but i'm getting a KeyError and I was wondering if someone could help me out.
The error I am getting is KeyError: 'main'
In the actual code I have put in my API but have taken it out for obvious reasons.
api_key = ""
base_url = "http://api.openweathermap.org/data/2.5/weather?"
city_name = input("Enter city name : ")
complete_url = base_url + "appid=" + api_key + "&q=" + city_name
response = requests.get(complete_url)
x = response.json()
if x["cod"] != "404":
y = x["main"]
current_temperature = y["temp"]
current_pressure = y["pressure"]
current_humidiy = y["humidity"]
z = x["weather"]
weather_description = z[0]["description"]
print(" Temperature (in kelvin unit) = " +
str(current_temperature) +
"\n atmospheric pressure (in hPa unit) = " +
str(current_pressure) +
"\n humidity (in percentage) = " +
str(current_humidiy) +
"\n description = " +
str(weather_description))
else:
print(" City Not Found ")
The following works:
import requests
OPEN_WEATHER_MAP_APIKEY = '<your key>'
def get_weather_data_by_location( lat, long):
url = f'https://api.openweathermap.org/data/2.5/onecall?lat={lat}&lon={long}&appid={OPEN_WEATHER_MAP_APIKEY}&units=metric'
print(f"Getting data via {url}")
r = requests.get(url)
return r.json()
if r.status_code == 200:
return r.json()
else:
return None
if __name__ == '__main__':
print("Getting Weather Data")
print( get_weather_data_by_location( '22.300910042194783', '114.17070449064359') )
I have an beginners guide to open weather map which you can follow here: https://pythonhowtoprogram.com/get-weather-forecasts-and-show-it-on-a-chart-using-python-3/

Craigslist Multi City Search python script. Adding GUI more cities

So i am very new to python programing. Just trying to figure out a good project to get me started. Wanted to attempt searching craigslist in multiple cities. I found a dated example online and used it as a starting point. The below script currently only has cities in ohio but i plan on adding all us cities. The "homecity" is currently set to Dayton. It asks for a search radius, search term, min price, and max price. Based on lat lon of cities it only searches cities in the radius. I also have it searching all pages if there is more than 1 page of results. At the end it creates an html file of the results and opens it in a browser. It seems to be working fine, but was hoping to get feedback on if i am doing everything efficiently. I would also like to add in a GUI to capture user inputs but not even sure where to start. Any advice there? Thanks!
#Craigslist Search
"""
Created on Thu Mar 27 11:56:54 2014
used http://cal.freeshell.org/2010/05/python-craigslist-search-script-version-2/ as
starting point.
"""
import re
import os
import os.path
import time
import urllib2
import webbrowser
from math import *
results = re.compile('<p.+</p>', re.DOTALL) #Find pattern for search results.
prices = re.compile('<span class="price".*?</span>', re.DOTALL) #Find pattern for
pages = re.compile('button pagenum">.*?</span>')
delay = 10
def search_all():
for city in list(set(searchcities)):#add another for loop for all pages
#Setup headers to spoof Mozilla
dat = None
ua = "Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US; rv:1.9.1.4) Gecko/20091007 Firefox/3.5.4"
head = {'User-agent': ua}
errorcount=0
#Do a quick search to see how many pages of results
url = "http://" + city + ".craigslist.org/search/" + "sss?s=" + "0" + "&catAbb=sss&query=" + query.replace(' ', '+') + "&minAsk=" + pricemin + "&maxAsk=" + pricemax
req = urllib2.Request(url, dat, head)
try:
response = urllib2.urlopen(req)
except urllib2.HTTPError:
if errorcount < 1:
errorcount = 1
print "Request failed, retrying in " + str(delay) + " seconds"
time.sleep(int(delay))
response = urllib2.urlopen(req)
msg = response.read()
errorcount = 0
pglist = pages.findall(msg)
pg = pglist.pop(0)
if pg.find('of') == -1:
pg=100
else:
pg =pg[int((pg.find('of'))+3) : int((pg.find('</span>'))) ]
if int(pg)/100 == 0:
pg = 100
numpages = range(int(pg)/100)
for page in numpages:
print "searching...."
page = page*100
url = "http://" + city + ".craigslist.org/search/" + "sss?s=" + str(page) + "&catAbb=sss&query=" + query.replace(' ', '+') + "&minAsk=" + pricemin + "&maxAsk=" + pricemax
cityurl = "http://" + city + ".craigslist.org"
errorcount = 0
#Get page
req = urllib2.Request(url, dat, head)
try:
response = urllib2.urlopen(req)
except urllib2.HTTPError:
if errorcount < 1:
errorcount = 1
print "Request failed, retrying in " + str(delay) + " seconds"
time.sleep(int(delay))
response = urllib2.urlopen(req)
msg = response.read()
errorcount = 0
res = results.findall(msg)
res = str(res)
res = res.replace('[', '')
res = res.replace(']', '')
res = res.replace('<a href="' , '<a href="' + cityurl )
#res = re.sub(prices,'',res)
res = "<BLOCKQUOTE>"*6 + res + "</BLOCKQUOTE>"*6
outp = open("craigresults.html", "a")
outp.write(city)
outp.write(str(res))
outp.close()
def calcDist(lat_A, long_A, lat_B, long_B):#This was found at zip code database project
distance = (sin(radians(lat_A)) *
sin(radians(lat_B)) +
cos(radians(lat_A)) *
cos(radians(lat_B)) *
cos(radians(long_A - long_B)))
distance = (degrees(acos(distance))) * 69.09
return distance
cities = """akroncanton:41.043955,-81.51919
ashtabula:41.871212,-80.79178
athensohio:39.322847,-82.09728
cincinnati:39.104410,-84.50774
cleveland:41.473451,-81.73580
columbus:39.990764,-83.00117
dayton:39.757758,-84.18848
limaohio:40.759451,-84.08458
mansfield:40.759156,-82.51118
sandusky:41.426460,-82.71083
toledo:41.646649,-83.54935
tuscarawas:40.397916,-81.40527
youngstown:41.086279,-80.64563
zanesville:39.9461,-82.0122
"""
if os.path.exists("craigresults.html")==True:
os.remove("craigresults.html")
homecity = "dayton"
radius = raw_input("Search Distance from Home in Miles: ")
query = raw_input("Search Term: ")
pricemin = raw_input("Min Price: ")
pricemax = raw_input("Max Price: ")
citylist = cities.split()
#create dictionary
citdict = {}
for city in citylist:
items=city.split(":")
citdict[items[0]] = items[1]
homecord = str(citdict.get(homecity)).split(",")
homelat = float(homecord[0])
homelong = float(homecord[1])
searchcities = []
for key,value in citdict.items():
distcity=key
distcord=str(value).split(",")
distlat = float(distcord[0])
distlong = float(distcord[1])
dist = calcDist(homelat,homelong,distlat,distlong)
if dist < int(radius):
searchcities.append(key)
print searchcities
search_all()
webbrowser.open_new('craigresults.html')

What is the best approach to parse a non ordered HTML page with python?

I'm trying to Parse the following HTML pages using BeautifulSoup (I'm going to parse a bulk of pages).
I need to save all of the fields in every page, but they can change dynamically (on different pages).
here is an example of a page - Page 1
and a page with different fields order - Page 2
I've written the following code to parse the page.
import requests
from bs4 import BeautifulSoup
PTiD = 7680560
url = "http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1&u=/netahtml/PTO/srchnum.htm&r=1&f=G&l=50&s1=" + str(PTiD) + ".PN.&OS=PN/" + str(PTiD) + "&RS=PN/" + str(PTiD)
res = requests.get(url, prefetch = True)
raw_html = res.content
print "Parser Started.. "
bs_html = BeautifulSoup(raw_html, "lxml")
#Initialize all the Search Lists
fonts = bs_html.find_all('font')
para = bs_html.find_all('p')
bs_text = bs_html.find_all(text=True)
onlytext = [x for x in bs_text if x != '\n' and x != ' ']
#Initialize the Indexes
AppNumIndex = onlytext.index('Appl. No.:\n')
FiledIndex = onlytext.index('Filed:\n ')
InventorsIndex = onlytext.index('Inventors: ')
AssigneeIndex = onlytext.index('Assignee:')
ClaimsIndex = onlytext.index('Claims')
DescriptionIndex = onlytext.index(' Description')
CurrentUSClassIndex = onlytext.index('Current U.S. Class:')
CurrentIntClassIndex = onlytext.index('Current International Class: ')
PrimaryExaminerIndex = onlytext.index('Primary Examiner:')
AttorneyOrAgentIndex = onlytext.index('Attorney, Agent or Firm:')
RefByIndex = onlytext.index('[Referenced By]')
#~~Title~~
for a in fonts:
if a.has_key('size') and a['size'] == '+1':
d_title = a.string
print "title: " + d_title
#~~Abstract~~~
d_abstract = para[0].string
print "abstract: " + d_abstract
#~~Assignee Name~~
d_assigneeName = onlytext[AssigneeIndex +1]
print "as name: " + d_assigneeName
#~~Application number~~
d_appNum = onlytext[AppNumIndex + 1]
print "ap num: " + d_appNum
#~~Application date~~
d_appDate = onlytext[FiledIndex + 1]
print "ap date: " + d_appDate
#~~ Patent Number~~
d_PatNum = onlytext[0].split(':')[1].strip()
print "patnum: " + d_PatNum
#~~Issue Date~~
d_IssueDate = onlytext[10].strip('\n')
print "issue date: " + d_IssueDate
#~~Inventors Name~~
d_InventorsName = ''
for x in range(InventorsIndex+1, AssigneeIndex, 2):
d_InventorsName += onlytext[x]
print "inv name: " + d_InventorsName
#~~Inventors City~~
d_InventorsCity = ''
for x in range(InventorsIndex+2, AssigneeIndex, 2):
d_InventorsCity += onlytext[x].split(',')[0].strip().strip('(')
d_InventorsCity = d_InventorsCity.strip(',').strip().strip(')')
print "inv city: " + d_InventorsCity
#~~Inventors State~~
d_InventorsState = ''
for x in range(InventorsIndex+2, AssigneeIndex, 2):
d_InventorsState += onlytext[x].split(',')[1].strip(')').strip() + ','
d_InventorsState = d_InventorsState.strip(',').strip()
print "inv state: " + d_InventorsState
#~~ Asignee City ~~
d_AssigneeCity = onlytext[AssigneeIndex + 2].split(',')[1].strip().strip('\n').strip(')')
print "asign city: " + d_AssigneeCity
#~~ Assignee State~~
d_AssigneeState = onlytext[AssigneeIndex + 2].split(',')[0].strip('\n').strip().strip('(')
print "asign state: " + d_AssigneeState
#~~Current US Class~~
d_CurrentUSClass = ''
for x in range (CuurentUSClassIndex + 1, CurrentIntClassIndex):
d_CurrentUSClass += onlytext[x]
print "cur us class: " + d_CurrentUSClass
#~~ Current Int Class~~
d_CurrentIntlClass = onlytext[CurrentIntClassIndex +1]
print "cur intl class: " + d_CurrentIntlClass
#~~~Primary Examiner~~~
d_PrimaryExaminer = onlytext[PrimaryExaminerIndex +1]
print "prim ex: " + d_PrimaryExaminer
#~~d_AttorneyOrAgent~~
d_AttorneyOrAgent = onlytext[AttorneyOrAgentIndex +1]
print "agent: " + d_AttorneyOrAgent
#~~ Referenced by ~~
for x in range(RefByIndex + 2, RefByIndex + 400):
if (('Foreign' in onlytext[x]) or ('Primary' in onlytext[x])):
break
else:
d_ReferencedBy += onlytext[x]
print "ref by: " + d_ReferencedBy
#~~Claims~~
d_Claims = ''
for x in range(ClaimsIndex , DescriptionIndex):
d_Claims += onlytext[x]
print "claims: " + d_Claims
I insert all the text from the page to a list (using BeautifulSoup's find_all(text=True)). then I try to Find The indexes of the fields Names, and go over the list from that location and save the members to a string until I reach the next field index.
When I tried the code on several different pages I've noticed that the structure of the members is changing, and I can't find their indexes in the list.
for example, I search for the index of '123' and on some pages it shows in the list as '12','3'.
Can You think of any other way to parse the page that would be generic?
thanks.
I think the easiest solution is to use pyquery library
http://packages.python.org/pyquery/api.html
you can select the elements of the page using jquery selectors.
if you using beautifulsoup, and have dom <p>123</p> and find_all(text=True) you will have ['123']
however, if you have dom <p>12<b>3</b></p>, which have the same semantics as previous, but beautifulsoup will give you ['12','3']
maybe you could just find exactly which tag stucks you from getting complete ['123'] , and ignore / eliminate that tag first.
some fake code on how to eliminate <b> tag
import re
html='<p>12<b>3</b></p>'
reExp='<[\/\!]?b[^<>]*?>'
print re.sub(reExp,'',html)
for patterns, you could use this:
import re
patterns = '<TD align=center>(?P<VALUES_TO_FIND>.*?)<\/TD>'
print re.findall(patterns, your_html)

Categories

Resources