G'day guys, I'm working on a python project that pulls weather data from BOM (https://bom.gov.au).
The script works correctly, however I would like for it to be able to use part of the URL within the post request. i.e., the user navigates to https://example.com/taf/ymml, the script runs and uses YMML within the POST.
the script I am using is below. I would like to swap out 'YSSY' in myobj for something that pulls it from the url that the user navigates to.
import requests
import re
url = 'http://www.bom.gov.au/aviation/php/process.php'
myobj = {'keyword': 'YSSY', 'type': 'search', 'page': 'TAF'}
headers = {'User-Agent': 'Chrome/102.0.0.0'}
x = requests.post(url, data = myobj, headers=headers)
content = x.text
stripped = re.sub('<[^<]+?>', ' ', content)
split_string = stripped.split("METAR", 1)
substring = split_string[0]
print(substring)
Any ideas?
Ok so I've managed to get this working using fastapi. When a user navigates to example.com/taf/ymml, the site will return in plain text the taf for ymml. it can be substituted for any Australian Aerodrome. One thing I haven't figured out is how to remove the the square brackets around the taf, but that is a problem for another time.
from fastapi import FastAPI
import requests
from bs4 import BeautifulSoup
app = FastAPI()
#app.get("/taf/{icao}")
async def read_icao(icao):
url = 'http://www.bom.gov.au/aviation/php/process.php'
myobj = {'keyword': icao, 'type': 'search', 'page': 'TAF'}
headers = {'User-Agent': 'Chrome/102.0.0.0'}
x = requests.post(url, data = myobj, headers=headers)
content = x.text
split_string = content.split("METAR", 1)
substring = split_string[0]
soup = BeautifulSoup(substring, 'html.parser')
for br in soup('br'):
br.replace_with(' ')
#Create TAFs array.
tafs = []
for taf in soup.find_all('p', class_="product"):
full_taf = taf.get_text()
tafs.append(full_taf.rstrip())
return {tuple(tafs)}
Related
I'm using requests and regex to scrape data from an entire website and then save it to a JSON file, hosted on github so I and anyone else can access the data from other devices.
The first thing I tried was just to open every single page on the website and get all the data I want but I found that to be unnecessary so I decided to make two scripts, the first one finds the URL of every page on the site and the second one will be the one called which will then scrape the called URL. What I'm having trouble with right now is getting my data formatted correctly for the JSON file. Currently this is a sample of what the output looks like:
{
"Console":"/neo-geo-aes",
"Call ID":"62815",
"URL":"https://www.pricecharting.com/game/jp-sega-mega-drive/bare-knuckle"
}{
"Console":"/neo-geo-cd",
"Call ID":"62817",
"URL":"https://www.pricecharting.com/game/jp-sega-mega-drive/bare-knuckle-2"
}{
"Console":"/neo-geo-pocket-color",
"Call ID":"62578",
"URL":"https://www.pricecharting.com/game/jp-sega-mega-drive/batman"
}{
"Console":"/playstation",
"Call ID":"62580",
"URL":"https://www.pricecharting.com/game/jp-sega-mega-drive/batman-forever"
}
I've looked into this a lot and can't find a solution, here's the code in question:
import re
import requests
import json
##The base URL
URL = "https://www.pricecharting.com/"
r = requests.get(URL)
htmltext = r.text
##Find all system URLs
dataUrl = re.findall('(?<=<li><a href="\/console).*(?=">)', htmltext)
print(dataUrl)
##For each Item(number of consoles) find games
for i in range(len(dataUrl)):
##make console URL
newUrl = ("https://www.pricecharting.com/console" + dataUrl[i])
req = requests.get(newUrl)
newHtml = req.text
##Get item URLs
urlOne = re.findall('(?<=<a href="\/game).*(?=">)', newHtml)
itemId = re.findall('(?<=tr id="product-).*(?=" data)', newHtml)
##For every item in list(items per console)
out_list = []
for i in range(len(urlOne)):
##Make item URL
itemUrl = ("https://www.pricecharting.com/game" + urlOne[i])
callId = (itemId[i])
##Format for JSON
json_file_content = {}
json_file_content['Console'] = dataUrl[i]
json_file_content['Call ID'] = callId
json_file_content['URL'] = itemUrl
out_list.append(json_file_content)
data_json_filename = 'docs/result.json'
with open(data_json_filename, 'a') as data_json_file:
json.dump(out_list, data_json_file, indent=4)
When i search for books with a single name(e.g bluets) my code works fine, but when I search for books that have two names or spaces (e.g white whale) I got an error(jinja2 synatx) how do I solve this error?
#app.route("/book", methods["GET", "POST"])
def get_books():
api_key =
os.environ.get("API_KEY")
if request.method == "POST":
book = request.form.get("book")
url =f"https://www.googleapis.com/books/v1/volumes?q={book}:keyes&key={api_key}"
response =urllib.request.urlopen(url)
data = response.read()
jsondata = json.loads(data)
return render_template ("book.html", books=jsondata["items"]
I tried to search for similar cases, and just found one solution, but I didn't understand it
Here is my error message
http.client.InvalidURL
http.client.InvalidURL: URL can't contain control characters. '/books/v1/volumes?q=white whale:keyes&key=AIzaSyDtjvhKOniHFwkIcz7-720bgtnubagFxS8' (found at least ' ')
Some chars in url need to be encoded - in your situation you have to use + or %20 instead of space.
This url has %20 instead of space and it works for me. If I use + then it also works
import urllib.request
import json
url = 'https://www.googleapis.com/books/v1/volumes?q=white%20whale:keyes&key=AIzaSyDtjvhKOniHFwkIcz7-720bgtnubagFxS8'
#url = 'https://www.googleapis.com/books/v1/volumes?q=white+whale:keyes&key=AIzaSyDtjvhKOniHFwkIcz7-720bgtnubagFxS8'
response = urllib.request.urlopen(url)
text = response.read()
data = json.loads(text)
print(data)
With requests you don't even have to do it manually because it does it automatically
import requests
url = 'https://www.googleapis.com/books/v1/volumes?q=white whale:keyes&key=AIzaSyDtjvhKOniHFwkIcz7-720bgtnubagFxS8'
r = requests.get(url)
data = r.json()
print(data)
You may use urllib.parse.urlencode() to make sure all chars are correctly encoded.
import urllib.request
import json
payload = {
'q': 'white whale:keyes',
'key': 'AIzaSyDtjvhKOniHFwkIcz7-720bgtnubagFxS8',
}
query = urllib.parse.urlencode(payload)
url = 'https://www.googleapis.com/books/v1/volumes?' + query
response = urllib.request.urlopen(url)
text = response.read()
data = json.loads(text)
print(data)
and the same with requests - it also doesn't need encoding
import requests
payload = {
'q': 'white whale:keyes',
'key': 'AIzaSyDtjvhKOniHFwkIcz7-720bgtnubagFxS8',
}
url = 'https://www.googleapis.com/books/v1/volumes'
r = requests.get(url, params=payload)
data = r.json()
print(data)
I've written a script in python to scrape the result populated upon filling in two inputboxes zipcode and distance with 66109,10000. When I try the inputs manually, the site does display results but when I try the same using the script I get nothing. The script throws no error either. What might be the issues here?
Website link
I've tried with:
import requests
from bs4 import BeautifulSoup
url = 'https://www.sart.org/clinic-pages/find-a-clinic/'
payload = {
'zip': '66109',
'strdistance': '10000',
'SelectedState': 'Select State or Region'
}
def get_clinics(link):
session = requests.Session()
response = session.post(link,data=payload,headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(response.text,"lxml")
item = soup.select_one(".clinics__search-meta").text
print(item)
if __name__ == '__main__':
get_clinics(url)
I'm only after this line Within 10000 miles of 66109 there are 383 clinics. generated when the search is made.
I changed the url and the requests method to GET and worked for me
def get_clinics(link):
session = requests.Session()
response = session.get(link, headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(response.text,"lxml")
item = soup.select_one(".clinics__search-meta").text
print(item)
url = 'https://www.sart.org/clinic-pages/find-a-clinic?zip=66109&strdistance=10000&SelectedState=Select+State+or+Region'
get_clinics(url)
Include cookies is one of the main concern here. If you do it in the right way, you can get a valid response following the way you started. Here is the working code:
import requests
from bs4 import BeautifulSoup
url = 'https://www.sart.org/clinic-pages/find-a-clinic/'
payload = {
'zip': '66109',
'strdistance': '10000',
'SelectedState': 'Select State or Region'
}
def get_clinics(link):
with requests.Session() as s:
res = s.get(link)
req = s.post(link,data=payload,cookies=res.cookies.get_dict())
soup = BeautifulSoup(req.text,"lxml")
item = soup.select_one(".clinics__search-meta").get_text(strip=True)
print(item)
if __name__ == '__main__':
get_clinics(url)
In my code, a user inputs a search term and the get_all_links parses the html response and extract the links that start with ‘http’. When req is replaced with a hard coded url such as:
content = urllib.request.urlopen("http://www.ox.ac.uk")
The program returns a list of properly formatted links correctly. However passing in req, no links are returned. I suspect this may be a formatting blip.
Here is my code:
import urllib.request
def get_all_links(s): # function to get all the links
d=0
links=[] # getting all links into a list
while d!=-1: # untill d is -1. i.e no links in that page
d=s.find('<a href=',d) # if <a href is found
start=s.find('"',d) # stsrt will be the next character
end=s.find('"',start+1) # end will be upto "
if d!=-1: # d is not -1
d+=1
if(s[start+1]=='h'): # add the link which starts with http only.
links.append(s[start+1:end]) # to link list
return links # return list
def main():
term = input('Enter a search term: ')
url = 'http://www.google.com/search'
value = {'q' : term}
user_agent = 'Mozilla/5.0'
headers = {'User-Agent' : user_agent}
data = urllib.parse.urlencode(value)
print(data)
url = url + '?' + data
print(url)
req = urllib.request.Request(url, None, headers)
content = urllib.request.urlopen(req)
s = content.read()
print(s)
links = get_all_links(s.decode('utf-8'))
for i in links: # print the returned list.
print(i)
main()
You should use a HTML parser, as suggested in the comments. A library like BeautifulSoup is perfect for this.
I have adapted your code to use BeautifulSoup
import urllib.request
from bs4 import BeautifulSoup
def get_all_links(s):
soup = BeautifulSoup(s, "html.parser")
return soup.select("a[href^=\"http\"]") # Select all anchor tags whose href attribute starts with 'http'
def main():
term = input('Enter a search term: ')
url = 'http://www.google.com/search'
value = {'q' : term}
user_agent = 'Mozilla/5.0'
headers = {'User-Agent' : user_agent}
data = urllib.parse.urlencode(value)
print(data)
url = url + '?' + data
print(url)
req = urllib.request.Request(url, None, headers)
content = urllib.request.urlopen(req)
s = content.read()
print(s)
links = get_all_links(s.decode('utf-8'))
for i in links: # print the returned list.
print(i)
main()
It uses the select method of the BeautifulSoup library and returns a list of selected elements (in your case anchor-tags).
Using a library like BeautifulSoup not only makes it easier, but you can also use much more complex selections. Imagine how you would have to change your code when you wanted to select all links whose href attribute contains the word "google" or "code"?
You can read the BeautifulSoup documentation here.
I am requesting an Ajax Web site with a Python script and fetching cities and branch offices of http://www.yurticikargo.com/bilgi-servisleri/Sayfalar/en-yakin-sube.aspx
I completed the first step with posting
{cityID: 34} to this url and fetc the JSON output.
http://www.yurticikargo.com/_layouts/ArikanliHolding.YurticiKargo.WebSite/ajaxproxy-sswservices.aspx/GetTownByCity
But I can not retrive the JSON output with Python although i get succesfully with Chrome Advanced Rest Client Extension, posting {cityID:54,townID:5416,unitOnDutyFlag:null,closestFlag:2}
http://www.yurticikargo.com/_layouts/ArikanliHolding.YurticiKargo.WebSite/ajaxproxy-unitservices.aspx/GetUnit
All of the source code is here
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import requests
import json
class Yurtici(object):
baseUrl = 'http://www.yurticikargo.com/'
ajaxRoot = '_layouts/ArikanliHolding.YurticiKargo.WebSite/ajaxproxy-sswservices.aspx/'
getTown = 'GetTownByCity'
getUnit = 'GetUnit'
urlGetTown = baseUrl + ajaxRoot + getTown
urlGetUnit = baseUrl + ajaxRoot + getUnit
headers = {'content-type': 'application/json','encoding':'utf-8'}
def __init__(self):
pass
def ilceler(self, plaka=34): # Default testing value
payload = {'cityId':plaka}
url = self.urlGetTown
r = requests.post(url, data=json.dumps(payload), headers=self.headers)
return r.json() # OK
def subeler(self, ilceNo=5902): # Default testing value
# 5902 Çerkezköy
payload= {'cityID':59,'townID':5902,'unitOnDutyFlag':'null','closestFlag':0}
url = self.urlGetUnit
headers = {'content-type': 'application/json','encoding':'utf-8'}
r = requests.post(url, data=json.dumps(payload), headers=headers)
print r.status_code, r.raw.read()
if __name__ == '__main__':
a = Yurtici()
print a.ilceler(37) # OK
print a.subeler() # NOT OK !!!
Your code isn't posting to the same url you're using in your text example.
Let's walk through this backwards. First, let's look at the failing POST.
url = self.urlGetUnit
headers = {'content-type': 'application/json','encoding':'utf-8'}
r = requests.post(url, data=json.dumps(payload), headers=headers)
So we're posting to a URL that is equal to self.urlGetUnit. Ok, let's look at how that's defined:
baseUrl = 'http://www.yurticikargo.com/'
ajaxRoot = '_layouts/ArikanliHolding.YurticiKargo.WebSite/ajaxproxy-sswservices.aspx/'
getUnit = 'GetUnit'
urlGetUnit = baseUrl + ajaxRoot + getUnit
If you do the work in urlGetUnit, you get that the URL will be http://www.yurticikargo.com/_layouts/ArikanliHolding.YurticiKargo.WebSite/ajaxproxy-sswservices.aspx/GetUnit. Let's put this alongside the URL you used in Chrome to compare the differences:
http://www.yurticikargo.com/_layouts/ArikanliHolding.YurticiKargo.WebSite/ajaxproxy-sswservices.aspx/GetUnit
http://www.yurticikargo.com/_layouts/ArikanliHolding.YurticiKargo.WebSite/ajaxproxy-unitservices.aspx/GetUnit
See the difference? ajaxRoot is not the same for both URLs. Sort that out and you'll get back a JSON response.