How to use Hacker News API in Python? - python

Hacker News has released an API, how do I use it in Python?
I want get all the top posts. I tried using urllib, but I don't think I am doing right.
here's my code:
import urllib2
response = urllib2.urlopen('https://hacker-news.firebaseio.com/v0/topstories.json?print=pretty')
html = response.read()
print response.read()
It just prints empty
''
I missed a line, had updated my code.

As #jonrsharpe, explained read() is only one time operation. So if you print html, you will get list of all ids. And if you go through that list, you have to make each request again to get story of each id.
First you have to convert the received data to python list and go through them all.
base_url = 'https://hacker-news.firebaseio.com/v0/item/{}.json?print=pretty'
top_story_ids = json.loads(html)
for story in top_story_ids:
response = urllib2.urlopen(base_url.format(story))
print response.read()
Instead of all this, you could use haxor, it's a Python wrapper for Hacker News API. Following code will fetch you all the ids of top stories :
from hackernews import HackerNews
hn = HackerNews()
top_story_ids = hn.top_stories()
# >>> top_story_ids
# [8432709, 8432616, 8433237, ...]
Then you can go through that loop and print all them, for example:
for story in top_story_ids:
print hn.get_item(story)
Disclaimer: I wrote haxor.

You should
print html
instead of
print response.read()
Why? Because the read is a one-time operation; after you've done it, you can't repeat it:
>>>import ullrib2
>>> response = urllib2.urlopen('https://hacker-news.firebaseio.com/v0/topstories.json?print=pretty')
>>> response.read()
'[ 8445087, 8444739, 8444603, 8443981, 8444976, 8443902, 8444252, 8444634, 8444931, 8444272, 8444025, 8441939, 8444510, 8444640, 8443830, 8445076, 8443470, 8444785, 8443028, 8444077, 8444832, 8443841, 8443467, 8443309, 8443187, 8443896, 8444971, 8443360, 8444601, 8443287, 8441095, 8441681, 8441055, 8442712, 8444909, 8443621, 8442596, 8443836, 8442266, 8443298, 8445122, 8443096, 8441699, 8442119, 8442965, 8440486, 8442093, 8443393, 8442067, 8444989, 8440985, 8444622, 8438728, 8442555, 8444880, 8442004, 8443185, 8444370, 8436210, 8437671, 8439641, 8443727, 8441702, 8436309, 8441041, 8437367, 8422087, 8441711, 8438063, 8444212, 8439408, 8442049, 8440989, 8439367, 8438515, 8437403, 8435278, 8442486, 8442730, 8428522, 8438904, 8443450, 8432703, 8430412, 8422928, 8443635, 8439267, 8440191, 8439560, 8437230, 8442556, 8439977, 8444140, 8441682, 8443776, 8441209, 8428632, 8441388, 8422599, 8439547 ]\n'
>>> response.read()
''
In your case, though, you've assigned the string from read to the name html, so you can still access it.
Once you have the story IDs, you can access each one via '.../v0/item/{item number}.json?print=pretty':
>>> response = urllib2.urlopen('https://hacker-news.firebaseio.com/v0/item/8445087.json?print=pretty')
>>> print response.read()
{
"by" : "lalmachado",
"id" : 8445087,
"kids" : [ 8445205, 8445195, 8445173, 8445103 ],
"score" : 21,
"text" : "",
"time" : 1413116430,
"title" : "Show HN: Powerful ASCII art editor designed for the Mac",
"type" : "story",
"url" : "http://monodraw.helftone.com/"
}
You should read through the API documentation before continuing. It's also worth getting to grips with the json module.

Related

Output non JSON data from regex web scraping to a JSON file

I'm using requests and regex to scrape data from an entire website and then save it to a JSON file, hosted on github so I and anyone else can access the data from other devices.
The first thing I tried was just to open every single page on the website and get all the data I want but I found that to be unnecessary so I decided to make two scripts, the first one finds the URL of every page on the site and the second one will be the one called which will then scrape the called URL. What I'm having trouble with right now is getting my data formatted correctly for the JSON file. Currently this is a sample of what the output looks like:
{
"Console":"/neo-geo-aes",
"Call ID":"62815",
"URL":"https://www.pricecharting.com/game/jp-sega-mega-drive/bare-knuckle"
}{
"Console":"/neo-geo-cd",
"Call ID":"62817",
"URL":"https://www.pricecharting.com/game/jp-sega-mega-drive/bare-knuckle-2"
}{
"Console":"/neo-geo-pocket-color",
"Call ID":"62578",
"URL":"https://www.pricecharting.com/game/jp-sega-mega-drive/batman"
}{
"Console":"/playstation",
"Call ID":"62580",
"URL":"https://www.pricecharting.com/game/jp-sega-mega-drive/batman-forever"
}
I've looked into this a lot and can't find a solution, here's the code in question:
import re
import requests
import json
##The base URL
URL = "https://www.pricecharting.com/"
r = requests.get(URL)
htmltext = r.text
##Find all system URLs
dataUrl = re.findall('(?<=<li><a href="\/console).*(?=">)', htmltext)
print(dataUrl)
##For each Item(number of consoles) find games
for i in range(len(dataUrl)):
##make console URL
newUrl = ("https://www.pricecharting.com/console" + dataUrl[i])
req = requests.get(newUrl)
newHtml = req.text
##Get item URLs
urlOne = re.findall('(?<=<a href="\/game).*(?=">)', newHtml)
itemId = re.findall('(?<=tr id="product-).*(?=" data)', newHtml)
##For every item in list(items per console)
out_list = []
for i in range(len(urlOne)):
##Make item URL
itemUrl = ("https://www.pricecharting.com/game" + urlOne[i])
callId = (itemId[i])
##Format for JSON
json_file_content = {}
json_file_content['Console'] = dataUrl[i]
json_file_content['Call ID'] = callId
json_file_content['URL'] = itemUrl
out_list.append(json_file_content)
data_json_filename = 'docs/result.json'
with open(data_json_filename, 'a') as data_json_file:
json.dump(out_list, data_json_file, indent=4)

Web Scraping using Beautiful soup and executing multiple functions to add to a list

I'm fairly new to Python and I'm trying to webscrape Facebook.
I have created a function for each section to extract, i.e The Poster Name, Captions etc.
Here is the main part of the code :
FacebookPosts = []
source_data = driver.page_source
bs_data = bs(source_data, 'html.parser')
NumberofPosts = bs_data.find_all('h2', {"id": re.compile('^jsc_c')})
def _extract_post_name(bs_data):
postername = ""
actualPosts = bs_data.find_all('h2', {"id": re.compile('^jsc_c')})
for posts in actualPosts:
postername = posts.find('strong').text
#postername.append(paragraphs)
return postername
def _extract_post_caption(bs_data):
captionblocks = bs_data.find_all('div', {"class": re.compile('^ii04i59q')})
captions = ""
for captiondivs in captionblocks:
caption = captiondivs.find('div', attrs = {'style':'text-align: start;'}).text
#captions.append(caption)
return caption
for posts in NumberofPosts:
post = {
'Original Poster:' : _extract_post_name(bs_data),
'Caption:' : _extract_post_caption(bs_data),
}
FacebookPosts.append(post)
print(FacebookPosts)
I have other functions for more extraction but ill keep it small for simplicity.
The issue at the moment is, that with this method, only 1 line in the dictionary is being shown and always the same one, when I run the code inside the function without the function it prints multiple times, I know I can append to the list but there would be another issue.
Ultimately what I would like to extract is:
FacebookPosts{
Post1{
Poster Name : Steve
Caption : Text inside Caption
}
Post2: {
Poster Name : Bob
Caption : Please Help me
what's being extracted now is:
FacebookPosts{
Poster Name : Steve
Caption : Text inside Caption
}
Poster Name : Steve
Caption : Text inside Caption
}
For every element found in NumberofPosts
Any help is greatly appreciated, I've been stuck on this problem for days.
I believe that my problem is a lack of knowledge about functions and dictionary/lists.
Like how do you add to a dictionary from multiple sources such as functions and have them in the same set.
Oh I think this might be a simple fix brother.
for posts in NumberofPosts:
post = {
'Original Poster:' : _extract_post_name(bs_data),
'Caption:' : _extract_post_caption(bs_data),
}
FacebookPosts.append(post)
print(FacebookPosts)
There is an issue here you need to the put the FacebookPosts.append(post) inside the for block else you're only appending the last post
for posts in NumberofPosts:
post = {
'Original Poster:' : _extract_post_name(bs_data),
'Caption:' : _extract_post_caption(bs_data),
}
FacebookPosts.append(post)
print(FacebookPosts)
^That should fix it if I'm not mistaken.
I solved the issue. Basically I had to change NumberofPosts = bs_data.find_all('h2', {"id": re.compile('^jsc_c')}) that element was getting the H2 headers which only contained the Name of the poster. It has now been changed to bs_data.find_all('div', {"class": 'du4w35lb k4urcfbm l9j0dhe7 sjgh65i0'}) which is getting the wrapper of the post. I'll leave the post here just in case someone needs the code. Thanks for the help.

Parsing HTML using LXML Python

I'm trying to parse Oxford Dictionary in order to obtain the etymology of a given word.
class SkipException (Exception):
def __init__(self, value):
self.value = value
try:
doc = lxml.html.parse(urlopen('https://en.oxforddictionaries.com/definition/%s' % "good"))
except SkipException:
doc = ''
if doc:
table = []
trs = doc.xpath("//div[1]/div[2]/div/div/div/div[1]/section[5]/div/p")
I cannot seem to work out how to obtain the string of text I need. I know I lack some lines of code in the ones I have copied but I don't know how HTML nor LXML fully works. I would much appreciate if someone could provide me with the correct way to solve this.
You don't want to do web scraping, and especially when probably every dictionary has an API interface. In the case of Oxford create an account at https://developer.oxforddictionaries.com/. Get the API credentials from your account and do something like this:
import requests
import json
api_base = 'https://od-api.oxforddictionaries.com:443/api/v1/entries/{}/{}'
language = 'en'
word = 'parachute'
headers = {
'app_id': '',
'app_key': ''
}
url = api_base.format(language, word)
reply = requests.get(url, headers=headers)
if reply.ok:
reply_dict = json.loads(reply.text)
results = reply_dict.get('results')
if results:
headword = results[0]
entries = headword.get('lexicalEntries')[0].get('entries')
if entries:
entry = entries[0]
senses = entry.get('senses')
if senses:
sense = senses[0]
print(sense.get('short_definitions'))
Here's a sample to get you started scraping Oxford dictionary pages:
import lxml.html as lh
from urllib.request import urlopen
url = 'https://en.oxforddictionaries.com/definition/parachute'
html = urlopen(url)
root = lh.parse(html)
body = root.find("body")
elements = body.xpath("//span[#class='ind']")
for element in elements:
print(element.text)
To find the correct search string you need to format the html so you can see the structure. I used the html formatter at https://www.freeformatter.com/html-formatter.html. Looking at the formatted HTML, I could see the definitions were in the span elements with the 'ind' class attribute.

How can I get some data from a video YouTube? (Python)

I want to know how can I get some data from a video Youtube like views, thumbnails or coments it has. I have been looking for in the Google's API but I can't understand it.
Thank you!
A different approach would be using urllib2 and getting the HTML code from the page and then filtering it.
import urllib2
source = 'https://www.youtube.com/watch?v=wDjeBNv6ip0'
response = urllib2.urlopen(source)
html = response.read() #Done, you have the whole HTML file in a gigantic string.
After that all you have to do is to filter it as you would do to a string.
Getting the number of views for instance:
wordBreak = ['<','>']
html = list(html)
i = 0
while i < len(html):
if html[i] in wordBreak:
html[i] = ' '
i += 1
#The block above is just to make the html.split() easier.
html = ''.join(html)
html = html.split()
dataSwitch = False
numOfViews = ''
for element in html:
if element == '/div':
dataSwitch = False
if dataSwitch:
numOfViews += str(element)
if element == 'class="watch-view-count"':
dataSwitch = True
print (numOfViews)
>>> 45.608.212 views
This was a simple example of getting the number of views but you can do that to everything on the page, including number of comments, likes, the content of the comment itself, etc.
i think this is the the part you are looking for (source):
def get_video_localization(youtube, video_id, language):
results = youtube.videos().list(
part="snippet",
id=video_id,
hl=language
).execute()
localized = results["items"][0]["snippet"]["localized"]
localized will now contain title, description, etc.

Trying to Parse SOAP Response in Python

I'm struggling to find a way to parse the data that I'm getting back from a SOAP response. I'm only familiar with Python (v3.4), but relatively new to it. I'm using suds-jurko to pull the data from a 3rd party SOAP server. The response comes back in the form of "ArrayOfXmlNode". I've tried using ElementTree in different ways to parse the data, but I either get no information or I get "TypeError: invalid file: (ArrayOfXmlNode)" errors. Googling how to handle the ArrayOfXMLNode type response has gotten me nowhere.
The first part of the SOAP response is:
(ArrayOfXmlNode){
XmlNode[] =
(XmlNode){
Hl =
(Hl){
ID = "22437790"
Name = "Cameron"
SpeciesID = "1"
Sex = "Male"
PrimaryBreed = "German Shepherd"
SecondaryBreed = "Mix"
SN = ""
Age = "35"
OnHold = "No"
Location = "Foster Home"
BehaviorResult = ""
Photo = "http://sms.petpoint.com/sms/photos/615/123.jpg"
}
},
I've tried iterating through the data with code similar to:
from suds.client import Client
url = 'http://qag.petpoint.com/webservices/AdoptableSearch.asmx?WSDL'
client = Client(url)
result = client.service.adoptableSearchExtended('nunya', 0, 'A', 'All', 'N')
tree = result[0]
for node in tree:
pet_info = []
pet_info.extend(node)
print(pet_info)
The code above gives me the entire response in "result[0]". Below that I try to create a list from the data, but only get very last node (node being 1 set of information from ID to Photo). Attempts to modify this approach gives me either everything, nothing, or only the last node.
So then I tried to make use of ElementTree with simple code to test it out, but only get the "invalid file" errors.
import xml.etree.ElementTree as ET
from suds.client import Client
url = 'http://qag.petpoint.com/webservices/AdoptableSearch.asmx?WSDL'
client = Client(url)
result = client.service.adoptableSearchExtended('nunya', 0, 'A', 'All', 'N')
pet_info = ET.parse(result)
print(pet_info)
The result:
Traceback (most recent call last):
File "D:\Python\Eclipse Workspace\KivyTest\src\root\nested\Parse.py", line 11, in <module>
pet_info = ET.parse(result)
File "D:\Programs\Python34\lib\xml\etree\ElementTree.py", line 1186, in parse
tree.parse(source, parser)
File "D:\Programs\Python34\lib\xml\etree\ElementTree.py", line 587, in parse
source = open(source, "rb")
TypeError: invalid file: (ArrayOfXmlNode){
XmlNode[] =
(XmlNode){
Hl =
(Hl){
ID = "20840097"
Name = "Daisy"
SpeciesID = "1"
Sex = "Female"
PrimaryBreed = "Terrier, Pit Bull"
SecondaryBreed = ""
SN = ""
Age = "42"
OnHold = "No"
Location = "Dog Adoption"
BehaviorResult = ""
Photo = "http://sms.petpoint.com/sms/photos/615/40f428de-c015-4334-9101-89c707383817.jpg"
}
},
Can someone get me pointed in the right direction?
I had a similar problem parsing data from a web service using Python 3.4 and suds-jurko. I was able to solve the issue using the code in this post, https://stackoverflow.com/a/34844428/5874347. I used the fastest_object_to_dict function to convert the web service response into a dictionary. From there you can parse the data ...
Add the fastest_object_to_dict function to the top of your file
Make your web service call
Create a new variable to save the dictionary response to
result = client.service.adoptableSearchExtended('nunya', 0, 'A', 'All', 'N')
ParsedResponse = fastest_object_to_dict(result)
Your data will now be in the form of a dictionary, you can parse the dictionary on the python side as needed or send it back to your ajax call via json, and parse it with javascript.
To send it back as json
import json
import sys
sys.stdout.write("content-type: text/json\r\n\r\n")
sys.stdout.write(json.dumps(ParsedReponse))
Please try this:
result[0][0]
which will give you the first element of the array (ArrayOfXmlNode).
Similarly, try this:
result[0][1][2]
which will give you the third element of element result[0][1].
Hopefully, this offers an alternative solution.
If you are using Python, you can parse this result JSON from a XML result.
But your SOAP result needs to be a XML output, you can use the retxml=True on suds library.
I needed this result as a JSON output as well, and I ended up solving this way:
import xmltodict
# Parse the XML result into dict
data_dict = xmltodict.parse(soap_response)
# Dump the dict result into a JSON result
json_data = json.dumps(data_dict)
# Load the JSON string result
json = json.loads(json_data)

Categories

Resources