A little while ago, I posted on here for help using the API to download data from Tumblr blogs. birryree (https://stackoverflow.com/users/297696/birryree) was kind enough to help me correct my script and figure out where I had been going wrong, and I have been using his script with no problems since (Print more than 20 posts from Tumblr API).
This script requires that I manually input the blog name that I want to download each time. However, I need to download hundreds of blogs, so this has led to me working with hundreds of versions of the same script and is very time-consuming. I did some googling and found that it was possible to write Python scripts where you can input arguments from the command line and then they would be processed (if that's the right terminology) one by one.
I tried to write a script which would let me run a command from the command prompt and which would then download the three blogs I've asked for in the command prompt. (in this case, "prettythingsicantafford.tumblr.com; theficrecfairy.tumblr.com; and staff.tumblr.com).
So my script that I'm trying to run is:
import pytumblr
import sys
def get_all_posts(client, blog):
offset = 0
while True:
response = client.posts(blog, limit=20, offset=offset, reblog_info=True, notes_info=True)
# Get the 'posts' field of the response
posts = response['posts']
if not posts: return
for post in posts:
yield post
# move to the next offset
offset += 20
client = pytumblr.TumblrRestClient('SECRET')
blog = (sys.argv[1], sys.argv[2], sys.argv[3])
# use our function
with open('{}-posts.txt'.format(blog), 'w') as out_file:
for post in get_all_posts(client, blog):
print >>out_file, post
I am running the following command from the command prompt
tumblr_test2.py theficrecfairy prettythingsicantafford staff
However, I get the following error message:
Traceback (most recent call last):
File "C:\Users\izzy\test\tumblr_test2.py", line 29, in <module>
for post in get_all_posts(client, blog):
File "C:\Users\izzy\test\tumblr_test2.py", line 8, in get_all_posts
response = client.posts(blog, limit=20, offset=offset, reblog_info=True, notes_info=True)
File "C:\Python27\lib\site-packages\pytumblr\helpers.py", line 46, in add_dot_tumblr
args[1] += ".tumblr.com"
TypeError: can only concatenate tuple (not "str") to tuple
I have been trying to modify my script for about two weeks now in response to this error, but I have been unable to correct my no doubt very obvious mistake and would be very grateful for any help or advice.
EDIT FOLLOWING vishes_shell's ADVICE:
I am now working with the following script:
import pytumblr
import sys
def get_all_posts(client, blogs):
for blog in blogs:
offset = 0
while True:
response = client.posts(blog, limit=20, offset=offset, reblog_info=True, notes_info=True, filter='raw')
# Get the 'posts' field of the response
posts = response['posts']
if not posts: return
for post in posts:
yield post
# move to the next offset
offset += 20
client = pytumblr.TumblrRestClient('SECRET')
blog = sys.argv
# use our function
with open('{}-postsredux.txt'.format(blog), 'w') as out_file:
for post in get_all_posts(client, blog):
print >>out_file, post
However, I now get the following error message:
Traceback (most recent call last):
File "C:\Users\izzy\test\tumblr_test2.py", line 27, in <module>
with open('{}-postsredux.txt'.format(blog), 'w') as out_file:
IOError: [Errno 22] invalid mode ('w') or filename: "
['C:\\\\Users\\\\izzy\\\\test\\\\tumblr_test2.py',
'prettythingsicantafford', 'theficrecfairy']-postsredux.txt"
The problem that you trying to client.posts(blog, ...) when blog is tuple object, declared as:
blog = (sys.argv[1], sys.argv[2], sys.argv[3])
You need to refactor your method to go over each blog separately.
def get_all_posts(client, blogs):
for blog in blogs:
offset = 0
...
while True:
response = client.posts(blog, ...)
...
...
blog = sys.argv
...
Related
My requirement: Read contents from a input type="file" with ID= "rtfile1" and write it to a textarea with ID- "rt1"
Based on the documentation on [https://brython.info/][1] I tried to read a file but it fails with this error:
Access to XMLHttpRequest at 'file:///C:/fakepath/requirements.txt' from origin 'http://example.com:8000' has been blocked by CORS policy: Cross origin requests are only supported for protocol schemes: http, data, chrome, chrome-extension, https.
I tried following two Brython codes, both of them failed with the same aforementioned error.
Code 1:
def file_read(ev):
doc['rt1'].value = open(doc['rtfile1'].value).read()
doc["rtfile1"].bind("input", file_read)
Code 2:
def file_read(ev):
def on_complete(req):
if req.status==200 or req.status==0:
doc['rt1'].value = req.text
else:
doc['rt1'].value = "error "+req.text
def err_msg():
doc['rt1'].value = "server didn't reply after %s seconds" %timeout
timeout = 4
def go(url):
req = ajax.ajax()
req.bind("complete", on_complete)
req.set_timeout(timeout, err_msg)
req.open('GET', url, True)
req.send()
print('Triggered')
go(doc['rtfile1'].value)
doc["rtfile1"].bind("input", file_read)
Any help would be greatly appreciated. Thanks!!! :)
It's not related to Brython (you would have the same result with the equivalent Javascript), but to the way you tell the browser which file you want to upload.
If you select the file by an HTML tag such as
<input type="file" id="rtfile1">
the object referenced by doc['rtfile1'] in the Brython code has an attribute value, but it is not the file path or url, it's a "fakepath" built by the browser (as you can see in the error message), and you can't use it as an argument of the Brython function open(), or as a url to send an Ajax request to; if you want to use the file url, you should enter it in a basic input tag (without type="file").
It is better to select the file with type="file", but in this case the object doc['rtfile1'] is a FileList object, described in the DOM's Web API, whose first element is a File object. Reading its content is unfortunately not as simple as with open(), but here is a working example:
from browser import window, document as doc
def file_read(ev):
def onload(event):
"""Triggered when file is read. The FileReader instance is
event.target.
The file content, as text, is the FileReader instance's "result"
attribute."""
doc['rt1'].value = event.target.result
# Get the selected file as a DOM File object
file = doc['rtfile1'].files[0]
# Create a new DOM FileReader instance
reader = window.FileReader.new()
# Read the file content as text
reader.readAsText(file)
reader.bind("load", onload)
doc["rtfile1"].bind("input", file_read)
I'm programming in school and will soon need to program my final piece. The following program (written in python's programming language) as a program I'm writing simply to practice accessing APIs.
I'm attempting to access the API for a sight based on a game. The idea on the program is to check this API every 30 seconds to check for changes in the data, by storing to sets of data ('baseRank' and 'basePP') as soon as it's running, then comparing this data with new data taken 30 seconds later.
Here is my program:
import time
apiKey = '###'
rankDifferences = []
ppDifferences = []
const = True
username = '- Legacy'
url = "https://osu.ppy.sh/api/get_user?u={1}&k={0}".format(apiKey,username)
import urllib.request, json
with urllib.request.urlopen(url) as url:
stats = json.loads(url.read().decode())
stats = stats[0]
basePP = stats['pp_raw']
print(basePP)
baseRank = stats['pp_rank']
print(baseRank)
while const == True:
time.sleep(30)
import urllib.request, json
with urllib.request.urlopen(url) as url:
check = json.loads(url.read().decode())
check = check[0]
rankDifference = baseRank + check['pp_rank']
ppDifference = basePP + check['pp_raw']
baseRank = check['pp_raw']
basePP = check['pp_raw']
if rankDifference != 0:
print(rankDifference)
if ppDifference != 0:
print(ppDifference)`
Please note, where I have written 'apiKey = '###'', I am in fact using a real, working API key, but I've hidden it as the site asks you not to share your api key with others.
Here is the state of the shell after running:
5206.55
12045
Traceback (most recent call last):
File "C:/Users/ethan/Documents/osu API Accessor.py", line 23, in
with urllib.request.urlopen(url) as url:
File
"C:\Users\ethan\AppData\Local\Programs\Python\Python36\lib\urllib\request.py", >line 223, in urlopen
return opener.open(url, data, timeout)
File
"C:\Users\ethan\AppData\Local\Programs\Python\Python36\lib\urllib\request.py", >line 518, in open
protocol = req.type
AttributeError: 'HTTPResponse' object has no attribute 'type'
As you can see, it does print both 'basePP' and 'baseRank', proving that I can access this API. The problem seems to be when I try to access it a second time. To be completely honest, I'm not entirely sure what this error means.. So if you wouldn't mind taking the time to explain and/or help fix this error, it would be greatly appreciated.
Side note: This is my first time using this forum so if I'm doing anything wrong, I'm very sorry!
The problem seems to be when you do:
with urllib.request.urlopen(url) as url:
stats = json.loads(url.read().decode())
Your use of the url variable is changing it, so that when you try and use it later it doesn't work.
Try something like:
with urllib.request.urlopen(url) as page:
stats = json.loads(page.read().decode())
and it should be okay.
I am unable to download a xls file from a url. I have tried with both urlopen and urlretrive. But I recieve an really long error message starting with:
Traceback (most recent call last):
File "C:/Users/Henrik/Documents/Development/Python/Projects/ImportFromWeb.py", line 6, in
f = ur.urlopen(dls)
File "C:\Users\Henrik\AppData\Local\Programs\Python\Python35\lib\urllib\request.py", line 163, in urlopen
return opener.open(url, data, timeout)
and ending with:
urllib.error.HTTPError: HTTP Error 302: The HTTP server returned a redirect error that would lead to an infinite loop.
The last 30x error message was:
Found
Unfortionally I can't provide the url I am using since the data is sensitive. However I will give you the url with some parts removed.
https://xxxx.xxxx.com/xxxxlogistics/w/functions/transportinvoicelist?0-8.IBehaviorListener.2-ListPageForm-table-TableForm-exportToolbar-xlsExport&antiCache=1477160491504
As you can see the url dosn't end with a "/file.xls" for example. I don't know if that matters but most of the threds regarding this issue has had those types of links.
If I enter the url in my address bar the download file window appears:
Image of download window
The code I have written look like this:
import urllib.request as ur
import openpyxl as pyxl
dls = 'https://xxxx.xxxx.com/xxxxlogistics/w/functions/transportinvoicelist?0-8.IBehaviorListener.2-ListPageForm-table-TableForm-exportToolbar-xlsExport&antiCache=1477160491504'
f = ur.urlopen(dls)
I am grateful for any help you can provide!
I would like to find out the ten Instagram users that posted most pictures with a certain hashtag.
I am using python 2.7, and I wrote this:
import urllib, json
from collections import Counter
def GetNumberPics():
urlInstagram = "https://api.instagram.com/v1/tags/HASHTAG?access_token=ACCESSTOKEN"
response = urllib.urlopen(urlInstagram)
return json.loads(response.read())['data']['media_count']
def GetPics(url):
urlInstagram = url
response = urllib.urlopen(urlInstagram)
pics = json.loads(response.read())
return pics
in this next piece I find out how many pictures are on Instagram with that hashtag, and I divide it by 20. This because, as far as I understood, it is the number of picture data that I'll receive on each API call. So by doing this I should understand how many times I would have to make the API call to get the data of all pictures.
nPics = GetNumberPics()
print nPics
times = nPics / 20
print times
FirstUrl = 'https://api.instagram.com/v1/tags/HASHTAG/media/recent?client_id=CLIENTID'
pics = GetPics(FirstUrl)
making a list of all users:
users = []
for i in range(20):
users.append(pics['data'][i]['user']['username'])
getting the next url, as received in the first api call:
nextUrl = pics['pagination']['next_url']
making the api call for the times calculated before - I'm printing i just to see how many times I do the API call.
for i in range(times):
print i
pics = GetPics(nextUrl)
for l in range(len(pics['data'])):
users.append(pics['data'][l]['user']['username'])
nextUrl = pics['pagination']['next_url']
counting the users and printing out the ten users that used the most that hashtag:
counts = Counter(users)
print(counts).most_common(10)
I get an error which I can't understand, when I arrive to the 89th call, when using the hashtag "inerasmus":
Traceback (most recent call last):
File "C:\Users\Michele\Desktop\programming\EIE\tweetNumber.py", line 55, in <module>
nextUrl = pics['pagination']['next_url']
KeyError: 'next_url'
I hope it is a useful question also for someone else. Thank you very much!
Why don't you use their Python API? From what I see, you can do all of this with the python library. Also, there are already some people on Github that have messed around with the API. Here is one and another.
Getting the following error:
Traceback (most recent call last):
File "stack.py", line 31, in ?
print >> out, "%s" % escape(p) File
"/usr/lib/python2.4/cgi.py", line
1039, in escape
s = s.replace("&", "&") # Must be done first! TypeError: 'NoneType'
object is not callable
For the following code:
import urllib2
from cgi import escape # Important!
from BeautifulSoup import BeautifulSoup
def is_talk_anchor(tag):
return tag.name == "a" and tag.findParent("dt", "thumbnail")
def talk_description(tag):
return tag.name == "p" and tag.findParent("h3")
links = []
desc = []
for pagenum in xrange(1, 5):
soup = BeautifulSoup(urllib2.urlopen("http://www.ted.com/talks?page=%d" % pagenum))
links.extend(soup.findAll(is_talk_anchor))
page = BeautifulSoup(urllib2.urlopen("http://www.ted.com/talks/arvind_gupta_turning_trash_into_toys_for_learning.html"))
desc.extend(soup.findAll(talk_description))
out = open("test.html", "w")
print >>out, """<html><head><title>TED Talks Index</title></head>
<body>
<table>
<tr><th>#</th><th>Name</th><th>URL</th><th>Description</th></tr>"""
for x, a in enumerate(links):
print >> out, "<tr><td>%d</td><td>%s</td><td>http://www.ted.com%s</td>" % (x + 1, escape(a["title"]), escape(a["href"]))
for y, p in enumerate(page):
print >> out, "<td>%s</td>" % escape(p)
print >>out, "</tr></table>"
I think the issue is with % escape(p). I'm trying to take the contents of that <p> out. Am I not supposed to use escape?
Also having an issue with the line:
page = BeautifulSoup(urllib2.urlopen("%s") % a["href"])
That's what I want to do, but again running into errors and wondering if there's an alternate way of doing it. Just trying to collect the links I found from previous lines and run it through BeautifulSoup again.
You have to investigate (using pdb) why one of your links is returned as None instance.
In particular: the traceback is self-speaking. The escape() is called with None. So you have to investigate which argument is None...it's one of of your items in 'links'. So why is one of your items None?
Likely because one of your calls to
def is_talk_anchor(tag):
return tag.name == "a" and tag.findParent("dt", "thumbnail")
returns None because tag.findParent("dt", "thumbnail") returns None (due to your given HTML input).
So you have to check or filter your items in 'links' for None (or adjust your parser code above) in order to pickup only existing links according to your needs.
And please read your tracebacks carefully and think about what the problem might be - tracebacks are very helpful and provide you with valuable information about your problem.