I am trying to make a python script that downloads just the metadata of a torrent given an infohash. This infohash is loaded from a json file, with the contents like such:
{"infohash":"someinfohash"}
If I manually encode the infohash into the link string, or I make it personally using a dictionary, like this:
link = 'magnet:?xt=urn:btih:someinfohash'
or like this:
foo = {}
foo['infohash'] = 'someinfohash'
link = 'magnet:?xt=urn:btih:' + foo['infohash']
I can always download the metadata, no problem. But for some reason, when I load it from a json file, it consistently times out.
thedata = open(sys.argv[1]).read()
thedata = json.loads(thedata)
ses = lt.session()
ses.listen_on(6881, 6891)
params = {
'save_path': '.', # doesn't matter because we're only downloading metadata
}
link = 'magnet:?xt=urn:btih:' + thedata['infohash']
handle = lt.add_magnet_uri(ses, link, params)
ses.add_dht_router('dht.transmissionbt.com', 6881)
ses.add_dht_router('router.bittorrent.com', 6881)
ses.add_dht_router('router.utorrent.com', 6881)
ses.start_dht()
sys.stdout.write('Downloading metadata...')
sys.stdout.flush()
timeout = time.time()
while (not handle.has_metadata()):
if (time.time() >= 300 + timeout):
print 'timed out'
sys.exit(1)
time.sleep(1)
print 'done'
ses.pause()
If I check if the strings are equal, like so:
link = 'magnet:?xt=urn:btih:' + thedata['infohash']
link2 = 'magnet:?xt=urn:btih:someinfohash'
print link == link2
It prints true.
Does anybody have any idea what is going on?
I finally found what was wrong with it.
I got the idea to examine the programmer-friendly text representation of the json-generated dictionary as opposed to my one made manually.
thedata = open(sys.argv[1]).read()
thedata = json.loads(thedata)
print thedata
thedata = {}
thedata['infohash'] = 'someinfohash'
print thedata
What came up was this:
{u'infohash', u'someinfohash'}
{'infohash', 'someinfohash'}
Apparently the json-made dict was encoded in unicode, and this somehow prevented libtorrent from connecting to seeders. After un-unicoding all the keys and values, the script runs perfectly.
Related
I've recently been trying to scrape a site that contains chemistry exam tests in pdf using Python. I used requests for python and everything was going well, until some of the downloads were cut short at a very small size i.e. 2KB. What's curious though - it happens completely at random with every run of the script the files cut are different. I've been scratching my head for a while now and decided to ask here. Downloading them manually probably would have proved faster by now, but I want to know why the script isn't working, for future reference.
I've written the script to be asynchronous, thus it occurred to me that I could have been DoSing the server. However, I've replaced every Pool with a synchronous for loop, even adding time.sleep() here and there - it didn't help. Using this approach none of the files were fully downloaded - practically every single one stopping at 2KB.
Please forgive me if the question is naive or my mistake is foolish as I am only a hobby programmer. I'll be grateful for any help.
P.S. I've intercepted the headers using Postman from Chrome, without them the response was 500, however I won't include them as they contain session ids that would enable you to login into my account.
The script is as follows:
from shutil import copyfileobj
from multiprocessing.dummy import Pool as ThreadPool
from requests import get
from time import sleep
titles = {
"95": "Budowa atomu. Układ okresowy pierwiastków chemicznych",
"96": "Wiązania chemiczne",
"97": "Systematyka związków nieorganicznych",
"98": "Stechiometria",
"99": "Reakcje utleniania-redukcji. Elektrochemia",
"100": "Roztwory",
"101": "Kinetyka chemiczna",
"102": "Reakcje w wodnych roztworach elektrolitów",
"103": "Charakterystyka pierwiastków i związków chemicznych",
"104": "Chemia organiczna jako chemia związków węgla",
"105": "Węglowodory",
"106": "Jednofunkcyjne pochodne węglowodorów",
"107": "Wielofunkcyjne pochodne węglowodorów",
"108": "Arkusz maturalny"
}
#collection = {"120235": "Chemia nieorganiczna", "120586": "Chemia organiczna"}
url = "https://e-testy.terazmatura.pl/print/%s/quiz_%s/%s"
def downloadTest(id):
with ThreadPool(2) as tp:
tp.starmap(downloadActualTest, [(id, "blank"), (id, "key")])
def downloadActualTest(id, dataType):
name = titles[str(id)]
if id in range(95, 104):
collectionId = 120235
else:
collectionId = 120586
if dataType == "blank":
with open("Pulled Data/%s - pusty.pdf" % name, "wb") as test:
print("Downloading: " + url % (collectionId, id, "blank") + '\n')
r = get(url % (collectionId, id, "blank"),
stream=True,
headers=headers)
r.raw.decode_content = True
copyfileobj(r.raw, test)
elif dataType == "key":
with open("Pulled Data/%s - klucz.pdf" % name, "wb") as test:
print("Downloading: " + url % (collectionId, id, "key") + '\n')
r = get(url % (collectionId, id, "key"),
stream=True,
headers=headers)
r.raw.decode_content = True
copyfileobj(r.raw, test)
with ThreadPool(3) as p:
p.map(downloadTest, range(95, 109))
I was trying to bulk edit the signature of my personal macros on ZenDesk, and the only way to do that is via the API. So I wrote this quick Python script to try to do it:
import sys
import time
import logging
import requests
import re
start_time = time.time()
# Set up logging
logger = logging.getLogger()
log_handler = logging.StreamHandler(sys.stdout)
log_handler.setFormatter(logging.Formatter("%(asctime)s - %(name)s - %(levelname)s - %(message)s - %(funcName)s - line %(lineno)d"))
log_handler.setLevel(logging.DEBUG)
logger.addHandler(log_handler)
logger.setLevel(logging.DEBUG)
def doTheGet(url, user, pwd):
response = requests.get(url, auth=(user + "/token", pwd))
if response.status_code != 200:
logger.error("Status: %s (%s) Problem with the request. Exiting. %f seconds elapsed" % (response.status_code, response.reason, time.time() - start_time))
exit()
data = response.json()
return data
def doThePut(url, updated_data, user, pwd):
response = requests.put(url, json="{'macro': {'actions': %r}}" % updated_data, headers={"Content-Type": "application/json"}, auth=(user + "/token", pwd))
if response.status_code != 200:
logger.error("Status: %s (%s) Problem with the request. Exiting. %f seconds elapsed" % (response.status_code, response.reason, time.time() - start_time))
exit()
data = response.json()
return data
def getMacros():
macros = {}
data = doTheGet("https://mydomain.zendesk.com/api/v2/macros.json", "me#mydomain.com", "111tokenZZZ")
def getMacros(macro_list, page, page_count):
if not page:
for macro in macro_list:
if macro["restriction"] and macro["active"]:
if macro["restriction"]["type"] == "User":
macros[macro["id"]] = macro["actions"]
else:
for macro in macro_list:
if macro["restriction"] and macro["active"]:
if macro["restriction"]["type"] == "User":
macros[macro["id"]] = macro["actions"]
page_count += 1
new_data = doTheGet(page, "me#mydomain.com", "111tokenZZZ")
new_macs = new_data["macros"]
new_next_page = new_data["next_page"]
getMacros(new_macs, new_next_page, page_count)
macs = data["macros"]
current_page = 1
next_page = data["next_page"]
getMacros(macs, next_page, current_page)
return macros
def updateMacros():
macros = getMacros()
regular = "RegEx to match signature to be replaced$" #since some macros already have the updated signature
for macro in macros:
for action in macros[macro]:
if action["field"] == "comment_value":
if re.search(regular, action["value"][1]):
ind = action["value"][1].rfind("\n")
action["value"][1] = action["value"][1][:ind] + "\nNew signature"
return macros
macs = updateMacros()
for mac in macs:
doThePut("https://mydomain.zendesk.com/api/v2/macros/%d.json" % (mac), macs[mac], "me#mydomain.com", "111tokenZZZ")
Now, everything's running as expected, and I get no errors. When I go to my macros on ZenDesk and sort them by last updated, I do see that the script did something, since they show as being last updated today. However, nothing changes on them. I made sure the data I'm sending over is edited (updateMacros is doing its job). I made sure the requests send back an OK response. So I'm sending updated data, getting back a 200 response, but the response sent back shows me the macros as they were before, with zero changes.
The only thing that occurs to me as maybe being wrong in some way is the format of the data I'm sending over, or something of the sort. But even then, I'd expect the response to not be a 200, then...
What am I missing here?
Looks like you're double-encoding the JSON data in your PUT request:
response = requests.put(url, json="{'macro': {'actions': %r}}" % updated_data, headers={"Content-Type": "application/json"}, auth=(user + "/token", pwd))
The json parameter expects an object, which it then dutifully encodes as JSON and sends as the body of the request; this is merely a convenience; the implementation is simply,
if not data and json is not None:
# urllib3 requires a bytes-like body. Python 2's json.dumps
# provides this natively, but Python 3 gives a Unicode string.
content_type = 'application/json'
body = complexjson.dumps(json)
if not isinstance(body, bytes):
body = body.encode('utf-8')
(source: https://github.com/kennethreitz/requests/blob/master/requests/models.py#L424)
Since the value is always passed through json.dumps(), if you pass a string representing already-encoded JSON it will itself be encoded:
"{\'macro\': {\'actions\': [{\'field\': \'comment_value\', \'value\': [\'channel:all\', \'Spiffy New Sig that will Never Be Saved\']}]}}"
ZenDesk, upon being given JSON it doesn't expect, updates the updated_at field and... Does nothing else. You can verify this by passing an empty string - same result.
Note that you're also relying on Python's repr formatting to fill in your JSON; that's probably a bad idea too. Instead, let's just reconstruct our macro object and let requests encode it:
response = requests.put(url, json={'macro': {'actions': updated_data}}, headers={"Content-Type": "application/json"}, auth=(user + "/token", pwd))
This should do what you expect.
I'm new here to StackOverflow, but I have found a LOT of answers on this site. I'm also a programming newbie, so i figured i'd join and finally become part of this community - starting with a question about a problem that's been plaguing me for hours.
I login to a website and scrape a big body of text within the b tag to be converted into a proper table. The layout of the resulting Output.txt looks like this:
BIN STATUS
8FHA9D8H 82HG9F RECEIVED SUCCESSFULLY AWAITING STOCKING PROCESS
INVENTORY CODE: FPBC *SOUP CANS LENTILS
BIN STATUS
HA8DHW2H HD0138 RECEIVED SUCCESSFULLY AWAITING STOCKING PROCESS
8SHDNADU 00A123 #2956- INVALID STOCK COUPON CODE (MISSING).
93827548 096DBR RECEIVED SUCCESSFULLY AWAITING STOCKING PROCESS
There are a bunch of pages with the exact same blocks, but i need them to be combined into an ACTUAL table that looks like this:
BIN INV CODE STATUS
HA8DHW2HHD0138 FPBC-*SOUP CANS LENTILS RECEIVED SUCCESSFULLY AWAITING STOCKING PROCESS
8SHDNADU00A123 FPBC-*SOUP CANS LENTILS #2956- INVALID STOCK COUPON CODE (MISSING).
93827548096DBR FPBC-*SOUP CANS LENTILS RECEIVED SUCCESSFULLY AWAITING STOCKING PROCESS
8FHA9D8H82HG9F SSXR-98-20LM NM CORN CREAM RECEIVED SUCCESSFULLY AWAITING STOCKING PROCESS
Essentially, all separate text blocks in this example would become part of this table, with the inv code repeating with its Bin values. I would post my attempts at parsing this data(have tried Pandas/bs/openpyxl/csv writer), but ill admit they are a little embarrassing, as i cannot find any information on this specific problem. Is there any benevolent soul out there that can help me out? :)
(Also, i am using Python 2.7)
A simple custom parser like the following should do the trick.
from __future__ import print_function
def parse_body(s):
line_sep = '\n'
getting_bins = False
inv_code = ''
for l in s.split(line_sep):
if l.startswith('INVENTORY CODE:') and not getting_bins:
inv_data = l.split()
inv_code = inv_data[2] + '-' + ' '.join(inv_data[3:])
elif l.startswith('INVENTORY CODE:') and getting_bins:
print("unexpected inventory code while reading bins:", l)
elif l.startswith('BIN') and l.endswith('MESSAGE'):
getting_bins = True
elif getting_bins == True and l:
bin_data = l.split()
# need to add exception handling here to make sure:
# 1) we have an inv_code
# 2) bin_data is at least 3 items big (assuming two for
# bin_id and at least one for message)
# 3) maybe some constraint checking to ensure that we have
# a valid instance of an inventory code and bin id
bin_id = ''.join(bin_data[0:2])
message = ' '.join(bin_data[2:])
# we now have a bin, an inv_code, and a message to add to our table
print(bin_id.ljust(20), inv_code.ljust(30), message, sep='\t')
elif getting_bins == True and not l:
# done getting bins for current inventory code
getting_bins = False
inv_code = ''
A rather complex one, but this might get you started:
import re, pandas as pd
from pandas import DataFrame
rx = re.compile(r'''
(?:INVENTORY\ CODE:)\s*
(?P<inv>.+\S)
[\s\S]+?
^BIN.+[\n\r]
(?P<bin_msg>(?:(?!^\ ).+[\n\r])+)
''', re.MULTILINE | re.VERBOSE)
string = your_string_here
# set up the dataframe
df = DataFrame(columns = ['BIN', 'INV', 'MESSAGE'])
for match in rx.finditer(string):
inv = match.group('inv')
bin_msg_raw = match.group('bin_msg').split("\n")
rxbinmsg = re.compile(r'^(?P<bin>(?:(?!\ {2}).)+)\s+(?P<message>.+\S)\s*$', re.MULTILINE)
for item in bin_msg_raw:
for m in rxbinmsg.finditer(item):
# append it to the dataframe
df.loc[len(df.index)] = [m.group('bin'), inv, m.group('message')]
print(df)
Explanation
It looks for INVENTORY CODE and sets up the groups (inv and bin_msg) for further processing in afterwork() (note: it would be easier if you had only one line of bin/msg as you need to split the group here afterwards).
Afterwards, it splits the bin and msg part and appends all to the df object.
I had a code written for a website scrapping which may help you.
Basically what you need to do is write click on the web page go to html and try to find the tag for the table you are looking for and using the module (i am using beautiful soup) extract the information. I am creating a json as I need to store it into mongodb you can create table.
#! /usr/bin/python
import sys
import requests
import re
from BeautifulSoup import BeautifulSoup
import pymongo
def req_and_parsing():
url2 = 'http://businfo.dimts.in/businfo/Bus_info/EtaByRoute.aspx?ID='
list1 = ['534UP','534DOWN']
for Route in list1:
final_url = url2 + Route
#r = requests.get(final_url)
#parsing_file(r.text,Route)
outdict = []
outdict = [parsing_file( requests.get(url2+Route).text,Route) for Route in list1 ]
print outdict
conn = f_connection()
for i in range(len(outdict)):
insert_records(conn,outdict[i])
def parsing_file(txt,Route):
soup = BeautifulSoup(txt)
table = soup.findAll("table",{"id" : "ctl00_ContentPlaceHolder1_GridView2"})
#trtags = table[0].findAll('tr')
tdlist = []
trtddict = {}
"""
for trtag in trtags:
print 'print trtag- ' , trtag.text
tdtags = trtag.findAll('td')
for tdtag in tdtags:
print tdtag.text
"""
divtags = soup.findAll("span",{"id":"ctl00_ContentPlaceHolder1_ErrorLabel"})
for divtag in divtags:
for divtag in divtags:
print "div tag - " , divtag.text
if divtag.text == "Currently no bus is running on this route" or "This is not a cluster (orange bus) route":
print "Page not displayed Errored with below meeeage for Route-", Route," , " , divtag.text
sys.exit()
trtags = table[0].findAll('tr')
for trtag in trtags:
tdtags = trtag.findAll('td')
if len(tdtags) == 2:
trtddict[tdtags[0].text] = sub_colon(tdtags[1].text)
return trtddict
def sub_colon(tag_str):
return re.sub(';',',',tag_str)
def f_connection():
try:
conn=pymongo.MongoClient()
print "Connected successfully!!!"
except pymongo.errors.ConnectionFailure, e:
print "Could not connect to MongoDB: %s" % e
return conn
def insert_records(conn,stop_dict):
db = conn.test
print db.collection_names()
mycoll = db.stopsETA
mycoll.insert(stop_dict)
if __name__ == "__main__":
req_and_parsing()
I'm trying to grab the most recently uploaded videos. There's a standard feed for that - it's called most_recent. I don't have any problems grabbing the feed, but when I look at the entries inside, they're all half a year old, which is hardly recent.
Here's the code I'm using:
import requests
import os.path as P
import sys
from lxml import etree
import datetime
namespaces = {"a": "http://www.w3.org/2005/Atom", "yt": "http://gdata.youtube.com/schemas/2007"}
fmt = "%Y-%m-%dT%H:%M:%S.000Z"
class VideoEntry:
"""Data holder for the video."""
def __init__(self, node):
self.entry_id = node.find("./a:id", namespaces=namespaces).text
published = node.find("./a:published", namespaces=namespaces).text
self.published = datetime.datetime.strptime(published, fmt)
def __str__(self):
return "VideoEntry[id='%s']" % self.entry_id
def paginate(xml):
root = etree.fromstring(xml)
next_page = root.find("./a:link[#rel='next']", namespaces=namespaces)
if next_page == None:
next_link = None
else:
next_link = next_page.get("href")
entries = [VideoEntry(e) for e in root.xpath("/a:feed/a:entry", namespaces=namespaces)]
return entries, next_link
prefix = "https://gdata.youtube.com/feeds/api/standardfeeds/"
standard_feeds = set("top_rated top_favorites most_shared most_popular most_recent most_discussed most_responded recently_featured on_the_web most_viewed".split(" "))
feed_name = sys.argv[1]
assert feed_name in standard_feeds
feed_url = prefix + feed_name
all_video_ids = []
while feed_url is not None:
r = requests.get(feed_url)
if r.status_code != 200:
break
text = r.text.encode("utf-8")
video_ids, feed_url = paginate(text)
all_video_ids += video_ids
all_upload_times = [e.published for e in all_video_ids]
print min(all_upload_times), max(all_upload_times)
As you can see, it prints the min and max timestamps for the entire feed.
misha#misha-antec$ python get_standard_feed.py most_recent
2013-02-02 14:40:02 2013-02-02 14:54:00
misha#misha-antec$ python get_standard_feed.py top_rated
2006-04-06 21:30:53 2013-07-28 22:22:38
I've glanced through the downloaded XML and it appears to match the output. Am I doing something wrong?
Also, on an unrelated note, the feeds I'm getting are all about 100 entries (I'm paginating through them 25 at a time). Is this normal? I expected the feeds to be a bit bigger.
Regarding the "Most-Recent-Feed"-Topic: There is a ticket for this one here. Unfortunately, the YouTube-API-Teams doesn't respond or solved the problem so far.
Regarding the number of entries: That depends on the type of standardfeed, but for the most-recent-Feed it´s usually around 100.
Note: You could try using the "orderby=published" parameter to get recents videos, although I don´t know how "recent" they are.
https://gdata.youtube.com/feeds/api/videos?orderby=published&prettyprint=True
You can combine this query with the "category"-parameter or other ones (region-specific queries - like for the standard feeds - are not possible, afaik).
I'm trying to get some results from UniProt, which is a protein database (details are not important). I'm trying to use some script that translates from one kind of ID to another. I was able to do this manually on the browser, but could not do it in Python.
In http://www.uniprot.org/faq/28 there are some sample scripts. I tried the Perl one and it seems to work, so the problem is my Python attempts. The (working) script is:
## tool_example.pl ##
use strict;
use warnings;
use LWP::UserAgent;
my $base = 'http://www.uniprot.org';
my $tool = 'mapping';
my $params = {
from => 'ACC', to => 'P_REFSEQ_AC', format => 'tab',
query => 'P13368 P20806 Q9UM73 P97793 Q17192'
};
my $agent = LWP::UserAgent->new;
push #{$agent->requests_redirectable}, 'POST';
print STDERR "Submitting...\n";
my $response = $agent->post("$base/$tool/", $params);
while (my $wait = $response->header('Retry-After')) {
print STDERR "Waiting ($wait)...\n";
sleep $wait;
print STDERR "Checking...\n";
$response = $agent->get($response->base);
}
$response->is_success ?
print $response->content :
die 'Failed, got ' . $response->status_line .
' for ' . $response->request->uri . "\n";
My questions are:
1) How would you do that in Python?
2) Will I be able to massively "scale" that (i.e., use a lot of entries in the query field)?
question #1:
This can be done using python's urllibs:
import urllib, urllib2
import time
import sys
query = ' '.join(sys.argv)
# encode params as a list of 2-tuples
params = ( ('from','ACC'), ('to', 'P_REFSEQ_AC'), ('format','tab'), ('query', query))
# url encode them
data = urllib.urlencode(params)
url = 'http://www.uniprot.org/mapping/'
# fetch the data
try:
foo = urllib2.urlopen(url, data)
except urllib2.HttpError, e:
if e.code == 503:
# blah blah get the value of the header...
wait_time = int(e.hdrs.get('Retry-after', 0))
print 'Sleeping %i seconds...' % (wait_time,)
time.sleep(wait_time)
foo = urllib2.urlopen(url, data)
# foo is a file-like object, do with it what you will.
foo.read()
You're probably better off using the Protein Identifier Cross Reference service from the EBI to convert one set of IDs to another. It has a very good REST interface.
http://www.ebi.ac.uk/Tools/picr/
I should also mention that UniProt has very good webservices available. Though if you are tied to using simple http requests for some reason then its probably not useful.
Let's assume that you are using Python 2.5.
We can use httplib to directly call the web site:
import httplib, urllib
querystring = {}
#Build the query string here from the following keys (query, format, columns, compress, limit, offset)
querystring["query"] = ""
querystring["format"] = "" # one of html | tab | fasta | gff | txt | xml | rdf | rss | list
querystring["columns"] = "" # the columns you want comma seperated
querystring["compress"] = "" # yes or no
## These may be optional
querystring["limit"] = "" # I guess if you only want a few rows
querystring["offset"] = "" # bring on paging
##From the examples - query=organism:9606+AND+antigen&format=xml&compress=no
##Delete the following and replace with your query
querystring = {}
querystring["query"] = "organism:9606 AND antigen"
querystring["format"] = "xml" #make it human readable
querystring["compress"] = "no" #I don't want to have to unzip
conn = httplib.HTTPConnection("www.uniprot.org")
conn.request("GET", "/uniprot/?"+ urllib.urlencode(querystring))
r1 = conn.getresponse()
if r1.status == 200:
data1 = r1.read()
print data1 #or do something with it
You could then make a function around creating the query string and you should be away.
check this out bioservices. they interface a lot of databases through Python.
https://pythonhosted.org/bioservices/_modules/bioservices/uniprot.html
conda install bioservices --yes
in complement to O.rka answer:
Question 1:
from bioservices import UniProt
u = UniProt()
res = u.get_df("P13368 P20806 Q9UM73 P97793 Q17192".split())
This returns a dataframe with all information about each entry.
Question 2: same answer. This should scale up.
Disclaimer: I'm the author of bioservices
There is a python package in pip which does exactly what you want
pip install uniprot-mapper