I have a snippet of code that calls a request to an API and retrieves XML data. The API limits requests to 50 results per page. I can loop through the pages but they always return an end to the root (Interactions) for the page:
<interactions>
<interaction>
<media-type>Phone</media-type>
<channel-obj-id>12125000870</channel-obj-id>
</interaction>
</interactions>
<interactions>
<interaction>
<media-type>Phone</media-type>
<channel-obj-id>523452345</channel-obj-id>
</interaction>
</interactions>
I have looped through the record count to get all the data but it does not provide it as a continuous XML page.
The API docs does not provide any relevant information nor provide a count or total number of pages parameter.
dt = datetime.today() - timedelta(days =1)
dt = dt.strftime('%Y%m%d')
while rootLength == 50:
n+=50
response = (requests.get('https://APIURL/api/stats/allinteractions?
d=20190101,'+str(dt)+'&n='+str(n),
auth=('user','pass')
))
xtree = et.ElementTree(et.fromstring(response.content))
Is there a way to retrieve a single continuous XML page?
Related
I am attempting to retrieve data from this API that has a max offset of 200000. The records I am attempting to pull are more than the max offset. Below is a sample of the code I am using but when I reach the offset limit of 200000 it breaks (the API doesn't return any helpful response in terms of how many pages/requests I need to do that's why I am going until there are no more results ). I need to find a way to loop through and pull all the data. Thanks
def pull_api_data():
offset_limit = 0
teltel_data = []
# Loop through the results and add if present
while True:
print("Skip", offset_limit, "rows before beginning to return results")
querystring = {"offset": "{}".format(offset_limit), "filter": "starttime>="'{}'.format(date_filter), "limit" : "5000"}
response = session.get(url=url, headers=the_headers, params=querystring)
data = response.json()['data']
# Do we have more data from teltel ?
if len(data) == 0:
break
# If yes ,then add the data to the main list ,teltel_data
teltel_data.extend(data)
# Increase offset_limit to skip the already added data
offset_limit = offset_limit + 5000
# transform the raw data by converting it to a dataframe and do necessary cleaning
pull_api_data()
I need to obtain properties from a web service for a large list of products (~25,000) and this is a very time sensitive operation (ideally I need this to execute in just a few seconds). I coded this first using a for loop as a proof of concept but it's taking 1.25 hours. I'd like to vectorize this code and execute the http requests in parallel using a GPU on Google Collab. I've removed many of the unnecessary details, but it's important to note that the products and their web service urls are stored in a DataFrame.
Will this be faster to execute on a GPU? Or should I just use multiple threads on a CPU?
What is the best way to implement this? And How can I save the results from parallel processes to the results DataFrame (all_product_properties) without running into concurrency problems?
Each product has multiple properties (key-value pairs) that I'm obtaining from the JSON response, but the product_id is not included in the JSON response so I need to add the product_id to the DataFrame.
#DataFrame containing string column of urls
urls = pd.DataFrame(["www.url1.com", "www.url2.com", ..., "www.url3.com"], columns=["url"])
#initialize empty dataframe to store properties for all products
all_product_properties = pd.DataFrame(columns=["product_id", "property_name", "property_value"])
for i in range(1, len(urls)):
curr_url = urls.loc[i, "url"]
try:
http_response = requests.request("GET", curr_url)
if http_response is not None:
http_response_json = json.loads(http_response.text)
#extract product properties from JSON response
product_properties_json = http_response_json['product_properties']
curr_product_properties_df = pd.json_normalize(product_properties_json)
#add product id since it's not returned in the JSON
curr_product_properties_df["product_id"] = i
#save current product properties to DataFrame containing all product properties
all_product_properties = pd.concat([all_product_properties, curr_product_properties_df ])
except Exception as e:
print(e)
GPUs probably will not help here since they are meant for accelerating numerical operations. However, since you are trying to parallelize HTTP requests which are I/O bound, you can use Python multithreading (part of the standard library) to reduce the time required.
In addition, concatenating pandas dataframes in a loop is a very slow operation (see: Why does concatenation of DataFrames get exponentially slower?). You can instead append your output to a list, and run just a single concat after the loop has concluded.
Here's how I would implement your code w/ multithreading:
# Use an empty list for storing loop output
all_product_properties = []
thread_local = threading.local()
def get_session():
if not hasattr(thread_local, "session"):
thread_local.session = requests.Session()
return thread_local.session
def download_site(url):
session = get_session()
try:
with session.get(url) as response:
if response is not None:
http_response_json = json.loads(response.text)
product_properties_json = http_response_json['product_properties']
curr_product_properties_df = pd.json_normalize(product_properties_json)
#add product id since it's not returned in the JSON
curr_product_properties_df["product_id"] = i
#save current product properties to DataFrame containing all product properties
return curr_product_properties_df
print(f"Read {len(response.content)} from {url}")
except Exception as e:
print(e)
def download_all_sites(sites):
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
all_product_properties = executor.map(download_site, sites)
return all_product_properties
if __name__ == "__main__":
# Store URLs as list, example below
urls = ["https://www.jython.org", "http://olympus.realpython.org/dice"] * 10
start_time = time.time()
all_product_properties = download_all_sites(urls)
all_product_properties = pd.concat(all_product_properties)
duration = time.time() - start_time
print(f"Downloaded {len(urls)} in {duration} seconds")
Reference: this RealPython article on multithreading and multiprocessing in Python: https://realpython.com/python-concurrency/
I am collecting data from a web API by using a Python script. The web API provides maximum 50 results ("size":50). However, I need to collect all the results. Please let me know how can I do it. My initial code is available below. Thank you in advance.
def getData():
headers = {
'Content-type': 'application/json',
}
data = '{"size":50,"sites.recruitment_status":"ACTIVE", "sites.org_state_or_province":"VA"}'
response = requests.post('https://clinicaltrialsapi.cancer.gov/v1/clinical-trials', headers=headers, data=data)
print(response.json())
To add to the answer already given you can get then total results from the initial json. You can then use a loop to increment for batches
import requests
import json
url = "https://clinicaltrialsapi.cancer.gov/v1/clinical-trials"
r = requests.get(url).json()
num_results = int(r['total'])
results_per_request = 50
total = 0
while total < num_results:
total+=results_per_request
print(total)
Everything is in the doc :
https://clinicaltrialsapi.cancer.gov/#!/Clinical45trials/searchTrialsByGet
GET clinical-trials
Filters all clinical trials based upon supplied filter params. Filter
params may be any of the fields in the schema as well as any of the
following params...
size: limit the amount of results a supplied amount (default is 10,
max is 50)
from: start the results from a supplied starting point (default is 0)
...
So you just have to specify a "from" value, and increment it 50 by 50.
I'm trying to find the textual differences between two revisions of a given Wikipedia page using mwclient. I have the following code:
import mwclient
import difflib
site = mwclient.Site('en.wikipedia.org')
page = site.pages['Bowdoin College']
texts = [rev for rev in page.revisions(prop='content')]
if not (texts[-1][u'*'] == texts[0][u'*']):
##show me the differences between the pages
Thank you!
It's not clear weather you want a difflib-generated diff or a mediawiki-generated diff using mwclient.
In the first case, you have two strings (the text of two revisions) and you want to get the diff using difflib:
...
t1 = texts[-1][u'*']
t2 = texts[0][u'*']
print('\n'.join(difflib.unified_diff(t1.splitlines(), t2.splitlines())))
(difflib can also generate an HTML diff, refer to the documentation for more info.)
But if you want the MediaWiki-generated HTML diff using mwclient you'll need revision ids:
# TODO: Loading all revisions is slow,
# try to load only as many as required.
revisions = list(page.revisions(prop='ids'))
last_revision_id = revisions[-1]['revid']
first_revision_id = revisions[0]['revid']
Then use the compare action to compare the revision ids:
compare_result = site.get('compare', fromrev=last_revision_id, torev=first_revision_id)
html_diff = compare_result['compare']['*']
I'm using the search feature with Tweepy for Twitter and for some reason the search results are limited to 15. Here is my code
results=api.search(q="Football",rpp=1000)
for result in results:
print "%s" %(clNormalizeString(result.text))
print len(results)
and only 15 results are returned. Does it have something to do with different pages of results or something?
The question is more about Twitter API instead of tweepy itself.
According to the documentation, count parameter defines:
The number of tweets to return per page, up to a maximum of 100.
Defaults to 15. This was formerly the "rpp" parameter in the old
Search API.
FYI, you can use tweepy.Cursor to get paginated results, like this:
import tweepy
auth = tweepy.OAuthHandler(..., ...)
auth.set_access_token(..., ...)
api = tweepy.API(auth)
for tweet in tweepy.Cursor(api.search,
q="google",
count=100,
result_type="recent",
include_entities=True,
lang="en").items():
print tweet.created_at, tweet.text
See also: https://github.com/tweepy/tweepy/issues/197.
Hope that helps.
Here's a minimal working example (once you replace the fake keys with real ones).
import tweepy
from math import ceil
def get_authorization():
info = {"consumer_key": "A7055154EEFAKE31BD4E4F3B01F679",
"consumer_secret": "C8578274816FAEBEB3B5054447B6046F34B41F52",
"access_token": "15225728-3TtzidHIj6HCLBsaKX7fNpuEUGWHHmQJGeF",
"access_secret": "61E3D5BD2E1341FFD235DF58B9E2FC2C22BADAD0"}
auth = tweepy.OAuthHandler(info['consumer_key'], info['consumer_secret'])
auth.set_access_token(info['access_token'], info['access_secret'])
return auth
def get_tweets(query, n):
_max_queries = 100 # arbitrarily chosen value
api = tweepy.API(get_authorization())
tweets = tweet_batch = api.search(q=query, count=n)
ct = 1
while len(tweets) < n and ct < _max_queries:
print(len(tweets))
tweet_batch = api.search(q=query,
count=n - len(tweets),
max_id=tweet_batch.max_id)
tweets.extend(tweet_batch)
ct += 1
return tweets
Note: I did try using a for loop, but the twitter api sometimes returns fewer than 100 results (despite being asked for 100, and 100 being available). I'm not sure why this is, but that's the reason why I didn't include a check to break the loop if tweet_batch is empty -- you may want to add such a check yourself as there is a query rate limit.
Another Note: You can avoid hitting the rate limit by invoking wait_on_rate_limit=True like so
api = tweepy.API(get_authorization(), wait_on_rate_limit=True)