google-coud-storage python list_blobs performance - python

I have a very simple python function:
def list_blobs(bucket, project)
storage_client = storage.Client(project=project)
bucket = storage_client.get_bucket(bucket)
blobs = bucket.list_blobs(prefix='basepath/', max_results=999999,
fields='items(name,md5Hash),nextPageToken')
r = [(b.name, b.md5_hash) for b in blobs]
The blobs list contains 14599 items, and this code takes 7 seconds to run.
When profiling most of the time is wasted reading from the server (there are 16 calls to page_iterator._next_page.
So, how can I improve here? The iteration code is deep in the library, and the pointer to each page comes from the previous page, so I see no way on how to fetch the 16 pages in parallel so I can cut down those 7 seconds.
I am on python 3.6.8,
google-api-core==1.7.0
google-auth==1.6.2
google-cloud-core==0.29.1
google-cloud-storage==1.14.0
google-resumable-media==0.3.2
googleapis-common-protos==1.5.6
protobuf==3.6.1

Your max_results=999999 is larger than 14599 - the number of objects, forcing all results into a single page. From Bucket.list_blobs():
Parameters:
max_results (int) – (Optional) The maximum number of blobs in each page of results from this request. Non-positive values are ignored.
Defaults to a sensible value set by the API.
My guess is that the code spends a lot of time blocked waiting for the server to provide the info needed to iterate through the results.
So the 1st thing I'd try would be to actually iterate through multiple pages, using a max_results smaller than the number of blobs. Maybe 1000 or 2000 and see the impact on overall duration?
Maybe even trying to use the multiple pages explicitly, using blobs.pages, as suggested in the deprecated page_token property doc (emphasis mine):
page_token (str) – (Optional) If present, return the next batch of blobs, using the value, which must correspond to the nextPageToken
value returned in the previous response. Deprecated: use the pages
property of the returned iterator instead of manually passing the
token.
But I'm not quite sure how to force the multiple pages to be simultaneously pulled. Maybe something like this?
[(b.name, b.md5_hash) for page in blobs.pages for b in page]

Related

Getting all review requests from Review Board Python Web API

I would like to get the information about all reviews from my server. That's my code that I used to achieve my goal.
from rbtools.api.client import RBClient
client = RBClient('http://my-server.net/')
root = client.get_root()
reviews = root.get_review_requests()
The variable reviews contains just 25 review requests (I expected much, much more). What's even stranger I tried something a bit different
count = root.get_review_requests(counts_only=True)
Now count.count is equal to 17164. How can I extract the rest of my reviews? I tried to check the official documentation but I haven't found anything connected to my problem.
According to the documentation (https://www.reviewboard.org/docs/manual/dev/webapi/2.0/resources/review-request-list/#webapi2.0-review-request-list-resource), counts_only is only a Boolean flag that indicates following:
If specified, a single count field is returned with the number of results, instead of the results themselves.
But, what you could do, is to provide it with status, so:
count = root.get_review_requests(counts_only=True, status='all')
should return you all the requests.
Keep in mind that I didn't test this part of the code locally. I referred to their repo test example -> https://github.com/reviewboard/rbtools/blob/master/rbtools/utils/tests/test_review_request.py#L643 and the documentation link posted above.
You have to use pagination (unfortunately I can't provide exact code without ability to reproduce your question):
The maximum number of results to return in this list. By default, this is 25. There is a hard limit of 200; if you need more than 200 results, you will need to make more than one request, using the “next” pagination link.
Looks like pagination helper class also available.
If you want to get 200 results you may set max_results:
requests = root.get_review_requests(max_results=200)
Anyway HERE is a good example how to iterate over results.
Also I don't recommend to get all 17164 results by one request even if it possible. Because total data of response will be huge (let's say if size one a result is 10KB total size will be more than 171MB)

Returning all Data from a CKAN API Request? (Python)

This is my first time using a CKAN Data API. I am trying to download public road accident data from a government website. It is only showing the first 100 rows. On the CKAN documentation it says that the default limit of rows it requests is "100".I am pretty sure you can write an ckan expression to the end of the url to give you the max rows but I am now sure how to write it. Please see python code below of what I have so of far. Is it possible? Thanks
is there any way I can write code similar to the psuedo ckan code request below?
url='https://data.gov.au/data/api/3/action/datastore_search?resource_id=d54f7465-74b8-4fff-8653-37e724d0ebbb&limit=MAX_ROWS'
CKAN Documentation reference: http://docs.ckan.org/en/latest/maintaining/datastore.html
There are several interesting fields in the documentation for ckanext.datastore.logic.action.datastore_search(), but the ones that pop out are limit and offset.
limit seems to have an absolute maximum of 32000 so depending on the amount of data you might still hit this limit.
offset seems to be the way to go. You keep calling the API with the offset increasing by a set amount until you have all the data. See the code below.
But, actually calling the API revealed something interesting. It generates a next URL which you can call, it automagically updates the offset based on the limit used (and maintaining the limit set on the initial call).
You can call this URL to get the next batch of results.
Some testing showed that it will go past the maximum though, so you need to check if the returned records are lower than the limit you use.
import requests
BASE_URL = "https://data.gov.au/data"
INITIAL_URL = "/api/3/action/datastore_search?resource_id=d54f7465-74b8-4fff-8653-37e724d0ebbb"
LIMIT = 10000
def get_all() -> list:
result = []
resp = requests.get(f"{BASE_URL}{INITIAL_URL}&limit={LIMIT}")
js = resp.json()["result"]
result.extend(js["records"])
while "_links" in js and "next" in js["_links"]:
resp = requests.get(BASE_URL + js["_links"]["next"])
js = resp.json()["result"]
result.extend(js["records"])
print(js["_links"]["next"]) # just so you know it's actually doing stuff
if len(js["records"]) < LIMIT:
# if it returned less records than the limit, the end has been reached
break
return result
print(len(get_all()))
Note, when exploring an API, it helps to check what exactly is returned. I used the simple code below to check what was returned, which made exploring the API a lot easier. Also, reading the docs helps, like the one I linked above.
from pprint import pprint
pprint(requests.get(BASE_URL+INITIAL_URL+"&limit=1").json()["result"])

Google Forms API Record Limit?

When I use the service.forms().responses().list() snippet I am getting 5,000 rows in my df. When I look at the front end sheet I see more than 5,000 rows. Is there a limit to the number of rows that can be pulled?
You are hitting the default page size, as per this documentation.
When you don't specify a page size, it will return up to 5000 results, and provide you a page token if there are more results. Use this page token to call the API again, and you get the rest, up to 5000 again, with a new page token if you hit the page size again.
Note that you can set the page size to a large number. Do not do this. This method is not the one you should use, I'm including it for completeness' sake. You will still run into the same issue if you hit the larger page size. And there's a hard limit on Google's side (which is not specified/provided).
You can create a simple recursive function that handles this for you.
def get_all(service, formId, responses=[], pageToken=None) -> list:
# call the API
res = service.forms().responses(formId=formId, pageToken=pageToken).list().execute()
# Append responses
responses.extend(res["responses"])
# check for nextPageToken
if "nextPageToken" in res:
# recursively call the API again.
return get_all(service, formId, responses=responses, pageToken=res["nextPageToken"])
else:
return responses
please note that I didn't test this one, you might need to move the pageToken=pageToken from the .responses() to the .list() parameters. The implementation of Google's python libraries is not exactly consistent.
Let me know if this is the case so I can update the answer accordingly...

Twitter scraping in Python

I have to scrape tweets from Twitter for a specific user (#salvinimi), from January 2018. The issue is that there are a lot of tweets in this range of time, and so I am not able to scrape all the ones I need!
I tried multiple solutions:
1)
pip install twitterscraper
from twitterscraper import query_tweets_from_user as qtfu
tweets = qtfu(user='matteosalvinimi')
With this method, I get only a few teets (500~600 more or less), instead of all the tweets... Do you know why?
2)
!pip install twitter_scraper
from twitter_scraper import get_tweets
tweets = []
for i in get_tweets('matteosalvinimi', pages=100):
tweets.append(i)
With this method I get an error -> "ParserError: Document is empty"...
If I set "pages=40", I get the tweets without errors, but not all the ones. Do you know why?
Three things for the first issue you encounter:
first of all, every API has its limits and one like Twitter would be expected to monitor its use and eventually stop a user from retrieving data if the user is asking for more than the limits. Trying to overcome the limitations of the API might not be the best idea and might result in being banned from accessing the site or other things (I'm taking guesses here as I don't know what's the policy of Twitter on the matter). That said, the documentation on the library you're using states :
With Twitter's Search API you can only sent 180 Requests every 15 minutes. With a maximum number of 100 tweets per Request this means you can mine for 4 x 180 x 100 = 72.000 tweets per hour.
By using TwitterScraper you are not limited by this number but by your internet speed/bandwith and the number of instances of TwitterScraper you are willing to start.
then, the function you're using, query_tweets_from_user() has a limit argument which you can set to an integer. One thing you can try is changing that argument and seeing whether you get what you want or not.
finally, if the above does not work, you could be subsetting your time range in two, three ore more subsets if needed, collect the data separately and merge them together afterwards.
The second issue you mention might be due to many different things so I'll just take a broad guess here. For me, either setting pages=100 is too high and by one way or another the program or the API is unable to retrieve the data, or you're trying to look at a hundred pages when there is less than a hundred in pages to look for reality, which results in the program trying to parse an empty document.

Page through S3 objects matching specific filename using boto3

I have an AWS S3 bucket with a Prefix (or "folder") called /photos. That "contains" a bunch of image files and even fewer EVENT.json files. A naive representation might look like this:
my-awesome-events-bucket
photos
image1.jpg
image2.jpg
1_EVENT.json
image3.jpg
2_EVENT.json
...
The EVENT.json files have an object that contains a path reference to an arbitrary amount of image files, which group images into a specific event. Using the example above, image1.jpg and image2.jpg could appear in 1_EVENT.json, and image3.jpg may belong to 2_EVENT.json.
As the bucket gets larger, I have an interest in paging through the results. I only want to request a page at a time from S3 as I need them. The problem I'm running into is that I want to page specifically by keys that contain the word "EVENT". I'm finding this difficult to accomplish without bringing back ALL the objects and then filtering or iterating the results.
Using an S3 Paginator, I'm able to get paging working. Assuming my PageSize and MaxItems are set to 6, this is what I might get back for my first page:
/photos/
/photos/image1.jpg
/photos/image2.jpg
/photos/1_EVENT.json
/photos/image3.jpg
/photos/2_EVENT.json
S3's flat structure means that it's paging through all objects in the bucket according to the Prefix, and limiting and paging according to the pagination parameters. This means that I could easily get multiple EVENT.json files, or none at all, depending on the page.
So I'm looking for something more along the lines of this:
/photos/1_EVENT.json
/photos/2_EVENT.json
/photos/3_EVENT.json
/photos/4_EVENT.json
/photos/5_EVENT.json
/photos/6_EVENT.json
without first having to request all objects and then slice the results set in some way; which is exactly what I'm doing currently:
client = boto3.client('s3')
paginator = client.get_paginator('list_objects_v2')
page_iterator = paginator.paginate(
Bucket=app.config.get('S3_BUCKET'),
Prefix="photos/") # Left PaginationConfig MaxItems & PageSize off intentionally
filtered_iterator = page_iterator.search(
"Contents[?contains(Key, `EVENT`)][]")
for page in filtered_iterator:
# Do stuff.
pass
The above is really expensive, with no paging, but it does give me a list of all files containing my "EVENT" search string.
I specifically want to page results of only EVENT.json objects through S3 using boto3 without the overhead of returning and filtering all objects every request. Is that possible?
EDIT: I'm already narrowing requests down to just objects with the photos/ Prefix. This is because there are other "folders" in my bucket that also may contain EVENT files. That prevents me from using EVENT or EVENT.json as my Prefix, because the response may be polluted by files from other folders.
The simplest way would be to rehash your filename structure to have the EVENT files follow the pattern photos/EVENT_*.json instead of photos/*_EVENT.json. Then you could use a common prefix of photos/EVENT.
Short of that, I think that the expensive method you are using is actually the only way to go about it.
There is a prefix option you can throw on one of the search functions in boto. This will dramatically reduce the amount of files it has to scan. However if you are having to search strings with wildcards in the middle of the string last I knew it had to scan all the objects in the bucket then you would have to wildcard search though those objects.
ex:
bucket.search_function(prefix="string")
I can't recall the boto function off the top of my head though.

Categories

Resources