Creating a dataframe of s3 object data with a paginator

Creating a dataframe of s3 object data with a paginator - python

I'm trying to create a pandas dataframe with bucket object data (list_objects_v2) using boto3.
Without pagination, I can easily create a dataframe by using recursion on the response and appending rows to the dataframe.
import boto3
import pandas
s3 = boto3.client('s3')
response = s3.list_objects_v2(Bucket=bucket_name) #this creates a nested json
print(response)
{'ResponseMetadata': {'RequestId': 'PGMCTZNAPV677CWE', 'HostId': '/8qweqweEfpdegFSNU/hfqweqweqweSHtM=', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amz-id-2': '/8yacqweqwe/hfjuSwKXDv3qweqweqweHtM=', 'x-amz-request-id': 'PqweqweqweE', 'date': 'Fri, 09 Sep 2022 09:25:04 GMT', 'x-amz-bucket-region': 'eu-central-1', 'content-type': 'application/xml', 'transfer-encoding': 'chunked', 'server': 'AmazonS3'}, 'RetryAttempts': 0}, 'IsTruncated': False, 'Contents': [{'Key': 'qweqweIntraday.csv', 'LastModified': datetime.datetime(2022, 7, 12, 8, 32, 10, tzinfo=tzutc()), 'ETag': '"qweqweqwe4"', 'Size': 1165, 'StorageClass': 'STANDARD'}], 'Name': 'test-bucket', 'Prefix': '', 'MaxKeys': 1000, 'EncodingType': 'url', 'KeyCount': 1}
object_df = pandas.DataFrame()
for elem in response:
if 'Contents' in elem:
object_df = pandas.json_normalize(response['Contents'])
Because of the 1000 row limitation of list_objects_v2, I'm trying to get to the same result using recursion. I attempted to do this with the following code, but I don't get the desired output (infinite loops on larger buckets).
object_df = pandas.DataFrame()
paginator = s3.get_paginator("list_objects_v2")
for page in paginator.paginate(Bucket=bucket_name):
for elem in page:
if 'Contents' in elem:
object_df = pandas.json_normalize(page['Contents'])
I managed to find a solution with adding another dataframe and just appending each page to it.
appended_object_df = pandas.DataFrame()
paginator = s3.get_paginator("list_objects_v2")
for page in paginator.paginate(Bucket=bucket_name):
object_df = pandas.DataFrame()
object_df = pandas.json_normalize(page['Contents'])
appended_object_df=appended_object_df.append(object_df, ignore_index=True)
I'm still curious if it's possible to skip the appending part and have the code directly generate the complete df.

Per the pandas documentation:
Iteratively appending rows to a DataFrame can be more computationally intensive than a single concatenate. A better solution is to append those rows to a list and then concatenate the list with the original DataFrame all at once.
So, you could do:
df_list = []
paginator = s3.get_paginator("list_objects_v2")
for page in paginator.paginate(Bucket=bucket_name):
page_df = pandas.json_normalize(page['Contents'])
df_list.append(page_df)
object_df = pandas.concat(df_list, ignore_index=True)

Related

Duplication with 2 enumerate in itertools.product causing issues with data python

It seems there is something going on with the way the itertools.product has been written here I have the below method in class. Which more or less does what I need however there is an issue with the way the data is being created.
def replace_multiple_npi_lists(self):
conn = self.establish_connection()
list_ids = [] # list of unique list ids
new_npis = [] # list of new npis
df = pd.read_csv("./uploads/new_file.csv")[["NAME", "NPI_ID"]]
df = df.dropna()
if get_npi_lists := self.get_all_account_npi_lists(561939):
for name in get_npi_lists:
npi_list = {
"id": name["id"],
"name": name["name"],
}
list_ids.extend(
npi_list["id"]
for target in df["NAME"].unique()
if target == npi_list["name"]
)
for target in df["NAME"].unique():
new_list = df[df["NAME"] == target].reset_index(drop=True)
new_list = new_list["NPI_ID"].apply(lambda x: str(int(x))).to_list()
final_list = dict(npis=new_list)
new_npis.append(final_list)
try:
# loop through new_lists and send list to npis
for (i, npis), (i, id) in itertools.product(
enumerate(new_npis), enumerate(list_ids)
):
if id == list_ids[i]:
r = conn.put(
f"https://xxx.xxx.com/RestApi/v1/npi/npi-list/{id}",
json=npis,
)
r.raise_for_status()
if r.status_code == requests.codes.ok:
print(f"List #{i + 1} has been uploaded")
new_list = {"id": id, "npis": npis}
print(new_list)
time.sleep(5)
self.establish_connection()
except requests.exceptions.HTTPError:
print(r.json())
Then end point accepts an id in the URL and a json body like so:
{
'npis': [
122,
123,
...
]
}
I have formatted the below via print statements so i can see if the correct numbers are being passed to the right ids.
The method works, it does what I want it to do in terms of functionality but it doesn't create the dicts the way i need for them to be sent.
list_ids: is a list of unique IDS
new_npis: is a list of dicts that hold NPI numbers
list_ids
[4151, 8785, 8786]
new_npis
[{'npis': ['3099994294', '1430739187', '5968165218']}, {'npis': ['3559958528', '2502671659', '7646439044']}, {'npis': ['8065327496', '3487201540', '4693760324']}]
I believe my issue lies with how I have written the loop in the try clause, the output from the loop consists of the following:
List #1 has been uploaded
{'id': 4151, 'npis': {'npis': ['3099994294', '1430739187', '5968165218']}}
List #2 has been uploaded
{'id': 8785, 'npis': {'npis': ['3099994294', '1430739187', '5968165218']}}
List #3 has been uploaded
{'id': 8786, 'npis': {'npis': ['3099994294', '1430739187', '5968165218']}}
List #1 has been uploaded
{'id': 4151, 'npis': {'npis': ['3559958528', '2502671659', '7646439044']}}
List #2 has been uploaded
{'id': 8785, 'npis': {'npis': ['3559958528', '2502671659', '7646439044']}}
List #3 has been uploaded
{'id': 8786, 'npis': {'npis': ['3559958528', '2502671659', '7646439044']}}
List #1 has been uploaded
{'id': 4151, 'npis': {'npis': ['8065327496', '3487201540', '4693760324']}}
List #2 has been uploaded
{'id': 8785, 'npis': {'npis': ['8065327496', '3487201540', '4693760324']}}
List #3 has been uploaded
{'id': 8786, 'npis': {'npis': ['8065327496', '3487201540', '4693760324']}}
So currently it does want I want to a certain degree, how ever it seems to be adding the first dict from new_npis to the all ids in list_ids and then looping over them again to add the second dict from new_npis and so on.
The desired end result is to be the following:
{'id': 4151, 'npis': {'npis': ['3099994294', '1430739187','5968165218']}}
{'id': 8785, 'npis': {'npis': ['3559958528', '2502671659', '7646439044']}}
{'id': 8786, 'npis': {'npis': ['8065327496', '3487201540', '4693760324']}}
excuse the formatting here of the dicts it's the only way I could get them to print

I have resolved it by doing the following:
for i, npis in enumerate(new_npis):
r = conn.put(
f"https://xxx.xxx.com/RestApi/v1/npi/npi-list/{list_ids[i]}",
json=npis,
)
r.raise_for_status()

Python JSON multiple requests get response save to dataframe

Input:
I send multiple get requests with two parameters as lists (Year and Month) using API to import data
My sample code:
import grequests
import json
import requests
import io
import pandas as pd
from pandas import json_normalize
api_key = <api_key>
url = <url>
headersAuth = {
'Authorization': 'Bearer ' + api_key,
}
years_list = ['2020', '2021']
month_list = ['1','2']
#Create an array of urls
urls = [url + "Month=" + str(i) + "&Year=" + str(j) for i in month_list for j in years_list]
#Preapare requests
rs = (grequests.get(u, headers={'Authorization': 'Bearer ' + api_key}) for u in urls)
response = grequests.map(rs)
json_ = [r.json() for r in response]
But after json_ step I stuck, because I'm not sure how to parse this further in the right way
After running my script I get a format that looks like this
[
{'ID': 3473, 'Month': 1, 'Year': 2020, 'Sold': 1234},
{'ID': 3488, 'Month': 1, 'Year': 2020, 'Sold': 1789}]
... etc.
Output:
I would like to export it to pandas dataframe and then to xlsx or csv file as a normal column view
I'm probably missing some basics here, but can't process it further
I tried several things after that:
Use json.dumps jsonStr = json.dumps(json_)
Then convert it to listOfDictionaries = json.loads(jsonStr)
Then data_tuples = list(zip(listOfDictionaries))
df = pd.DataFrame(data_tuples)
But I couldn't get a desired output.
I would appreciate your help on this.

pandas.DataFrame.from_records
import pandas as pd
data = [
{'ID': 3473, 'Month': 1, 'Year': 2020, 'Sold': 1234},
{'ID': 3488, 'Month': 1, 'Year': 2020, 'Sold': 1789}]
pd.DataFrame.from_records(data)
ID Month Year Sold
0 3473 1 2020 1234
1 3488 1 2020 1789

RestAPI filter params JSON

I'm trying to get the last data from the bitmex API
Base URI: https://www.bitmex.com/api/v1
I don't really understand how to get the last data (from today) using filters : https://www.bitmex.com/app/restAPI
here is my code:
from datetime import date
import requests
import json
import pandas as pd
today = date.today()
d1 = today.strftime("%Y-%m-%d")
#print("d1 =", d1)
def parser():
today = date.today()
# yy/dd/mm
d1 = today.strftime("%Y-%m-%d")
# print("d1 =", d1)
return f'https://www.bitmex.com/api/v1/trade?symbol=.BVOL24H&startTime={d1}&timestamp.time=12:00:00.000&columns=price'
# Making a get request
response = requests.get(parser()).json()
# print(response)
for elem in response:
print(elem)
and the response is :
...
{'symbol': '.BVOL24H', 'timestamp': '2021-12-27T08:05:00.000Z', 'price': 2.02}
{'symbol': '.BVOL24H', 'timestamp': '2021-12-27T08:10:00.000Z', 'price': 2.02}
{'symbol': '.BVOL24H', 'timestamp': '2021-12-27T08:15:00.000Z', 'price': 2.02}
it's missing a few hours, I tried using endTime, StartTime and Count without success..
I think I need to pass another filter like endtime = now and timestamp.time = now but I don't know how to send a payload or how to url-encode it.

As Filtering part tells
Many table endpoints take a filter parameter. This is expected to be JSON
These parameters are not keys in the query string, but keys in a dictionary given in filter key
url = "https://www.bitmex.com/api/v1/trade"
filters = {
'startTime': date(2021, 12, 20).strftime("%Y-%m-%d"),
'timestamp.time': '12:00:00.000'
}
params = {
'symbol': '.BVOL24H',
'filter': json.dumps(filters),
}
response = requests.get(url, params=params)
for elem in response.json():
print(elem)
Example
/trade?symbol=.BVOL24H&filter={%22startTime%22:%222021-12-20%22,%22timestamp.time%22:%2212:00:00.000%22}

You can add additional parameters to the url with & like below.
'https://www.bitmex.com/api/v1/trade?symbol=.BVOL24H&startTime={d1}&timestamp.time=12:00:00.000&columns=price&endTime={date.today()}&timestamp.time={date.today()}'

aws python lambda: reading csv file (iterator should return strings)

I'm getting this message when I'm trying to test my python 3.8 lambda function:
Logs are:
soc-connect
contacts.csv
{'ResponseMetadata': {'RequestId': '9D7D7F0C5CB79984', 'HostId': 'wOd6HvIm+BpLOMKF2beRvqLiW0NQt5mK/kzjCjYxQ2kHQZY0MRCtGs3l/rqo4o0r4xAPuV1QpGM=', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amz-id-2': 'wOd6HvIm+BpLOMKF2beRvqLiW0NQt5mK/kzjCjYxQ2kHQZY0MRCtGs3l/rqo4o0r4xAPuV1QpGM=', 'x-amz-request-id': '9D7D7F0C5CB79984', 'date': 'Thu, 26 Mar 2020 11:21:35 GMT', 'last-modified': 'Tue, 24 Mar 2020 16:07:30 GMT', 'etag': '"8a3785e750475af3ca25fa7eab159dab"', 'accept-ranges': 'bytes', 'content-type': 'text/csv', 'content-length': '52522', 'server': 'AmazonS3'}, 'RetryAttempts': 0}, 'AcceptRanges': 'bytes', 'LastModified': datetime.datetime(2020, 3, 24, 16, 7, 30, tzinfo=tzutc()), 'ContentLength': 52522, 'ETag': '"8a3785e750475af3ca25fa7eab159dab"', 'ContentType': 'text/csv', 'Metadata': {}, 'Body': <botocore.response.StreamingBody object at 0x7f858dc1e6d0>}
1153
<_csv.reader object at 0x7f858ea76970>
[ERROR] Error: iterator should return strings, not bytes (did you open the file in text mode?)
The code snippet is:
import boto3
import csv
def digest_csv(bucket_name, key_name):
# Let's use Amazon S3
s3 = boto3.client('s3');
print(bucket_name)
print(key_name)
s3_object = s3.get_object(Bucket=bucket_name, Key=key_name)
print(s3_object)
# read the contents of the file and split it into a list of lines
lines = s3_object['Body'].read().splitlines(True)
print(len(lines))
contacts = csv.reader(lines, delimiter=';')
print(contacts)
# now iterate over those contacts
for contact in contacts:
# here you get a sequence of dicts
# do whatever you want with each line here
print('-*-'.join(contact))
I think the problem is on csv.reader.
I'm setting first parameter an array of lines... Should it be modified?
Any ideas?

Instead of using the csv.reader the following worked for me (adjusted for your delimiter and variables):
for line in lines:
contact = ''.join(line.decode().split(';'))
print(contact)

Unable to get just subfolder objects from s3 aws

I am using this function to get data from S3:
s3 = boto3.resource('s3')
s3client = boto3.client('s3')
Bucket = s3.Bucket('ais-django');
obj = s3.Object('ais-django', 'Event/')
list = s3client.list_objects_v2(Bucket='ais-django' ,Prefix='Event/' )
for s3_key in list:
filename = s3_key['Key']
When I use prefix for Event folder (path is like 'ais-django/Event/') it gives abnormal output like this:
{
'IsTruncated': False,
'Prefix': 'Event/',
'ResponseMetadata': {
'HTTPHeaders': {
'date': 'Mon, 11 Jun 2018 12:42:35 GMT',
'content-type': 'application/xml',
'transfer-encoding': 'chunked',
'x-amz-bucket-region': 'us-east-1',
'x-amz-request-id': '94ADDB21361252F3',
'server': 'AmazonS3',
'x-amz-id-2': 'IVuVQuB2V7nClm5FaX4FRbt6brS3gAiuwpERnZxknIWoZLH65LerURwmoynKW5sv37VP6FdbYho='
},
'RequestId': '94ADDB21361252F3',
'RetryAttempts': 0,
'HostId': 'IVuVQuB2V7nClm5FaX4FRbt6brS3gAiuwpERnZxknIWoZLH65LerURwmoynKW5sv37VP6FdbYho=',
'HTTPStatusCode': 200
},
'MaxKeys': 1000,
'Name': 'ais-django',
'KeyCount': 0
}
while without prefix when I add like this:
list = s3client.list_objects_v2(Bucket='ais-django' )[Contents]
it gives list of all objects.
So how I can get all objects in a specific folder ?

this is the way you should do it :)
import boto3
s3 = boto3.resource('s3')
bucket = s3.Bucket('ais-django')
for o in bucket.objects.filter(Prefix='Event/test-event'):
print(o.key)
this is the result you will get
the result contains Event/test-event/ as there is no folder system in AWS s3 , everything is an object, hence Event/test-event/ as well as Event/test-event/image.jpg are both considered as objects.
if you want only contents , i.e , image only you can do it like this,
import boto3
s3 = boto3.resource('s3')
bucket = s3.Bucket('ais-django')
for o in bucket.objects.filter(Prefix='Event/test-event'):
filename=o.key
if filename.endswith(".jpeg") or filename.endswith(".jpg") or filename.endswith(".png"):
print(o.key)
Now in this case we are getting Event/test-event/18342087_1323920084341024_7613721308394107132_n.jpg as a result as we are filtering our results out and this is the only image object in my bucket right now

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Creating a dataframe of s3 object data with a paginator - python

Related

Duplication with 2 enumerate in itertools.product causing issues with data python

Python JSON multiple requests get response save to dataframe

RestAPI filter params JSON

aws python lambda: reading csv file (iterator should return strings)

Unable to get just subfolder objects from s3 aws

Categories

Resources