I often use an API call to pull some customer data.
However, whenever I try to pull more than 20 customer ids, the API stops working.
When this happens, I run multiple API calls, transform each JSON output into a df and append all the dataframes together.
That is fine when I need just a couple of API calls, but becomes inefficient when I have several customer ids to pull, as sometimes I have to run 5/10 separate API calls.
I thought a loop could help here. Given I have little experience with Python, I had a look at other questions on looping APIs, but I couldn't find a solution.
Below is the code I use. How can I make a single API call that loops through several customer ids (keeping in mind that there's a limit of circa 20 ids per call) and returns a single dataframe?
Thanks!
#list of customer ids
customer_id = [
"1004rca402itas8470der874",
"1004rca402itas8470der875,
"1004rca402itas8470der876",
"1004rca402itas8470der877",
"1004rca402itas8470der878",
"1004rca402itas8470der879"
]
#API call
payload = {'customer':",".join(customer_id), 'countries':'DE, 'granularity':'daily', 'start_date':'2021-01-01', 'end_date':'2022-03-31'}
response = requests.get('https://api.xxxxxxjxjx.com/t3/customers/xxxxxxxxxxxx?auth_token=xxxxxxxxxxxx', params=payload)
response.status_code
#convert to dataframe
api = response.json()
df = pd.DataFrame(api)
df['sales'] = df['domestic_sales'] + df['international_sales']
df = df[['customer_id','country','date','sales']]
df.head()
Here is the general idea:
# List of dataframes
dfs = []
# List of lists of 20 customer ids each
ids = [customer_id[i:i+20] for i in range(0, len(customer_id), 20)]
# Iterate on 'ids' to call api and store new df in list called 'dfs'
for chunk in ids:
payload = {
"customer": ",".join(chunk),
"countries": "DE",
"granularity": "daily",
"start_date": "2021-01-01",
"end_date": "2022-03-31",
}
response = requests.get(
"https://api.xxxxxxjxjx.com/t3/customers/xxxxxxxxxxxx?auth_token=xxxxxxxxxxxx",
params=payload,
)
dfs.append(pd.DataFrame(response.json()))
# Concat all dataframes
df = dfs[0]
for other_df in dfs[1:]:
df = pd.concat([df, other_df])
# Additional work
df['sales'] = df['domestic_sales'] + df['international_sales']
df = df[['customer_id','country','date','sales']]
Related
I’m having some hard time trying to make this code shows more than 1 page of orders.
I already tried different methods, such as loops and also the one below (which is just a workaround) where I tried to get the page 2.
I just need it to brings me all the orders generated in a specific day - but I got completely stuck.
import requests
import pandas as pd
from datetime import datetime, timedelta
# Set the API token for the Shopify API
api_token = 'MYTOKEN'
# Get the current date and subtract one day
today = datetime.now()
yesterday = today - timedelta(days=1)
# Format the date strings for the API request
start_date = yesterday.strftime('%Y-%m-%dT00:00:00Z')
end_date = yesterday.strftime('%Y-%m-%dT23:59:59Z')
# Set the initial limit to 1
limit = 1
page_info = 2
# Set the initial URL for the API endpoint you want to access, including the limit and date range parameters
url = f'https://MYSTORE.myshopify.com/admin/api/2020-04/orders.json?page_info=%7Bpage_info%7D&limit=%7Blimit%7D&created_at_min=%7Bstart_date%7D&created_at_max=%7Bend_date%7D&'
# Set the API token as a header for the request
headers = {'X-Shopify-Access-Token': api_token}
# Make the GET request
response = requests.get(url, headers=headers)
# Check the status code of the response
if response.status_code == 200:
# Parse the JSON response directly
orders = response.json()['orders']
# Flatten the JSON response into a Pandas DataFrame, including the 'name' column (order number) and renaming the 'id' column to 'order_id'
df = pd.json_normalize(orders, sep='_', record_path='line_items', meta=['name', 'id'], meta_prefix='meta_')
# Flatten the line_items data into a separate DataFrame
line_items_df = pd.json_normalize(orders, 'line_items', ['id'], meta_prefix='line_item_')
# Flatten the 'orders' data into a separate | Added in Dec.26-2022
orders_df = pd.json_normalize(orders, sep='_', record_path='line_items', meta=['created_at', 'id'], meta_prefix='ordersDTbs_')
# Merge the 'df' and 'orders_df' DataFrames | Added in Dec.26-2022
df = pd.merge(df, orders_df[['id', 'ordersDTbs_created_at']], on='id')
# Converting create_at date to DATE only | Added in Dec.26-2022
df['ordersDTbs_created_at'] = df['ordersDTbs_created_at'].apply(lambda x: datetime.strptime(x, '%Y-%m-%dT%H:%M:%S%z').date())
# Concatenate the two dataframes
df = pd.merge(df, line_items_df[['id', 'sku', 'quantity']], on='id')
# Calculate the discount amount and add it as a new column in the dataframe
df['price_set_shop_money_amount'] = pd.to_numeric(df['price_set_shop_money_amount'])
df['total_discount_set_shop_money_amount'] = pd.to_numeric(df['total_discount_set_shop_money_amount'])
df = df.assign(paid_afterdiscount=df['price_set_shop_money_amount'] - df['total_discount_set_shop_money_amount'])
# Print the DataFrame
print(df[['meta_name','ordersDTbs_created_at','sku_y','title','fulfillable_quantity','quantity_x','quantity_y','paid_afterdiscount']])
#Checking if API ran smoothly
else:
print('Something went wrong.')
I already tried different methods, such as loops and also the one below (which is just a workaround) where I tried to get the page 2.
I am converting json_response to a dataframe by using the following commands:
df = pd.DataFrame(columns=["created_at", "username", "description", "tweet_id"]) #an empty dataframe to save data
data_nested = pd.json_normalize(json_response['data'])
df_temp = data_nested[["created_at", "username", "description"]].copy()
df = pd.concat([df, df_temp], ignore_index=True)
df.reset_index(inplace=True, drop=True)
Following is my sample json_response:
{
"data": [
{
"created_at": "2020-01-01T12:24:45.000Z",
"description": "This is a sample description",
"id": "12345678",
"name": "Sample Name",
"username": "sample_name"
}
],
"meta": {
"next_token": "sample_token",
"result_count": 1
}
}
This response is a result of querying "Retweeted_by" endpoint of Twitter API V2. I am trying to save "tweet_id" against each response in the loop (to understand which resulting row corresponds to which requesting tweet_id) by doing -> df['tweet_id'] = tweet_id. I understand that by using this, last tweet_id will replace everything else in the column.
I tried to do the following as well using index:
idx = df["username"].last_valid_index()
if pd.isnull(idx) or idx is None:
df.loc[0, "tweet_id"] = tweet_id
else:
df.loc[idx + 1, "tweet_id"] = tweet_id
But this fails as well because if result_count in json_response > 1, it will save tweet_id at the next row leaving previous ones as NaN.
Can someone please suggest a solution? Thank you.
Based on our exchange in the comments here is my proposed solution:
tweet_id_list = [1,2,3] # a list of all of your tweet ids
# here you will start looping through each id, and getting retweets.
# You could make this async but I would be careful since token limits are very
# strict on twitter. They can disable it if you go over the limit a lot.
all_dfs=[]
for tweet_id in tweet_id_list:
response = requests.post("url/tweet_id")
json_response = json.loads(response.text)
temp_df = pd.DataFrame.from_records(json_response['data'])
temp_df['tweet_id'] = tweet_id
all_dfs.append(temp_df)
# if you want to then have one big table with all the retweets and tweet_ids
# simply do:
df = pd.concat(all_dfs)
Just a bit of explanation.
You are creating a dataframe for each tweet_id retweets (temp_df). You are also creating an extra column in that dataframe called tweet_id. When you assign a value to a dataFrame column it propagates it to each row of said df.
You are then carefully collecting all the dataframes for each tweet_id into a list all_dfs.
After you exit the loop you are left with a list of dataframes. If you want to have one big table you concatenate them as a I have shown in the code.
I need to do a python script to
Read a csv file with the columns (person_id, name, flag). The file has 3000 rows.
Based on the person_id from the csv file, I need to call a URL passing the person_id to do a GET
http://api.myendpoint.intranet/get-data/1234
The URL will return some information of the person_id, like example below. I need to get all rents objects and save on my csv. My output needs to be like this
import pandas as pd
import requests
ids = pd.read_csv(f"{path}/data.csv", delimiter=';')
person_rents = df = pd.DataFrame([], columns=list('person_id','carId','price','rentStatus'))
for id in ids:
response = request.get(f'endpoint/{id["person_id"]}')
json = response.json()
person_rents.append( [person_id, rent['carId'], rent['price'], rent['rentStatus'] ] )
pd.read_csv(f"{path}/data.csv", delimiter=';' )
person_id;name;flag;cardId;price;rentStatus
1000;Joseph;1;6638;1000;active
1000;Joseph;1;5566;2000;active
Response example
{
"active": false,
"ctodx": false,
"rents": [{
"carId": 6638,
"price": 1000,
"rentStatus": "active"
}, {
"carId": 5566,
"price": 2000,
"rentStatus": "active"
}
],
"responseCode": "OK",
"status": [{
"request": 345,
"requestStatus": "F"
}, {
"requestId": 678,
"requestStatus": "P"
}
],
"transaction": false
}
After save the additional data from response on csv, i need to get data from another endpoint using the carId on the URL. The mileage result must be save in the same csv.
http://api.myendpoint.intranet/get-mileage/6638
http://api.myendpoint.intranet/get-mileage/5566
The return for each call will be like this
{"mileage":1000.0000}
{"mileage":550.0000}
The final output must be
person_id;name;flag;cardId;price;rentStatus;mileage
1000;Joseph;1;6638;1000;active;1000.0000
1000;Joseph;1;5566;2000;active;550.0000
SOmeone can help me with this script?
Could be with pandas or any python 3 lib.
Code Explanation
Create dataframe, df, with pd.read_csv.
It is expected that all of the values in 'person_id', are unique.
Use .apply on 'person_id', to call prepare_data.
prepare_data expects 'person_id' to be a str or int, as indicated by the type annotation, Union[int, str]
Call the API, which will return a dict, to the prepare_data function.
Convert the 'rents' key, of the dict, into a dataframe, with pd.json_normalize.
Use .apply on 'carId', to call the API, and extract the 'mileage', which is added to dataframe data, as a column.
Add 'person_id' to data, which can be used to merge df with s.
Convert pd.Series, s to a dataframe, with pd.concat, and then merge df and s, on person_id.
Save to a csv with pd.to_csv in the desired form.
Potential Issues
If there's an issue, it's most likely to occur in the call_api function.
As long as call_api returns a dict, like the response shown in the question, the remainder of the code will work correctly to produce the desired output.
import pandas as pd
import requests
import json
from typing import Union
def call_api(url: str) -> dict:
r = requests.get(url)
return r.json()
def prepare_data(uid: Union[int, str]) -> pd.DataFrame:
d_url = f'http://api.myendpoint.intranet/get-data/{uid}'
m_url = 'http://api.myendpoint.intranet/get-mileage/'
# get the rent data from the api call
rents = call_api(d_url)['rents']
# normalize rents into a dataframe
data = pd.json_normalize(rents)
# get the mileage data from the api call and add it to data as a column
data['mileage'] = data.carId.apply(lambda cid: call_api(f'{m_url}{cid}')['mileage'])
# add person_id as a column to data, which will be used to merge data to df
data['person_id'] = uid
return data
# read data from file
df = pd.read_csv('file.csv', sep=';')
# call prepare_data
s = df.person_id.apply(prepare_data)
# s is a Series of DataFrames, which can be combined with pd.concat
s = pd.concat([v for v in s])
# join df with s, on person_id
df = df.merge(s, on='person_id')
# save to csv
df.to_csv('output.csv', sep=';', index=False)
If there are any errors when running this code:
Leave a comment, to let me know.
edit your question, and paste the entire TraceBack, as text, into a code block.
Example
# given the following start dataframe
person_id name flag
0 1000 Joseph 1
1 400 Sam 1
# resulting dataframe using the same data for both id 1000 and 400
person_id name flag carId price rentStatus mileage
0 1000 Joseph 1 6638 1000 active 1000.0
1 1000 Joseph 1 5566 2000 active 1000.0
2 400 Sam 1 6638 1000 active 1000.0
3 400 Sam 1 5566 2000 active 1000.0
There are many different ways to implement this. One of them would be, like you started in your comment:
read the CSV file with pandas
for each line take the person_id and build a call
the delivered JSON response can then be taken from the rents
the carId is then extracted for each individual rental
finally this is collected in a row_list
the row_list is then converted back to csv via pandas
A very simple solution without any error handling could look something like this:
from types import SimpleNamespace
import pandas as pd
import requests
import json
path = '/some/path/'
df = pd.read_csv(f'{path}/data.csv', delimiter=';')
rows_list = []
for _, row in df.iterrows():
rentCall = f'http://api.myendpoint.intranet/get-data/{row.person_id}'
print(rentCall)
response = requests.get(rentCall)
r = json.loads(response.text, object_hook=lambda d: SimpleNamespace(**d))
for rent in r.rents:
mileageCall = f'http://api.myendpoint.intranet/get-mileage/{rent.carId}'
print(mileageCall)
response2 = requests.get(mileageCall)
m = json.loads(response2.text, object_hook=lambda d: SimpleNamespace(**d))
state = "active" if r.active else "inactive"
rows_list.append((row['person_id'], row['name'], row['flag'], rent.carId, rent.price, state, m.mileage))
df = pd.DataFrame(rows_list, columns=('person_id', 'name', 'flag', 'carId', 'price', 'rentStatus', 'mileage'))
print(df.to_csv(index=False, sep=';'))
Speeding up with multiprocessing
You mention that you have 3000 rows, which means that you'll have to make a lot of API calls. Depending on the connection, every one of these calls might take a while. As a result, performing this in a sequential way might be too slow. The majority of the time, your program will just be waiting on a response from the server without doing anything else.
We can improve this performance by using multiprocessing.
I use all the code from Trenton his answer, but I replace the following sequential call:
# call prepare_data
s = df.person_id.apply(prepare_data)
With a parallel alternative:
from multiprocessing import Pool
n_processes=20 # Experiment with this to see what works well
with Pool(n_processes) as p:
s=p.map(prepare_data, df.person_id)
Alternatively, a threadpool might be faster, but you'll have to test that by replacing the import with
from multiprocessing.pool import ThreadPool as Pool.
I'm trying to aggregate multiple years worth of data from the EPA's air quality API. The API returns a JSON file for each year, which I would like to convert to a dataframe, ultimately appending each subsequent year to the same dataframe. Here's my code:
pd.set_option('display.max_columns', 60)
i = 1999
for i in range(1999, 2020):
parameters = {
"email": "patrick.debiasse#gmail.com",
"key": "khakihawk63",
"param": "81104,44201,42602,42101,42401",
"bdate": str(i) + "1201",
"edate": str(i) + "1202",
"state": "49",
"county": "035",
"site": "3006"
}
#requesting the JSON data
json_data = requests.get("https://aqs.epa.gov/data/api/annualData/bySite email=test#aqs.api&key=test¶m=44201&bdate=20170618&edate=20170618&state=37&county=183&site=0014", params=parameters).json()
#converting to dataframe
df = pd.DataFrame((json_data['Data']))
#appending the converted data to a separate dataframe which will ultimately contain all the years' data
df2 = df.append(df)
i + 1
df2
When I run the above code I only see data for the last year (2019) in the "df2" dataframe, and it seems to be included twice (2019 data appended to 2019 data). Am I making some novice for loop mistake here? Not appending the data correctly? Something else I'm not considering? Any help is much appreciated.
You currently append the dataframe to itself in every loop and assign it to the df2 variable.
Instead try this:
for i in range(1999, 2020):
...
#converting to dataframe
df = pd.DataFrame((json_data['Data']))
#appending the converted data to a separate dataframe which will ultimately contain all the years' data
if i == 1999:
result = df
else:
result = result.append(df)
I also don't understand why you have the statement i + 1 at the end to the for loop. The for loop itself takes care of increasing the counter and you don't need to do that. If you would want to skip one year, you could use range(1999, 2020, 2) instead.
Also, the way you wrote it i doesn't get increased, you would need to write i += 1.
I have to make several calls to an API, and merge the results from each one into a singular dataframe. They have the same keys and I can do it once, but when I try to merge the third one, nothing happens. I'm probably also not doing it the most efficient way.
I originally tried a for-loop to do this, but for experimental purposes, I am trying to do just do it manually (changing the parameter by 5000 each time). The call limit is 5000 so I can only do this many records at a time. I know some of my variable names are probably inaccurate descriptions of what they represent ("JSONString etc.), but bear with me.
I won't include the url in my calls below, but it is accurate.
#First call, gets the necessary values out of the API and successfully turns them into a dataframe
params = urllib.parse.urlencode({
# Request parameters
'limit': 5000,
})
categoriesJSON = s.get(url, headers=headers)
categoriesJSONString = categoriesJSON.json()
categoriesDf = pandas.DataFrame(categoriesJSONString['value'])
#Second call, gets the necessary values out of the API and successfully turns them into a dataframe and then appends that dataframe to the original dataframe successfully
params = urllib.parse.urlencode({
# Request parameters
'limit': 5000,
'offset': 5000
})
categoriesJSON = s.get(url, headers=headers)
categoriesJSONString = categoriesJSON.json()
newCategoriesDf = pandas.DataFrame(categoriesJSONString['value'])
categoriesDf.append(newCategoriesDf, ignore_index = True)
#Third, gets the necessary values out of the API and turns them into a dataframe and then appends that dataframe to the original dataframe unsuccessfully
params = urllib.parse.urlencode({
# Request parameters
'limit': 5000,
'offset': 10000
})
categoriesJSON = s.get(url, headers=headers)
categoriesJSONString = categoriesJSON.json()
newCategoriesDf = pandas.DataFrame(categoriesJSONString['value'])
categoriesDf.append(newCategoriesDf, ignore_index = True)
After the second call, my dataframe is 10000 rows long, but after the third call, my dataframe is still 10000 rows long. What is preventing it from being 15000 long? I know that I have more than 10000 rows of data to get.
Append returns a new dataframe, the existing one is not updated.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.append.html
just update the target like this
download = pd.DataFrame(data = 1, index = [1], columns = ['Data'])
x = download
x = x.append(download, ignore_index=True)
x = x.append(download, ignore_index=True)
x = x.append(download, ignore_index=True)
df.append returns an appended DF, you need to change the last line to:
categoriesDf = categoriesDf.append(newCategoriesDf, ignore_index = True)
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.append.html