I've just started working with the multiprocessing Python library. I would like to make many API calls (get) using requests. I have a Pandas dataframe in which each row has the arguments I will be using to process the requests.get.
Here is an example of the dataframe I want to starmap to.
import pandas as pd
d = {
"companyId": ['1000','1005'],
"headers": [{'Authorization': 'Bearer token1'},{'Authorization': 'Bearer token1'}],
"employeeId": ['1500','1500'],
"date": ['2022-01-01','2022-01-02']
}
df = pd.DataFrame(d)
df.head()
Code to make request:
import multiprocessing as mp
def get_data(df: pd.DataFrame):
query: dict = {
'companyId': df['companyId'].astype(str),
'driverId': df['employeeId'].astype(str),
'day': df['date'].astype(str)
}
resp = requests.get(url=df['url'], headers=df['headers'], params=query)
return resp
if __name__ == "__main__":
with mp.Pool(mp.cpu_count()) as p:
res = list(p.starmap(get_data, zip(df.itertuples())))
print(res)
p.close()
p.join()
However, I receive some errors I am trying to understand. Ultimately, I want to map the api function to each row of my pandas dataframe in a parallel fashion. I would prefer to just use the multiprocessing library but do not necessarily need to use Pandas here if there is a simpler and more native solution.
Related
I often use an API call to pull some customer data.
However, whenever I try to pull more than 20 customer ids, the API stops working.
When this happens, I run multiple API calls, transform each JSON output into a df and append all the dataframes together.
That is fine when I need just a couple of API calls, but becomes inefficient when I have several customer ids to pull, as sometimes I have to run 5/10 separate API calls.
I thought a loop could help here. Given I have little experience with Python, I had a look at other questions on looping APIs, but I couldn't find a solution.
Below is the code I use. How can I make a single API call that loops through several customer ids (keeping in mind that there's a limit of circa 20 ids per call) and returns a single dataframe?
Thanks!
#list of customer ids
customer_id = [
"1004rca402itas8470der874",
"1004rca402itas8470der875,
"1004rca402itas8470der876",
"1004rca402itas8470der877",
"1004rca402itas8470der878",
"1004rca402itas8470der879"
]
#API call
payload = {'customer':",".join(customer_id), 'countries':'DE, 'granularity':'daily', 'start_date':'2021-01-01', 'end_date':'2022-03-31'}
response = requests.get('https://api.xxxxxxjxjx.com/t3/customers/xxxxxxxxxxxx?auth_token=xxxxxxxxxxxx', params=payload)
response.status_code
#convert to dataframe
api = response.json()
df = pd.DataFrame(api)
df['sales'] = df['domestic_sales'] + df['international_sales']
df = df[['customer_id','country','date','sales']]
df.head()
Here is the general idea:
# List of dataframes
dfs = []
# List of lists of 20 customer ids each
ids = [customer_id[i:i+20] for i in range(0, len(customer_id), 20)]
# Iterate on 'ids' to call api and store new df in list called 'dfs'
for chunk in ids:
payload = {
"customer": ",".join(chunk),
"countries": "DE",
"granularity": "daily",
"start_date": "2021-01-01",
"end_date": "2022-03-31",
}
response = requests.get(
"https://api.xxxxxxjxjx.com/t3/customers/xxxxxxxxxxxx?auth_token=xxxxxxxxxxxx",
params=payload,
)
dfs.append(pd.DataFrame(response.json()))
# Concat all dataframes
df = dfs[0]
for other_df in dfs[1:]:
df = pd.concat([df, other_df])
# Additional work
df['sales'] = df['domestic_sales'] + df['international_sales']
df = df[['customer_id','country','date','sales']]
I need to do a python script to
Read a csv file with the columns (person_id, name, flag). The file has 3000 rows.
Based on the person_id from the csv file, I need to call a URL passing the person_id to do a GET
http://api.myendpoint.intranet/get-data/1234
The URL will return some information of the person_id, like example below. I need to get all rents objects and save on my csv. My output needs to be like this
import pandas as pd
import requests
ids = pd.read_csv(f"{path}/data.csv", delimiter=';')
person_rents = df = pd.DataFrame([], columns=list('person_id','carId','price','rentStatus'))
for id in ids:
response = request.get(f'endpoint/{id["person_id"]}')
json = response.json()
person_rents.append( [person_id, rent['carId'], rent['price'], rent['rentStatus'] ] )
pd.read_csv(f"{path}/data.csv", delimiter=';' )
person_id;name;flag;cardId;price;rentStatus
1000;Joseph;1;6638;1000;active
1000;Joseph;1;5566;2000;active
Response example
{
"active": false,
"ctodx": false,
"rents": [{
"carId": 6638,
"price": 1000,
"rentStatus": "active"
}, {
"carId": 5566,
"price": 2000,
"rentStatus": "active"
}
],
"responseCode": "OK",
"status": [{
"request": 345,
"requestStatus": "F"
}, {
"requestId": 678,
"requestStatus": "P"
}
],
"transaction": false
}
After save the additional data from response on csv, i need to get data from another endpoint using the carId on the URL. The mileage result must be save in the same csv.
http://api.myendpoint.intranet/get-mileage/6638
http://api.myendpoint.intranet/get-mileage/5566
The return for each call will be like this
{"mileage":1000.0000}
{"mileage":550.0000}
The final output must be
person_id;name;flag;cardId;price;rentStatus;mileage
1000;Joseph;1;6638;1000;active;1000.0000
1000;Joseph;1;5566;2000;active;550.0000
SOmeone can help me with this script?
Could be with pandas or any python 3 lib.
Code Explanation
Create dataframe, df, with pd.read_csv.
It is expected that all of the values in 'person_id', are unique.
Use .apply on 'person_id', to call prepare_data.
prepare_data expects 'person_id' to be a str or int, as indicated by the type annotation, Union[int, str]
Call the API, which will return a dict, to the prepare_data function.
Convert the 'rents' key, of the dict, into a dataframe, with pd.json_normalize.
Use .apply on 'carId', to call the API, and extract the 'mileage', which is added to dataframe data, as a column.
Add 'person_id' to data, which can be used to merge df with s.
Convert pd.Series, s to a dataframe, with pd.concat, and then merge df and s, on person_id.
Save to a csv with pd.to_csv in the desired form.
Potential Issues
If there's an issue, it's most likely to occur in the call_api function.
As long as call_api returns a dict, like the response shown in the question, the remainder of the code will work correctly to produce the desired output.
import pandas as pd
import requests
import json
from typing import Union
def call_api(url: str) -> dict:
r = requests.get(url)
return r.json()
def prepare_data(uid: Union[int, str]) -> pd.DataFrame:
d_url = f'http://api.myendpoint.intranet/get-data/{uid}'
m_url = 'http://api.myendpoint.intranet/get-mileage/'
# get the rent data from the api call
rents = call_api(d_url)['rents']
# normalize rents into a dataframe
data = pd.json_normalize(rents)
# get the mileage data from the api call and add it to data as a column
data['mileage'] = data.carId.apply(lambda cid: call_api(f'{m_url}{cid}')['mileage'])
# add person_id as a column to data, which will be used to merge data to df
data['person_id'] = uid
return data
# read data from file
df = pd.read_csv('file.csv', sep=';')
# call prepare_data
s = df.person_id.apply(prepare_data)
# s is a Series of DataFrames, which can be combined with pd.concat
s = pd.concat([v for v in s])
# join df with s, on person_id
df = df.merge(s, on='person_id')
# save to csv
df.to_csv('output.csv', sep=';', index=False)
If there are any errors when running this code:
Leave a comment, to let me know.
edit your question, and paste the entire TraceBack, as text, into a code block.
Example
# given the following start dataframe
person_id name flag
0 1000 Joseph 1
1 400 Sam 1
# resulting dataframe using the same data for both id 1000 and 400
person_id name flag carId price rentStatus mileage
0 1000 Joseph 1 6638 1000 active 1000.0
1 1000 Joseph 1 5566 2000 active 1000.0
2 400 Sam 1 6638 1000 active 1000.0
3 400 Sam 1 5566 2000 active 1000.0
There are many different ways to implement this. One of them would be, like you started in your comment:
read the CSV file with pandas
for each line take the person_id and build a call
the delivered JSON response can then be taken from the rents
the carId is then extracted for each individual rental
finally this is collected in a row_list
the row_list is then converted back to csv via pandas
A very simple solution without any error handling could look something like this:
from types import SimpleNamespace
import pandas as pd
import requests
import json
path = '/some/path/'
df = pd.read_csv(f'{path}/data.csv', delimiter=';')
rows_list = []
for _, row in df.iterrows():
rentCall = f'http://api.myendpoint.intranet/get-data/{row.person_id}'
print(rentCall)
response = requests.get(rentCall)
r = json.loads(response.text, object_hook=lambda d: SimpleNamespace(**d))
for rent in r.rents:
mileageCall = f'http://api.myendpoint.intranet/get-mileage/{rent.carId}'
print(mileageCall)
response2 = requests.get(mileageCall)
m = json.loads(response2.text, object_hook=lambda d: SimpleNamespace(**d))
state = "active" if r.active else "inactive"
rows_list.append((row['person_id'], row['name'], row['flag'], rent.carId, rent.price, state, m.mileage))
df = pd.DataFrame(rows_list, columns=('person_id', 'name', 'flag', 'carId', 'price', 'rentStatus', 'mileage'))
print(df.to_csv(index=False, sep=';'))
Speeding up with multiprocessing
You mention that you have 3000 rows, which means that you'll have to make a lot of API calls. Depending on the connection, every one of these calls might take a while. As a result, performing this in a sequential way might be too slow. The majority of the time, your program will just be waiting on a response from the server without doing anything else.
We can improve this performance by using multiprocessing.
I use all the code from Trenton his answer, but I replace the following sequential call:
# call prepare_data
s = df.person_id.apply(prepare_data)
With a parallel alternative:
from multiprocessing import Pool
n_processes=20 # Experiment with this to see what works well
with Pool(n_processes) as p:
s=p.map(prepare_data, df.person_id)
Alternatively, a threadpool might be faster, but you'll have to test that by replacing the import with
from multiprocessing.pool import ThreadPool as Pool.
I'm looking to implement multithreading or multiprocessing of request objects.
My code is below:
def validate(testurl):
json_d = {"task_id": "user_uid","data": {"document1":testurl}}
response = requests.post("https://example.net.com/document",headers=headers,json=json_d)
my_data1 = response.text
with open("testurl.txt","a+") as file:
file.write(my_data1)
my_data = json.loads(my_data1)
result = {'bool_value':my_data['data']}
return result
Is there a way to multithread or multiprocess a Pandas apply() function for more than 5000 urls? For example:
df['res'] = df['testurl'].apply(validate)
Should I be using this below?
from joblib import parallel, delayed
you can use swifter or dask for doing this. you can refer to https://gdcoder.com/speed-up-pandas-apply-function-using-dask-or-swifter-tutorial/
df['res'] = df['testurl'].swifter.apply(lambda x: validate(x))
I have output from a REST call that I've converted to JSON.
It's a highly nested collection of dicts and lists, but I'm eventually able to convert it to dataframe as follows:
import panads as pd
from requests import get
url = 'http://stats.oecd.org/SDMX-JSON/data/MEI_FIN/IR3TIB.GBR+USA.M/all'
params = {
'startTime' : '2008-06',
'dimensionAtObservation' : 'TimeDimension'
}
r = get(url, params = params)
x = r.json()
d = x['dataSets'][0]['series']
a = pd.DataFrame(d['0:0:0']['observations'])
b = pd.DataFrame(d['0:1:0']['observations'])
This works absent some manipulation to make it easier to work with, and as there are multiple time series, I can do a version of the same for each, but it goes without saying it's kind of clunky.
Is there a better/cleaner way to do this.
The pandasdmx library makes this super-simple:
import pandasdmx as sdmx
df = sdmx.Request('OECD').data(
resource_id='MEI_FIN',
key='IR3TIB.GBR+USA.M',
params={'startTime': '2008-06', 'dimensionAtObservation': 'TimeDimension'},
).write()
Absent any responses, here's the solution I came up with. I added a list comprehension to deal with getting each series into a dataframe, and then a transpose as this source resulted in the series being aligned across rows instead of down columns.
import panads as pd
from requests import get
url = 'http://stats.oecd.org/SDMX-JSON/data/MEI_FIN/IR3TIB.GBR+USA.M/all'
params = {
'startTime' : '2008-06',
'dimensionAtObservation' : 'TimeDimension'
}
r = get(url, params = params)
x = r.json()
d = x['dataSets'][0]['series']
df = [pd.DataFrame(d[i]['observations']).loc[0] for i in d]
df = pd.DataFrame(df).T
I need to get a lot of data from Elasticsearch (es), so I'm using the scan command which is a wrap-up for the native es scroll command.
As a result I will get the following generator Object: <generator object scan at 0x000001BF5A25E518>. Farther more, I'd like to insert all the data into a Pandas DataFrame object so I can easily process it.
Code goes as follows:
from elasticsearch import Elasticsearch
from elasticsearch.helpers import scan as escan
import pandas as pd
es = Elasticsearch(dpl_server, verify_certs=False)
body = {
"size": 1000,
"query": {
"match_all": {}
}
}
response = escan(client=es,
index="index-*,
query=body, request_timeout=30, size=1000)
print(response)
#<generator object scan at 0x000001BF5A25E518>
What I want to do is putting all the results in Pandas DataFrame. If I print each element in the generator as follows:
for res in response:
print(res['_source'])
# { .... }
# { .... }
# { .... }
I will get a lot of dictionaries. A naive solution of mine so far is to add them 1 by 1 like so:
df = None
for res in response:
if (df is None):
df = pd.DataFrame([res['_source']])
else:
df = pd.concat([df, pd.DataFrame([res['_source']])], sort=True)
I wish to know if there's a better way in doing so (first, in terms of speed, second, in terms of clean code). For instance, would it be better to accumulate all the results from the generator into a list and then build a complete DataFrame ?
You can use panda's json_normalize.
from pandas.io.json import json_normalize
from elasticsearch import Elasticsearch
from elasticsearch.helpers import scan as escan
import pandas as pd
es = Elasticsearch(dpl_server, verify_certs=False)
body = {
"size": 1000,
"query": {
"match_all": {}
}
}
response = escan(client=es,
index="index",
query=body, request_timeout=30, size=1000)
# Initialize a double ended queue
output_all = deque()
# Extend deque with iterator
output_all.extend(response)
# Convert deque to DataFrame
output_df = json_normalize(output_all)
Here you can find more info on the double ended queue.