How can I merge these dataframes with pandas?

How can I merge these dataframes with pandas? - python

I have to make several calls to an API, and merge the results from each one into a singular dataframe. They have the same keys and I can do it once, but when I try to merge the third one, nothing happens. I'm probably also not doing it the most efficient way.
I originally tried a for-loop to do this, but for experimental purposes, I am trying to do just do it manually (changing the parameter by 5000 each time). The call limit is 5000 so I can only do this many records at a time. I know some of my variable names are probably inaccurate descriptions of what they represent ("JSONString etc.), but bear with me.
I won't include the url in my calls below, but it is accurate.
#First call, gets the necessary values out of the API and successfully turns them into a dataframe
params = urllib.parse.urlencode({
# Request parameters
'limit': 5000,
})
categoriesJSON = s.get(url, headers=headers)
categoriesJSONString = categoriesJSON.json()
categoriesDf = pandas.DataFrame(categoriesJSONString['value'])
#Second call, gets the necessary values out of the API and successfully turns them into a dataframe and then appends that dataframe to the original dataframe successfully
params = urllib.parse.urlencode({
# Request parameters
'limit': 5000,
'offset': 5000
})
categoriesJSON = s.get(url, headers=headers)
categoriesJSONString = categoriesJSON.json()
newCategoriesDf = pandas.DataFrame(categoriesJSONString['value'])
categoriesDf.append(newCategoriesDf, ignore_index = True)
#Third, gets the necessary values out of the API and turns them into a dataframe and then appends that dataframe to the original dataframe unsuccessfully
params = urllib.parse.urlencode({
# Request parameters
'limit': 5000,
'offset': 10000
})
categoriesJSON = s.get(url, headers=headers)
categoriesJSONString = categoriesJSON.json()
newCategoriesDf = pandas.DataFrame(categoriesJSONString['value'])
categoriesDf.append(newCategoriesDf, ignore_index = True)
After the second call, my dataframe is 10000 rows long, but after the third call, my dataframe is still 10000 rows long. What is preventing it from being 15000 long? I know that I have more than 10000 rows of data to get.

Append returns a new dataframe, the existing one is not updated.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.append.html
just update the target like this
download = pd.DataFrame(data = 1, index = [1], columns = ['Data'])
x = download
x = x.append(download, ignore_index=True)
x = x.append(download, ignore_index=True)
x = x.append(download, ignore_index=True)

df.append returns an appended DF, you need to change the last line to:
categoriesDf = categoriesDf.append(newCategoriesDf, ignore_index = True)
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.append.html

Related

Make multiple API calls - Python

I often use an API call to pull some customer data.
However, whenever I try to pull more than 20 customer ids, the API stops working.
When this happens, I run multiple API calls, transform each JSON output into a df and append all the dataframes together.
That is fine when I need just a couple of API calls, but becomes inefficient when I have several customer ids to pull, as sometimes I have to run 5/10 separate API calls.
I thought a loop could help here. Given I have little experience with Python, I had a look at other questions on looping APIs, but I couldn't find a solution.
Below is the code I use. How can I make a single API call that loops through several customer ids (keeping in mind that there's a limit of circa 20 ids per call) and returns a single dataframe?
Thanks!
#list of customer ids
customer_id = [
"1004rca402itas8470der874",
"1004rca402itas8470der875,
"1004rca402itas8470der876",
"1004rca402itas8470der877",
"1004rca402itas8470der878",
"1004rca402itas8470der879"
]
#API call
payload = {'customer':",".join(customer_id), 'countries':'DE, 'granularity':'daily', 'start_date':'2021-01-01', 'end_date':'2022-03-31'}
response = requests.get('https://api.xxxxxxjxjx.com/t3/customers/xxxxxxxxxxxx?auth_token=xxxxxxxxxxxx', params=payload)
response.status_code
#convert to dataframe
api = response.json()
df = pd.DataFrame(api)
df['sales'] = df['domestic_sales'] + df['international_sales']
df = df[['customer_id','country','date','sales']]
df.head()

Here is the general idea:
# List of dataframes
dfs = []
# List of lists of 20 customer ids each
ids = [customer_id[i:i+20] for i in range(0, len(customer_id), 20)]
# Iterate on 'ids' to call api and store new df in list called 'dfs'
for chunk in ids:
payload = {
"customer": ",".join(chunk),
"countries": "DE",
"granularity": "daily",
"start_date": "2021-01-01",
"end_date": "2022-03-31",
}
response = requests.get(
"https://api.xxxxxxjxjx.com/t3/customers/xxxxxxxxxxxx?auth_token=xxxxxxxxxxxx",
params=payload,
)
dfs.append(pd.DataFrame(response.json()))
# Concat all dataframes
df = dfs[0]
for other_df in dfs[1:]:
df = pd.concat([df, other_df])
# Additional work
df['sales'] = df['domestic_sales'] + df['international_sales']
df = df[['customer_id','country','date','sales']]

KeyError on the 2nd loop on pandas df

so on the first run in the loop everything works fine but on the second loop, it causes a KeyError on the column values on my df. I don't understand why this is happening since in every loop I'm triggering a set of functions.
Part of the code that creates the error
def market_data (crypto, ts_float):
#request to kraken for pricing data
r = requests.get('https://futures.kraken.com/api/charts/v1/trade/' + crypto + '/15m?from=' + ts_float)
#set JSON response to data
data = r.json()
#normalize data into dataframe
df = pd.json_normalize(data, record_path=['candles'])
#convert unix time back into readable time
df['time'] = pd.to_datetime(df['time'],unit='ms')
#set time as index
df = df.set_index('time')
#convert into integer for calculations
df['open'] = df['open'].astype(float).astype(int)
df['high'] = df['high'].astype(float).astype(int)
df['low'] = df['low'].astype(float).astype(int)
df['close'] = df['close'].astype(float).astype(int)
df['volume'] = df['volume'].astype(float).astype(int)
return df
crypto_pairs = [
{"crypto": "pf_ethusd", "size": 0.05},
{"crypto": "pf_btcusd", "size": 0.0003},
{"crypto": "pf_avaxusd", "size":3},
{"crypto": "pf_dotusd", "size":10},
{"crypto": "pf_ltcusd", "size":1.5}
]
# getting the timstamp to get the data from
ts = (datetime.now() - timedelta(hours = 48)).timestamp()
ts_float = str(int(ts))
for cryptos in enumerate(crypto_pairs):
data = market_data(cryptos[1]['crypto'], ts_float)
KeyError: time
I have a set of functions in my enumerate loop and the market_data which is the first one generates the mentioned error on the 2nd loop. The errors are always happening when changing the column names such as "time" and "open".

I don't have skills in 'request', but this worked for me. Try the following. In the 'deep market_data' function, after receiving the dataframe, set a check, if len(df)<=0, then exit.
Where the dataframe turns out to be empty, the request returns 200, that is, everything is fine. Printed out 'crypto'. An empty dataframe is obtained on 'pf_btcusd'. I tried to swap it and again an empty dataframe turns out to be 'pf_btcusd'. Something is wrong with this symbol.
def market_data (crypto, ts_float):
#request to kraken for pricing data
r = requests.get('https://futures.kraken.com/api/charts/v1/trade/' + crypto + '/15m?from=' + ts_float)
#print(r.status_code)
#set JSON response to data
data = r.json()
#normalize data into dataframe
df = pd.json_normalize(data, record_path=['candles'])
if len(df) <=0:
print(r.status_code)
print(crypto)
return

how can i make two values in same column as separate columns in python

in this commands;
import requests
import pandas as pd
url = "https://api.binance.com/api/v3/depth?symbol=BNBUSDT&limit=100"
payload = {}
headers = {
'Content-Type': 'application/json'
}
response = requests.request("GET", url, headers=headers, data=payload).json()
depth = pd.DataFrame(response, columns=["bids","asks"])
print(depth)
outputs :
........
bids
asks
0
[382.40000000, 86.84800000]
[382.50000000, 196.24600000]
1
[382.30000000, 174.26400000]
[382.60000000, 116.10300000]
and first i need to change the table this format
........
bidsValue
bidsQuantity
asksValue
asksQuantity
rangeTotalbidsQuantity
rangeTotalasksQuantity
0
382.40000000
86.84800000
382.50000000
196.24600000
1
382.30000000
174.26400000
382.60000000
116.10300000
and then turn the columns values float so that be able to calculate in a specific range values quantity (e.g 0bidsValue 400.00 and ?bidsValue 380.00 ("?" because of i don't know row number) first i must find row number of bidsValue 380.00 (for example it is 59) then calculate 0bidsQuantity to 59bidsQuantity) and last write the results in rangeTotalbidsQuantity column.
I've been trying for hours, I tried many commands from "pandas.Dataframe" but I couldn't do it.
Thank you!

You can use this solution.
In your case this would look something like this:
depth['bidsValue'], depth['bidsQuantity'] = zip(*list(depth['bids'].values))
depth['asksValue'], depth['asksQuantity'] = zip(*list(depth['asks'].values))
depth = depth.drop(columns=['bids', 'asks'])
For the second part look at this tutorial.

You can use for example:
pd.DataFrame(df['bids'].tolist())
Then concatenate that to the original dataframe.

You can apply() pandas.Series and assign to new columns
depth[ ['bidsValue', 'bidsQuantity'] ] = depth['bids'].apply(pd.Series)
depth[ ['asksValue', 'asksQuantity'] ] = depth['asks'].apply(pd.Series)
and later you have to remove original columns bids,asks
depth = depth.drop(columns=['bids', 'asks'])
Full working code with other changes
import requests
import pandas as pd
url = "https://api.binance.com/api/v3/depth"
payload = {
'symbol': 'BNBUSDT',
'limit': 100,
}
response = requests.get(url, params=payload)
#print(response.status_code)
data = response.json()
depth = pd.DataFrame(data, columns=["bids","asks"])
#print(depth)
depth[ ['bidsValue', 'bidsQuantity'] ] = depth['bids'].apply(pd.Series)
depth[ ['asksValue', 'asksQuantity'] ] = depth['asks'].apply(pd.Series)
depth = depth.drop(columns=['bids', 'asks'])
print(depth.to_string())

How to update a pandas dataframe, from multiple API calls

I need to do a python script to
Read a csv file with the columns (person_id, name, flag). The file has 3000 rows.
Based on the person_id from the csv file, I need to call a URL passing the person_id to do a GET
http://api.myendpoint.intranet/get-data/1234
The URL will return some information of the person_id, like example below. I need to get all rents objects and save on my csv. My output needs to be like this
import pandas as pd
import requests
ids = pd.read_csv(f"{path}/data.csv", delimiter=';')
person_rents = df = pd.DataFrame([], columns=list('person_id','carId','price','rentStatus'))
for id in ids:
response = request.get(f'endpoint/{id["person_id"]}')
json = response.json()
person_rents.append( [person_id, rent['carId'], rent['price'], rent['rentStatus'] ] )
pd.read_csv(f"{path}/data.csv", delimiter=';' )
person_id;name;flag;cardId;price;rentStatus
1000;Joseph;1;6638;1000;active
1000;Joseph;1;5566;2000;active
Response example
{
"active": false,
"ctodx": false,
"rents": [{
"carId": 6638,
"price": 1000,
"rentStatus": "active"
}, {
"carId": 5566,
"price": 2000,
"rentStatus": "active"
}
],
"responseCode": "OK",
"status": [{
"request": 345,
"requestStatus": "F"
}, {
"requestId": 678,
"requestStatus": "P"
}
],
"transaction": false
}
After save the additional data from response on csv, i need to get data from another endpoint using the carId on the URL. The mileage result must be save in the same csv.
http://api.myendpoint.intranet/get-mileage/6638
http://api.myendpoint.intranet/get-mileage/5566
The return for each call will be like this
{"mileage":1000.0000}
{"mileage":550.0000}
The final output must be
person_id;name;flag;cardId;price;rentStatus;mileage
1000;Joseph;1;6638;1000;active;1000.0000
1000;Joseph;1;5566;2000;active;550.0000
SOmeone can help me with this script?
Could be with pandas or any python 3 lib.

Code Explanation
Create dataframe, df, with pd.read_csv.
It is expected that all of the values in 'person_id', are unique.
Use .apply on 'person_id', to call prepare_data.
prepare_data expects 'person_id' to be a str or int, as indicated by the type annotation, Union[int, str]
Call the API, which will return a dict, to the prepare_data function.
Convert the 'rents' key, of the dict, into a dataframe, with pd.json_normalize.
Use .apply on 'carId', to call the API, and extract the 'mileage', which is added to dataframe data, as a column.
Add 'person_id' to data, which can be used to merge df with s.
Convert pd.Series, s to a dataframe, with pd.concat, and then merge df and s, on person_id.
Save to a csv with pd.to_csv in the desired form.
Potential Issues
If there's an issue, it's most likely to occur in the call_api function.
As long as call_api returns a dict, like the response shown in the question, the remainder of the code will work correctly to produce the desired output.
import pandas as pd
import requests
import json
from typing import Union
def call_api(url: str) -> dict:
r = requests.get(url)
return r.json()
def prepare_data(uid: Union[int, str]) -> pd.DataFrame:
d_url = f'http://api.myendpoint.intranet/get-data/{uid}'
m_url = 'http://api.myendpoint.intranet/get-mileage/'
# get the rent data from the api call
rents = call_api(d_url)['rents']
# normalize rents into a dataframe
data = pd.json_normalize(rents)
# get the mileage data from the api call and add it to data as a column
data['mileage'] = data.carId.apply(lambda cid: call_api(f'{m_url}{cid}')['mileage'])
# add person_id as a column to data, which will be used to merge data to df
data['person_id'] = uid
return data
# read data from file
df = pd.read_csv('file.csv', sep=';')
# call prepare_data
s = df.person_id.apply(prepare_data)
# s is a Series of DataFrames, which can be combined with pd.concat
s = pd.concat([v for v in s])
# join df with s, on person_id
df = df.merge(s, on='person_id')
# save to csv
df.to_csv('output.csv', sep=';', index=False)
If there are any errors when running this code:
Leave a comment, to let me know.
edit your question, and paste the entire TraceBack, as text, into a code block.
Example
# given the following start dataframe
person_id name flag
0 1000 Joseph 1
1 400 Sam 1
# resulting dataframe using the same data for both id 1000 and 400
person_id name flag carId price rentStatus mileage
0 1000 Joseph 1 6638 1000 active 1000.0
1 1000 Joseph 1 5566 2000 active 1000.0
2 400 Sam 1 6638 1000 active 1000.0
3 400 Sam 1 5566 2000 active 1000.0

There are many different ways to implement this. One of them would be, like you started in your comment:
read the CSV file with pandas
for each line take the person_id and build a call
the delivered JSON response can then be taken from the rents
the carId is then extracted for each individual rental
finally this is collected in a row_list
the row_list is then converted back to csv via pandas
A very simple solution without any error handling could look something like this:
from types import SimpleNamespace
import pandas as pd
import requests
import json
path = '/some/path/'
df = pd.read_csv(f'{path}/data.csv', delimiter=';')
rows_list = []
for _, row in df.iterrows():
rentCall = f'http://api.myendpoint.intranet/get-data/{row.person_id}'
print(rentCall)
response = requests.get(rentCall)
r = json.loads(response.text, object_hook=lambda d: SimpleNamespace(**d))
for rent in r.rents:
mileageCall = f'http://api.myendpoint.intranet/get-mileage/{rent.carId}'
print(mileageCall)
response2 = requests.get(mileageCall)
m = json.loads(response2.text, object_hook=lambda d: SimpleNamespace(**d))
state = "active" if r.active else "inactive"
rows_list.append((row['person_id'], row['name'], row['flag'], rent.carId, rent.price, state, m.mileage))
df = pd.DataFrame(rows_list, columns=('person_id', 'name', 'flag', 'carId', 'price', 'rentStatus', 'mileage'))
print(df.to_csv(index=False, sep=';'))

Speeding up with multiprocessing
You mention that you have 3000 rows, which means that you'll have to make a lot of API calls. Depending on the connection, every one of these calls might take a while. As a result, performing this in a sequential way might be too slow. The majority of the time, your program will just be waiting on a response from the server without doing anything else.
We can improve this performance by using multiprocessing.
I use all the code from Trenton his answer, but I replace the following sequential call:
# call prepare_data
s = df.person_id.apply(prepare_data)
With a parallel alternative:
from multiprocessing import Pool
n_processes=20 # Experiment with this to see what works well
with Pool(n_processes) as p:
s=p.map(prepare_data, df.person_id)
Alternatively, a threadpool might be faster, but you'll have to test that by replacing the import with
from multiprocessing.pool import ThreadPool as Pool.

Checking HTTP Status (Python)

Is there a way to check the HTTP Status Code in the code below, as I have not used the request or urllib libraries which would allow for this.
from pandas.io.excel import read_excel
url = 'http://www.bankofengland.co.uk/statistics/Documents/yieldcurve/uknom05_mdaily.xls'
#check the sheet number, spot: 9/9, short end 7/9
spot_curve = read_excel(url, sheetname=8) #Creates the dataframes
short_end_spot_curve = read_excel(url, sheetname=6)
# do some cleaning, keep NaN for now, as forward fill NaN is not recommended for yield curve
spot_curve.columns = spot_curve.loc['years:']
valid_index = spot_curve.index[4:]
spot_curve = spot_curve.loc[valid_index]
# remove all maturities within 5 years as those are duplicated in short-end file
col_mask = spot_curve.columns.values > 5
spot_curve = spot_curve.iloc[:, col_mask]
#Providing correct names
short_end_spot_curve.columns = short_end_spot_curve.loc['years:']
valid_index = short_end_spot_curve.index[4:]
short_end_spot_curve = short_end_spot_curve.loc[valid_index]
# merge these two, time index are identical
# ==============================================
combined_data = pd.concat([short_end_spot_curve, spot_curve], axis=1, join='outer')
# sort the maturity from short end to long end
combined_data.sort_index(axis=1, inplace=True)
def filter_func(group):
return group.isnull().sum(axis=1) <= 50
combined_data = combined_data.groupby(level=0).filter(filter_func)

In pandas:
read_excel try to use urllib2.urlopen(urllib.request.urlopen instead in py3x) to open the url and get .read() of response immediately without store the http request like:
data = urlopen(url).read()
Though you need only part of the excel, pandas will download the whole excel each time. So, I voted #jonnybazookatone.
It's better to store the excel to your local, then you can check the status code and md5 of file first to verify data integrity or others.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How can I merge these dataframes with pandas? - python

df.append returns an appended DF, you need to change the last line to: categoriesDf = categoriesDf.append(newCategoriesDf, ignore_index = True) https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.append.html

Related

Make multiple API calls - Python

KeyError on the 2nd loop on pandas df

how can i make two values in same column as separate columns in python

How to update a pandas dataframe, from multiple API calls

Checking HTTP Status (Python)

Categories

Resources

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How can I merge these dataframes with pandas? - python

df.append returns an appended DF, you need to change the last line to: categoriesDf = categoriesDf.append(newCategoriesDf, ignore_index = True) https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.append.html

Related

Make multiple API calls - Python

KeyError on the 2nd loop on pandas df

how can i make two values ​in same column as separate columns in python

How to update a pandas dataframe, from multiple API calls

Checking HTTP Status (Python)

Categories

Resources

how can i make two values in same column as separate columns in python