This script should be printing every order for every second for a two minute duration, but the csv file just has the the same row repeated. Sample data from the csv is below.
import cbpro
import time
import pandas as pd
import os
import json
public_client = cbpro.PublicClient()
res = json.dumps(public_client.get_product_ticker(product_id='BTC-USD'))
csv_file = "cbpro-test-1.csv"
df = pd.DataFrame()
timeout = time.time() + 60*2
while True:
converted = json.loads(res)
df = df.append(pd.DataFrame.from_dict(pd.json_normalize(converted), orient='columns'))
if time.time() > timeout:
break
df.to_csv(csv_file, index=False, encoding='utf-8')
here is some sample output of the csv:
trade_id,price,size,time,bid,ask,volume
127344793,32750.24,0.00113286,2021-01-29T06:18:58.637859Z,32750.24,32755.06,41795.68551358
127344793,32750.24,0.00113286,2021-01-29T06:18:58.637859Z,32750.24,32755.06,41795.68551358
127344793,32750.24,0.00113286,2021-01-29T06:18:58.637859Z,32750.24,32755.06,41795.68551358
127344793,32750.24,0.00113286,2021-01-29T06:18:58.637859Z,32750.24,32755.06,41795.68551358
edit: I moved the public client and the res variable to inside the loop and it works somewhat, it skips a second data looks like this now:
127347670,32620.2,0.00307689,2021-01-29T06:33:50.16111Z,32610,32620.12,41966.5764529
127347670,32620.2,0.00307689,2021-01-29T06:33:50.16111Z,32610,32620.12,41966.5764529
127347671,32614.11,0.00146359,2021-01-29T06:33:52.491186Z,32610,32610.01,41966.5764529
127347671,32614.11,0.00146359,2021-01-29T06:33:52.491186Z,32610,32610.01,41966.5764529
it goes from 06:33:50 to 06:33:52, the rest of the file follows the same format
tried with this while loop:
while True:
public_client = cbpro.PublicClient()
res = json.dumps(public_client.get_product_ticker(product_id='BTC-USD'))
converted = json.loads(res)
df = df.append(pd.DataFrame.from_dict(pd.json_normalize(converted), orient='columns'))
if time.time() > timeout:
break
You fetch only one quote, before you enter the loop. Then you repeatedly process that same data. You never change res, you simply keep appending the same values to you DF, one iteration after another. You need to fetch repeatedly using get_product_ticker.
After OP update:
Yes, that's how to get the quotations rapidly. You can do it better if you move that first line above the loop: you don't need to re-create the client object on every iteration.
Several lines are identical because you're fetching real-time quotations. If nobody changes the current-best bid or ask price, then the quotation remains static. If you want only the changes, then use the unique method of PANDAS to remove the duplicates.
Related
A problem with likely a very easy fix, yet I'm unfortunately new to this.
The problem is: My generated csv file includes data from only one URL, while I want them all.
I've made a list of many contract numbers and I'm trying to access them all and return their data into one csv file (a long list). The API's URL consists of a baseURL and a contract number plus some parameters, so my URLs look like this (showing 2 of 150)
https://api.nfz.gov.pl/app-umw-api/agreements/edc47a7d-a3b8-d354-79d5-a0518f8ba6d4?format=json&api-version=1.2&limit=25&page={}
https://api.nfz.gov.pl/app-umw-api/agreements/9a6d9313-c9cc-c0db-9c86-b7b4be0e11c1?format=json&api-version=1.2&limit=25&page={}
The publisher has imposed a limit of 25 records per page, therefore I've got some pagination going on here.
It seems like the program is making calls into each URL in turn, given that it printed the number of pages from each call. But the csv only has 4 rows, instead of hundreds. I'm wondering where I'm going wrong. I tried to fix by deleting the indent on the last 3 lines (no change) and other trial&error.
Another small question - the 4 rows are actually duplicated 2 rows. I think my code somewhere duplicates the first page of results, but I can't figure out where.
And another one - how can I make the first column of the csv file show the 'contract' (from my list 'contracts') that relates to the output? I need some way of identifying which rows in the csv came from which contract while the API keeps the info in a separate branch of the data 'tree' that I don't really know how to return efficiently.
import requests
import pandas as pd
import math
from contracts_list1 import contracts
baseurl = 'https://api.nfz.gov.pl/app-umw-api/agreements/'
for contract in contracts:
api_url = ''.join([baseurl, contract])
def main_request(api_url):
r = requests.get(api_url)
return r.json()
def get_pages(response):
return math.ceil(response['meta']['count'] / 25)
p_number = main_request(api_url)
all_data = []
for page in range(0, get_pages(p_number)+1): # <-- increase page numbers here
data = requests.get(api_url.format(page)).json()
for a in data["data"]["plans"]:
all_data.append({**a["attributes"]})
df = pd.DataFrame(all_data)
df.to_csv('file1.csv', encoding='utf-8-sig', index=False)
print(get_pages(p_number))
Your accumulator all_date is inside of the contracts loop, therefore each iteration will overwrite the last iteration result. That's why you're only seeing the result of the last iteration, instead of all of them.
Try to put your accumulator all_data = [] outside of your outer For Loop:
import requests
import pandas as pd
import math
from contracts_list1 import contracts
baseurl = 'https://api.nfz.gov.pl/app-umw-api/agreements/'
all_data = []
for contract in contracts:
api_url = ''.join([baseurl, contract])
def main_request(api_url):
r = requests.get(api_url)
return r.json()
def get_pages(response):
return math.ceil(response['meta']['count'] / 25)
p_number = main_request(api_url)
for page in range(0, get_pages(p_number)+1): # <-- increase page numbers here
data = requests.get(api_url.format(page)).json()
for a in data["data"]["plans"]:
all_data.append({**a["attributes"]})
df = pd.DataFrame(all_data)
df.to_csv('file1.csv', encoding='utf-8-sig', index=False)
print(get_pages(p_number))
I have a script that I use to fire orders from a csv file, to an exchange using a for loop.
data = pd.read_csv('orderparameters.csv')
df = pd.DataFrame(data)
for i in range(len(df)):
order = Client.new_order(...
...)
file = open('orderData.txt', 'a')
original_stdout = sys.stdout
with file as f:
sys.stdout = f
print(order)
file.close()
sys.stdout = original_stdout
I put the response from the exchange in a txt file like this...
I want to turn the multiple responses into 1 single dataframe. I would hope it would look something like...
(I did that manually).
I tried;
data = pd.read_csv('orderData.txt', header=None)
dfData = pd.DataFrame(data)
print(dfData)
but I got;
I have also tried
data = pd.read_csv('orderData.txt', header=None)
organised = data.apply(pd.Series)
print(organised)
but I got the same output.
I can print order['symbol'] within the loop etc.
I'm not certain whether I should be populating this dataframe within the loop, or by capturing and writing the response and processing it afterwards. Appreciate your advice.
It looks like you are getting json strings back, you could read json objects into dictionaries and then create a dataframe from that. Perhaps try something like this (no longer needs a file)
data = pd.read_csv('orderparameters.csv')
df = pd.DataFrame(data)
response_data = []
for i in range(len(df)):
order_json = Client.new_order(...
...)
response_data.append(eval(order_json))
response_dataframe = pd.DataFrame(response_data)
If I understand your question correctly, you can simply do the following:
import pandas as pd
orders = pd.read_csv('orderparameters.csv')
responses = pd.DataFrame(Client.new_order(...) for _ in range(len(orders)))
I am trying to do calculation and write it to another txt file using multiprocessing program. I am getting count mismatch in output txt file. every time execute I am getting different output count.
I am new to python could some one please help.
import pandas as pd
import multiprocessing as mp
source = "\\share\usr\data.txt"
target = "\\share\usr\data_masked.txt"
Chunk = 10000
def process_calc(df):
'''
get source df do calc and return newdf
...
'''
return(newdf)
def calc_frame(df):
output_df = process_calc(df)
output_df.to_csv(target,index=None,sep='|',mode='a',header=False)
if __name__ == '__main__':
reader= pd.read_table(source,sep='|',chunksize = chunk,encoding='ANSI')
pool = mp.Pool(mp.cpu_count())
jobs = []
for each_df in reader:
process = mp.Process(target=calc_frame,args=(each_df)
jobs.append(process)
process.start()
for j in jobs:
j.join()
You have several issues in your source as posted that would prevent it from even compiling let alone running. I have attempted to correct those in an effort to also solving your main problem. But do check the code below thoroughly just to make sure the corrections make sense.
First, the args argument to the Process constructor should be specified as a tuple. You have specified args=(each_df), but (each_df) is not a tuple, it is a simple parenthesized expression; you need (each_df,) to make if a tuple (the statement is also missing a closing parentheses).
The problem you have in addition to making no provision against multiple processes simultaneously attempting to append to the same file is that you cannot be assured of the order in which the processes complete and thus you have no real control over the order in which the dataframes will be appended to the csv file.
The solution is to use a processing pool with the imap method. The iterable to pass to this method is just the reader, which when iterated returns the next dataframe to process. The return value from imap is an iterable that when iterated will return the next return value from calc_frame in task-submission order, i.e. the same order that the dataframes were submitted. So as these new, modified dataframes are returned, the main process can simply append these to the output file one by one:
import pandas as pd
import multiprocessing as mp
source = r"\\share\usr\data.txt"
target = r"\\share\usr\data_masked.txt"
Chunk = 10000
def process_calc(df):
'''
get source df do calc and return newdf
...
'''
return(newdf)
def calc_frame(df):
output_df = process_calc(df)
return output_df
if __name__ == '__main__':
with mp.Pool() as pool:
reader = pd.read_table(source, sep='|', chunksize=Chunk, encoding='ANSI')
for output_df in pool.imap(process_calc, reader):
output_df.to_csv(target, index=None, sep='|', mode='a', header=False)
I have an Excel file with 3k worth of sheets. I've currently reading the sheets one by one, converting to a dataframe, append to a list and repeat.
An iteration in the for loop lasts aprox 90 seconds, which is a huge amount of time. Each sheet has around 35 rows of data with 5 columns.
Can somebody suggest a better methodology in approaching this?
This is my code:
import pandas as pd
import time
nr_pages_workbook = list(range(1,3839))
nr_pages_workbook = ['Page '+str(x) for x in nr_pages_workbook]
list_df = []
start = time.time()
for number in nr_pages_workbook:
data = pd.read_excel('D:\\DEV\\Stage\\Project\\Extras.xlsx',sheet_name=number)
list_df.append(data)
break
stop = time.time() - start
Df_Date_Raw = pd.concat(list_df)
You can try passing nr_pages_workbook directly to sheet_name param in read_excel, according to the docs it can be a list, and the return value will be a dict of dataframes. This way you can avoid the overhead of opening and reading the file in every cycle.
Or just simply omit the parameter, and read all sheets into a dict, and then concatenate from the dict:
data = pd.read_excel('D:\\DEV\\Stage\\Project\\Extras.xlsx')
df = pd.concat([v for k,v in data.items()])
You are reading the whole file again whenever you are iterating through the loop. I would suggest reading it once using ExcelFile and then just accessing a particular sheet in the loop. Try:
import pandas as pd
xl = pd.ExcelFile('foo.xls')
sheet_list = xl.sheet_names
for i in sheet_list:
if i ==0:
df = xl.parse(i)
else:
df = df.append(xl.parse(i), ignore_index=True)
I can combined 2 csv scripts and it works well.
import pandas
csv1=pandas.read_csv('1.csv')
csv2=pandas.read_csv('2.csv')
merged=csv1.merge(csv2,on='field1')
merged.to_csv('output.csv',index=False)
Now, I would like to combine more than 2 csvs using the same method as above.
I have list of CSV which I defined to something like this
import pandas
collection=['1.csv','2.csv','3.csv','4.csv']
for i in collection:
csv=pandas.read_csv(i)
merged=csv.merge(??,on='field1')
merged.to_csv('output2.csv',index=False)
I havent got it work so far if more than 1 csv..I guess it just a matter iterate inside the list ..any idea?
You need special handling for the first loop iteration:
import pandas
collection=['1.csv','2.csv','3.csv','4.csv']
result = None
for i in collection:
csv=pandas.read_csv(i)
if result is None:
result = csv
else:
result = result.merge(csv, on='field1')
if result:
result.to_csv('output2.csv',index=False)
Another alternative would be to load the first CSV outside the loop but this breaks when the collection is empty:
import pandas
collection=['1.csv','2.csv','3.csv','4.csv']
result = pandas.read_csv(collection[0])
for i in collection[1:]:
csv = pandas.read_csv(i)
result = result.merge(csv, on='field1')
if result:
result.to_csv('output2.csv',index=False)
I don't know how to create an empty document (?) in pandas but that would work, too:
import pandas
collection=['1.csv','2.csv','3.csv','4.csv']
result = pandas.create_empty() # not sure how to do this
for i in collection:
csv = pandas.read_csv(i)
result = result.merge(csv, on='field1')
result.to_csv('output2.csv',index=False)