I am trying to manipulate the following data from a websocket.
Here is the data:
{'e': 'kline', 'E': 1659440374345, 's': 'MATICUSDT', 'k': {'t': 1659440100000, 'T': 1659440399999, 's': 'MATICUSDT', 'i': '5m', 'f': 274454614, 'L': 274455188, 'o': '0.87210000', 'c': '0.87240000', 'h': '0.87240000', 'l': '0.87000000', 'v': '145806.50000000', 'n': 575, 'x': False, 'q': '127036.96453000', 'V': '76167.60000000', 'Q': '66365.16664000', 'B': '0'}}
I am trying to extract following: 'E', 's' AND 'c'. To manipulate to: 'E' = time, 's' = symbol and 'c' = PRICE
def createframe(msg):
df = pd.DataFrame([msg])
df = df.loc[:,['s','E','c']
df.columns = ['symbol', 'Time', 'Price']
df.Price = df.Price.astype(float)
df.Time = pd.to_datetime(df.Time, unit = 'ms')
return df
When I run the next line of code to pull data:
async with stream as receiver:
while True:
data = await receiver.recv()
data = json.loads(data)['data']
df = createframe(data)
print(df)
I am getting error that 'c' is not defined.
PLEASE HELP. THANK YOU
If you look at the data frame, you'll see that in column "k" you have a whole dictionary's worth of data. That's because the value of k is itself a dictionary. You're getting the error that c is not defined because it is not a column itself, just a piece of data in column "k".
In order to get all this data into individual columns, you'll have to "flatten" the data. You can do something like this:
def createframe(msg):
df = pd.DataFrame([msg])
df = df.loc[:,['s','E','c']]
df.columns = ['symbol', 'Time', 'Price']
df.Price = df.Price.astype(float)
df.Time = pd.to_datetime(df.Time, unit = 'ms')
return df
def flatten(data):
newdict = {}
for each in msg:
if isinstance(msg[each], dict):
for i in msg[each]:
newdict[i] = msg[each][i]
else:
newdict[each] = msg[each]
return newdict
async with stream as receiver:
while True:
data = await receiver.recv()
data = json.loads(data)['data']
data = flatten(data)
df = createframe(data)
print(df)
Hope this helps! If you have questions just comment on this answer.
Related
I am using python multiprocessing to process files. Last processed
file record is stored in a dict A i.e. dict_A = {'file1_xx': '8-04-22', 'file2_xx': '8-04-22', 'file3_xx': '8-04-22', 'file4_xx': '8-04-22'}
Files directory is scanned, filenames with last modified date are stored in dict_test. Files recorded in both dicts are compared for new files: i.e. compare each file last modified date i.e file1_xx against the last processed date in dict_A. There's a condition which will update the dict_A if the file last modified date is greater than last processed date per single file.
I am facing issues as the dictionary is not updated after the files are processed.
Ideally the dict_A should be updated with the latest modified date per file of same category. This dict_A is then uploaded to PostgreSQL db through sqlalchemy.
def compare_rec(i):
a = dict_A[i]
b = dict_test[i]
if a >= b:
print("none")
else:
lock.acquire()
print("found")
a = b
lock.release()
def init(l):
global lock
lock = l
if __name__ == '__main__':
file_cat=['a', 'b', 'c', 'd']
dict_A={'a': '10', 'b': '10', 'c': '10', 'd': '10'}
dict_test={'a': '11', 'b': '11', 'c': '11', 'd': '11'}
l = multiprocessing.Lock()
pool = multiprocessing.Pool(initializer=init, initargs=(l,))
pool.map(compare_rec, file_cat)
pool.close()
pool.join()
Processes don't share variables.
In function I would use return to send filename and date back to main process
if ...:
return i, a
else:
return i, b
main thread should get results from all processes
results = pool.map(compare_rec, file_cat)
and it should update dictonary
dict_A.update(results)
Full code:
import multiprocessing
def compare_rec(key):
print('key:', key)
a = dict_A[key]
b = dict_test[key]
if a >= b:
print("none", key, a)
return key, a
else:
print("found:", key, b)
return key, b
if __name__ == '__main__':
file_cat = ['a', 'b', 'c', 'd']
dict_A = {'a': '10', 'b': '10', 'c': '10', 'd': '10'}
dict_test = {'a': '11', 'b': '11', 'c': '11', 'd': '11'}
pool = multiprocessing.Pool()
results = pool.map(compare_rec, file_cat)
print(results)
print('before:', dict_A)
dict_A.update(results)
print('after :', dict_A)
pool.close()
pool.join()
I am running a standard buy/sell trader using websocket and talib. With the socket, I am able to get 2 messages through with different time intervals (1min and 3min). I am getting ETHUDS data only but with two different time intervals using this socket:
TRADE_SYMBOOL = 'ethusdt'
INTERVAL = '1m'
INTERVAL_2 = '3m'
SOCKET = f'wss://stream.binance.com:9443/ws/{TRADE_SYMBOOL}#kline_{INTERVAL}/{TRADE_SYMBOOL}#kline_{INTERVAL_2}'
This gives me a json.loads(message) of:
{'e': 'kline', 'E': 1646017123875, 's': 'ETHUSDT', 'k': {'t': 1646017080000, 'T': 1646017139999, 's': 'ETHUSDT', 'i': '1m', 'f': 769965188, 'L': 769965629, 'o': '2605.00000000', 'c': '2605.88000000', 'h': '2606.98000000', 'l': '2603.21000000', 'v': '191.57300000', 'n': 442, 'x': False, 'q': '499047.95132700', 'V': '78.57690000', 'Q': '204678.10094600', 'B': '0'}}
{'e': 'kline', 'E': 1646017123875, 's': 'ETHUSDT', 'k': {'t': 1646017020000, 'T': 1646017199999, 's': 'ETHUSDT', 'i': '3m', 'f': 769964266, 'L': 769965629, 'o': '2599.08000000', 'c': '2605.88000000', 'h': '2606.98000000', 'l': '2595.10000000', 'v': '922.85610000', 'n': 1364, 'x': False, 'q': '2399363.68094500', 'V': '356.83860000', 'Q': '928388.14101500', 'B': '0'}}
If you scroll across a bit, the 'i' tick is showing 1m in one and 3m in the next.
I am then extracting the close 'c' from the line and compiling a list of closes.
What I want to be able to do is make a list of closes from the 1m list and then a separate list from the 3m list.
closes = []
def on_message(ws, message):
global in_position
json_message = json.loads(message)
candle = json_message['k']
is_candle_closed = candle['x']
close = candle['c']
if is_candle_closed: # this only returns True at the end of each candle close (1minute)
closes.append(float(close))
This is what I was using when I was only using 1m intervals, but now I don't know how to sort the 2 json loads that are coming in with each message. How can I differentiate between the two pieces of json data so I can store their closing prices accordingly?
I need to be able to store the closing prices from the 1m and 3m candles as separate lists.
How can I list closes into two separate list eg. closes_1m and closes_3m please?
I would like list closes_1m to contain the price of the close after every 1 minute.
I would like list closes_3m to contain the price of the close after every 3 minutes.
I was able to separate them with a simple if statement.
closes_1m = []
closes_3m = []
def on_message(ws, message):
json_message = json.loads(message)
candle = json_message['k']
is_candle_closed = candle['x']
close = candle['c']
close_time = int(candle['T']) / 1000
interval = candle['i']
if interval == '1m' and is_candle_closed: # this only runs at the end of each candle close (1minute)
closes_1m.append(float(close))
if interval == '3m' and is_candle_closed: # this only runs at the end of each candle close (3minute)
closes_3m.append(float(close))
So every 1 minute I appended a closing price to closes_1m and every 3 minutes I appended the closing price data to closes_3m.
I have a list that already quite resembles a dictionary:
l=["'S':'NP''VP'", "'NP':'DET''N'", "'VP':'V'", "'DET':'a'", "'DET':'an'", "'N':'elephant'", "'N':'elephants'", "'V':'talk'", "'V':'smile'"]
I want to create a dictionary keeping all information:
dict= {'S': [['NP','VP']],
'NP': [['DET', 'N']],
'VP': [['V']], 'DET': [['a'], ['an']],
'N': [['elephants'], ['elephant']],
'V': [['talk'], ['smile]]}
I tried using this:
d = {}
elems = filter(str.isalnum,l.replace('"',"").split("'"))
values = elems[1::2]
keys = elems[0::2]
d.update(zip(keys,values))
and this:
s = l.split(",")
dictionary = {}
for i in s:
dictionary[i.split(":")[0].strip('\'').replace("\"", "")] = i.split(":")[1].strip('"\'')
print(dictionary)
You can use collections.defaultdict with re:
import re, collections
l=["'S':'NP''VP'", "'NP':'DET''N'", "'VP':'V'", "'DET':'a'", "'DET':'an'", "'N':'elephant'", "'N':'elephants'", "'V':'talk'", "'V':'smile'"]
d = collections.defaultdict(list)
for i in l:
d[(k:=re.findall('\w+', i))[0]].append(k[1:])
print(dict(d))
Output:
{'S': [['NP', 'VP']], 'NP': [['DET', 'N']], 'VP': [['V']], 'DET': [['a'], ['an']], 'N': [['elephant'], ['elephants']], 'V': [['talk'], ['smile']]}
Hi am working with two json files , and im having problem with the data cleaning.
Suppose a record in g1j or g2j looks like this:
{
'cls_loc': 'QOEBBG_K0101',
'date': 1584957443013,
'dur': 32,
'exp': [
{
'm': 'spot_excited',
's': 8.5,
't': 8.5,
'w': 'spot_bored',
'x': 'A'
},
{
's': 1.1,
't': 11.4,
'w': 'spot_scared',
'x': 'A'
}
],
'mod': 'Poster',
'pre': False,
'scr': 67,
'usr': 'QOGOBN',
'ver': '20.5.3'
}
What we want per row in our DataFrame is this:
{
'student_pin': 'QOGOBN', # from `usr`
'date': datetime.date(2020 3, 23), # from `date`, but parsed
'duration': 32, # from `dur`
'level': 3, # the "K" from `cls_loc`, mapped to int
'unit': 1, # from `cls_loc`, mapped to int
'module': 1, # from `cls_loc`, mapped to int
'accuracy': 0.5, # calcualted from `exp`
}
my code so far:
from datetime import datetime
import json
import numpy as np
import pandas as pd
from scipy import stats
with open('/content/drive/MyDrive/group1_exp_2020-04-08.json', 'r') as f:
g1j = json.loads(f.read())
with open('/content/drive/MyDrive/group2_exp_2020-04-22.json', 'r') as f:
g2j = json.loads(f.read())
#convert the integer timestamp to a datetime.date
def timestamp_to_date():
l =[]
for item in g1j:
timestamp =item['date']
timestamp = timestamp/1000
dt_obj = datetime.fromtimestamp(timestamp).strftime('%Y, %m, %d ')
l.append(dt_obj)
return l
timestamp_to_date()
def timestamp_to_date():
l =[]
for item in g2j:
timestamp =item['date']
timestamp = timestamp/1000
dt_obj = datetime.fromtimestamp(timestamp).strftime('%Y, %m, %d ')
l.append(dt_obj)
return l
#extract the level, unit, module, and accuracy here
def get_level(x):
loc = x['cls_loc'].split('_')[-1]
return level_map[loc[0]]
def get_unit(x):
loc = x['cls_loc'].split('_')[-1]
unit = loc[1:3]
return int(unit)
def get_module(x):
loc = x['cls_loc'].split('_')[-1]
module = loc[3:]
return int(module)
def get_accuracy(x):
challenges = [x for x in x['exp'] if x['x'] == 'A']
n = len(challenges)
if n == 0:
return 'N/A'
mistakes = [x for x in challenges if 'm' in x.keys()]
correct = n - len(mistakes)
return correct / n
#create the function to convert experience records to the pandas.DataFrame
def exp_to_df(g1j):
df = pd.DataFrame(f, columns=['exp'])
return df
def exp_to_df(g2j):
df = pd.DataFrame(f, columns=['exp'])
return df
#uses the function you just implemented, and checks that your function keeps the records and uses the right column names
g1 = exp_to_df(g1j)
g2 = exp_to_df(g2j)
assert len(g1) == len(g1j)
assert len(g2) == len(g2j)
columns = ['student_pin', 'date', 'level', 'unit', 'module', 'accuracy']
assert all(c in g1.columns for c in columns)
assert all(c in g2.columns for c in columns)
What am I doing wrong? It seems like def exp_to_df(g1j) and def exp_to_df(g2j) are wrong. Any suggestions?
Also is my def timestamp_to_date() also wrong?
I suggest using the pandas read_json() function to load your json directly into a dataframe (I added a couple dummy records):
g1 = pd.read_json('/content/drive/MyDrive/group1_exp_2020-04-08.json')
# cls_loc date dur exp mod pre scr usr ver
# 0 QOEBBG_K0101 2020-03-23 09:57:23.013 32 [{'m': 'spot_excited', 's': 8.5, 't': 8.5, 'w'... Poster False 67 QOGOBN 20.5.3
# 1 QOEBBG_K0102 2020-03-23 09:57:23.013 32 [{'m': 'spot_excited', 's': 8.5, 't': 8.5, 'w'... Poster False 67 QOGOBN 20.5.3
# 2 QOEBBG_K0103 2020-03-23 09:57:23.013 32 [{'s': 1.1, 't': 11.4, 'x': 'C'}] Poster False 67 QOGOBN 20.5.3
Then you can do all the data wrangling with pandas functions like
str.extract(),
assign(),
to_datetime(),
map(), and
apply():
# extract level, unit, module as columns
g1 = g1.assign(**g1.cls_loc
.str.extract(r'_([a-zA-Z])([0-9]{2})([0-9]{2})')
.rename({0: 'level', 1: 'unit', 2: 'module'}, axis=1))
# convert date to datetime
g1.date = pd.to_datetime(g1.date, unit='ms')
# map level to int
level_map = {'K': 3}
g1.level = g1.level.map(level_map)
# compute accuracy
def accuracy(exp):
challenges = [e for e in exp if e['x'] == 'A']
n = len(challenges)
if n == 0:
return np.nan
mistakes = [c for c in challenges if 'm' in c.keys()]
correct = n - len(mistakes)
return correct / n
g1['accuracy'] = g1.exp.apply(accuracy)
# rename usr -> student_pin
g1 = g1.rename({'usr': 'student_pin'}, axis=1)
# keep desired columns
columns = ['student_pin', 'date', 'level', 'unit', 'module', 'accuracy']
g1 = g1[columns]
Output:
student_pin date level unit module accuracy
0 QOGOBN 2020-03-23 09:57:23.013 3 01 01 0.500000
1 QOGOBN 2020-03-23 09:57:23.013 3 01 02 0.333333
2 QOGOBN 2020-03-23 09:57:23.013 3 01 03 NaN
dict = {'A': 71.07884,
'B': 110,
'C': 103.14484,
'D': 115.08864,
'E': 129.11552,
'F': 147.1766,
'G': 57.05196,
'H': 137.1412
}
def search_replace(search, replacement, searchstring):
p = re.compile(search)
searchstring = p.sub(replacement, searchstring)
return (searchstring)
def main():
with open(sys.argv[1]) as filetoread:
lines = filetoread.readlines()
file = ""
for i in range(len(lines)):
file += lines[i]
file = search_replace('(?<=[BC])', ' ', file)
letterlist = re.split('\s+', file)
for j in range(len(letterlist)):
print(letterlist[j])
if __name__ == '__main__':
import sys
import re
main()
My program open a file and split the text of letters after B or C.
The file looks like:
ABHHFBFEACEGDGDACBGHFEDDCAFEBHGFEBCFHHHGBAHGBCAFEEAABCHHGFEEEAEAGHHCF
Now I want to sum each lines with their values from dict.
For example:
AB = 181.07884
HHFB = 531.4590000000001
And so on.
I dont know how to start. Thanks a lot for all your answers.
You already did most of the work! All you miss out is the sum for each substring.
As substrings can occur more often, I'll do the summation only once, and store the values for each substring encountered in a dict (and your above dict for the relation of letter to value I renamed to mydict in order to avoid keyword confustion):
snippets = {}
for snippet in letterlist:
if snippet not in snippets:
value = 0
for s in snippet:
value += mydict.get(s)
snippets[snippet] = value
print(snippets)
That gives me an output of
{
'AB': 181.07884,
'HHFB': 531.4590000000001,
'FEAC': 450.5158,
'EGDGDAC': 647.6204,
'B': 110,
'GHFEDDC': 803.8074,
'AFEB': 457.37096,
'HGFEB': 580.4852800000001,
'C': 103.14484,
'FHHHGB': 725.6521600000001,
'AHGB': 375.272,
'AFEEAAB': 728.64416,
'HHGFEEEAEAGHHC': 1571.6099199999999,
'F': 147.1766}
Try to simplify things...
Given you already have a string s and a dictionary d:
ctr = 0
temp = ''
for letter in s:
ctr += d[letter]
temp += letter
if letter in 'BC':
print(temp, ctr)
ctr = 0
temp = ''
In the case you supplied where:
s = "ABHHFBFEACEGDGDACBGHFEDDCAFEBHGFEBCFHHHGBAHGBCAFEEAABCHHGFEEEAEAGHHCF"
d = {'A': 71.07884,
'B': 110,
'C': 103.14484,
'D': 115.08864,
'E': 129.11552,
'F': 147.1766,
'G': 57.05196,
'H': 137.1412
}
You get the results (printed to terminal):
>>> ('AB', 181.07884)
('HHFB', 531.4590000000001)
('FEAC', 450.5158)
('EGDGDAC', 647.6204)
('B', 110)
('GHFEDDC', 803.8074)
('AFEB', 457.37096)
('HGFEB', 580.4852800000001)
('C', 103.14484)
('FHHHGB', 725.6521600000001)
('AHGB', 375.272)
('C', 103.14484)
('AFEEAAB', 728.64416)
('C', 103.14484)
('HHGFEEEAEAGHHC', 1571.6099199999999)
Open you file and then read each character, then find the character on the dictionary and add the value to your total.
sum_ = 0
letters = "letters_file"
opened = open(letters, "r")
for row in opened:
for char in row:
sum_ += int(your_dictionary[char])
print(sum_)
You can use re.split with itertools.zip_longest in a dict comprehension:
import re
from itertools import zip_longest
i = iter(re.split('([BC])', s))
{w: sum(d[c] for c in w)for p in zip_longest(i, i, fillvalue='') for w in (''.join(p),)}
This returns:
{'AB': 181.07884, 'HHFB': 531.4590000000001, 'FEAC': 450.5158, 'EGDGDAC': 647.6204, 'B': 110, 'GHFEDDC': 803.8074, 'AFEB': 457.37096, 'HGFEB': 580.4852800000001, 'C': 103.14484, 'FHHHGB': 725.6521600000001, 'AHGB': 375.272, 'AFEEAAB': 728.64416, 'HHGFEEEAEAGHHC': 1571.6099199999999, 'F': 147.1766}