Is there a better way via multiprocessing? - python

I am using python multiprocessing to process files. Last processed
file record is stored in a dict A i.e. dict_A = {'file1_xx': '8-04-22', 'file2_xx': '8-04-22', 'file3_xx': '8-04-22', 'file4_xx': '8-04-22'}
Files directory is scanned, filenames with last modified date are stored in dict_test. Files recorded in both dicts are compared for new files: i.e. compare each file last modified date i.e file1_xx against the last processed date in dict_A. There's a condition which will update the dict_A if the file last modified date is greater than last processed date per single file.
I am facing issues as the dictionary is not updated after the files are processed.
Ideally the dict_A should be updated with the latest modified date per file of same category. This dict_A is then uploaded to PostgreSQL db through sqlalchemy.
def compare_rec(i):
a = dict_A[i]
b = dict_test[i]
if a >= b:
print("none")
else:
lock.acquire()
print("found")
a = b
lock.release()
def init(l):
global lock
lock = l
if __name__ == '__main__':
file_cat=['a', 'b', 'c', 'd']
dict_A={'a': '10', 'b': '10', 'c': '10', 'd': '10'}
dict_test={'a': '11', 'b': '11', 'c': '11', 'd': '11'}
l = multiprocessing.Lock()
pool = multiprocessing.Pool(initializer=init, initargs=(l,))
pool.map(compare_rec, file_cat)
pool.close()
pool.join()

Processes don't share variables.
In function I would use return to send filename and date back to main process
if ...:
return i, a
else:
return i, b
main thread should get results from all processes
results = pool.map(compare_rec, file_cat)
and it should update dictonary
dict_A.update(results)
Full code:
import multiprocessing
def compare_rec(key):
print('key:', key)
a = dict_A[key]
b = dict_test[key]
if a >= b:
print("none", key, a)
return key, a
else:
print("found:", key, b)
return key, b
if __name__ == '__main__':
file_cat = ['a', 'b', 'c', 'd']
dict_A = {'a': '10', 'b': '10', 'c': '10', 'd': '10'}
dict_test = {'a': '11', 'b': '11', 'c': '11', 'd': '11'}
pool = multiprocessing.Pool()
results = pool.map(compare_rec, file_cat)
print(results)
print('before:', dict_A)
dict_A.update(results)
print('after :', dict_A)
pool.close()
pool.join()

Related

How to manipulate data from binance stream

I am trying to manipulate the following data from a websocket.
Here is the data:
{'e': 'kline', 'E': 1659440374345, 's': 'MATICUSDT', 'k': {'t': 1659440100000, 'T': 1659440399999, 's': 'MATICUSDT', 'i': '5m', 'f': 274454614, 'L': 274455188, 'o': '0.87210000', 'c': '0.87240000', 'h': '0.87240000', 'l': '0.87000000', 'v': '145806.50000000', 'n': 575, 'x': False, 'q': '127036.96453000', 'V': '76167.60000000', 'Q': '66365.16664000', 'B': '0'}}
I am trying to extract following: 'E', 's' AND 'c'. To manipulate to: 'E' = time, 's' = symbol and 'c' = PRICE
def createframe(msg):
df = pd.DataFrame([msg])
df = df.loc[:,['s','E','c']
df.columns = ['symbol', 'Time', 'Price']
df.Price = df.Price.astype(float)
df.Time = pd.to_datetime(df.Time, unit = 'ms')
return df
When I run the next line of code to pull data:
async with stream as receiver:
while True:
data = await receiver.recv()
data = json.loads(data)['data']
df = createframe(data)
print(df)
I am getting error that 'c' is not defined.
PLEASE HELP. THANK YOU
If you look at the data frame, you'll see that in column "k" you have a whole dictionary's worth of data. That's because the value of k is itself a dictionary. You're getting the error that c is not defined because it is not a column itself, just a piece of data in column "k".
In order to get all this data into individual columns, you'll have to "flatten" the data. You can do something like this:
def createframe(msg):
df = pd.DataFrame([msg])
df = df.loc[:,['s','E','c']]
df.columns = ['symbol', 'Time', 'Price']
df.Price = df.Price.astype(float)
df.Time = pd.to_datetime(df.Time, unit = 'ms')
return df
def flatten(data):
newdict = {}
for each in msg:
if isinstance(msg[each], dict):
for i in msg[each]:
newdict[i] = msg[each][i]
else:
newdict[each] = msg[each]
return newdict
async with stream as receiver:
while True:
data = await receiver.recv()
data = json.loads(data)['data']
data = flatten(data)
df = createframe(data)
print(df)
Hope this helps! If you have questions just comment on this answer.

How to compile two lists of closing prices from websocket client?

I am running a standard buy/sell trader using websocket and talib. With the socket, I am able to get 2 messages through with different time intervals (1min and 3min). I am getting ETHUDS data only but with two different time intervals using this socket:
TRADE_SYMBOOL = 'ethusdt'
INTERVAL = '1m'
INTERVAL_2 = '3m'
SOCKET = f'wss://stream.binance.com:9443/ws/{TRADE_SYMBOOL}#kline_{INTERVAL}/{TRADE_SYMBOOL}#kline_{INTERVAL_2}'
This gives me a json.loads(message) of:
{'e': 'kline', 'E': 1646017123875, 's': 'ETHUSDT', 'k': {'t': 1646017080000, 'T': 1646017139999, 's': 'ETHUSDT', 'i': '1m', 'f': 769965188, 'L': 769965629, 'o': '2605.00000000', 'c': '2605.88000000', 'h': '2606.98000000', 'l': '2603.21000000', 'v': '191.57300000', 'n': 442, 'x': False, 'q': '499047.95132700', 'V': '78.57690000', 'Q': '204678.10094600', 'B': '0'}}
{'e': 'kline', 'E': 1646017123875, 's': 'ETHUSDT', 'k': {'t': 1646017020000, 'T': 1646017199999, 's': 'ETHUSDT', 'i': '3m', 'f': 769964266, 'L': 769965629, 'o': '2599.08000000', 'c': '2605.88000000', 'h': '2606.98000000', 'l': '2595.10000000', 'v': '922.85610000', 'n': 1364, 'x': False, 'q': '2399363.68094500', 'V': '356.83860000', 'Q': '928388.14101500', 'B': '0'}}
If you scroll across a bit, the 'i' tick is showing 1m in one and 3m in the next.
I am then extracting the close 'c' from the line and compiling a list of closes.
What I want to be able to do is make a list of closes from the 1m list and then a separate list from the 3m list.
closes = []
def on_message(ws, message):
global in_position
json_message = json.loads(message)
candle = json_message['k']
is_candle_closed = candle['x']
close = candle['c']
if is_candle_closed: # this only returns True at the end of each candle close (1minute)
closes.append(float(close))
This is what I was using when I was only using 1m intervals, but now I don't know how to sort the 2 json loads that are coming in with each message. How can I differentiate between the two pieces of json data so I can store their closing prices accordingly?
I need to be able to store the closing prices from the 1m and 3m candles as separate lists.
How can I list closes into two separate list eg. closes_1m and closes_3m please?
I would like list closes_1m to contain the price of the close after every 1 minute.
I would like list closes_3m to contain the price of the close after every 3 minutes.
I was able to separate them with a simple if statement.
closes_1m = []
closes_3m = []
def on_message(ws, message):
json_message = json.loads(message)
candle = json_message['k']
is_candle_closed = candle['x']
close = candle['c']
close_time = int(candle['T']) / 1000
interval = candle['i']
if interval == '1m' and is_candle_closed: # this only runs at the end of each candle close (1minute)
closes_1m.append(float(close))
if interval == '3m' and is_candle_closed: # this only runs at the end of each candle close (3minute)
closes_3m.append(float(close))
So every 1 minute I appended a closing price to closes_1m and every 3 minutes I appended the closing price data to closes_3m.

List that resembles a dict to dict

I have a list that already quite resembles a dictionary:
l=["'S':'NP''VP'", "'NP':'DET''N'", "'VP':'V'", "'DET':'a'", "'DET':'an'", "'N':'elephant'", "'N':'elephants'", "'V':'talk'", "'V':'smile'"]
I want to create a dictionary keeping all information:
dict= {'S': [['NP','VP']],
'NP': [['DET', 'N']],
'VP': [['V']], 'DET': [['a'], ['an']],
'N': [['elephants'], ['elephant']],
'V': [['talk'], ['smile]]}
I tried using this:
d = {}
elems = filter(str.isalnum,l.replace('"',"").split("'"))
values = elems[1::2]
keys = elems[0::2]
d.update(zip(keys,values))
and this:
s = l.split(",")
dictionary = {}
for i in s:
dictionary[i.split(":")[0].strip('\'').replace("\"", "")] = i.split(":")[1].strip('"\'')
print(dictionary)
You can use collections.defaultdict with re:
import re, collections
l=["'S':'NP''VP'", "'NP':'DET''N'", "'VP':'V'", "'DET':'a'", "'DET':'an'", "'N':'elephant'", "'N':'elephants'", "'V':'talk'", "'V':'smile'"]
d = collections.defaultdict(list)
for i in l:
d[(k:=re.findall('\w+', i))[0]].append(k[1:])
print(dict(d))
Output:
{'S': [['NP', 'VP']], 'NP': [['DET', 'N']], 'VP': [['V']], 'DET': [['a'], ['an']], 'N': [['elephant'], ['elephants']], 'V': [['talk'], ['smile']]}

Python: Sum each lines with their values from dict

dict = {'A': 71.07884,
'B': 110,
'C': 103.14484,
'D': 115.08864,
'E': 129.11552,
'F': 147.1766,
'G': 57.05196,
'H': 137.1412
}
def search_replace(search, replacement, searchstring):
p = re.compile(search)
searchstring = p.sub(replacement, searchstring)
return (searchstring)
def main():
with open(sys.argv[1]) as filetoread:
lines = filetoread.readlines()
file = ""
for i in range(len(lines)):
file += lines[i]
file = search_replace('(?<=[BC])', ' ', file)
letterlist = re.split('\s+', file)
for j in range(len(letterlist)):
print(letterlist[j])
if __name__ == '__main__':
import sys
import re
main()
My program open a file and split the text of letters after B or C.
The file looks like:
ABHHFBFEACEGDGDACBGHFEDDCAFEBHGFEBCFHHHGBAHGBCAFEEAABCHHGFEEEAEAGHHCF
Now I want to sum each lines with their values from dict.
For example:
AB = 181.07884
HHFB = 531.4590000000001
And so on.
I dont know how to start. Thanks a lot for all your answers.
You already did most of the work! All you miss out is the sum for each substring.
As substrings can occur more often, I'll do the summation only once, and store the values for each substring encountered in a dict (and your above dict for the relation of letter to value I renamed to mydict in order to avoid keyword confustion):
snippets = {}
for snippet in letterlist:
if snippet not in snippets:
value = 0
for s in snippet:
value += mydict.get(s)
snippets[snippet] = value
print(snippets)
That gives me an output of
{
'AB': 181.07884,
'HHFB': 531.4590000000001,
'FEAC': 450.5158,
'EGDGDAC': 647.6204,
'B': 110,
'GHFEDDC': 803.8074,
'AFEB': 457.37096,
'HGFEB': 580.4852800000001,
'C': 103.14484,
'FHHHGB': 725.6521600000001,
'AHGB': 375.272,
'AFEEAAB': 728.64416,
'HHGFEEEAEAGHHC': 1571.6099199999999,
'F': 147.1766}
Try to simplify things...
Given you already have a string s and a dictionary d:
ctr = 0
temp = ''
for letter in s:
ctr += d[letter]
temp += letter
if letter in 'BC':
print(temp, ctr)
ctr = 0
temp = ''
In the case you supplied where:
s = "ABHHFBFEACEGDGDACBGHFEDDCAFEBHGFEBCFHHHGBAHGBCAFEEAABCHHGFEEEAEAGHHCF"
d = {'A': 71.07884,
'B': 110,
'C': 103.14484,
'D': 115.08864,
'E': 129.11552,
'F': 147.1766,
'G': 57.05196,
'H': 137.1412
}
You get the results (printed to terminal):
>>> ('AB', 181.07884)
('HHFB', 531.4590000000001)
('FEAC', 450.5158)
('EGDGDAC', 647.6204)
('B', 110)
('GHFEDDC', 803.8074)
('AFEB', 457.37096)
('HGFEB', 580.4852800000001)
('C', 103.14484)
('FHHHGB', 725.6521600000001)
('AHGB', 375.272)
('C', 103.14484)
('AFEEAAB', 728.64416)
('C', 103.14484)
('HHGFEEEAEAGHHC', 1571.6099199999999)
Open you file and then read each character, then find the character on the dictionary and add the value to your total.
sum_ = 0
letters = "letters_file"
opened = open(letters, "r")
for row in opened:
for char in row:
sum_ += int(your_dictionary[char])
print(sum_)
You can use re.split with itertools.zip_longest in a dict comprehension:
import re
from itertools import zip_longest
i = iter(re.split('([BC])', s))
{w: sum(d[c] for c in w)for p in zip_longest(i, i, fillvalue='') for w in (''.join(p),)}
This returns:
{'AB': 181.07884, 'HHFB': 531.4590000000001, 'FEAC': 450.5158, 'EGDGDAC': 647.6204, 'B': 110, 'GHFEDDC': 803.8074, 'AFEB': 457.37096, 'HGFEB': 580.4852800000001, 'C': 103.14484, 'FHHHGB': 725.6521600000001, 'AHGB': 375.272, 'AFEEAAB': 728.64416, 'HHGFEEEAEAGHHC': 1571.6099199999999, 'F': 147.1766}

How to print first 10 lines instead of the whole list using pprint

What I am trying to accomplish is printing 10 lines only instead of the whole list using pprint(dict(str_types))
Here is my code
from collections import defaultdict
str_type_re = re.compile(r'\b\S+\.?$', re.IGNORECASE)
expected = ["Street", "Avenue", "Boulevard", "Drive", "Court", "Place", "Square", "Lane", "Road",
"Trail", "Parkway", "Commons"]
def audit_str_type(str_types, str_name, rex):
stn = rex.search(str_name)
if stn :
str_type = stn.group()
if str_type not in expected:
str_types[str_type].add(str_name)
I defined a function that audits tag elements where k="addr:street", and also any tag elements match the is_str_name function.
def audit(osmfile,rex):
osm_file = open(osmfile, "r", encoding="utf8")
str_types = defaultdict(set)
for event, elem in ET.iterparse(osm_file, events=("start",)):
if elem.tag == "node" or elem.tag == "way":
for tag in elem.iter("tag"):
if is_str_name(tag):
audit_str_type(str_types, tag.attrib['v'],rex)
return str_types
In the code above , I used "is_str_name" function to filter tag when calling the audit function to audit street names.
def is_str_name(elem):
return (elem.attrib['k'] == "addr:street")
str_types = audit(mydata, rex = str_type_re)
pprint.pprint(dict(str_types[:10]))
Use pprint.pformat to get back the string representation of the object instead of printing it directly, then you can split it up by lines and only print out the first few:
whole_repr = pprint.pformat(dict(str_types))
for line in whole_repr.splitlines()[:10]:
print(line)
Note that I couldn't test this since you did not have a MCVE but I did verify it with a more trivial example:
>>> import pprint
>>> thing = pprint.pformat({i:str(i) for i in range(10000)})
>>> type(thing), len(thing)
(<class 'str'>, 147779)
>>> for line in thing.splitlines()[:10]:print(line)
{0: '0',
1: '1',
2: '2',
3: '3',
4: '4',
5: '5',
6: '6',
7: '7',
8: '8',
9: '9',

Categories

Resources