My Merkle Tree calculation does not match the actual one - python

I am studying bitcoin.
https://en.bitcoin.it/wiki/Protocol_documentation#Merkle_Trees
I read the above URL and implemented Merkle root in Python.
Using the API below, I collected all transactions in block 641150 and calculated the Merkle Root.
https://www.blockchain.com/explorer/api/blockchain_api
The following is the expected value
67a637b1c49d95165b3dd3177033adbbbc880f6da3620498d451ee0976d7b1f4
(https://www.blockchain.com/btc/block/641150 )
The values I calculated were as follows
f2a2207a1e8360b75729fd2f23659b1b79b14940b6e4982a985cf6aa6f941ad7
What is wrong?
My python code is;
from hashlib import sha256
import requests, json
base_url = 'https://blockchain.info/rawblock/'
block_hash = '000000000000000000042cef688cf40b4a70ac814e4222e6646bd6bb79d18168'
end_point = base_url + block_hash
def reverse(hex_be):
bytes_be = bytes.fromhex(hex_be)
bytes_le = bytes_be[::-1]
hex_le = bytes_le.hex()
return hex_le
def dhash(hash):
return sha256(sha256(hash.encode('utf-8')).hexdigest().encode('utf-8')).hexdigest()
def culculate_merkle(hash_list):
if len(hash_list) == 1:
return dhash(hash_list[0])
hashed_list = list(map(dhash, hash_list))
if len(hashed_list) % 2 == 1:
hashed_list.append(hashed_list[-1])
parent_hash_list = []
it = iter(hashed_list)
for hash1, hash2 in zip(it, it):
parent_hash_list.append(hash1 + hash2)
hashed_list = list(map(dhash, hash_list))
return culculate_merkle(parent_hash_list)
data = requests.get(end_point)
jsondata = json.loads(data.text)
tx_list = list(map(lambda tx_object: tx_object['hash'], jsondata['tx']))
markleroot = '67a637b1c49d95165b3dd3177033adbbbc880f6da3620498d451ee0976d7b1f4'
tx_list = list(map(reverse, tx_list))
output = culculate_merkle(tx_list)
output = reverse(output)
print(output)
result
$ python merkleTree.py
f2a2207a1e8360b75729fd2f23659b1b79b14940b6e4982a985cf6aa6f941ad7
I expect the following output as a result
67a637b1c49d95165b3dd3177033adbbbc880f6da3620498d451ee0976d7b1f4

self solving
first:
There was mistake in the dhash calculation method.
second:
The hash list obtained from the API had already had the dhash applied once.
⇒ The first time dhash was performed without this perspective
The complete source code is summarized in the following blog (in Japanese)
https://tec.tecotec.co.jp/entry/2022/12/25/000000

Related

What is the best way to return a variable or call a function to maximize code reuse?

I was wondering if i could get some input from some season python exports, i have a couple questions
I am extracting data from an api request and calculating the total vulnerabilities,
what is the best way i can return this data so that i can call it in another function
what is the way i can add up all the vulnerabilities (right now its just adding it per 500 at a time, id like to do the sum of every vulnerability
def _request():
third_party_patching_filer = {
"asset": "asset.agentKey IS NOT NULL",
"vulnerability" : "vulnerability.categories NOT IN ['microsoft patch']"}
headers = _headers()
print(headers)
url1 = f"https://us.api.insight.rapid7.com/vm/v4/integration/assets"
resp = requests.post(url=url1, headers=headers, json=third_party_patching_filer, verify=False).json()
jsonData = resp
#print(jsonData)
has_next_cursor = False
nextKey = ""
if "cursor" in jsonData["metadata"]:
has_next_cursor = True
nextKey = jsonData["metadata"]["cursor"]
while has_next_cursor:
url2 = f"https://us.api.insight.rapid7.com/vm/v4/integration/assets?&size=500&cursor={nextKey}"
resp2 = requests.post(url=url2, headers=headers, json=third_party_patching_filer, verify=False).json()
cursor = resp2["metadata"]
print(cursor)
if "cursor" in cursor:
nextKey = cursor["cursor"]
print(f"next key {nextKey}")
#print(desktop_support)
for data in resp2["data"]:
for tags in data['tags']:
total_critical_vul_osswin = []
total_severe_vul_osswin = []
total_modoer_vuln_osswin = []
if tags["name"] == 'OSSWIN':
print("OSSWIN")
critical_vuln_osswin = data['critical_vulnerabilities']
severe_vuln_osswin = data['severe_vulnerabilities']
modoer_vuln_osswin = data['moderate_vulnerabilities']
total_critical_vul_osswin.append(critical_vuln_osswin)
total_severe_vul_osswin.append(severe_vuln_osswin)
total_modoer_vuln_osswin.append(modoer_vuln_osswin)
print(sum(total_critical_vul_osswin))
print(sum(total_severe_vul_osswin))
print(sum(total_modoer_vuln_osswin))
if tags["name"] == 'DESKTOP_SUPPORT':
print("Desktop")
total_critical_vul_desktop = []
total_severe_vul_desktop = []
total_modorate_vuln_desktop = []
critical_vuln_desktop = data['critical_vulnerabilities']
severe_vuln_desktop = data['severe_vulnerabilities']
moderate_vuln_desktop = data['moderate_vulnerabilities']
total_critical_vul_desktop.append(critical_vuln_desktop)
total_severe_vul_desktop.append(severe_vuln_desktop)
total_modorate_vuln_desktop.append(moderate_vuln_desktop)
print(sum(total_critical_vul_desktop))
print(sum(total_severe_vul_desktop))
print(sum(total_modorate_vuln_desktop))
else:
pass
else:
has_next_cursor = False
If you have a lot of parameters to pass, consider using a dict to combine them. Then you can just return the dict and pass it along to the next function that needs that data. Another approach would be to create a class and either access the variables directly or have helper functions that do so. The latter is a cleaner solution vs a dict, since with a dict you have to quote every variable name, and with a class you can easily add additional functionally beyond just being a container for a bunch of instance variables.
If you want the total across all the data, you should put these initializations:
total_critical_vul_osswin = []
total_severe_vul_osswin = []
total_modoer_vuln_osswin = []
before the while has_next_cursor loop (and similarly for the desktop totals). The way your code is currently, they are initialized each cursor (ie, each 500 samples based on the URL).

Improving the runtime of a pandas loop [duplicate]

This question already has an answer here:
Pandas - Explanation on apply function being slow
(1 answer)
Closed 7 months ago.
I am actively running some Python code in jupyter on a df consisting of about 84k rows. I'm estimating this is going to take somewhere in the neighborhood of 9 hours at this rate. My code is below, I have read that ideally one would vectorize for max speed but being sort of new to Python and coding in general, I'm not sure how I can go about changing the below code to vectorize it. The goal is to look at the value in the first column of the dataframe and add that value to the end of a url. I then check the first line in the url and compare it to some predetermined values to find out if there is a match. Any advice would be greatly appreciated!
#Python 3
import pandas as pd
import urllib
no_res = "Item Not Found"
error = "Does Not Exist"
for i in df1.index:
path = 'http://xxxx/xxx/xxx.pl?part=' + str(df1['ITEM_ID'][i])
parsed_path = path.replace(' ','%20')
f = urllib.request.urlopen(parsed_path)
raw = str(f.read().decode("utf-8"))
lines = raw.split('\n')
r = lines[0]
if r == no_res:
sap = 'NO'
elif r == error:
sap = 'ERROR'
else:
sap = 'YES'
df1["item exists"][i] = sap
df1["Path"][i] = path
df1["URL return value"][i] = r
Edit adding test code below
import concurrent.futures
import pandas as pd
import urllib
import numpy as np
def my_func(df_row):
no_res = "random"
error = "entered"
path = "http://www.google.com"
parsed_path = path.replace(' ','%20')
f = urllib.request.urlopen(parsed_path)
raw = str(f.read().decode("utf-8"))
lines = raw.split('\n')
r = df_row['0']
if r == no_res:
sap = "NO"
elif r == error:
sap = "ERROR"
else:
sap = "YES"
df_row['4'] = sap
df_row['5'] = lines[0]
df_row['6'] = r
n = 1000
my_df = pd.DataFrame(np.random.choice(['random','words','entered'], size=(n,3)))
my_df['4'] = ""
my_df['5'] = ""
my_df['6'] = ""
my_df = my_df.apply(lambda col: col.astype('category'))
executor = concurrent.futures.ProcessPoolExecutor(8)
futures = [executor.submit(my_func, row) for _,row in my_df.iterrows()]
concurrent.futures.wait(futures)
This is throwing the following error (shortened):
DoneAndNotDoneFutures(done={<Future at 0x1cfe4938040 state=finished raised BrokenProcessPool>, <Future at 0x1cfe48b8040 state=finished raised BrokenProcessPool>,
Since you are doing some outside operation with a URL, I do not think vectorization is a solution (let possible).
The bottleneck of your operation is the following line
f = urllib.request.urlopen(parsed_path)
This line waits for the response and is blocking, as mentioned your operation is I/O bound. The CPU can start other jobs while waiting for the response. The solution to address this is using concurrency.
Edit: My original answer was using python built-in multi threading which was problematic. The best way to do multiprocessing/threading with pandas data frame is using "dask" library.
The following code is tested with the dummy data set on my PC and on average speeds up the naive for loop by ~ 12 times.
#%%
import time
import urllib.request
import pandas as pd
import numpy as np
import dask.dataframe as dd
def my_func(df_row):
df_row = df_row.copy()
no_res = "random"
error = "entered"
path = "http://www.google.com"
parsed_path = path.replace(' ','%20')
f = urllib.request.urlopen(parsed_path)
# I had to change the encoding on my machine.
raw = str(f.read().decode("'windows-1252"))
lines = raw.split('\n')
r = df_row[0]
if r == no_res:
sap = "NO"
elif r == error:
sap = "ERROR"
else:
sap = "YES"
df_row['4'] = sap
df_row['5'] = lines[0]
df_row['6'] = r
return df_row
def run():
print("started.")
n = 1000
my_df = pd.DataFrame(np.random.choice(['random','words','entered'], size=(n,3)))
my_df = my_df.apply(lambda col: col.astype('category'))
my_df['4'] = ""
my_df['5'] = ""
my_df['6'] = ""
# Literally dask partitions the original dataframe into
# npartitions chunks and use them in apply function
# in parallel.
my_ddf = dd.from_pandas(my_df, npartitions=15)
start = time.time()
q = my_ddf.apply(my_func, axis= 1, meta=my_ddf)
# num_workers is number of threads used,
print(q.compute(num_workers= 50))
time_end = time.time()
print(f"Elapsed: {time_end - start:10.2f}")
if __name__ == "__main__":
run()
dask provides many other tools and options to facilitate concurrent processing and it would be a good idea to take a look at its documentation to investigate other options.
P.S. : if you run the above code too many times on google you will receive "HTTP Error 429: Too Many Requests". This happens to prevent something like DDoS attack on a public server. So, if for your real job you are querying a public website, you may end up receiving the same 429 response, if you try 84K queries in a short time.

Parsing Json with multiple "levels" with Python

I'm trying to parse a json file from an api call.
I have found this code that fits my need and trying to adapt it to what I want:
import math, urllib2, json, re
def download():
graph = {}
page = urllib2.urlopen("http://fx.priceonomics.com/v1/rates/?q=1")
jsrates = json.loads(page.read())
pattern = re.compile("([A-Z]{3})_([A-Z]{3})")
for key in jsrates:
matches = pattern.match(key)
conversion_rate = -math.log(float(jsrates[key]))
from_rate = matches.group(1).encode('ascii','ignore')
to_rate = matches.group(2).encode('ascii','ignore')
if from_rate != to_rate:
if from_rate not in graph:
graph[from_rate] = {}
graph[from_rate][to_rate] = float(conversion_rate)
return graph
And I've turned it into:
import math, urllib2, json, re
def download():
graph = {}
page = urllib2.urlopen("https://bittrex.com/api/v1.1/public/getmarketsummaries")
jsrates = json.loads(page.read())
for pattern in jsrates['result'][0]['MarketName']:
for key in jsrates['result'][0]['Ask']:
matches = pattern.match(key)
conversion_rate = -math.log(float(jsrates[key]))
from_rate = matches.group(1).encode('ascii','ignore')
to_rate = matches.group(2).encode('ascii','ignore')
if from_rate != to_rate:
if from_rate not in graph:
graph[from_rate] = {}
graph[from_rate][to_rate] = float(conversion_rate)
return graph
Now the problem is that there is multiple level in the json "Result > 0, 1,2 etc"
json screenshot
for key in jsrates['result'][0]['Ask']:
I want the zero to be able to be any number, I don't know if thats clear.
So I could get all the ask price to match their marketname.
I have shortened the code so it doesnt make too long of a post.
Thanks
PS: sorry for the english, its not my native language.
You could loop through all of the result values that are returned, ignoring the meaningless numeric index:
for result in jsrates['result'].values():
ask = result.get('Ask')
if ask is not None:
# Do things with your ask...

Data gets mixed up while trying to transfer it to arangodb

I'm trying to transfer ca. 10GB of json data (tweets in my case) to a collection in arangodb. I'm also trying to use joblib for it:
from ArangoConn import ArangoConn
import Userdata as U
import encodings
from joblib import Parallel,delayed
import json
from glob import glob
import time
def progress(total, prog, start, stri = ""):
if(prog == 0):
print("")
prog = 1;
perc = prog / total
diff = time.time() - start
rem = (diff / prog) * (total - prog)
bar = ""
for i in range(0,int(perc*20)):
bar = bar + "|"
for i in range(int(perc*20),20):
bar = bar + " "
print("\r"+"progress: " + "[" + bar + "] " + str(prog) + " of " +
str(total) + ": {0:.1f}% ".format(perc * 100) + "- " +
time.strftime("%H:%M:%S", time.gmtime(rem)) + " " + stri, end="")
def processfile(filepath):
file = open(filepath,encoding='utf-8')
s = file.read()
file.close()
data = json.loads(s)
Parallel(n_jobs=12, verbose=0, backend="threading"
(map(delayed(ArangoConn.createDocFromObject), data))
files = glob(U.path+'/*.json')
i = 1
j = len(files)
starttime = time.time()
for f in files:
progress(j,i,starttime,f)
i = i+1
processfile(f)
and
from pyArango.connection import Connection
import Userdata as U
import time
class ArangoConn:
def __init__(self,server,user,pw,db,collectionname):
self.server = server
self.user = user
self.pw = pw
self.db = db
self.collectionname = collectionname
self.connection = None
self.dbHandle = self.connect()
if not self.dbHandle.hasCollection(name=self.collectionname):
coll = self.dbHandle.createCollection(name=collectionname)
else:
coll = self.dbHandle.collections[collectionname]
self.collection = coll
def db_createDocFromObject(self, obj):
data = obj.__dict__()
doc = self.collection.createDocument()
for key,value in data.items():
doc[key] = value
doc._key= str(int(round(time.time() * 1000)))
doc.save()
def connect(self):
self.connection = Connection(arangoURL=self.server + ":8529",
username=self.user, password=self.pw)
if not self.connection.hasDatabase(self.db):
db = self.connection.createDatabase(name=self.db)
else:
db = self.connection.databases.get(self.db)
return db
def disconnect(self):
self.connection.disconnectSession()
def getAllData(self):
docs = []
for doc in self.collection.fetchAll():
docs.append(self.doc_to_result(doc))
return docs
def addData(self,obj):
self.db_createDocFromObject(obj)
def search(self,collection,search,prop):
docs = []
aql = """FOR q IN """+collection+""" FILTER q."""+prop+""" LIKE
"%"""+search+"""%" RETURN q"""
results = self.dbHandle.AQLQuery(aql, rawResults=False, batchSize=1)
for doc in results:
docs.append(self.doc_to_result(doc))
return docs
def doc_to_result(self,arangodoc):
modstore = arangodoc.getStore()
modstore["_key"] = arangodoc._key
return modstore
def db_createDocFromJson(self,json):
for d in json:
doc = self.collection.createDocument()
for key,value in d.items():
doc[key] = value
doc._key = str(int(round(time.time() * 1000)))
doc.save()
#staticmethod
def createDocFromObject(obj):
c = ArangoConn(U.url, U.user, U.pw, U.db, U.collection)
data = obj
doc = c.collection.createDocument()
for key, value in data.items():
doc[key] = value
doc._key = doc["id"]
doc.save()
c.connection.disconnectSession()
It kinda works like that. My problem is that the data that lands in the database is somehow mixed up.
as you can see in the screenshot "id" and "id_str" are not the same - as they should be.
what i investigated so far:
I thought that at some points the default keys in the databese may "collide"
because of the threading so I set the key to the tweet id.
I tried to do it without multiple threads. the threading doesn't seem to be
the problem
I looked at the data I send to the database... everything seems to be fine
But as soon as I communicate with the db the data mixes up.
My professor thought that maybe something in pyarango isn't threadsafe and it messes up the data but I don't think so as threading doesn't seem to be the problem.
I have no ideas left where this behavior could come from...
Any ideas?
The screenshot shows the following values:
id : 892886691937214500
id_str : 892886691937214465
It looks like somewhere along the way the value is converted to an IEEE754 double, which cannot safely represent the latter value. So there is potentially some precision loss due to conversion.
A quick example in node.js (JavaScript is using IEEE754 doubles for any number values greater than 0xffffffff) shows that this is likely the problem cause:
$ node
> 892886691937214500
892886691937214500
> 892886691937214465
892886691937214500
So the question is where the conversion does happen. Can you check whether the python client program is correctly sending the expected values to ArangoDB, or does it already send the converted/truncated values?
In general, any integer number that exceeds 0x7fffffffffffffff will be truncated when stored in ArangoDB, or converted to an IEEE754 double. This can be avoided by storing the number values inside a string, but of course comparing two number strings will produce different results than comparing two numbers (e.g. "10" < "9" vs. 10 > 9).

Why are the videos on the most_recent standard feed so out of date?

I'm trying to grab the most recently uploaded videos. There's a standard feed for that - it's called most_recent. I don't have any problems grabbing the feed, but when I look at the entries inside, they're all half a year old, which is hardly recent.
Here's the code I'm using:
import requests
import os.path as P
import sys
from lxml import etree
import datetime
namespaces = {"a": "http://www.w3.org/2005/Atom", "yt": "http://gdata.youtube.com/schemas/2007"}
fmt = "%Y-%m-%dT%H:%M:%S.000Z"
class VideoEntry:
"""Data holder for the video."""
def __init__(self, node):
self.entry_id = node.find("./a:id", namespaces=namespaces).text
published = node.find("./a:published", namespaces=namespaces).text
self.published = datetime.datetime.strptime(published, fmt)
def __str__(self):
return "VideoEntry[id='%s']" % self.entry_id
def paginate(xml):
root = etree.fromstring(xml)
next_page = root.find("./a:link[#rel='next']", namespaces=namespaces)
if next_page == None:
next_link = None
else:
next_link = next_page.get("href")
entries = [VideoEntry(e) for e in root.xpath("/a:feed/a:entry", namespaces=namespaces)]
return entries, next_link
prefix = "https://gdata.youtube.com/feeds/api/standardfeeds/"
standard_feeds = set("top_rated top_favorites most_shared most_popular most_recent most_discussed most_responded recently_featured on_the_web most_viewed".split(" "))
feed_name = sys.argv[1]
assert feed_name in standard_feeds
feed_url = prefix + feed_name
all_video_ids = []
while feed_url is not None:
r = requests.get(feed_url)
if r.status_code != 200:
break
text = r.text.encode("utf-8")
video_ids, feed_url = paginate(text)
all_video_ids += video_ids
all_upload_times = [e.published for e in all_video_ids]
print min(all_upload_times), max(all_upload_times)
As you can see, it prints the min and max timestamps for the entire feed.
misha#misha-antec$ python get_standard_feed.py most_recent
2013-02-02 14:40:02 2013-02-02 14:54:00
misha#misha-antec$ python get_standard_feed.py top_rated
2006-04-06 21:30:53 2013-07-28 22:22:38
I've glanced through the downloaded XML and it appears to match the output. Am I doing something wrong?
Also, on an unrelated note, the feeds I'm getting are all about 100 entries (I'm paginating through them 25 at a time). Is this normal? I expected the feeds to be a bit bigger.
Regarding the "Most-Recent-Feed"-Topic: There is a ticket for this one here. Unfortunately, the YouTube-API-Teams doesn't respond or solved the problem so far.
Regarding the number of entries: That depends on the type of standardfeed, but for the most-recent-Feed it´s usually around 100.
Note: You could try using the "orderby=published" parameter to get recents videos, although I don´t know how "recent" they are.
https://gdata.youtube.com/feeds/api/videos?orderby=published&prettyprint=True
You can combine this query with the "category"-parameter or other ones (region-specific queries - like for the standard feeds - are not possible, afaik).

Categories

Resources