How to speed up dictionary build in Python

How to speed up dictionary build in Python - python

I have looked at the links but nothing appears to apply. I am doing what I thought would be a simple build of three dictionaries that I use elsewhere. They are not all that large but this function takes almost 4 minutes to complete. I am likely missing something and as I would like this to run faster. This is Python 3.4
class VivifiedDictionary(dict):
def __missing__(self, key):
value = self[key] = type(self)()
return value
def dict_build(exclude_chrY):
coordinate_intersection_dict = VivifiedDictionary()
aberration_list_dict = VivifiedDictionary()
gene_list_dict = VivifiedDictionary()
if eval(exclude_chrY):
chr_y = ""
else:
chr_y = "chrY"
abr_type_list = ["del", "ins"]
mouse_list = ["chr1", "chr2", "chr3", "chr4", "chr5", "chr6", "chr7", "chr8", "chr9", "chr10", "chr11", "chr12", "chr13", "chr14", "chr15", "chr16", "chr17", "chr18", "chr19", "chrX", chr_y]
for chrom in mouse_list:
for aberration in abr_type_list:
coordinate_intersection_dict[chrom][aberration] = []
aberration_list_dict[chrom][aberration] = []
gene_list_dict[chrom][aberration] = []

Pleas check this The Python Profilers I think this may help in finding the bottleneck in your bigger script.

Related

Python - How to iterate over nested data

I've seen a lot of information on how to iterate over nested data, but none seem to be applicable to my problem - perhaps my method of nesting is not ideal...
I'm working on a web scraping problem to increase the efficiency of my workflow (breaking out and presenting data in a better format than the website provides). I have a class that contains information about a contractor called ContractorData that has a property subs whcih is a list that contains more ContractorData and this nest can continue (each contractor can have sub contractors and those sub contractor can have sub contractors ...)
I have an efficient way to build this hierarchy, but I'm now struggling to find a way to iterate over every contractor down the hierarchy.
class ContractorData():
def __init__(self, soup:BeautifulSoup, parentId=None):
self.contractorName = soup.find('a', id=True).getText()
self.id = int(soup.find('a', id=True).get('href').split('&')[-1].split('=')[1])
status_options = ['Enrolled', 'Excluded', 'Pending']
self.status = 'Unknown'
for i in range(3):
cls = f'contractorStatusCol{i+1}'
if 'gray' not in soup.find('div', class_=cls).find('img').get('src'):
self.status = status_options[i]
break
self.date_enrolled = soup.find('div', class_ = 'contractorStatusCol4').get_text().replace(u'\xa0', '')
self.has_subs = soup.find('a', class_="expand") is not None
self.subs = []
self.parent = parentId
building the hierarchy: (general_contractor is the top of the hierarchy)
def load_subs(pid, cid, level):
sub_data = {'mode': 'loadEnrollmentCRUD', 'projectId': pid, 'contractorId': cid, 'level':level}
sub_resp = session.post('https://my-site.com/ajax-contractor.html', data=sub_data)
sub_soup = BeautifulSoup(sub_resp.text, 'html.parser')
return [ContractorData(x, cid) for x in sub_soup.find_all('div', class_='contractorStatusCRUD')]
level = 0
has_sub_list = [general_contractor]
while has_sub_list:
new_sub_list = []
level = level + 1
for x in has_sub_list:
x.subs = load_subs(PROJECT_ID, x.id, level)
new_sub_list.extend([y for y in x.subs if y.has_subs])
has_sub_list = new_sub_list
I'm thinking to use another while loop like the one I used to build the data, but I can't help but think my data architecture isn't optimal for this type of problem.
Edit based on comments:
The goal is to traverse through all contractors under the general_contractor and check their status to see if I need to take action on them.
I did just learn about a recursive function that can call itself which may work for this.
def walk(sub):
if sub.status != 'Enrolled':
# handle this case
pass
if sub.subs:
walk(sub.subs)
Thanks!

Recursive functions to the rescue! I didn't know this was possible, but it's very crisp and clean IMO.
flags = []
def check_subs(contractor:ContractorData):
if contractor.status == 'Pending': flags.append(contractor)
for x in contractor.subs:
check_subs(x)
check_subs(general_contractor)
for c in flags:
print(f'{c.contractorName}, {c.status}')

Python - Mocking a ZipFile

I am trying to get further into testing with python and right now I am hard stuck at trying to write a test for the following code:
def get_files(zip_path: Path):
archive = zipfile.ZipFile(os.path.join(os.path.dirname(__file__), '..', zip_path))
python_files = []
for x in archive.filelist:
if x.filename.endswith(".py"):
python_files.append(x)
return python_files
The test I came up with looks like that:
#mock.patch('zipfile.ZipFile')
def test_get_files(mock_zipfile):
mock_zipfile.return_value.filelist.return_value = [zipfile.ZipInfo('py_file.py'), zipfile.ZipInfo('py_file.py'),
zipfile.ZipInfo('any_file.any')]
nodes = get_ast_nodes(Path('/dummy/path/archive.zip'))
assert len(nodes) == 2
But I am not able to get the test to pass nor do I have any idea what is going wrong.

If someone is looking this up, i might also add the answer. This is how I got it to work:
#mock.patch('zipfile.ZipFile')
def test_get_files(mock_zipfile):
mock_zipfile.return_value.filelist = [zipfile.ZipInfo('py_file.py'), zipfile.ZipInfo('py_file.py'),
zipfile.ZipInfo('any_file.any')]
nodes = get_python_files(zipfile.ZipFile("dummy"))
assert len(nodes) == 2

Simplifying code blocks with functions [Python]

Im searching for a way to simplify my code via functions. 90% of my operation are equal and only differ from the if condition.
E.g.
if isFile:
fFound = False
for key in files:
if item["path"] in key:
fFound = True
for c in cmds.keys():
if item["path"] in cmds[c]["files"]:
ifxchecker(item["requiredIFX"], cmds[c]["ifx_match"])
outputCFG()
if not fFound:
notFound.append(item['path'])
else:
dir = item["path"][:-1]
pFound = False
for key in files:
if dir in key:
pFound = True
for c in cmds.keys():
for file in cmds[c]["files"]:
if dir in file:
ifxchecker(item["requiredIFX"], cmds[c]["ifx_match"])
outputCFG()
if not pFound:
notFound.append(dir)
My code is working fine, I'm just trying to get the most of it in a function and only differ from these small if conditions. I can't find a way to simplify it and I'm not even sure if there is.
I did some small functions as you see but I think there would be a better way to simplify the whole construct.

Unfortunately can't test it, because multiple vars and methods are not defined, but it seems to work. Maybe using is_dir bool variable instead of elem will be better, if you'd like: replace elem with is_dir and add the following line to the beginning of function:
elem = item["path"][:-1] if is_dir else item["path"]
def do_stuff(elem, files, item, cmds, notFound):
fFound = False
for key in files:
if elem in key:
fFound = True
for c in cmds.keys():
if elem in cmds[c]["files"]:
ifxchecker(item["requiredIFX"], cmds[c]["ifx_match"])
outputCFG()
if not fFound:
return elem
if isFile:
res = do_stuff(item["path"], files, item, cmds)
if res is not None:
notFound.append(res)
else:
do_stuff(item["path"][:-1], files, item, cmds)
if res is not None:
notFound.append(res)

I solved it like that with #azro method:
def cfgFunction(x):
global file
fFound = False
for file in files:
if x in file:
fFound = True
for group in cmds.keys():
if x in cmds[group]["files"]:
ifxchecker(item["requiredIFX"], cmds[group]["ifx_match"])
outputCFG()
if not fFound:
notFound.append(x)

Adding multiprocessing to existing Python script?

I'm trying to add multiprocessing to an existing password cracker, the source of which is located here: https://github.com/axcheron/pyvboxdie-cracker
The script works great but it's really slow, adding multiprocessing will certainly speed it up. I've looked online (and on here) for some examples and I've hit a wall of complete information overload, I just can't get my head around it. I found a really helpful post on here by the user Camon (posted here: Python Multiprocessing password cracker) but I can't see how I can implement it in the script.
def crack_keystore(keystore, dict):
wordlist = open(dict, 'r')
hash = get_hash_algorithm(keystore)
count = 0
print("\n[*] Starting bruteforce...")
for line in wordlist.readlines():
kdf1 = PBKDF2HMAC(algorithm=hash, length=keystore['Key_Length'], salt=keystore['Salt1_PBKDF2'],
iterations=keystore['Iteration1_PBKDF2'], backend=backend)
aes_key = kdf1.derive(line.rstrip().encode())
cipher = Cipher(algorithms.AES(aes_key), modes.XTS(tweak), backend=backend)
decryptor = cipher.decryptor()
aes_decrypt = decryptor.update(keystore['Enc_Password'])
kdf2 = PBKDF2HMAC(algorithm=hash, length=keystore['KL2_PBKDF2'], salt=keystore['Salt2_PBKDF2'],
iterations=keystore['Iteration2_PBKDF2'], backend=backend)
final_hash = kdf2.derive(aes_decrypt)
if random.randint(1, 20) == 12:
print("\t%d password tested..." % count)
count += 1
if binascii.hexlify(final_hash).decode() == binascii.hexlify(keystore['Final_Hash'].rstrip(b'\x00')).decode():
print("\n[*] Password Found = %s" % line.rstrip())
exit(0)
print("\t[-] Password Not Found. You should try another dictionary.")
This is the part of the script that I need to edit, the example by Carmon has a function to split the wordlist into chunks and each process is given it's own chunk. The problem I have implementing it, is that the wordlist is only populated inside the function (after other tasks have been completed, full source on repo). How would I implement multiprocessing to this section? Thanks for any help.

from multiprocessing import Process
# keystore = some_value
# dict1, dict2, dict3, dict4
proc_1 = Process(target=crack_keystore, args=(keystore, dict1))
proc_2 = Process(target=crack_keystore, args=(keystore, dict2))
proc_3 = Process(target=crack_keystore, args=(keystore, dict3))
proc_4 = Process(target=crack_keystore, args=(keystore, dict4))
proc_1.start()
proc_2.start()
proc_3.start()
proc_4.start()
proc_1.join()
proc_2.join()
proc_3.join()
proc_4.join()
print("All processes successfully ended!")
The maximum count of processes must not be more than the count of cores of your CPU.

How do I make it so I only need my api key referenced once?

I am teaching myself how to use python and django to access the google places api to make nearby searches for different types of gyms.
I was only taught how to use python and django with databases you build locally.
I wrote out a full Get request for they four different searches I am doing. I looked up examples but none seem to work for me.
allgyms = requests.get('https://maps.googleapis.com/maps/api/place/nearbysearch/json?location=38.9208,-77.036&radius=2500&type=gym&key=AIzaSyDOwVK7bGap6b5Mpct1cjKMp7swFGi3uGg')
all_text = allgyms.text
alljson = json.loads(all_text)
healthclubs = requests.get('https://maps.googleapis.com/maps/api/place/nearbysearch/json?location=38.9208,-77.036&radius=2500&type=gym&keyword=healthclub&key=AIzaSyDOwVK7bGap6b5Mpct1cjKMp7swFGi3uGg')
health_text = healthclubs.text
healthjson = json.loads(health_text)
crossfit = requests.get('https://maps.googleapis.com/maps/api/place/nearbysearch/json?location=38.9208,-77.036&radius=2500&type=gym&keyword=crossfit&key=AIzaSyDOwVK7bGap6b5Mpct1cjKMp7swFGi3uGg')
cross_text = crossfit.text
crossjson = json.loads(cross_text)
I really would like to be pointed in the right direction on how to have the api key referenced only one time while changing the keywords.

Try this for better readability and better reusability
BASE_URL = 'https://maps.googleapis.com/maps/api/place/nearbysearch/json?'
LOCATION = '38.9208,-77.036'
RADIUS = '2500'
TYPE = 'gym'
API_KEY = 'AIzaSyDOwVK7bGap6b5Mpct1cjKMp7swFGi3uGg'
KEYWORDS = ''
allgyms = requests.get(BASE_URL+'location='+LOCATION+'&radius='+RADIUS+'&type='+TYPE+'&key='+API_KEY) all_text = allgyms.text
alljson = json.loads(all_text)
KEYWORDS = 'healthclub'
healthclubs = requests.get(BASE_URL+'location='+LOCATION+'&radius='+RADIUS+'&type='+TYPE+'&keyword='+KEYWORDS+'&key='+API_KEY)
health_text = healthclubs.text
healthjson = json.loads(health_text)
KEYWORDS = 'crossfit'
crossfit = requests.get(BASE_URL+'location='+LOCATION+'&radius='+RADIUS+'&type='+TYPE+'&keyword='+KEYWORDS+'&key='+API_KEY)
cross_text = crossfit.text
crossjson = json.loads(cross_text)
as V-R suggested in a comment you can go further and define function which makes things more reusable allowing you to use the that function in other places of your application
Function implementation
def makeRequest(location, radius, type, keywords):
BASE_URL = 'https://maps.googleapis.com/maps/api/place/nearbysearch/json?'
API_KEY = 'AIzaSyDOwVK7bGap6b5Mpct1cjKMp7swFGi3uGg'
result = requests.get(BASE_URL+'location='+location+'&radius='+radius+'&type='+type+'&keyword='+keywords+'&key='+API_KEY)
jsonResult = json.loads(result)
return jsonResult
Function invocation
json = makeRequest('38.9208,-77.036', '2500', 'gym', '')
Let me know if there is an issue

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to speed up dictionary build in Python - python

Pleas check this The Python Profilers I think this may help in finding the bottleneck in your bigger script.

Related

Python - How to iterate over nested data

Python - Mocking a ZipFile

Simplifying code blocks with functions [Python]

Adding multiprocessing to existing Python script?

How do I make it so I only need my api key referenced once?

Categories

Resources