Extract data from dictionary as fast as possible - python

I have a dictionary d with around 500 main keys (name1, name2, etc.). Each value is itself a small dictionary with 5 keys called ppty1, ppty2, etc.), and the corresponding values are floats converted to strings.
I want to extract data faster than I presently do, based on a list of lists of the form ['name1', 'ppty3','ppty4'] (name1 could by any other nameX and ppty3 and ppty4 could be any other pptyX).
In my application, I have many dictionaries, but they differ only by the values of the fields ppty1, ..., ppty5. All the keys are "static". I do not care if there are some preliminary operations, I would just like the processing time of one dictionary to be, ideally, much faster than now. My poor implementation, consisting in looping over every field takes about 3 ms.
Here is the code to generate d and fields; this is just to simulate dummy data, it does not need to be improved:
import random
random.seed(314)
# build dictionary
def make_small_dict():
d = {}
for i in range(5):
key = "ppty" + str(i)
d[key] = str(random.random())
return d
d = {}
for i in range(100):
d["name" + str(i)] = make_small_dict()
# build fields
def make_row():
line = ['name' + str(random.randint(0,100))]
[line.append('ppty' + str(random.randint(0,5))) for i in range(2)]
return line
fields = [0]*300
for i in range(300):
fields[i] = [make_row() for j in range(3)]
For example, fields[0] returns
[['name420', 'ppty1', 'ppty1'],
['name206', 'ppty1', 'ppty2'],
['name21', 'ppty2', 'ppty4']]
so the first row of the output should be something like
[[d['name420']['ppty1'], d['name420']['ppty1'],
[d['name206']['ppty1'], d['name206']['ppty2']],
[d['name21']['ppty2'], d['name21']['ppty4']]]]
My solution:
start = time.time()
data = [0] * len(fields)
i = 0
for field in fields:
data2 = [0] * 3
j = 0
for row in field:
lst = [d[row[0]][key] for key in [row[1], row[2]]]
data2[j] = lst
j += 1
data[i] = data2
i += 1
print time.time() - start
My main question is, how to do improve my code? Few additional question:
Later, I need to do some operations such as column extraction, basic operation on some entries of data: would you recommend storing the extracted values directly in an np.array?
How to avoid extracting the same values multiple times (fields has some redundant rows such as ['name1', 'ppty3', 'ppty4'])?
I read that things such as i += 1 take a little bit of time, how can I avoid them?

This was tough to read, so I started by breaking bits out into functions. Then I could test to see if that worked using just a list comprehension. It's already faster, comparison over 10000 runs with timeit showed this code runs in about 64% of the original code's time.
In this case I kept everything in lists to force execution so it is directly comparable, but you could use generators or map, and that'd push the computation back to when the data is actually consumed.
def row_lookup(name, key1, key2):
return (d[name][key1], d[name][key2]) # Tuple is faster to construct than list
def field_lookup(field):
return [row_lookup(*row) for row in field]
start = time.time()
result = [field_lookup(field) for field in fields]
print(time.time() - start)
print(data == result)
# without dupes in fields
from itertools import groupby
result = [field_lookup(field) for field, _ in groupby(fields)]
Change just the result assignment line to:
result = map(field_lookup, fields)
And the runtime becomes negligible, because map is a generator, so it's not actually going to compute the data until you ask it for the result. This is not a fair comparison, but if you're not going to consume all the data, you'd save time. Change the list comprehensions in the functions to generators and you'd get the same benefit there too. Multiprocessing and asyncio didn't improve performance time in this case.
If you can change the structure you can preprocess your fields into a list of just the rows [['namex', 'pptyx', 'pptyX']..]. In this case, you can change it to just a single list comprehension, which lets you get this down to about 29% of the original runtime, ignoring the preprocessing to slim the fields.
from itertools import groupby, chain
slim_fields = [row for row, _ in groupby(chain.from_iterable(fields))]
results = [(d[name][key1], d[name][key2]) for name, key1, key2 in slim_fields]
In this case, results is just a list of tuples containing the values: [(value1, value2)..]

Related

accelerate comparing dictionary keys and values to strings in list in python

Sorry if this is trivial I'm still learning but I have a list of dictionaries that looks as follow:
[{'1102': ['00576', '00577', '00578', '00579', '00580', '00581']},
{'1102': ['00582', '00583', '00584', '00585', '00586', '00587']},
{'1102': ['00588', '00589', '00590', '00591', '00592', '00593']},
{'1102': ['00594', '00595', '00596', '00597', '00598', '00599']},
{'1102': ['00600', '00601', '00602', '00603', '00604', '00605']}
...]
it contains ~89000 dictionaries. And I have a list containing 4473208 paths. example:
['/****/**/******_1102/00575***...**0CT.csv',
'/****/**/******_1102/00575***...**1CT.csv',
'/****/**/******_1102/00575***...**2CT.csv',
'/****/**/******_1102/00575***...**3CT.csv',
'/****/**/******_1102/00575***...**4CT.csv',
'/****/**/******_1102/00578***...**1CT.csv',
'/****/**/******_1102/00578***...**2CT.csv',
'/****/**/******_1102/00578***...**3CT.csv',
...]
and what I want to do is group each path that contains the grouped values in the dict in the folder containing the key together.
I tried using for loops like this:
grpd_cts = []
for elem in tqdm(dict_list):
temp1 = []
for file in ct_paths:
for key, val in elem.items():
if (file[16:20] == key) and (any(x in file[21:26] for x in val)):
temp1.append(file)
grpd_cts.append(temp1)
but this takes around 30hours. is there a way to make it more efficient? any itertools function or something?
Thanks a lot!
ct_paths is iterated repeatedly in your inner loop, and you're only interested in a little bit of it for testing purposes; pull that out and use it to index the rest of your data, as a dictionary.
What does make your problem complicated is that you're wanting to end up with the original list of filenames, so you need to construct a two-level dictionary where the values are lists of all originals grouped under those two keys.
ct_path_index = {}
for f in ct_paths:
ct_path_index.setdefault(f[16:20], {}).setdefault(f[21:26], []).append(f)
grpd_cts = []
for elem in tqdm(dict_list):
temp1 = []
for key, val in elem.items():
d2 = ct_path_index.get(key)
if d2:
for v in val:
v2 = d2.get(v)
if v2:
temp1 += v2
grpd_cts.append(temp1)
ct_path_index looks like this, using your data:
{'1102': {'00575': ['/****/**/******_1102/00575***...**0CT.csv',
'/****/**/******_1102/00575***...**1CT.csv',
'/****/**/******_1102/00575***...**2CT.csv',
'/****/**/******_1102/00575***...**3CT.csv',
'/****/**/******_1102/00575***...**4CT.csv'],
'00578': ['/****/**/******_1102/00578***...**1CT.csv',
'/****/**/******_1102/00578***...**2CT.csv',
'/****/**/******_1102/00578***...**3CT.csv']}}
The use of setdefault (which can be a little hard to understand the first time you see it) is important when building up collections of collections, and is very common in these kinds of cases: it makes sure that the sub-collections are created on demand and then re-used for a given key.
Now, you've only got two nested loops; the inner checks are done using dictionary lookups, which are close to O(1).
Other optimizations would include turning the lists in dict_list into sets, which would be worthwhile if you made more than one pass through dict_list.

What is faster: adding a key-value pair or checking for the existence of a key?

Imagine a CSV file with 3 columns: individual name, group name, group ID.
Obviously column 1 is different for every line while column 2 and 3 can be the same as before (every group name has an individual ID though). This is not sorted in any way.
For reasons I'm creating a dict to save: group ID (key) --> group name (value).
Now what is faster of the following variants?
checking if that key already exists and only saving if not.
if ID not in group_dict:
group_dict[ID] = name
just saving it every time again (replacing the value, which is the same anyway).
group_dict[ID] = name
It's really best to profile the code when you have a question like this. Python provides the timeit module, which is useful for this purpose. Here is some code you can use to experiment with,
import timeit
setup_code = """
import random
keysize = 20
valsize = 32
store = dict()
data = [(random.randint(0, 2**keysize), random.randint(0, 2**valsize)) for _ in range(1000000)]
"""
query = """
for key, val in data:
if key not in store:
store[key] = val
"""
no_query = """
for key, val in data:
store[key] = val
"""
if __name__ == "__main__":
print(timeit.timeit(stmt=query, setup=setup_code, number=1))
print(timeit.timeit(stmt=no_query, setup=setup_code, number=1))
The performance of the code depends upon the number of key collisions. In this code, if you increase keysize you will have fewer collisions and checking the dict first will be slower. Conversely, if you reduce the keysize the number of collisions will increase and checking the dict starts to perform better. The take away here is that the number of collision you have will determine which of these approaches is preferable.

Is there a way to do it faster?

ladder have around 15000 elements, this code snippet performed in 5-8sec, is there any way to do it faster? I try do it without checking for duplicate and without creating accs list and time was down to 2-3sec, but I don't need duplicate in csv file.
I work in python 2.7.9
accs =[]
with codecs.open('test.csv','w', encoding="UTF-8") as out:
row = ''
for element in ladder:
if element['account']['name'] not in accs:
accs.append(element['account']['name'])
row += element['account']['name']
if 'twitch' in element['account']:
row += "," + element['account']['twitch']['name'] + ","
else:
row += ",,"
row += str(element['account']['challenges']['total']) + "\n"
out.write(row)
seen = set()
results = []
for user in ladder:
acc = user['account']
name = acc['name']
if name not in seen:
seen.add(name)
twitch_name = acc['twitch']['name'] if "twitch" in acc else ''
challenges = acc['challenges']['total']
results.append("%s,%s,%d" % (name, twitch_name, challenges))
with codecs.open('test.csv','w', encoding="UTF-8") as out:
out.write("\n".join(results))
You can’t do much about the loop, since you need to go through every element in ladder after all. However, you can improve this membership test:
if element['account']['name'] not in accs:
Since accs is a list, this will essentially loop through all items of accs and check if the name is in there. And you loop for every element in ladder, so this can easily become very inefficient.
Instead, use a set instead of a list for accs as this will give you a constant membership lookup. So you reduce your algorithm from a quadratic complexity to a linear complexity. For that, just use accs = set() and change your code to use accs.add() instead of append.
Another issue is that you are doing string concatenation. Every time you do someString + "something" you are throwing away that string object and create a new one. This can become inefficient for a high number of operations too. Instead, use a list here to collect all the elements you want to write, and then join them:
row = []
row.append(element['account']['name'])
if 'twitch' in element['account']:
row.append(element['account']['twitch']['name'])
else:
row.append('')
row.append(str(element['account']['challenges']['total']))
out.write(','.join(row))
out.write('\n')
Alternatively, since you are writing to a file anyway, you could just call out.write multiple times with each string part.
Finally, you could also look into the csv module if you are interested in writing out CSV data.

Custom sort method in Python is not sorting list properly

I'm a student in a Computing class and we have to write a program which contains file handling and a sort. I've got the file handling done and I wrote out my sort (it's a simple sort) but it doesn't sort the list. My code is this:
namelist = []
scorelist = []
hs = open("hst.txt", "r")
namelist = hs.read().splitlines()
hss = open("hstscore.txt","r")
for line in hss:
scorelist.append(int(line))
scorelength = len(scorelist)
for i in range(scorelength):
for j in range(scorelength + 1):
if scorelist[i] > scorelist[j]:
temp = scorelist[i]
scorelist[i] = scorelist[j]
scorelist[j] = temp
return scorelist
I've not been doing Python for very long so I know the code may not be efficient but I really don't want to use a completely different method for sorting it and we're not allowed to use .sort() or .sorted() since we have to write our own sort function. Is there something I'm doing wrong?
def super_simple_sort(my_list):
switched = True
while switched:
switched = False
for i in range(len(my_list)-1):
if my_list[i] > my_list[i+1]:
my_list[i],my_list[i+1] = my_list[i+1],my_list[i]
switched = True
super_simple_sort(some_list)
print some_list
is a very simple sorting implementation ... that is equivelent to yours but takes advantage of some things to speed it up (we only need one for loop, and we only need to repeat as long as the list is out of order, also python doesnt require a temp var for swapping values)
since its changing the actual array values you actually dont even need to return

Nested for loop to search 2 lists

Using: Python 2.4
Currently, I have a nested for loop that iterates over 2 lists and makes a match based on two elements that exists on both lists. Once a match has been found, it the element from the r120Final list and puts in a new list called "r120Delta":
for r120item in r120Final:
for spectraItem in spectraFinal:
if(str(spectraItem[0]) == r120item[2].strip()) and (str(spectraItem[25]) == r120item[10]):
r120Delta.append(r120item)
break
The problem is that this is SO SLOW and the lists aren't that deep. The R120 is about 64,000 lines and the Spectra is about 150,000 lines.
The r120Final list is a nested array and it looks like so:
r120Final[0] = [['xxx','xxx','12345','xxx','xxx','xxx','xxx','xxx','xxx','xxx','234567']]
...
r120Final[n] = [['xxx','xxx','99999','xxx','xxx','xxx','xxx','xxx','xxx','xxx','678901']]
The spectraFinal list is essentially the same, a nested array and it looks like so:
spectraFinal[0] = [['12345','xxx','xxx','xxx','xxx','xxx','xxx','xxx','xxx','xxx','xxx','xxx','xxx','xxx','xxx','xxx','xxx','xxx','xxx','xxx','xxx','xxx','xxx','xxx','xxx','234567']]
...
spectraFinal[0] = [['99999','xxx','xxx','xxx','xxx','xxx','xxx','xxx','xxx','xxx','xxx','xxx','xxx','xxx','xxx','xxx','xxx','xxx','xxx','xxx','xxx','xxx','xxx','xxx','xxx','678901']]
Finally, the reason for the "r120Delta" is so that way I can then do a list differential between r120Final and r120Delta and retrieve r120 data elements that were NOT matched. This is the function I defined for this very task, and again, slow:
def listDiff( diffList, completeList ):
returnList = []
for completeItem in completeList:
if not completeItem in diffList:
returnList.append(completeItem)
return returnList
Basically, I'm knowledgeable in Python but by no means an expert. I'm looking for some experts to show me how to speed this up. Any help is appreciated!
spectra_set = set((str(spectraItem[0]), str(spectraItem[25])) for spectraItem in spectraFinal)
returnList = []
for r120item in r120Final:
if (r120item[2].strip(), r120item[10]) not in spectra_set:
returnList.append(r120item)
This will add all items that didn't match to the returnList.
You can do it in one line (if you really want) as
returnList = [r120item for r120item in r120Final
if (r120item[2].strip(), r120item[10]) not in
set((str(spectraItem[0]), str(spectraItem[25]))
for spectraItem in spectraFinal)]
If you need the full spectraItem:
spectra_dict = dict(((str(spectraItem[0]), str(spectraItem[25])), spectraItem) for spectraItem in spectraFinal)
returnList = []
for r120item in r120Final:
key = (r120item[2].strip(), r120item[10])
if key not in spectra_dict:
returnList.append(r120item)
else:
return_item = some_function_of(r120item, spectra_dict[key])
returnList.append(return_item)

Categories

Resources