I have a list of Dictionaries, each consisting of arrays of strings and floats such as below:
Product1 = {
'Name': 'TLSK',
'Name2': 'B1940',
'Tagid': '23456222',
'Cord': np.array(['09:42:23', '9', '-55:52:32', '9']),
'Cord2': np.array([432.34, 222.115]),
'Ref': 'Siegman corp. 234',
'Exp': 22.0,
'CX': np.array([0.00430, 0.00069, 0.00094])
}
I sometimes need access to certain elements of Dictionary for further calculations. What I do is I first merge them as follows:
Products = (Product1, Product2, Product3, ....)
Then I use for loop to store a certain element of each Dictionary as follows
Expall=[]
for i in Products:
exp = i['Exp']
Expall.append(exp)
To me this seems like an inefficient/bad coding and I was wondering if there is a better way to do it. I am coming from IDL language and for instance, in IDL you could have access to that info without a for loop like this. Expall = Products[*]['Exp']
Most of the time, I even have to store the data first and I use pickle in Python to do that. Since I am a bit new to python and I heard few good things about pandas and etc, I wanted to see if there is a more efficient/quicker way to handle all this stuff.
Following could work (List comprehension):
if you have separated Producti,
[product['Exp'] for product in [Product1, Product2]]
if you have Products,
[product['Exp'] for product in Products]
Related
I've got a use case where I'm pulling in "links" to files on a server file share.
I then need to run some regex checks on these links and break them out into specific pieces so I can sort them.
Once sorted I need to split the list among x amount of servers to start pulling them. The sorting is important as each split needs to be even.
['/local/custom_name/a_database/database_name_56_1118843/file_1.tgz']
['/local/custom_name/a_database/database_name_56_1118843/file_3.tgz']
['/local/custom_name/a_database/database_name_56_1118843/file_4.tgz']
['/local/custom_name/a_database/database_name_56_1118843/file_2.tgz']
['/local/custom_name/a_database/database_name_655_1118843/file_1.tgz']
['/local/custom_name/a_database/database_name_655_1118843/file_2.tgz']
['/local/custom_name/a_database/database_name_655_1118843/file_3.tgz']
['/local/custom_name/a_database/database_name_655_1118843/file_4.tgz']
['/local/custom_name_4/b_database/database_name_5242_11132428843/file_1.tgz']
['/local/custom_name_4/b_database/database_name_5242_11132428843/file_2.tgz']
['/shared/custom_name/c_database/database_name_56_1118843/file_1.tgz']
['/shared/custom_name/c_database/database_name_56_1118843/file_2.tgz']
['/local/custom_name_4/c_database/database_name_58_1118843/file_1.tgz']
['/local/custom_name/ac_database/database_name_58_1118843/file_2.tgz']
For example since there are 8 files in a_database and and 4 per name, then say 4 servers, I'd need one file from each path to go to each server.
What I did was look at each link and then break out the path into a dictionary where the first value is a unique id:
{'uid' : 'local_custom_name_a_database_database_name_56', 'link_list': [] }
Then after I go through the original list again and add any links that fit to the dict:
{'uid' : 'local_custom_name_a_database_database_name_56', 'link_list': [
'/local/custom_name/a_database/database_name_56_1118843/file_1.tgz',
'/local/custom_name/a_database/database_name_56_1118843/file_3.tgz',
'/local/custom_name/a_database/database_name_56_1118843/file_4.tgz',
'/local/custom_name/a_database/database_name_56_1118843/file_2.tgz'
]}
Then split the link_list among the servers.
All of this works as intended however the second part, where I compare the original link to the new dictionary uid and add the link to the list takes forever. With 10 000 items it takes a couple mins, but with 900 000 items, it looks like it'll take around 125 hours. which isn't ok.
The real data is more complicated and there's significant sorting going on, but that isn't the bottle neck. The bottleneck is where described. While the logic works I'm certain I'm not doing this in the most efficient way.
Any help is appreciated. Even just pointing me in the direction of a better way to handle this many items outside of native lists and lists of lists or dicts.
If performance is a concern, this type of data structure {'uid' : 'local_custom_name_a_database_database_name_56', 'link_list': [] } is going to be a problem. It's O(n) to find an element based on UID. Instead, you need a dictionary mapping the UID directly to the link list. This allows O(1) access. If needed, you can transform the data later.
I don't know the exact logic behind getting the UIDs, so I just have an example one:
l = [
'/local/custom_name/a_database/database_name_56_1118843/file_1.tgz',
'/local/custom_name/a_database/database_name_56_1118843/file_3.tgz',
'/local/custom_name/a_database/database_name_56_1118843/file_4.tgz',
'/local/custom_name/a_database/database_name_56_1118843/file_2.tgz',
'/local/custom_name/a_database/database_name_655_1118843/file_1.tgz',
'/local/custom_name/a_database/database_name_655_1118843/file_2.tgz',
'/local/custom_name/a_database/database_name_655_1118843/file_3.tgz',
'/local/custom_name/a_database/database_name_655_1118843/file_4.tgz',
'/local/custom_name_4/b_database/database_name_5242_11132428843/file_1.tgz',
'/local/custom_name_4/b_database/database_name_5242_11132428843/file_2.tgz',
'/shared/custom_name/c_database/database_name_56_1118843/file_1.tgz',
'/shared/custom_name/c_database/database_name_56_1118843/file_2.tgz',
'/local/custom_name_4/c_database/database_name_58_1118843/file_1.tgz',
'/local/custom_name/ac_database/database_name_58_1118843/file_2.tgz',
]
def getUid(s):
return s[1:].rpartition("/")[0].rpartition("_")[0].replace("/", "_")
result = {}
for s in l:
result.setdefault(getUid(s), []).append(s)
print(result)
{'local_custom_name_a_database_database_name_56': ['/local/custom_name/a_database/database_name_56_1118843/file_1.tgz',
'/local/custom_name/a_database/database_name_56_1118843/file_3.tgz',
'/local/custom_name/a_database/database_name_56_1118843/file_4.tgz',
'/local/custom_name/a_database/database_name_56_1118843/file_2.tgz'],
'local_custom_name_a_database_database_name_655': ['/local/custom_name/a_database/database_name_655_1118843/file_1.tgz',
'/local/custom_name/a_database/database_name_655_1118843/file_2.tgz',
'/local/custom_name/a_database/database_name_655_1118843/file_3.tgz',
'/local/custom_name/a_database/database_name_655_1118843/file_4.tgz'],
'local_custom_name_4_b_database_database_name_5242': ['/local/custom_name_4/b_database/database_name_5242_11132428843/file_1.tgz',
'/local/custom_name_4/b_database/database_name_5242_11132428843/file_2.tgz'],
'shared_custom_name_c_database_database_name_56': ['/shared/custom_name/c_database/database_name_56_1118843/file_1.tgz',
'/shared/custom_name/c_database/database_name_56_1118843/file_2.tgz'],
'local_custom_name_4_c_database_database_name_58': ['/local/custom_name_4/c_database/database_name_58_1118843/file_1.tgz'],
'local_custom_name_ac_database_database_name_58': ['/local/custom_name/ac_database/database_name_58_1118843/file_2.tgz']}
Then, if needed:
transformed = [{"uid": k, "link_list": v} for k, v in result.items()]
print(transformed)
[{'uid': 'local_custom_name_a_database_database_name_56',
'link_list': ['/local/custom_name/a_database/database_name_56_1118843/file_1.tgz',
'/local/custom_name/a_database/database_name_56_1118843/file_3.tgz',
'/local/custom_name/a_database/database_name_56_1118843/file_4.tgz',
'/local/custom_name/a_database/database_name_56_1118843/file_2.tgz']},
{'uid': 'local_custom_name_a_database_database_name_655',
'link_list': ['/local/custom_name/a_database/database_name_655_1118843/file_1.tgz',
'/local/custom_name/a_database/database_name_655_1118843/file_2.tgz',
'/local/custom_name/a_database/database_name_655_1118843/file_3.tgz',
'/local/custom_name/a_database/database_name_655_1118843/file_4.tgz']},
{'uid': 'local_custom_name_4_b_database_database_name_5242',
'link_list': ['/local/custom_name_4/b_database/database_name_5242_11132428843/file_1.tgz',
'/local/custom_name_4/b_database/database_name_5242_11132428843/file_2.tgz']},
{'uid': 'shared_custom_name_c_database_database_name_56',
'link_list': ['/shared/custom_name/c_database/database_name_56_1118843/file_1.tgz',
'/shared/custom_name/c_database/database_name_56_1118843/file_2.tgz']},
{'uid': 'local_custom_name_4_c_database_database_name_58',
'link_list': ['/local/custom_name_4/c_database/database_name_58_1118843/file_1.tgz']},
{'uid': 'local_custom_name_ac_database_database_name_58',
'link_list': ['/local/custom_name/ac_database/database_name_58_1118843/file_2.tgz']}]
I'm using python(requests) to query an API. The JSON response is list of dictionaries, like below:
locationDescriptions = timeseries.publish.get('/GetLocationDescriptionList')['LocationDescriptions']
print(locationDescriptions[0])
{'Name': 'Test',
'Identifier': '000045',
'UniqueId': '3434jdfsiu3hk34uh8',
'IsExternalLocation': False,
'PrimaryFolder': 'All Locations',
'SecondaryFolders': [],
'LastModified': '2021-02-09T06:01:25.0446910+00:00',}
I'd like to extract 1 field (Identifier) as a list for further analysis (count, min, max, etc.) but I'm having a hard time figuring out how to do this.
Python has a syntax feature called "list comprehensions", and you can do something like:
identifiers = [item['Identifier'] for item in locationDescriptions]
Here is a small article that gives you more details, and also shows an alternate way using map. And here is one of the many resources detailing list comprehensions, should you need it.
You could extract them with a list comprehension:
identifiers = [i['Identifier'] for i in locationDescriptions]
You allude to needing a list of numbers (count, min, max, etc...), in which case:
identifiers = [int(i['Identifier']) for i in locationDescriptions]
You can do
ids = [locationDescription['Identifier'] for locationDescription in locationDescriptions]
You will have a list of identifiers as a string.
Best regards
I am new to python, and I am trying to split a list of dictionaries into separate lists of dictionaries based on some condition.
This is how my list looks like this:
[{'username': 'AnastasiadesCY',
'created_at': '2020-12-02 18:58:16',
'id': 1.33421029132062e+18,
'language': 'en',
'contenttype': 'text/plain',
'content': 'Pleased to participate to the international conference in support of the Lebanese people. Cypriot citizens, together with the Government 🇨🇾, have provided significant quantities of material assistance, from the day of the explosion until today.\n\n#Lebanon 🇱🇧'},
{'username': 'AnastasiadesCY',
'created_at': '2020-11-19 18:13:06',
'id': 1.32948788307022e+18,
'language': 'en',
'contenttype': 'text/plain',
'content': '#Cyprus stand ready to support all efforts towards a coordinated approach of vaccination strategies across Europe, that will prove instrumental in our fight against the pandemic.\n\nUnited Against #COVID19 \n\n#EUCO'},...
I would like to split and group all list's elements that have the same username into separate lists of dictionaries. The elements of the list - so each dictionary - are ordered by username.
Is there a way to loop over the dictionaries and append each element to a list until username in "item 1" is equal to username in "item 1 + 1" and so on?
Thank you for your help!
Finding the same thing works the best if we sort the list by it - then all the same names are next to each other.
But even after sorting, we don't need to do such things manually - there are already tools for that. :) - itertools.groupby documentation and a nice explanation how it works
from itertools import groupby
from operator import itemgetter
my_list.sort(key=itemgetter("username"))
result = {}
for username, group in groupby(my_list, key=itemgetter("username")):
result[username] = list(group)
result is a dict with usernames as keys
If you want a list-of-lists, do result = [] and then result.append(list(group)) instead.
A better would be to create a dictionary with username as key and value as list of user attributes
op = defauldict(list)
for user_dic in list_of_userdictss:
op[user_dic.pop('username')].append(user_dic)
op = OrderedDict(sorted(user_dic.items()))
list of dict to separate lists
data = pd.DataFrame(your_list_of_dict)
username_list = data.username.values.tolist()
So I have a dictionary which is a hash object I'm getting from Redis, similar to the following dictionary:
source_data = {
b'key-1': b'{"age":33,"gender":"Male"}',
b'key-2': b'{"age":20,"gender":"Female"}'
}
My goal is extract all the values from this dictionary and have them as a list of Python dictionaries like so:
final_data = [
{
'age': 33,
'gender': 'Male'
},
{
'age': 20,
'gender': 'Female'
}
]
I tried list comprehension with json parsing:
import json
final_data = [json.loads(a) for a in source_data.values()]
It works but for large data set, it takes too much time.
I switch to using this 3rd party json module ujson which is faster according to this benchmark, but I haven't noticed any improvement.
I tried using multi-threading :
pool = Pool()
final_data = pool.map(ujson.loads, source_data.values(), chunksize=500)
pool.close()
pool.join()
I played a bit with chunksize but the result is the same, still taking too much time.
It would be super helpful if someone can suggest another solution or improvement to previous tries, it would be ideal if I could avoid using a loop.
Assuming the values are, indeed, valid JSON, it might be faster to build a single JSON object to decode. I think it should be safe to just join the values into a single string.
>>> new_json = b'[%s]' % (b','.join(source_data.values(),)
>>> new_json
b'[{"age":33,"gender":"Male"},{"age":20,"gender":"Female"}]'
>>> json.loads(new_json)
[{'age': 33, 'gender': 'Male'}, {'age': 20, 'gender': 'Female'}]
This replaces the overhead of calling json.loads 2000+ times with the lesser overhead of a single call to b','.join and a single string-formatting operation.
For reference, I tried replicating the situation:
import json, timeit, random
source_data = { 'key-{}'.format(n).encode('ascii'):
'{{"age":{},"gender":"{}"}}'.format(
random.randint(18,75),
random.choice(("Male", "Female"))
).encode('ascii')
for n in range(45000) }
timeit.timeit("{ k: json.loads(v) for (k,v) in source_data.items() }",
number=1, globals={'json': json, 'source_data': source_data})
This completed in far less than a second. Those over 30 seconds must be from something I'm not seeing.
My closest guess is that you had the data in some sort of proxy container, wherein each key fetch turned into a remote call, such as if using hscan rather than hgetall. A tradeoff between the two should be possible using the count hint to hscan.
Proper profiling should reveal where the delays come from.
I'm pretty new to Python, so I'm having a hard time even coming up with the proper jargon to describe my issue.
Basic idea is I have a dict that has the following structure:
myDict =
"SomeMetric":{
"day":[
{"date": "2013-01-01","value": 1234},
{"date": "2013-01-02","value": 5678},
etc...
I want to pull out the "value" where the date is known. So I want:
myDict["SomeMetric"]["day"]["value"] where myDict["SomeMetric"]["day"]["date"] = "2013-01-02"
Is there a nice one-line method for this without iterating through the whole dict as my dict is much larger, and I'm already iterating through it, so I'd rather not do nested iteritems.
Generator expressions to the resque:
next(d['value']
for d in myDict['SomeMetric']['day']
if d['date'] == "2013-01-02")
So, loop over all day dictionaries, and find the first one that matches the date you are looking for. This loop stops as soon as a match is found.
Do you have control over your data structure? It seems to be constructed in such a way that lends itself to sub-optimal lookups.
I'd structure it as such:
data = { 'metrics': { '2013-01-02': 1234, '2013-01-01': 4321 } }
And then your lookup is simply:
data['metrics']['2013-01-02']
Can you change the structure? If you can, you might find it much easier to change the day list to a dictionary which has dates as keys and values as values, so
myDict = {
"SomeMetric":{
"day":{
"2013-01-01": 1234,
"2013-01-02": 5678,
etc...
Then you can just index into it directly with
myDict["SomeMetric"]["day"]["2013-01-02"]