removing json items from array if value is duplicate python - python

I am incredibly new to python.
I have an array full of json objects. Some of the json objects contain duplicated values. The array looks like this:
[{"id":"1","name":"Paul","age":"21"},
{"id":"2","name":"Peter","age":"22"},
{"id":"3","name":"Paul","age":"23"}]
What I am trying to do is to remove an item if the name is the same as another json object, and leave the first one in the array.
So in this case I should be left with
[{"id":"1"."name":"Paul","age":"21"},
{"id":"2","name":"Peter","age":"22"}]
The code I currently have can be seen below and is largely based on this answer:
import json
ds = json.loads('python.json') #this file contains the json
unique_stuff = { each['name'] : each for each in ds }.values()
all_ids = [ each['name'] for each in ds ]
unique_stuff = [ ds[ all_ids.index(text) ] for text in set(texts) ]
print unique_stuff
I am not even sure that this line is working ds = json.loads('python.json') #this file contains the json as when I try and print ds nothing shows up in the console.

You might have overdone in your approach. I might tend to rewrite the list as a dictionary with "name" as a key and then fetch the values
ds = [{"id":"1","name":"Paul","age":"21"},
{"id":"2","name":"Peter","age":"22"},
{"id":"3","name":"Paul","age":"23"}]
{elem["name"]:elem for elem in ds}.values()
Out[2]:
[{'age': '23', 'id': '3', 'name': 'Paul'},
{'age': '22', 'id': '2', 'name': 'Peter'}]
Off-course the items within the dictionary and the list may not be ordered, but I do not see much of a concern. If it is, let us know and we can think over it.

If you need to keep the first instance of "Paul" in your data a dictionary comprehension gives you the opposite result.
A simple solution could be as following
new = []
seen = set()
for record in old:
name = record['name']
if name not in seen:
seen.add(name)
new.append(record)
del seen

First of all, your json snippet has invalid format - there are dot instead of commas separating some keys.
You can solve your problem using a dictionary with names as keys:
import json
with open('python.json') as fp:
ds = json.load(fp) #this file contains the json
mem = {}
for record in ds:
name = record["name"]
if name not in mem:
mem[name] = record
print mem.values()

Related

Iterate and append faster through millions of dictionnaries in python

I'm writing a script to handle millions of dictionnaries from many files of 1 million lines each.
The main interest of this script is to create a json file and send it to Elasticsearch bulk edit.
What I'm trying to do is to read lines from called "entity" files, and in those entities, I have to find their matched addresses in "sub" files (for sub-entities). But my problem here, is that the function which should associate them, take REALLY too much time to do a single iteration... But after trying to optimize it as much as possible, the association still insanely slow.
So, just to be clear about the data structure:
Entities are object like Persons : an id, an unique_id, a name, list of postal addresses, list of email addresses :
{id: 0, uniqueId: 'Z5ER1ZE5', name: 'John DOE', postalList: [], emailList: []}
Sub-entities are object referenced as differents types of addresses (mail, postal, etc ..) : {personUniqueId: 'Z5ER1ZE5', 'Email': 'john.doe#gmail.com'}
So, I get files content with Pandas using pd.read_csv(filename).
To optimize as much as possible I've decided to handle every iterations on data using multiprocessing (working fine, even if I didn't handled RAM usage..) :
## Using manager to be alble to pass my main object and update it through processes
manager = multiprocessing.Manager()
main_obj = manager.dict({
'dataframes': manager.dict(),
'dicts': manager.dict(),
'json': manager.dict()
})
## Just an example of how I use multiprocessing
pool = multiprocessing.Pool()
result = pool.map(partial(func, obj=main_obj), data_list_to_iterate)
pool.close()
pool.join()
And I have some referentials, where identifiers refers to a dict having entities name in keys and their uniqueId as value, and sub_associations is a dict where we have sub-entities name as keys and their related collection as value :
sub_associations = {
'Persons_emlAddr': 'postalList',
'Persons_pstlAddr': 'emailList'
}
identifiers = {
'Animals': 'uniqueAnimalId',
'Persons': 'uniquePersonId',
'Persons_emlAddr': 'uniquePersonId',
'Persons_pstlAddr': 'uniquePersonId'
}
This said, now, I experienced a big issue in my function to get sub-entities, for my entity:
for key in list(main_obj['dicts'].keys()):
main_obj['json'][key] = ''
with mp.Pool() as stringify_pool:
res_stringify = stringify_pool.map(partial(convert_to_json, obj=main_obj, name=key), main_obj['dicts'][key]['records'])
stringify_pool.close()
stringify_pool.join()
Here is where I call my issued function. I feed it with keys I have in my main_obj['dicts'], where keys are just entities filename (Persons, Animals, ..), and my main_obj['dicts'][key] is a list of dict where main_obj['dicts'][key] = {name: 'Persons', records: []} where records is the list of entities dicts I need to iterate on.
def convert_to_json(item, obj, name):
global sub_associations
global identifiers
dump = ''
subs = [val for val in sub_associations.keys() if val.startswith(name)]
if subs:
for sub in subs:
df = obj['dataframes'][sub]
id_name = identifiers[name]
sub_items = df[df[id_name] == item[id_name]].to_dict('records')
if sub_items:
item[sub_associations[sub]] = sub_items
else:
item[sub_associations[sub]] = []
index = {
"index": {
"_index": name,
"_id": item[identifiers[name]]
}
}
dump += f'{json.dumps(index)}\n'
dump += f'{json.dumps(item)}\n'
obj['json'][name] += dump
return 'Done'
Does someone could have an idea about what could be the main issue ? And how I could change it to make it faster ?
If you need any additionnal information.. If I haven't been clear on some things, feel free.
Thank you in advance ! :)

Read a pickled dictionary Python

I'm working with a pickled file in Python, and I need to extract the data from it. The data was saved as a dictionary:
I read it
import pickle
data = pickle.load( open("MyData.p", "rb") )
I read one dictionary:
data[0]
[{'StartTime': '2018-04-01 11:11:28',
'Name': 'AA',
'StudyName': '2018{AF4}',
'Data': [(10829.162109375,
13013.4033203125),
(11050.34375,
13063.3125),
(11514.7509765625,
13103.005859375)],
'Times': (5514.899,
5542.091,
5952.291),
'startOffset': 0.0}]
and get all the fields and can see it if printed. One of the fields is called "StartTime". However, when I want to access the field says
data[0]["StartTime"]
TypeError: list indices must be integers or slices, not str
Same with all fields.
How can I access the fields individually?
You can always just pretty print the data to see what you got:
import pprint
pprint.pprint(data)
In your specific case, try this:
print(data[0][0]["StartTime"])
There is another list you need to select the 0 element from.
data[0][0]["StartTime"]

Extract values from json-file which has no unique markers

A json-file which has unique markers (or [more appropriate] field-names) preceeding the values is (rather) easy to dissect, because you can perform a string search on the unique markers/field-names to find within the string the first and last position of the characters of the value, and with that info you can pinpoint the position of the value, and extract the value.
Have performed that function with various lua-scripts and Python-scripts (also on xml-files).
Now need to extract values from a json-file which does not have unique markers/ field-names, but just a multiple occurrence of "value_type" and "value", preceeding the 'name', respectively the 'value': see below.
{
"software_version": "NRZ-2017-099",
"age":"78",
"sensordatavalues":[
{"value_type":"SDS_P1","value":"4.43"},
{"value_type":"SDS_P2","value":"3.80"},
{"value_type":"temperature","value":"20.10"},
{"value_type":"humidity","value":"44.50"},
{"value_type":"samples","value":"614292"},
{"value_type":"min_micro","value":"233"},
{"value_type":"max_micro","value":"25951"},
{"value_type":"signal","value":"-66"}
]
}
Experience as described above does not provide working solution.
Question: In this json-filelayout, how to directly extract the specific, individual values (preferably by lua-script)?
[Or might XML-parsing provide an easier solution?]
Here is Python to read the JSON file and make it more convenient:
import json
import pprint
with open("/tmp/foo.json") as j:
data = json.load(j)
for sdv in data.pop('sensordatavalues'):
data[sdv['value_type']] = sdv['value']
pprint.pprint(data)
The results:
{'SDS_P1': '4.43',
'SDS_P2': '3.80',
'age': '78',
'humidity': '44.50',
'max_micro': '25951',
'min_micro': '233',
'samples': '614292',
'signal': '-66',
'software_version': 'NRZ-2017-099',
'temperature': '20.10'}
You might want to have a look into filter functions.
E.g. in your example json to get only the dict that contains the value for samples you could go by:
sample_sensordata = list(filter(lambda d: d["value_type"] == "samples", your_json_dict["sensordatavalues"]))
sample_value = sample_sensordata["value"]
To make a dictionary like Ned Batchelder said you could also go with a dict comprehension like this:
sensor_data_dict = {d['value_type']: d['value'] for d in a}
and then get the value you want just by sensor_data_dict['<ValueTypeYouAreLookingFor>']
A little bit late and I'm trying Anvil in which the previous answers didn't work. just for the curious people.
resp = anvil.http.request("http://<ipaddress>/data.json", json=True)
#print(resp) # prints json file
tempdict = resp['sensordatavalues'][2].values()
humiddict = resp['sensordatavalues'][3].values()
temperature = float(list(tempdict)[1])
humidity = float(list(humiddict)[1])
print(temperature)
print(humidity)

Update dictionary if in list

I'm running through an excel file reading line by line to create dictionaries and append them to a list, so I have a list like:
myList = []
and a dictionary in this format:
dictionary = {'name': 'John', 'code': 'code1', 'date': [123,456]}
so I do this: myList.append(dictionary), so far so good. Now I'll go into the next line where I have a pretty similar dictionary:
dictionary_two = {'name': 'John', 'code': 'code1', 'date': [789]}
I'd like to check if I already have a dictionary with 'name' = 'John' in myList so I check it with this function:
def checkGuy(dude_name):
return any(d['name'] == dude_name for d in myList)
Currently I'm writing this function to add the guys to the list:
def addGuy(row_info):
if not checkGuy(row_info[1]):
myList.append({'name':row_info[1],'code':row_info[0],'date':[row_info[2]]})
else:
#HELP HERE
in this else I'd like to dict.update(updated_dict) but I don't know how to get the dictionary here.
Could someone help so dictionary appends the values of dictionary_two?
I would modify checkGuy to something like:
def findGuy(dude_name):
for d in myList:
if d['name'] == dude_name:
return d
else:
return None # or use pass
And then do:
def addGuy(row_info):
guy = findGuy(row_info[1])
if guy is None:
myList.append({'name':row_info[1],'code':row_info[0],'date':[row_info[2]]})
else:
guy.update(updated_dict)
This answer suggestion is pasted on the comments where it was suggested that if "name" is the only criteria to search on then it could be used as a key in a dictionary instead of using a list.
master = {"John" : {'code': 'code1', 'date': [123,456]}}
def addGuy(row_info):
key = row_info[1]
code = row_info[0]
date = row_info[2]
if master.get(key):
master.get(key).update({"code": code, "date": date})
else:
master[key] = {"code": code, "date": date}
If you dict.update the existing data each time you see a repeated name, your code can be reduced to a dict of dicts right where you read the file. Calling update on existing dicts with the same keys is going to overwrite the values leaving you with the last occurrence so even if you had multiple "John" dicts they would all contain the exact same data by the end.
def read_file():
results = {name: {"code": code, "date": date}
for code, name, date in how_you_read_into_rows}
If you actually think that the values get appended somehow, you are wrong. If you wanted to do that you would need a very different approach. If you actually want to gather the dates and codes per user then use a defauldict appending the code,date pair to a list with the name as the key:
from collections import defaultdict
d = defaultdict(list)
def read_file():
for code, name, date in how_you_read_into_rows:
d["name"].append([code, date])
Or some variation depending on what you want the final output to look like.

How do I turn list values into an array with an index that matches the other dic values?

Hoping someone can help me out. I've spent the past couple hours trying to solve this, and fair warning, I'm still fairly new to python.
This is a repost of a question I recently deleted. I've misinterpreted my code in the last example.The correct example is:
I have a dictionary, with a list that looks similar to:
dic = [
{
'name': 'john',
'items': ['pants_1', 'shirt_2','socks_3']
},
{
'name': 'bob',
items: ['jacket_1', 'hat_1']
}
]
I'm using .append for both 'name', and 'items', which adds the dic values into two new lists:
for x in dic:
dic_name.append(dic['name'])
dic_items.append(dic['items'])
I need to split the item value using '_' as the delimiter, so I've also split the values by doing:
name, items = [i if i is None else i.split('_')[0] for i in dic_name],
[if i is None else i.split('_')[0] for i in chain(*dic_items)])
None is used in case there is no value. This provides me with a new list for name, items, with the delimiter used. Disregard the fact that I used '_' split for names in this example.
When I use this, the index for name, and item no longer match. Do i need to create the listed items in an array to match the name index, and if so, how?
Ideally, I want name[0] (which is john), to also match items[0] (as an array of the items in the list, so pants, shirt, socks). This way when I refer to index 0 for name, it also grabs all the values for items as index 0. The same thing regarding the index used for bob [1], which should match his items with the same index.
#avinash-raj, thanks for your patience, as I've had to update my question to reflect more closely to the code I'm working with.
I'm reading a little bit between the lines but are you trying to just collapse the list and get rid of the field names, e.g.:
>>> dic = [{'name': 'john', 'items':['pants_1','shirt_2','socks_3']},
{'name': 'bob', 'items':['jacket_1','hat_1']}]
>>> data = {d['name']: dict(i.split('_') for i in d['items']) for d in dic}
>>> data
{'bob': {'hat': '1', 'jacket': '1'},
'john': {'pants': '1', 'shirt': '2', 'socks': '3'}}
Now the data is directly related vs. indirectly related via a common index into 2 lists. If you want the dictionary split out you can always
>>> dic_name, dic_items = zip(*data.items())
>>> dic_name
('bob', 'john')
>>> dic_items
({'hat': '1', 'jacket': '1'}, {'pants': '1', 'shirt': '2', 'socks': '3'})
You need a list of dictionaries because the duplicate keys name and items are overwritten:
items = [[i.split('_')[0] for i in d['items']] for d in your_list]
names = [d['name'] for d in your_list] # then grab names from list
Alternatively, you can do this in one line with the built-in zip method and generators, like so:
names, items = zip(*((i['name'], [j.split('_')[0] for j in i['items']]) for i in dic))
From Looping Techniques in the Tutorial.
for name, items in div.items():
names.append(name)
items.append(item)
That will work if your dict is structured
{'name':[item1]}
In the loop body of
for x in dic:
dic_name.append(dic['name'])
dic_items.append(dic['items'])
you'll probably want to access x (to which the items in dic will be assigned in turn) rather than dic.

Categories

Resources