I have a dict of numpy arrays :
{'data1': array([[0.16461831, 0.82400555],
[0.02958593, 0.483629 ],
[0.51268564, 0.07030046],
[0.17027816, 0.35304705]]),
'data2': array([[0.8292598 , 0.78136548],
[0.30389913, 0.69250432],
[0.66608852, 0.42237639],
[0.72678807, 0.40486951]]),
'data3': array([[0.45614633, 0.96677904],
[0.87066105, 0.75826116],
[0.39431988, 0.73041888],
[0.65685809, 0.65498308]])}
Expected output :
[([0.16461831, 0.82400555], [0.8292598 , 0.78136548], [0.45614633, 0.96677904]),
([0.02958593, 0.483629 ], [0.66608852, 0.42237639], [0.87066105, 0.75826116]),
([0.51268564, 0.07030046], [0.66608852, 0.42237639], [0.39431988, 0.73041888]),
([0.17027816, 0.35304705], [0.72678807, 0.40486951], [0.65685809, 0.65498308])]
But when I am trying with zip :
list(zip(data.values()))
Getting this output:
[(array([[0.16461831, 0.82400555],
[0.02958593, 0.483629 ],
[0.51268564, 0.07030046],
[0.17027816, 0.35304705]]),),
(array([[0.8292598 , 0.78136548],
[0.30389913, 0.69250432],
[0.66608852, 0.42237639],
[0.72678807, 0.40486951]]),),
(array([[0.45614633, 0.96677904],
[0.87066105, 0.75826116],
[0.39431988, 0.73041888],
[0.65685809, 0.65498308]]),)]
How to zip list of numpy arrays?
Use
list(zip(*data.values())
Output:
[(array([0.16461831, 0.82400555]),
array([0.8292598 , 0.78136548]),
array([0.45614633, 0.96677904])),
(array([0.02958593, 0.483629 ]),
array([0.30389913, 0.69250432]),
array([0.87066105, 0.75826116])),
(array([0.51268564, 0.07030046]),
array([0.66608852, 0.42237639]),
array([0.39431988, 0.73041888])),
(array([0.17027816, 0.35304705]),
array([0.72678807, 0.40486951]),
array([0.65685809, 0.65498308]))]
If a 3D array works for you, you can just stack on the 2nd axis (axis=1):
np.stack(data.values(), axis=1)
#[[[0.16461831 0.82400555]
# [0.8292598 0.78136548]
# [0.45614633 0.96677904]]
# [[0.02958593 0.483629 ]
# [0.30389913 0.69250432]
# [0.87066105 0.75826116]]
# [[0.51268564 0.07030046]
# [0.66608852 0.42237639]
# [0.39431988 0.73041888]]
# [[0.17027816 0.35304705]
# [0.72678807 0.40486951]
# [0.65685809 0.65498308]]]
following code will create your output:
tmp = [data[d].tolist() for d in data]
tmp = list(zip(*tmp))
output:
[([0.16461831, 0.82400555], [0.8292598, 0.78136548], [0.45614633, 0.96677904]), ([0.02958593, 0.483629], [0.30389913, 0.69250432], [0.87066105, 0.75826116]), ([0.51268564, 0.07030046], [0.66608852, 0.42237639], [0.39431988, 0.73041888]), ([0.17027816, 0.35304705], [0.72678807, 0.40486951], [0.65685809, 0.65498308])]
this link will explain about * syntax
Another way:
[tuple(x) for x in np.stack(data.values(),axis=1).tolist()]
[([0.16461831, 0.82400555], [0.8292598, 0.78136548], [0.45614633, 0.96677904]),
([0.02958593, 0.483629], [0.30389913, 0.69250432], [0.87066105, 0.75826116]),
([0.51268564, 0.07030046], [0.66608852, 0.42237639], [0.39431988, 0.73041888]),
([0.17027816, 0.35304705], [0.72678807, 0.40486951], [0.65685809, 0.65498308])]
Related
I have a dictionary like this:
no_empty_keys = {'783': [['4gsx', 'ADTQGS', 0.3333333333333333, {'A': ['A224', 'T226'], 'B': ['A224', 'T226']}, 504, 509], ['4gt0', 'ADTQGS', 0.3333333333333333, {'A': ['A224', 'T226'], 'B': ['A224', 'T226']}, 504, 509]],'1062': [['4gsx', 'AELTGY', 0.5, {'A': ['L175', 'T176', 'Y178'], 'B': ['L175', 'T176', 'Y178']}, 453, 458], ['4gt0', 'AELTGY', 0.5, {'A': ['L175', 'T176', 'Y178'], 'B': ['L175', 'T176', 'Y178']}, 453, 458]]}
My function to transform that into a CSV is this one:
epitope_df = pd.DataFrame(columns=['Epitope ID', 'PDB', 'Percent Identity', 'Epitope Mapped', 'Epitope Sequence', 'Starting Position', 'Ending Position'])
for x in no_empty_keys:
for y in no_empty_keys[x]:
epitope_df = epitope_df.append({'Epitope ID': x, 'PDB': y[0], 'Percent Identity': y[2], 'Epitope Mapped' : y[3], 'Epitope Sequence' : y[1], 'Starting Position' : y[4], 'Ending Position' : y[5]}, ignore_index=True)
epitope_df.to_csv('test.csv', index=False)
My output is a csv file like this:
It is working, but it isn't well optimized. The process is very slow when I run into a dictionary with more than > 10,000 entries. Any ideas on how to speed this process up? Thank you for your time.
I'd start with getting rid of pandas.append. Appending rows to DataFrames is inefficient. You can create a DataFrame in one go:
result = []
for x in no_empty_keys:
for y in no_empty_keys[x]:
result.append(
{
'Epitope ID': x,
'PDB': y[0],
'Percent Identity': y[2],
'Epitope Mapped': y[3],
'Epitope Sequence': y[1],
'Starting Position': y[4],
'Ending Position': y[5]
}
)
epitope_df = epitope_df.from_records(result)
epitope_df.to_csv('new.csv', index=False)
You can either write an ad hoc code by hand or use convtools library, which generates such converters for you:
from convtools import conversion as c
from convtools.contrib.tables import Table
no_empty_keys = {
"783": [
[ "4gsx", "ADTQGS", 0.3333333333333333, {"A": ["A224", "T226"], "B": ["A224", "T226"]}, 504, 509, ],
[ "4gt0", "ADTQGS", 0.3333333333333333, {"A": ["A224", "T226"], "B": ["A224", "T226"]}, 504, 509, ],
],
"1062": [
[ "4gsx", "AELTGY", 0.5, {"A": ["L175", "T176", "Y178"], "B": ["L175", "T176", "Y178"]}, 453, 458,],
[ "4gt0", "AELTGY", 0.5, {"A": ["L175", "T176", "Y178"], "B": ["L175", "T176", "Y178"]}, 453, 458, ],
],
}
columns = (
"Epitope ID",
"PDB",
"Percent Identity",
"Epitope Mapped",
"Epitope Sequence",
"Starting Position",
"Ending Position",
)
# this is just a function, so it can be run on startup once and stored for
# further reuse
converter = (
c.iter(
c.zip(
c.repeat(c.item(0)),
c.item(1)
).iter(
(c.item(0),) + tuple(c.item(1, i) for i in range(len(columns) - 1))
)
)
.flatten()
.gen_converter()
)
# here is the stuff to profile
Table.from_rows(
converter(no_empty_keys.items()),
header=columns,
).into_csv("out.csv")
Consider installing black and passing debug=True to gen_converter if you are curious on the code convtools generates under the hood.
I have a input variable(stud_id), list(sub_code) and array(data) with the below values.
stud_id: 10
sub_code: ['002', '003', '007']
data: [array([['867192', '5545']], dtype=object), array([['964433', '0430']], dtype=object), array([['965686', '2099']], dtype=object)]
How to convert the above input into json format like this?
stud_id is the main key
output = '{ "10" : { "002" : [ 867192, 5545 ], '\
' "003" : [ 964433, 0430 ], '\
' "007" : [ 965686, 2099 ] } }'
I had to adjust your array type for testing.
Try this code:
stud_id = 10
sub_code = ['002', '003', '007']
#data = [array([['867192', '5545']], dtype=object),
# array([['964433', '0430']], dtype=object),
# array([['965686', '2099']], dtype=object)]
data = [['867192', '5545'],
['964433', '0430'],
['965686', '2099']]
output = '{ "10" : { "002" : [ 867192, 5545 ], '\
' "003" : [ 964433, 0430 ], '\
' "007" : [ 965686, 2099 ] } }'
dd = {str(stud_id):{k:a for k,a in zip(sub_code, data)}}
print(dd)
Output
{'10': {'002': ['867192', '5545'], '003': ['964433', '0430'], '007': ['965686', '2099']}}
>>> import json
>>> from numpy import array
>>> stud_id = 10
>>> sub_code = ['002', '003', '007']
>>> data = [array([['867192', '5545']], dtype=object),
... array([['964433', '0430']], dtype=object),
... array([['965686', '2099']], dtype=object)]
>>> json.dumps({stud_id: dict(zip(sub_code, map(lambda arr: arr[0].tolist(), data)))})
'{"10": {"002": ["867192", "5545"], "003": ["964433", "0430"], "007": ["965686", "2099"]}}'
Zip sub_code and data, turn them into a dict with a dict comprehension, then put them in another dictionary with stud_id as a key, then dump as json:
import json
json.dumps({stud_id: {k: v.tolist()[0] for (k, v) in zip(sub_code, data)}})
# '{"10": {"002": ["867192", "5545"], "003": ["964433", "0430"], "007": ["965686", "2099"]}}'
BIG = { "Brand" : ["Clothes" , "Watch"], "Beauty" : ["Skin", "Hair"] }
SMA = { "Clothes" : ["T-shirts", "pants"], "Watch" : ["gold", "metal"],
"Skin" : ["lotion", "base"] , "Hair" : ["shampoo", "rinse"]}
I want to combine this data
like this...
BIG = {"Brand" : [ {"Clothes" : ["T-shirts", "pants"]}, {"Watch" : ["gold", "metal"]} ],...
Please tell me how to solve this problem.
First off, those are dictionaries and not lists. Also, I do no know your intention behind merging two dictionaries in that representation.
Anyways, if you want the output to be exactly as you mentioned, then this is the way to do it -
BIG = { "Brand" : ["Clothes" , "Watch"], "Beauty" : ["Skin", "Hair"] }
SMA = { "Clothes" : ["T-shirts", "pants"], "Watch" : ["gold", "metal"],"Skin" : ["lotion", "base"] , "Hair" : ["shampoo", "rinse"]}
for key,values in BIG.items(): #change to BIG.iteritems() in python 2.x
newValues = []
for value in values:
if value in SMA:
newValues.append({value:SMA[value]})
else:
newValues.append(value)
BIG[key]=newValues
Also, BIG.update(SMA) will not give you the right results in the way you want them to be.
Here is a test run -
>>> BIG.update(SMA)
>>> BIG
{'Watch': ['gold', 'metal'], 'Brand': ['Clothes', 'Watch'], 'Skin': ['lotion', 'base'], 'Beauty': ['Skin', 'Hair'], 'Clothes': ['T-shirts', 'pants'], 'Hair': ['shampoo', 'rinse']}
Firstly, you need to iterate on first dictionary and search the pair of key in second dictionary.
BIG = { "Brand" : ["Clothes" , "Watch"], "Beauty" : ["Skin", "Hair"] }
SMA = { "Clothes" : ["T-shirts", "pants"], "Watch" : ["gold", "metal"], "Skin" : ["lotion", "base"] , "Hair" : ["shampoo", "rinse"]}
for key_big in BIG:
for key_sma in BIG[key_big]:
if key_sma in SMA:
BIG[key_big][BIG[key_big].index(key_sma)] = {key_sma: SMA.get(key_sma)}
print BIG
The result of code:
>>> {'Brand': [{'Clothes': ['T-shirts', 'pants']}, {'Watch': ['gold', 'metal']}], 'Beauty': [{'Skin': ['lotion', 'base']}, {'Hair': ['shampoo', 'rinse']}]}
So I've a list of students which looks something like this :
students = [ {'name': 'Jack' , 'status' : 'Average' , 'subjects' : { 'subject1' : 'English' , 'subject2' : 'Math' } , 'height' : '20cm' },
{'name': 'Tom' , 'status' : 'Good' , 'subjects' : { 'subject1' : 'English' , 'subject2' : 'Science' } , 'height' : '30cm' }
]
So the above list is of size 2. Assume that the size is pretty big, lets say 50 or 60 or more.
I want to return a list students_output & for each student I want to return a dictionary which contains the following values for each student which are fetched from the above list but have slightly modified 'keys'. The end output should be something like this :
students_output = [ {'student_name': 'Jack' , 'student_status' : 'Average' , 'student_subjects' : { 'student_subject1' : 'English' , 'student_subject2' : 'Math' } , 'child_height' : '20cm' },
{'student_name': 'Tom' , 'student_status' : 'Good' , 'student_subjects' : { 'student_subject1' : 'English' , 'student_subject2' : 'Science' } , 'child_height' : '30cm' }
]
I am not able to understand how I can create an effective loop so that the keys in my resultant data structure are maintained as provided in the output and i can fetch the data from the first list.
for example, in students_output, I know
students_output[0]['student_name']=students[0]['name']
But can anyone help me do it iteratively ?
In order to achieve this, you have to concatenate "student_" at the start of each key with some exception as "height" key. You may do it via combination of list comprehension and dict comprehension expression as:
students = [
{'name': 'Jack' , 'status' : 'Average' , 'subjects' : { 'subject1' : 'English' , 'subject2' : 'Math' } , 'height' : '20cm' },
{'name': 'Tom' , 'status' : 'Good' , 'subjects' : { 'subject1' : 'English' , 'subject2' : 'Science' } , 'height' : '30cm' }
]
def get_key(key):
return {
'height': 'child_height', # All exception you need in `key`
# apart from concatenating `"student_"`
}.get(key, 'student_' + key)
new_list = [{
get_key(k): ({
get_key(kk):v for kk, vv in v.items()} if isinstance(v, dict) else v) \
for k, v in s.items()
} for s in students]
Value hold by new_list will be:
[{'student_name': 'Jack', 'child_height': '20cm', 'student_status': 'Average', 'student_subjects': {'student_subject1': {'subject1': 'English', 'subject2': 'Math'}, 'student_subject2': {'subject1': 'English', 'subject2': 'Math'}}},
{'student_name': 'Tom', 'child_height': '30cm', 'student_status': 'Good', 'student_subjects': {'student_subject1': {'subject1': 'English', 'subject2': 'Science'}, 'student_subject2': {'subject1': 'English', 'subject2': 'Science'}}}]
Here's a quick-and-dirty function that will do what you need:
In [10]: def rename_keys(students):
...: d = {}
...: for k,v in students.items():
...: if isinstance(v,dict):
...: k = "student_" + k
...: v = rename_keys(v)
...: d[k] = v
...: elif k == 'height':
...: k = "child_height"
...: d[k] = v
...: else:
...: k = "student_" + k
...: d[k] = v
...: return d
...:
...:
In [11]: [rename_keys(d) for d in students]
Out[11]:
[{'child_height': '20cm',
'student_name': 'Jack',
'student_status': 'Average',
'student_subjects': {'student_subject1': 'English',
'student_subject2': 'Math'}},
{'child_height': '30cm',
'student_name': 'Tom',
'student_status': 'Good',
'student_subjects': {'student_subject1': 'English',
'student_subject2': 'Science'}}]
And really, this doesn't have to be recursive, you could substitute the recursive call with a dictionary comprehension:
v = {'student_'+key:value for key,value in v.items()}
You can use the following function inside a list comprehension like this:
def new_dict(d):
res = {}
for key, value in d.iteritems():
student_or_child = 'student' if key != 'height' else 'child'
if type(value) == dict:
res['{}_{}'.format(student_or_child, key)] = new_dict(value)
else:
res['{}_{}'.format(student_or_child, key)] = value
return res
The above function takes a dict as argument, for each key, value in the passed dict, if value is of type dict then the same function is called on value, and the result is added to res dict, else the same value is added.
Now, with a list comprehension, we can do:
[new_dict(d) for d in students]
Output:
>>> [new_dict(d) for d in students]
[{'child_height': '20cm', 'student_name': 'Jack', 'student_status': 'Average', 'student_subjects': {'student_subject1': 'English', 'student_subject2': 'Math'}}, {'child_height': '30cm', 'student_name': 'Tom', 'student_status': 'Good', 'student_subjects': {'student_subject1': 'English', 'student_subject2': 'Science'}}]
General question: how can I search a specific key:value pair in a JSON using Python?
Details for the specific case: I'm reading ~ 45'000 JSON objects, each one of them look like this one.
As you can see, inside every JSON there are several dictionaries that have the same keys (but different values): "facetName, "facetLabel", "facetValues".
I'm interested in the dictionary that starts with "facetName": "soggettof", that goes like:
{
"facetName": "soggettof",
"facetLabel": "Soggetto",
"facetValues": [
[
"chiesa - storia - documenti",
"chiesa - storia - documenti",
"1"
],
[
"espiazione - mare mediterraneo <bacino> - antichita - congressi - munster - 1999",
"espiazione - mare mediterraneo <bacino> - antichita - congressi - munster - 1999",
"1"
],
[
"lega rossa combattenti - storia",
"lega rossa combattenti - storia",
"1"
],
[
"pavia - storia ecclesiastica - origini-sec. 12.",
"pavia - storia ecclesiastica - origini-sec. 12.",
"1"
],
[
"pavia <diocesi> - storia - origini-sec. 12.",
"pavia <diocesi> - storia - origini-sec. 12.",
"1"
],
[
"persia - sviluppo economico - 1850-1900 - fonti diplomatiche inglesi",
"persia - sviluppo economico - 1850-1900 - fonti diplomatiche inglesi",
"1"
]
Please note, that not all the JSON objects have that.
How can I grab the values of the facetValues list, but only in the dictionary that I'm interested in?
I found your question a little confusing, partially because the data shown in it was not really the JSON-object you needed to extract the information from—but instead was just an example of a sub-JSON-object you wanted to extract it from. Fortunately you had a link to the outermost container JSON-object (even though the data in corresponding sub-JSON-object in it was different). Here's the data from that link:
json_obj = {"numFound":1,"start":0,"rows":3,"briefRecords":[{"progressivoId":0,"codiceIdentificativo":"IT\\ICCU\\LO1\\0120590","autorePrincipale":"Savoia, Carlo","titolo":"Per la inaugurazione dell'Asilo infantile Strozzi nei locali della caserma Filippini già convento della Vittoria / parole di mons. Carlo Savoia","pubblicazione":"Mantova : Tip. Eredi Segna, 1870","livello":"Monografia","tipo":"Testo a stampa","numeri":[],"note":[],"nomi":[],"luogoNormalizzato":[],"localizzazioni":[],"citazioni":[]}],"facetRecords":[{"facetName":"level","facetLabel":"Livello bibliografico","facetValues":[["Monografia","m","1"]]},{"facetName":"tiporec","facetLabel":"Tipo di documento","facetValues":[["Testo a stampa","a","1"]]},{"facetName":"nomef","facetLabel":"Autore","facetValues":[["savoia, carlo","savoia, carlo","1"]]},{"facetName":"soggettof","facetLabel":"Soggetto","facetValues":[["mantova - asili infantili","mantova - asili infantili","1"]]},{"facetName":"luogof","facetLabel":"Luogo di pubblicazione","facetValues":[["mantova","mantova","1"]]},{"facetName":"lingua","facetLabel":"Lingua","facetValues":[["italiano","ita","1"]]},{"facetName":"paese","facetLabel":"Paese","facetValues":[["italia","it","1"]]}]}
It's important to have this outermost container because it is through it you will have to drill-down to the portion you want. Once you have the actual data It's often helpful to reformat it to make its structure clear. You can do this by hand, or have the computer do it via a print(json.dumps(json_obj, indent=2)), although the results from that can sometimes have a little too much white space in them (which can be counterproductive).
That being the case here, below is the more succinct version I came up doing it manually that still let's me see the overall layout of the data:
json_obj = {"numFound" : 1,
"start" : 0,
"rows" : 3,
"briefRecords" : [
{"progressivoId" : 0,
"codiceIdentificativo" : "IT\\ICCU\\LO1\\0120590",
"autorePrincipale" : "Savoia, Carlo",
"titolo" : "Per la inaugurazione dell'Asilo infantile Strozzi nei locali della caserma Filippini già convento della Vittoria / parole di mons. Carlo Savoia",
"pubblicazione" : "Mantova : Tip. Eredi Segna, 1870",
"livello" : "Monografia",
"tipo" : "Testo a stampa",
"numeri" : [],
"note" : [],
"nomi" : [],
"luogoNormalizzato" : [],
"localizzazioni" : [],
"citazioni" : []
}
],
"facetRecords" : [
{"facetName" : "level" ,
"facetLabel" : "Livello bibliografico" ,
"facetValues" : [["Monografia" , "m" , "1"]]},
{"facetName" : "tiporec" ,
"facetLabel" : "Tipo di documento" ,
"facetValues" : [["Testo a stampa" , "a" , "1"]]},
{"facetName" : "nomef" ,
"facetLabel" : "Autore" ,
"facetValues" : [["savoia, carlo" , "savoia, carlo" , "1"]]},
{"facetName" : "soggettof" ,
"facetLabel" : "Soggetto" ,
"facetValues" : [["mantova - asili infantili" , "mantova - asili infantili" , "1"]]},
{"facetName" : "luogof" ,
"facetLabel" : "Luogo di pubblicazione" ,
"facetValues" : [["mantova" , "mantova" , "1"]]},
{"facetName" : "lingua" ,
"facetLabel" : "Lingua" ,
"facetValues" : [["italiano" , "ita" , "1"]]},
{"facetName" : "paese" ,
"facetLabel" : "Paese" ,
"facetValues" : [["italia" , "it" , "1"]]}
]
}
Once you have something like this, it's usually fairly easy to determine what code is needed. In this case it's:
target_facet_name = "soggettof"
for record in json_obj["facetRecords"]:
if record["facetName"] == target_facet_name:
for value in record["facetValues"]:
print(value)
Since facetRecords is a list, a linear search through them as shown is required to find the one(s) wanted.