Can dictionary data split into test and training set randomly? - python

I want to understand if I have a set of Dictionary data in JSON such as example below:
data = {'a':'120120121',
'b':'12301101',
'c':'120120121',
'd':'12301101',
'e':'120120121',
'f':'12301101',
'g':'120120121',
'h':'12301101',
'i':'120120121',
'j':'12301101'}
Is it possible to split the dictionary to 70:30 randomly using Python?
The output should be like:
training_data = {'a':'120120121',
'b':'12301101',
'c':'120120121',
'e':'120120121',
'g':'120120121',
'i':'120120121',
'j':'12301101'}
test_data = {'d':'12301101',
'f':'12301101',
'h':'12301101'}

The easiest way would be to just use sklearn.model_selection.train_test_split here, and
turn back to dictionary if that is the structure you want:
from sklearn.model_selection import train_test_split
s = pd.Series(data)
training_data , test_data = [i.to_dict() for i in train_test_split(s, train_size=0.7)]
print(training_data)
# {'b': '12301101', 'j': '12301101', 'a': '120120121', 'f': '12301101',
# 'e': '120120121', 'c': '120120121', 'h': '12301101'}
print(test_data)
# {'i': '120120121', 'd': '12301101', 'g': '120120121'}

Related

Count of values of all categorical variable using Python

I have a dataset with a large number of columns, how do I calculate the frequency of values of all categorical variables in Python? I don't want frequency for one or two specific columns rather I need the frequency of all variables type="category".
Use selected_dtypes() for selecting the columns with type=category, and use sum() method to calculate the frequencies:
df.select_dtypes(include='category').sum()
output:
col_cat1 9
col_cat2 21
Not entirely sure I know what you mean, but if you just want to keep a running count of frequencies, dictionaries are a great way to do this.
E.g. if we use the dummy data ['A', 'A', 'B', 'A', 'C', 'C']
category_counts = {}
for category in categories:
try:
category_counts[category] += 1
except:
category_counts[category] = 1
print(category_counts)
returns:
{'A': 3, 'B': 1, 'C': 2}
EDIT: so if you want a count of the categories of each column the code only changes slightly to:
table = [['Male/Female','M','M','F','M',"F"],['Age','10-20','30-40','10-20','20-30','10-20']]
category_counts = {}
for column in table:
category_counts[column[0]] = {}
for data in column[1:]:
try:
category_counts[column[0]][data] += 1
except:
category_counts[column[0]][data] = 1
print(category_counts)
Which prints:
{'Male/Female': {'M': 3, 'F': 2}, 'Age': {'10-20': 3, '30-40': 1, '20-30': 1}}
But I'm unsure how you're currently storing your data

How can this code be written in one line? How can it be written more pythonically?

I used a nested for loop to create a dictionary which holds a list of elements. But, this feels clunky for Python. How can it be written more Pythonically? Is there an elegant way to write this in one line?
d1 = {}
for ds in datasets:
d1[ds] = {}
for mutType in mutTypes:
d1[ds][mutType] = []
You can put this together as two nested dictionary comprehensions. I not sure I would call it more pythonic that the loops which are quite readable, but I wouldn't argue with someone who preferred it:
datasets = ['a', 'b', 'c']
mutTypes = ['x', 'y', 'z']
d1 = {k:{mutType: [] for mutType in mutTypes} for k in datasets}
Result
{'a': {'x': [], 'y': [], 'z': []},
'b': {'x': [], 'y': [], 'z': []},
'c': {'x': [], 'y': [], 'z': []}}
There usually isn't a need to declare the data structure in advance in Python. What I would do is to use a defaultdict as my container, and use it directly.
from collections import defaultdict
d1 = defaultdict(lambda: defaultdict(list))
# Use d1 directly
d1[ds_1][mutType_1].append(123)
d1[ds_2][mutType_2].append(234)
# If you wish to strip out the nested defaultdict after, you can do something like this:
d2 = {key:dict(value) for key, value in d1.items()}
As always, it depends on what you're trying to do. Using d1 as such means it won't raise a KeyError when you're using keys not in datasets and/or mutTypes.

Extend dictionary values by same key for SVM training data

Hi I'm quite new to Python and Machine learning, I want to extract SVM's x and y from two dictionaries.
the two dictionaries look like:
DIC_01
{'A': ['Low'],
'B': ['High']}
DIC_02
{'A': [2623.83740234375,
-1392.9608154296875,
416.20831298828125],
'B': [1231.1268310546875,
-963.231201171875,
1823.742431640625]}
About the data: The keys of the dictionaries are my 'keywords'. DIC_01 was converted from a dataframe, its values are keyword's probability of sales. DIC_02 is the vectors to represent the keyword.
I want to organise this dictionary to SVM training data format. x is the value of DIC_02, y is the value of DIC_01.
I don't know what's the most efficient way to do this task. At the moment I'm thinking...
step 1: merge values with the same keys
{'A': [2623.83740234375,
-1392.9608154296875,
416.20831298828125],['Low'],
'B': [1231.1268310546875,
-963.231201171875,
1823.742431640625],['High']}
step 2: extract the first and second value as SVM's x and y then train the model.
Thank you!
Hi is that what you want to do?
DIC_01 = {'A': ['Low'],
'B': ['High']}
DIC_02 = {'A': [2623.83740234375,
-1392.9608154296875,
416.20831298828125],
'B': [1231.1268310546875,
-963.231201171875,
1823.742431640625]}
smv_X = []
smv_Y = []
for e in DIC_01:
smv_X.append(DIC_02[e])
smv_Y.append(DIC_01[e][0])
print(smv_X) # [[2623.83740234375, -1392.9608154296875, 416.20831298828125], [1231.1268310546875, -963.231201171875, 1823.742431640625]]
print(smv_Y) # ['Low', 'High']
for k,v in DIC_01.items():
# k = key
# v = value

Avoiding key error storing values in nested dictionary (Python)

Introduction
Following dictionary has three levels of keys and then a value.
d = {
1:{
'A':{
'i': 100,
'ii': 200
},
'B':{
'i': 300
}
},
2:{
'A':{
'ii': 500
}
}
}
Examples that need to be added.
d[1]['B']['ii'] = 600 # OK
d[2]['C']['iii'] = 700 # Keyerror on 'C'
d[3]['D']['iv'] = 800 # Keyerror on 3
Problem Statement
I wanted to create code that would create the necessary nested keys and avoid any key errors.
Solution 1
The first solution I came up with, was:
def NewEntry_1(d, lv1, lv2, lv3, value):
if lv1 in d:
if lv2 in d:
d[lv1][lv2][lv3] = value
else:
d[lv1][lv2] = {lv3: value}
else:
d[lv1] = {lv2: {lv3: value}}
Seems legit, but embedding this in order pieces of code made it mind boggling. I explored Stackoverflow for other solutions and was reading on the get() and setdefault() functions.
Solution 2
There is plenty of material to find about get() and setdefault(), but not so much on nested dictionaries. Ultimately I was able to come up with:
def NewEntry_2(d, lv1, lv2, lv3, value):
return d.setdefault(lv1, {}).setdefault(lv2,{}).setdefault(lv3, value)
It is one line of code so it is not really necessary to make it a function. Easily modifiable to include operations:
d[lv1][lv2][lv3] = d.setdefault(lv1, {}).setdefault(lv2,{}).setdefault(lv3, 0) + value
Seems perfect?
Question
When adding large quantities of entries and doing many modifications, is option 2 better than option 1? Or should I define function 1 and call it? The answers I'm looking should take into account speed and/or potential for errors.
Examples
NewEntry_1(d, 1, 'B', 'ii', 600)
# output = {1: {'A': {'i': 100, 'ii': 200}, 'B': {'i': 300, 'ii': 600}}, 2: {'A': {'ii': 500}}}
NewEntry_1(d, 2, 'C', 'iii', 700)
# output = {1: {'A': {'i': 100, 'ii': 200}, 'B': {'i': 300, 'ii': 600}}, 2: {'A': {'ii': 500}, 'C': {'iii': 700}}}
NewEntry_1(d, 3, 'D', 'iv', 800)
# output = {1: {'A': {'i': 100, 'ii': 200}, 'B': {'i': 300, 'ii': 600}}, 2: {'A': {'ii': 500}, 'C': {'iii': 700}}, 3: {'D': {'iv': 800}}}
More background
I'm a business analyst exploring using Python for creating Graph DB that would help me with very specific analysis. The dictionary structure is used to story the influence one node has on one of its neighbors:
lv1 is Node From
lv2 is Node To
lv3 is Iteration
value is Influence (in
%)
In the first iteration Node 1 has direct influence on Node 2. In the second iteration Node 1 influences all the Nodes that Node 2 is influencing.
I'm aware of packages that can help me with it (networkx), but I'm trying to understand Python/GraphDB before I want to start using them.
As for the nested dictionaries, you should take a look at defaultdict. Using it will save you a lot of the function-calling overhead. The nested defaultdict construction resorts to lambda functions for their default factories:
d = defaultdict(lambda: defaultdict(lambda: defaultdict(int))) # new, shiny, empty
d[1]['B']['ii'] = 600 # OK
d[2]['C']['iii'] = 700 # OK
d[3]['D']['iv'] = 800 # OK
Update: A useful trick to know to create a deeply nested defaultdict is the following:
def tree():
return defaultdict(tree)
d = tree()
# now any depth is possible
# d[1][2][3][4][5][6][7][8] = 9

Python - Load multiple Pickle objects into a single dictionary

So my problem is this... I have multiple Pickle object files (which are Pickled Dictionaries) and I want to load them all, but essentially merge each dictionary into a single larger dictionary.
E.g.
I have pickle_file1 and pickle_file2 both contain dictionaries. I would like the contents of pickle_file1 and pickle_file2 loaded into my_dict_final.
EDIT
As per request here is what i have so far:
for pkl_file in pkl_file_list:
pickle_in = open(pkl_file,'rb')
my_dict = pickle.load(pickle_in)
pickle_in.close()
In essence, it works, but just overwrites the contents of my_dict rather than append each pickle object.
Thanks in advance for the help.
my_dict_final = {} # Create an empty dictionary
with open('pickle_file1', 'rb') as f:
my_dict_final.update(pickle.load(f)) # Update contents of file1 to the dictionary
with open('pickle_file2', 'rb') as f:
my_dict_final.update(pickle.load(f)) # Update contents of file2 to the dictionary
print my_dict_final
You can use the dict.update function.
pickle_dict1 = pickle.load(picke_file1)
pickle_dict2 = pickle.load(picke_file2)
my_dict_final = pickle_dict1
my_dict_final.update(pickle_dict2)
Python Standard Library Docs
#Nunchux, #Vikas Ojha If the dictionaries happen to have common keys, the update method will, unfortunately, overwrite the values for those common keys. Example:
>>> dict1 = {'a': 4, 'b': 3, 'c': 0, 'd': 4}
>>> dict2 = {'a': 1, 'b': 8, 'c': 5}
>>> All_dict = {}
>>> All_dict.update(dict1)
>>> All_dict.update(dict2)
>>> All_dict
{'a': 1, 'b': 8, 'c': 5, 'd': 4}
If you'd like to avoid this and keep adding the counts of common keys, one option is to use the following strategy. Applied to your example, here is a minimal working example:
import os
import pickle
from collections import Counter
dict1 = {'a': 4, 'b': 3, 'c': 0, 'd': 4}
dict2 = {'a': 1, 'b': 8, 'c': 5}
# just creating two pickle files:
pickle_out = open("dict1.pickle", "wb")
pickle.dump(dict1, pickle_out)
pickle_out.close()
pickle_out = open("dict2.pickle", "wb")
pickle.dump(dict2, pickle_out)
pickle_out.close()
# Here comes:
pkl_file_list = ["dict1.pickle", "dict2.pickle"]
All_dict = Counter({})
for pkl_file in pkl_file_list:
if os.path.exists(pkl_file):
pickle_in = open(pkl_file, "rb")
dict_i = pickle.load(pickle_in)
All_dict = All_dict + Counter(dict_i)
print (dict(All_dict))
This will happily give you:
{'a': 5, 'b': 11, 'd': 4, 'c': 5}

Categories

Resources