double layer key in data frame python - python

I need to create a data frame in python which has double key, each key will have 3 subkeys, so to call a value in the data frame i'll need to refer to it with 3 indexes
df=[ key1 key2 key3
index key11 key12 key13 key21 key22 key23 key31 key32 key33
0 12 32 45 345 34 43 3 54 134
1 143 41 14 4 1 13 14 41 43
2 114 11 54 11 13 13 43 13 13
]
so to call the '11' in column 2 row 3 it should look like df[2,'key1','key12'].
is there any possibility to do so?
Thank you

You could quite easily set this up with embedded dictionaries and a list. For example:
df = {
'key1' : {
'key11' : [12, 143, 114],
'key12' : [32, 41, 14],
'key13' : [114, 11, 54]},
'key2' : {
'key21' : [345, 4, 11],
'key22' : [34, 1, 13],
'key23' : [43, 13, 13]},
'key3' : {
'key31' : [3, 14, 43],
'key32' : [54, 41, 13],
'key33' : [134, 43, 13]}
}
The value that you want would then be df['key1']['key12'][2].
This dictionary could be set up automatically with more info on how your data was coming in.

Related

turning a raw dataframe data into a dictionary with a list

I am trying to turn this data (in a Dataframe):
0 1
0 HT01 CC363292
29 RL01 CC363292
50 TN01 CC363292
4 BN02 CC363293
7 MR20 CC363294
9 TN01 CC363295
10 RL01 CC363296
13 HT01 CC363297
17 HT01 CC363298
21 SU01 CC363299
22 BN02 CC363300
25 MR20 CC363301
27 MR20 CC363302
54 BN02 CC363313
57 BN02 CC363314
60 BN02 CC363315
52 SU01 EA363303
32 RL01 EA363303
35 MR20 EA363304
37 HU01 EA363305
38 HU01 EA363306
39 BN02 EA363307
63 RL01 EA363311
66 MR20 EA363312
42 HT01 SC363308
46 RL01 SC363309
51 SP01 SC363309
53 FU01 SC363309
49 SP01 SC363310
into a dictionary that uses col.0 as key and a list of matching col.1 info (see below)
i.e.
temp_dict = {
CC363292 : [HT01, RL01, TN01]
CC363293 : [BN02]
}
I have tried using a for loop to append the list to the key with no luck.
can any one assist?
zip_data = zip(df['col1'], df['col2'])
result = {}
for i in zip_data:
result.setdefault(i[1], []).append(i[0])
This might work.
The questions requires to group the rows by a column and list another column based on the groups. A quick solution can be:
import pandas as pd
data = [
[0, "HT01", "CC363292"],
[29, "RL01", "CC363292"],
[50, "TN01", "CC363292"],
[4, "BN02", "CC363293"],
[7, "MR20", "CC363294"],
[9, "TN01", "CC363295"],
[10, "RL01", "CC363296"],
[13, "HT01", "CC363297"],
[17, "HT01", "CC363298"],
[21, "SU01", "CC363299"],
[22, "BN02", "CC363300"],
[25, "MR20", "CC363301"],
[27, "MR20", "CC363302"],
[54, "BN02", "CC363313"],
[57, "BN02", "CC363314"],
[60, "BN02", "CC363315"],
[52, "SU01", "EA363303"],
[32, "RL01", "EA363303"],
[35, "MR20", "EA363304"],
[37, "HU01", "EA363305"],
[38, "HU01", "EA363306"],
[39, "BN02", "EA363307"],
[63, "RL01", "EA363311"],
[66, "MR20", "EA363312"],
[42, "HT01", "SC363308"],
[46, "RL01", "SC363309"],
[51, "SP01", "SC363309"],
[53, "FU01", "SC363309"],
[49, "SP01", "SC363310"],
]
df = pd.DataFrame(data)
# Group by the third column.
# List the second column.
groups = df.groupby(df.columns[2])[df.columns[1]].apply(list)
print(groups)
The output should look similar to:
CC363292 [HT01, RL01, TN01]
CC363293 [BN02]
CC363294 [MR20]
CC363295 [TN01]
CC363296 [RL01]
CC363297 [HT01]
CC363298 [HT01]
CC363299 [SU01]
CC363300 [BN02]
CC363301 [MR20]
EA363311 [RL01]
EA363312 [MR20]
SC363308 [HT01]
SC363309 [RL01, SP01, FU01]
SC363310 [SP01]
To convert to a dictionary, use dict(groups) instead. The output should be:
{
'CC363292': ['HT01', 'RL01', 'TN01'],
'CC363293': ['BN02'],
'CC363294': ['MR20'],
'CC363295': ['TN01'],
'CC363296': ['RL01'],
'CC363297': ['HT01'],
'CC363298': ['HT01'],
'CC363299': ['SU01'],
'CC363300': ['BN02'],
'CC363301': ['MR20'],
'CC363302': ['MR20'],
'CC363313': ['BN02'],
'CC363314': ['BN02'],
'CC363315': ['BN02'],
'EA363303': ['SU01', 'RL01'],
'EA363304': ['MR20'],
'EA363305': ['HU01'],
'EA363306': ['HU01'],
'EA363307': ['BN02'],
'EA363311': ['RL01'],
'EA363312': ['MR20'],
'SC363308': ['HT01'],
'SC363309': ['RL01', 'SP01', 'FU01'],
'SC363310': ['SP01']
}
I'm a little familiar with Pandas and here is the solution I came up to:
# Create DataFrame from initial data
data = [
('HT01', 'CC363292'),
('RL01', 'CC363292'),
('TN01', 'CC363292'),
('BN02', 'CC363293'),
...
]
df = pandas.DataFrame(data=data, columns=['col1', 'col2'])
# This will create a such dataframe:
# col1 | col2 |
# HT01 | CC363292 |
# RL01 | CC363292 |
# TN01 | CC363292 |
# BN02 | CC363293 |
# .... | ........ |
# The next step is to convert 'col2' to categorical
_df = pandas.get_dummies(data=df, columns=['col2'], prefix='', prefix_sep='')
# This will give us such result:
# col1 | CC363292 | CC363293 | ...
# HT01 | 1 | 0 | ...
# RL01 | 1 | 0 | ...
# TN01 | 1 | 0 | ...
# BN02 | 0 | 1 | ...
# .... | ........ | ........ | ...
# Then we'll create the simple lambda function to initialize our lists:
f = lambda col: [_df.col1[i] for i, val in enumerate(_df[col]) if val]
# And obtain the requested result using dict-comprehensions:
my_dict = {col: f(col) for col in _df.columns[1:]}
# Important: using _df.columns[1:] is not very universal, but
# will be ok for the problem you described

How do I add a column based on selected row filter in pandas?

Hi I would like to give a final score to the students based on current Score + Score for their favourite subject.
import pandas as pd
new_data = [['tom', 31, 50, 30, 20, 'English'], ['nick', 30, 42, 23, 21, 'Math'], ['juli', 39, 14, 40, 38, 'Science']]
df = pd.DataFrame(new_data, columns = ['Name','Current_Score','English','Science','Math','Favourite_Subject'])
for subj in df['Favourite_Subject'].unique():
mask = (df['Favourite_Subject'] == subj)
df['Final_Score'] = df[mask].apply(lambda row: row['Current_Score'] + row[subj], axis=1)
Name Score English Science Math Favourite_Subject Final_Score
0 tom 31 50 30 20 English NaN
1 nick 30 42 23 21 Math NaN
2 juli 39 14 40 38 Science 79.0
When I apply the above function, I got NaN in the other 2 entries for 'Final_Score' column, how do I get the following result without overwriting with NaN? Thanks!
Name Score English Science Math Favourite_Subject Final_Score
0 tom 31 50 30 20 English 81
1 nick 30 42 23 21 Math 51
2 juli 39 14 40 38 Science 79
We can use lookup to find the scores corresponding to the Favourite_Subject then add them with the Current_Score to calculate Final_Score
i = df.columns.get_indexer(df['Favourite_Subject'])
df['Final_Score'] = df['Current_Score'] + df.values[df.index, i]
Name Current_Score English Science Math Favourite_Subject Final_Score
0 tom 31 50 30 20 English 81
1 nick 30 42 23 21 Math 51
2 juli 39 14 40 38 Science 79
You do not need a loop, you can apply this directly to the dataframe:
import pandas as pd
new_data = [['tom', 31, 50, 30, 20, 'English'], ['nick', 30, 42, 23, 21, 'Math'], ['juli', 39, 14, 40, 38, 'Science']]
df = pd.DataFrame(new_data, columns = ['Name','Current_Score','English','Science','Math','Favourite_Subject'])
df['Final_Score'] = df.apply(lambda x: x['Current_Score'] + x[x['Favourite_Subject']], axis=1)
You can use .apply() on axis=1 and get the column label from the column value of column Favourite_Subject to get the value of the corresponding column. Then, add the result to column Current_Score with df['Current_Score'], as follows:
df['Final_Score'] = df['Current_Score'] + df.apply(lambda x: x[x['Favourite_Subject']], axis=1)
Result:
print(df)
Name Current_Score English Science Math Favourite_Subject Final_Score
0 tom 31 50 30 20 English 81
1 nick 30 42 23 21 Math 51
2 juli 39 14 40 38 Science 79
Seems like you are overwriting the previous values during each loop which is why you only have the Final score for the final row when the loop ends.
Here is my implementation:
import pandas as pd
new_data = [['tom', 31, 50, 30, 20, 'English'], ['nick', 30, 42, 23, 21, 'Math'], ['juli', 39, 14, 40, 38, 'Science']]
df = pd.DataFrame(new_data, columns = ['Name','Current_Score','English','Science','Math','Favourite_Subject'])
favsubj = df['Favourite_Subject'].to_list()
final_scores = []
for i in range(0,len(df)):
final_scores.append(df['Current_Score'].iloc[i] + df[favsubj[i]].iloc[i])
df['Final_Score'] = final_scores

Is there a better way to convert JSON into DataFrame?

Want to convert JSON object to DataFrame.
This my JSON object
data = {'situation': {'OpenPlay': {'shots': 282,
'goals': 33,
'xG': 36.38206055667251,
'against': {'shots': 276, 'goals': 29, 'xG': 33.0840025995858}},
'FromCorner': {'shots': 46,
'goals': 2,
'xG': 2.861613758839667,
'against': {'shots': 46, 'goals': 4, 'xG': 3.420699148438871}},
'DirectFreekick': {'shots': 19,
'goals': 1,
'xG': 1.0674087516963482,
'against': {'shots': 10, 'goals': 0, 'xG': 0.6329493299126625}},
'SetPiece': {'shots': 14,
'goals': 1,
'xG': 0.6052199145779014,
'against': {'shots': 21, 'goals': 1, 'xG': 2.118571280501783}},
'Penalty': {'shots': 6,
'goals': 6,
'xG': 4.5670130252838135,
'against': {'shots': 2, 'goals': 1, 'xG': 1.5222634673118591}}}
Want output:
My code:
df = pd.json_normalize(data['situation']['OpenPlay'])
for i in range(1,4):
df = df.append(pd.json_normalize(data['situation'][type_of_play[i]]))
df = df.reset_index()
any efficient way of doing this?
First of all, your data lack of a '}' at the end.
Try this code:
obj = [pd.json_normalize(data['situation'][e]) for e in data['situation']]
pd.concat(obj, ignore_index=True)
You can load the data into a dataframe regularly, then run json_normalize on the column that contains the remaining dicts, and join it with the main dataframe:
df = pd.DataFrame(data['situation']).T.reset_index()
df = df.join(pd.json_normalize(df.against), lsuffix='_against', how='left').drop(columns=['against'])
Result:
index
shots_against
goals_against
xG_against
shots
goals
xG
0
OpenPlay
282
33
36.3821
276
29
33.084
1
FromCorner
46
2
2.86161
46
4
3.4207
2
DirectFreekick
19
1
1.06741
10
0
0.632949
3
SetPiece
14
1
0.60522
21
1
2.11857
4
Penalty
6
6
4.56701
2
1
1.52226
For efficiency, it is best to process the data outside Pandas, within the dictionary, and then create the dataframe.
You could use jmespath to extract the data, before passing it into pandas; should be more efficient; you could run tests to check the speed:
Summary idea for jmespath; if you are accessing a key, use ., if it's an array/list, use []:
import jmespath
expression = """{shots: *.*.shots[],
goals: *.*.goals[],
xG : *.*.xG[],
against_shots: *.*.against.shots[],
against_goals: *.*.against.goals[],
against_XG: *.*.against.xG[]
}"""
expression = jmespath.compile(expression)
expression = expression.search(data)
#dataframe
pd.DataFrame(expression)
shots goals xG against_shots against_goals against_XG
0 282 33 36.382061 276 29 33.084003
1 46 2 2.861614 46 4 3.420699
2 19 1 1.067409 10 0 0.632949
3 14 1 0.605220 21 1 2.118571
4 6 6 4.567013 2 1 1.522263
jmespath can be convenient, especially as the nesting in the dict/json becomes more convoluted; most efficient however, would be to use the dictionary data structure directly:
from collections import defaultdict
df = defaultdict(list)
for key, value in data['situation'].items():
df['shots'].append(value['shots'])
df['goals'].append(value['goals'])
df['xG'].append(value['xG'])
df['against_shots'].append(value['against']['shots'])
df['against_goals'].append(value['against']['goals'])
df['against_xG'].append(value['against']['xG'])
# create dataframe
pd.DataFrame(df)
shots goals xG against_shots against_goals against_xG
0 282 33 36.382061 276 29 33.084003
1 46 2 2.861614 46 4 3.420699
2 19 1 1.067409 10 0 0.632949
3 14 1 0.605220 21 1 2.118571
4 6 6 4.567013 2 1 1.522263

How to create a dictionary of items from a dataframe?

I have a Pandas dataframe df which is of the form:
pk id_column date_column sales_column
0 111 03/10/19 23
1 111 04/10/19 24
2 111 05/10/19 25
3 111 06/10/19 26
4 112 07/10/19 27
5 112 08/10/19 28
6 112 09/10/19 29
7 112 10/10/19 30
8 113 11/10/19 31
9 113 12/10/19 32
10 113 13/10/19 33
11 113 14/10/19 34
12 114 15/10/19 35
13 114 16/10/19 36
14 114 17/10/19 37
15 114 18/10/19 38
How do I get a new dictionary which contains data from id_column and sales_column as its value like below in the order of date_column.
{
111: [23, 24, 25, 26],
112: [27, 28, 29, 30],
113: ...,
114: ...
}
First create Series of lists in groupby with list and then convert to dictionary by Series.to_dict:
If need sorting by id_column and date_column first convert values to datetimes and then use DataFrame.sort_values:
df['date_column'] = pd.to_datetime(df['date_column'], dayfirst=True)
df = df.sort_values(['id_column','date_column'])
d = df.groupby('id_column')['sales_column'].apply(list).to_dict()
print (d)
{111: [23, 24, 25, 26], 112: [27, 28, 29, 30], 113: [31, 32, 33, 34], 114: [35, 36, 37, 38]}

Inconsistent python print output

(Python 2.7.12) - I have created an NxN array, when I print it I get the exact following output:
Sample a:
SampleArray=np.random.randint(1,100, size=(5,5))
[[49 72 88 56 41]
[30 73 6 43 53]
[83 54 65 16 34]
[25 17 73 10 46]
[75 77 82 12 91]]
Nice and clean.
However, when I go to sort this array by the elements in the 4th column using the code:
SampleArray=sorted(SampleArray, key=lambda x: x[4])
I get the following output:
Sample b:
[array([90, 9, 77, 63, 48]), array([43, 97, 47, 74, 53]), array([60, 64, 97, 2, 73]), array([34, 20, 42, 80, 76]), array([86, 61, 95, 21, 82])]
How can I get my output to stay in the format of 'Sample a'. It will make debugging much easier if I can see the numbers in a straight column.
Simply with numpy.argsort() routine:
import numpy as np
a = np.random.randint(1,100, size=(5,5))
print(a) # initial array
print(a[np.argsort(a[:, -1])]) # sorted array
The output for # initial array:
[[21 99 34 33 55]
[14 81 92 44 97]
[68 53 35 46 22]
[64 33 52 40 75]
[65 35 35 78 43]]
The output for # sorted array:
[[68 53 35 46 22]
[65 35 35 78 43]
[21 99 34 33 55]
[64 33 52 40 75]
[14 81 92 44 97]]
you just need to convert sample array back to a numpy array by using
SampleArray = np.array(SampleArray)
sample code:-
import numpy as np
SampleArray=np.random.randint(1,100, size=(5,5))
print (SampleArray)
SampleArray=sorted(SampleArray, key=lambda x: x[4])
print (SampleArray)
SampleArray = np.array(SampleArray)
print (SampleArray)
output:-
[[28 25 33 56 54]
[77 88 10 68 61]
[30 83 77 87 82]
[83 93 70 1 2]
[27 70 76 28 80]]
[array([83, 93, 70, 1, 2]), array([28, 25, 33, 56, 54]), array([77, 88, 10, 68, 61]), array([27, 70, 76, 28, 80]), array([30, 83, 77, 87, 82])]
[[83 93 70 1 2]
[28 25 33 56 54]
[77 88 10 68 61]
[27 70 76 28 80]
[30 83 77 87 82]]
This can help:
from pprint import pprint
pprint(SampleArray)
The output is a little bit different from the one for Sample A but it still looks neat and debugging will be easier.
Edit: here's my output
[[92 8 41 64 61]
[18 67 91 80 35]
[68 37 4 6 43]
[26 81 57 26 52]
[ 6 82 95 15 69]]
[array([18, 67, 91, 80, 35]),
array([68, 37, 4, 6, 43]),
array([26, 81, 57, 26, 52]),
array([92, 8, 41, 64, 61]),
array([ 6, 82, 95, 15, 69])]

Categories

Resources