How to create a dictionary of items from a dataframe? - python

I have a Pandas dataframe df which is of the form:
pk id_column date_column sales_column
0 111 03/10/19 23
1 111 04/10/19 24
2 111 05/10/19 25
3 111 06/10/19 26
4 112 07/10/19 27
5 112 08/10/19 28
6 112 09/10/19 29
7 112 10/10/19 30
8 113 11/10/19 31
9 113 12/10/19 32
10 113 13/10/19 33
11 113 14/10/19 34
12 114 15/10/19 35
13 114 16/10/19 36
14 114 17/10/19 37
15 114 18/10/19 38
How do I get a new dictionary which contains data from id_column and sales_column as its value like below in the order of date_column.
{
111: [23, 24, 25, 26],
112: [27, 28, 29, 30],
113: ...,
114: ...
}

First create Series of lists in groupby with list and then convert to dictionary by Series.to_dict:
If need sorting by id_column and date_column first convert values to datetimes and then use DataFrame.sort_values:
df['date_column'] = pd.to_datetime(df['date_column'], dayfirst=True)
df = df.sort_values(['id_column','date_column'])
d = df.groupby('id_column')['sales_column'].apply(list).to_dict()
print (d)
{111: [23, 24, 25, 26], 112: [27, 28, 29, 30], 113: [31, 32, 33, 34], 114: [35, 36, 37, 38]}

Related

How do I add a column based on selected row filter in pandas?

Hi I would like to give a final score to the students based on current Score + Score for their favourite subject.
import pandas as pd
new_data = [['tom', 31, 50, 30, 20, 'English'], ['nick', 30, 42, 23, 21, 'Math'], ['juli', 39, 14, 40, 38, 'Science']]
df = pd.DataFrame(new_data, columns = ['Name','Current_Score','English','Science','Math','Favourite_Subject'])
for subj in df['Favourite_Subject'].unique():
mask = (df['Favourite_Subject'] == subj)
df['Final_Score'] = df[mask].apply(lambda row: row['Current_Score'] + row[subj], axis=1)
Name Score English Science Math Favourite_Subject Final_Score
0 tom 31 50 30 20 English NaN
1 nick 30 42 23 21 Math NaN
2 juli 39 14 40 38 Science 79.0
When I apply the above function, I got NaN in the other 2 entries for 'Final_Score' column, how do I get the following result without overwriting with NaN? Thanks!
Name Score English Science Math Favourite_Subject Final_Score
0 tom 31 50 30 20 English 81
1 nick 30 42 23 21 Math 51
2 juli 39 14 40 38 Science 79
We can use lookup to find the scores corresponding to the Favourite_Subject then add them with the Current_Score to calculate Final_Score
i = df.columns.get_indexer(df['Favourite_Subject'])
df['Final_Score'] = df['Current_Score'] + df.values[df.index, i]
Name Current_Score English Science Math Favourite_Subject Final_Score
0 tom 31 50 30 20 English 81
1 nick 30 42 23 21 Math 51
2 juli 39 14 40 38 Science 79
You do not need a loop, you can apply this directly to the dataframe:
import pandas as pd
new_data = [['tom', 31, 50, 30, 20, 'English'], ['nick', 30, 42, 23, 21, 'Math'], ['juli', 39, 14, 40, 38, 'Science']]
df = pd.DataFrame(new_data, columns = ['Name','Current_Score','English','Science','Math','Favourite_Subject'])
df['Final_Score'] = df.apply(lambda x: x['Current_Score'] + x[x['Favourite_Subject']], axis=1)
You can use .apply() on axis=1 and get the column label from the column value of column Favourite_Subject to get the value of the corresponding column. Then, add the result to column Current_Score with df['Current_Score'], as follows:
df['Final_Score'] = df['Current_Score'] + df.apply(lambda x: x[x['Favourite_Subject']], axis=1)
Result:
print(df)
Name Current_Score English Science Math Favourite_Subject Final_Score
0 tom 31 50 30 20 English 81
1 nick 30 42 23 21 Math 51
2 juli 39 14 40 38 Science 79
Seems like you are overwriting the previous values during each loop which is why you only have the Final score for the final row when the loop ends.
Here is my implementation:
import pandas as pd
new_data = [['tom', 31, 50, 30, 20, 'English'], ['nick', 30, 42, 23, 21, 'Math'], ['juli', 39, 14, 40, 38, 'Science']]
df = pd.DataFrame(new_data, columns = ['Name','Current_Score','English','Science','Math','Favourite_Subject'])
favsubj = df['Favourite_Subject'].to_list()
final_scores = []
for i in range(0,len(df)):
final_scores.append(df['Current_Score'].iloc[i] + df[favsubj[i]].iloc[i])
df['Final_Score'] = final_scores

Pandas fillna multiple columns with values from corresponding columns without repeating for each

Let's say I have a DataFrame like this:
x = pd.DataFrame({'col1_x': [15, np.nan, 136, 93, 743, np.nan, np.nan, 91] ,
'col2_x': [np.nan, np.nan, 51, 22, 38, np.nan, 72, np.nan],
'col1_y': [10, 20, 30, 40, 50, 60, 70, 80],
'col2_y': [93, 24, 52, 246, 142, 53, 94, 2]})
And I want to fill the NaN values in col_x with the values in col_y respectively,
I can do this:
x['col1_x'] = x['col1_x'].fillna(x['col1_y'])
x['col2_x'] = x['col2_x'].fillna(x['col2_y'])
print(x)
Which will yield:
col1_x col2_x col1_y col2_y
0 15.0 93.0 10 93
1 20.0 24.0 20 24
2 136.0 51.0 30 52
3 93.0 22.0 40 246
4 743.0 38.0 50 142
5 60.0 53.0 60 53
6 70.0 72.0 70 94
7 91.0 2.0 80 2
But requires to repeat the same function with different variables, now let's assume that I have a bigger DataFrame with much more columns, is it possible to do it without repeating?
you can use **kwargs to assign()
build up a dict with a comprehension to build **kwargs
import pandas as pd
import numpy as np
x = pd.DataFrame({'col1_x': [15, np.nan, 136, 93, 743, np.nan, np.nan, 91] ,
'col2_x': [np.nan, np.nan, 51, 22, 38, np.nan, 72, np.nan],
'col1_y': [10, 20, 30, 40, 50, 60, 70, 80],
'col2_y': [93, 24, 52, 246, 142, 53, 94, 2]})
x.assign(**{c:x[c].fillna(x[c.replace("_x","_y")]) for c in x.columns if "_x" in c})
col1_x
col2_x
col1_y
col2_y
0
15
93
10
93
1
20
24
20
24
2
136
51
30
52
3
93
22
40
246
4
743
38
50
142
5
60
53
60
53
6
70
72
70
94
7
91
2
80
2
How does it work
# core - loop through columns that end with _x and generate it's pair column _y
{c:c.replace("_x","_y")
for c in x.columns if "_x" in c}
# now we have all the pairs of a columns let's do what we want - fillna()
{c:x[c].fillna(x[c.replace("_x","_y")]) for c in x.columns if "_x" in c}
# this dictionary matches this function.... https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.assign.html
# so final part is call the function with **kwargs
x.assign(**{c:x[c].fillna(x[c.replace("_x","_y")])
for c in x.columns if "_x" in c})
You can use the following notation -
x.fillna({"col1_x": x["col1_y"], "col2_x": x["col2_y"]})
Assuming you can extract all the indices numbers you can do -
replace_dict = {f"col{item}_x":x[f"col{item}_y"] for item in indices}
x = x.fillna(replace_dict}
Are you trying to make this type of function :
def fil(fill,fromm):
fill.fillna(fromm,inplace=True)
fil(x['col1_x'],x['col1_y'])
Or if you are sure about dataframe(x) then this :
def fil(fill,fromm):
x[fill].fillna(x[fromm],inplace=True)
fil('col1_x','col1_y')
For your code :
import pandas as pd
import numpy as np
x = pd.DataFrame({'col1_x': [15, np.nan, 136, 93, 743, np.nan, np.nan, 91] ,
'col2_x': [np.nan, np.nan, 51, 22, 38, np.nan, 72, np.nan],
'col1_y': [10, 20, 30, 40, 50, 60, 70, 80],
'col2_y': [93, 24, 52, 246, 142, 53, 94, 2]})
def fil(fill,fromm):
x[fill].fillna(x[fromm],inplace=True)
fil('col1_x','col1_y')
fil('col2_x','col2_y')
print(x)
"""
col1_x col2_x col1_y col2_y
0 15.0 93.0 10 93
1 20.0 24.0 20 24
2 136.0 51.0 30 52
3 93.0 22.0 40 246
4 743.0 38.0 50 142
5 60.0 53.0 60 53
6 70.0 72.0 70 94
7 91.0 2.0 80 2
"""
Additionally, if you have column name like col1_x,col2_x,col3_x.... same for y then you may automate it like this :
for i in range(1,3):
fil(f'col{i}_x',f'col{i}_y')

Pandas: Iterate and insert column with conditions within groups complex question

I have a quite complex question about how to add a new column with conditions for each group. Here is the example dataframe,
df = pd.DataFrame({
'id': ['AA', 'AA', 'AA', 'AA', 'BB', 'BB', 'BB', 'BB', 'BB',
'CC', 'CC', 'CC', 'CC', 'CC', 'CC', 'CC'],
'From_num': [80, 68, 751, 'Issued', 32, 68, 126, 'Issued', 'Missed', 105, 68, 114, 76, 68, 99, 'Missed'],
'To_num':[99, 80, 68, 751, 105, 32, 68, 126, 49, 324, 105, 68, 114, 76, 68, 99],
})
id From_num To_num
0 AA 80 99
1 AA 68 80
2 AA 751 68
3 AA Issued 751
4 BB 32 105
5 BB 68 32
6 BB 126 68
7 BB Issued 126
8 BB Missed 49
9 CC 105 324
10 CC 68 105
11 CC 114 68
12 CC 76 114
13 CC 68 76
14 CC 99 68
15 CC Missed 99
I have a 'flag' number 68. In each group, for any row equals or above this flag number in 'From_num' column will be tagged "Forward" in the new column , any row equals or below the flag number in 'To_num' column will be labelled 'Back' in the same column. However, the hardest situation is: if this flag number appears more than once in each column, the rows between the 'From_num' and 'To_num' will be labelled "Forward&Back" in the new column, see the df and the expected result below.
Expected result
id From_num To_num Direction
0 AA 80 99 Forward
1 AA 68 80 Forward
2 AA 751 68 Back
3 AA Issued 751 Back
4 BB 32 105 Forward
5 BB 68 32 Forward
6 BB 126 68 Back
7 BB Issued 126 Back
8 BB Missed 49 Back
9 CC 105 324 Forward
10 CC 68 105 Forward
11 CC 114 68 Forward&Back # From line 11 to 13, flag # 68 appears more than once
12 CC 76 114 Forward&Back # so the line 11, 12 and 13 labelled "Forward&Back"
13 CC 68 76 Forward&Back
14 CC 99 68 Back
15 CC Missed 99 Back
I tried to write many loops, and they all failed and could not have an expected result. So if anyone has ideas, please help. Hopefully the question is clear. Many thanks!
I've done without "real looping".
preserve the row numbers (reset_index())
construct a new data frame that is records that contain flags (68)
simple logic for "Forward" and "Back" is based on row being before or after first sighting of 68
"Forward&Back" occurs when there are multiple sightings and between 2nd and (n-1)th sighting
def direction(r):
flagrow = df2[(df2["id"]==r["id"]) ]["index"].values
if r["index"] <= flagrow[0]: val = "Forward"
elif r["index"] > flagrow[0]: val = "Back"
if len(flagrow)>2 and r["index"] >= flagrow[1] and r["index"]<flagrow[-1]: val = "Forward&Back"
return val
df = pd.DataFrame({
'id': ['AA', 'AA', 'AA', 'AA', 'BB', 'BB', 'BB', 'BB', 'BB',
'CC', 'CC', 'CC', 'CC', 'CC', 'CC', 'CC'],
'From_num': [80, 68, 751, 'Issued', 32, 68, 126, 'Issued', 'Missed', 105, 68, 114, 76, 68, 99, 'Missed'],
'To_num':[99, 80, 68, 751, 105, 32, 68, 126, 49, 324, 105, 68, 114, 76, 68, 99],
})
df = df.reset_index()
df2 = df[(df.From_num==68) | (df.To_num==68)].copy()
df["Direction"] = df.apply(lambda r: direction(r), axis=1)
df

Inconsistent python print output

(Python 2.7.12) - I have created an NxN array, when I print it I get the exact following output:
Sample a:
SampleArray=np.random.randint(1,100, size=(5,5))
[[49 72 88 56 41]
[30 73 6 43 53]
[83 54 65 16 34]
[25 17 73 10 46]
[75 77 82 12 91]]
Nice and clean.
However, when I go to sort this array by the elements in the 4th column using the code:
SampleArray=sorted(SampleArray, key=lambda x: x[4])
I get the following output:
Sample b:
[array([90, 9, 77, 63, 48]), array([43, 97, 47, 74, 53]), array([60, 64, 97, 2, 73]), array([34, 20, 42, 80, 76]), array([86, 61, 95, 21, 82])]
How can I get my output to stay in the format of 'Sample a'. It will make debugging much easier if I can see the numbers in a straight column.
Simply with numpy.argsort() routine:
import numpy as np
a = np.random.randint(1,100, size=(5,5))
print(a) # initial array
print(a[np.argsort(a[:, -1])]) # sorted array
The output for # initial array:
[[21 99 34 33 55]
[14 81 92 44 97]
[68 53 35 46 22]
[64 33 52 40 75]
[65 35 35 78 43]]
The output for # sorted array:
[[68 53 35 46 22]
[65 35 35 78 43]
[21 99 34 33 55]
[64 33 52 40 75]
[14 81 92 44 97]]
you just need to convert sample array back to a numpy array by using
SampleArray = np.array(SampleArray)
sample code:-
import numpy as np
SampleArray=np.random.randint(1,100, size=(5,5))
print (SampleArray)
SampleArray=sorted(SampleArray, key=lambda x: x[4])
print (SampleArray)
SampleArray = np.array(SampleArray)
print (SampleArray)
output:-
[[28 25 33 56 54]
[77 88 10 68 61]
[30 83 77 87 82]
[83 93 70 1 2]
[27 70 76 28 80]]
[array([83, 93, 70, 1, 2]), array([28, 25, 33, 56, 54]), array([77, 88, 10, 68, 61]), array([27, 70, 76, 28, 80]), array([30, 83, 77, 87, 82])]
[[83 93 70 1 2]
[28 25 33 56 54]
[77 88 10 68 61]
[27 70 76 28 80]
[30 83 77 87 82]]
This can help:
from pprint import pprint
pprint(SampleArray)
The output is a little bit different from the one for Sample A but it still looks neat and debugging will be easier.
Edit: here's my output
[[92 8 41 64 61]
[18 67 91 80 35]
[68 37 4 6 43]
[26 81 57 26 52]
[ 6 82 95 15 69]]
[array([18, 67, 91, 80, 35]),
array([68, 37, 4, 6, 43]),
array([26, 81, 57, 26, 52]),
array([92, 8, 41, 64, 61]),
array([ 6, 82, 95, 15, 69])]

double layer key in data frame python

I need to create a data frame in python which has double key, each key will have 3 subkeys, so to call a value in the data frame i'll need to refer to it with 3 indexes
df=[ key1 key2 key3
index key11 key12 key13 key21 key22 key23 key31 key32 key33
0 12 32 45 345 34 43 3 54 134
1 143 41 14 4 1 13 14 41 43
2 114 11 54 11 13 13 43 13 13
]
so to call the '11' in column 2 row 3 it should look like df[2,'key1','key12'].
is there any possibility to do so?
Thank you
You could quite easily set this up with embedded dictionaries and a list. For example:
df = {
'key1' : {
'key11' : [12, 143, 114],
'key12' : [32, 41, 14],
'key13' : [114, 11, 54]},
'key2' : {
'key21' : [345, 4, 11],
'key22' : [34, 1, 13],
'key23' : [43, 13, 13]},
'key3' : {
'key31' : [3, 14, 43],
'key32' : [54, 41, 13],
'key33' : [134, 43, 13]}
}
The value that you want would then be df['key1']['key12'][2].
This dictionary could be set up automatically with more info on how your data was coming in.

Categories

Resources