Create a dictionary of dictionaries of frequencies from dataframe

Create a dictionary of dictionaries of frequencies from dataframe - python

i have a big dataset like this and I'm trying to make a dictionary of dictionaries of the dataframe to organize the column of 'crime' with the frequencies of the other columns.
train_data
23 Wednesday BAYVIEW CENTRAL INGLESIDE NORTHERN PARK RICHMOND crime
0 1 1 0 0 0 1 0 0 3
1 1 1 0 0 0 1 0 0 1
2 1 1 0 0 0 1 0 0 1
3 1 1 0 0 0 1 0 0 0
4 1 1 0 0 0 0 1 0 0
5 1 1 0 0 1 0 0 0 0
6 1 1 0 0 1 0 0 0 2
7 1 1 1 0 0 0 0 0 2
8 1 1 0 0 0 0 0 1 0
9 1 1 0 1 0 0 0 0 0
So i decided first of all to groupby the dataframe with the column of 'crime':
train_data=train_data.groupby(['crime']).sum()
23 Wednesday BAYVIEW CENTRAL INGLESIDE NORTHERN PARK RICHMOND
crime
0 5 5 0 1 1 1 1 1
1 2 2 0 0 0 2 0 0
2 2 2 1 0 1 0 0 0
3 1 1 0 0 0 1 0 0
And then i tried to organize them in a dictionary of dictionaries but i can't make it, i tried in some ways iterating too but there is something wrong with the dataframe.
The result should be something like this:
{0: {23: 5, Wednesday: 1, BAYVIEW: 0, CENTRAL: 1, ...},
1: {23: 2, Wednesday: 2, BAYVIEW: 0, ...},
2: {...}, 3: {...}}

You can use
d = train_data.to_dict(orient='index')
See http://pandas.pydata.org/pandas-docs/version/0.17.1/generated/pandas.DataFrame.to_dict.html for more options.

If you're on pandas 0.17.0 or greater as MaxNoe posted:
train_data.groupby('crime').sum().to_dict(orient='index')
otherwise:
train_data.groupby('crime').sum().T.to_dict()

Related

Pandas DF Groupby

I have a dataframe of student responses[S1-S82] and each strand corresponding to the response. I want to know the count of each response given wrt each strand. If the student marked answer correctly I want to know the strand name and no. of correct responses, if the answer is wrong I want to know the strand name and no. of wrong responses(similar to value counts). I am attaching a screenshot of the dataframe.
https://prnt.sc/1125odu
I have written the following code
data_transposed['Counts'] = data_transposed.groupby(['STRAND-->'])['S1'].transform('count')
but it is really not helping me get what I want. I am looking for an option similar to value_counts to plot the data.
Please look into it and help me. Thank you,

I think you are looking to groupby the Strands for each student S1 thru S82.
Here's how I would do it.
Step 1: Create a DataFrame with groupby Strand--> where value is 0
Step 2: Create another DataFrame with groupby Strand--> where value
is 1
Step 3: Add a column in each of the dataframes and assign value of 0 or 1 to represent which data it grouped
Step 4: Concatenate both dataframes.
Step 5: Rearrange the columns to have Strand-->, val, then all students S1 thru S82
Step 6: Sort the dataframe using Strand--> so you get the values in
the right order.
The code is as shown below:
import pandas as pd
import numpy as np
d = {'Strand-->':['Geometry','Geometry','Geometry','Geometry','Mensuration',
'Mensuration','Mensuration','Geometry','Algebra','Algebra',
'Comparing Quantities','Geometry','Data Handling','Geometry','Geometry']}
for i in range(1,83): d ['S'+str(i)] = np.random.randint(0,2,size=15)
df = pd.DataFrame(d)
print (df)
df1 = df.groupby('Strand-->').agg(lambda x: x.eq(0).sum())
df1['val'] = 0
df2 = df.groupby('Strand-->').agg(lambda x: x.ne(0).sum())
df2['val'] = 1
df3 = pd.concat([df1,df2]).reset_index()
dx = [0,-1] + [i for i in range(1,83)]
df3 = df3[df3.columns[dx]].sort_values('Strand-->').reset_index(drop=True)
print (df3)
The output of this will be as follows:
Original DataFrame:
Strand--> S1 S2 S3 S4 S5 ... S77 S78 S79 S80 S81 S82
0 Geometry 0 1 0 0 1 ... 1 0 0 0 1 0
1 Geometry 0 0 0 1 1 ... 1 1 1 0 0 0
2 Geometry 1 1 1 0 0 ... 0 0 1 0 0 0
3 Geometry 0 1 1 0 1 ... 1 0 0 1 0 1
4 Mensuration 1 1 1 0 1 ... 0 1 1 1 0 0
5 Mensuration 0 1 1 1 0 ... 1 0 0 1 1 0
6 Mensuration 1 0 1 1 1 ... 0 1 0 0 1 0
7 Geometry 1 0 1 1 1 ... 1 1 1 0 0 1
8 Algebra 0 0 1 0 1 ... 1 1 0 0 1 1
9 Algebra 0 1 0 1 1 ... 1 1 1 1 0 1
10 Comparing Quantities 1 1 0 1 1 ... 1 1 0 1 1 0
11 Geometry 1 1 1 1 0 ... 0 0 1 0 1 0
12 Data Handling 1 1 0 0 0 ... 1 0 1 1 0 0
13 Geometry 1 1 1 0 0 ... 1 1 1 1 0 0
14 Geometry 0 1 0 0 1 ... 0 1 1 0 1 0
Updated DataFrame:
Note here that column 'val' will be 0 or 1. If 0, then it is the count of 0s. If 1, then it is the count of 1s.
Strand--> val S1 S2 S3 S4 ... S77 S78 S79 S80 S81 S82
0 Algebra 0 2 1 1 1 ... 0 0 1 1 1 0
1 Algebra 1 0 1 1 1 ... 2 2 1 1 1 2
2 Comparing Quantities 0 0 0 1 0 ... 0 0 1 0 0 1
3 Comparing Quantities 1 1 1 0 1 ... 1 1 0 1 1 0
4 Data Handling 0 0 0 1 1 ... 0 1 0 0 1 1
5 Data Handling 1 1 1 0 0 ... 1 0 1 1 0 0
6 Geometry 0 4 2 3 5 ... 3 4 2 6 5 6
7 Geometry 1 4 6 5 3 ... 5 4 6 2 3 2
8 Mensuration 0 1 1 0 1 ... 2 1 2 1 1 3
9 Mensuration 1 2 2 3 2 ... 1 2 1 2 2 0

For single student you can do:
df.groupby(['Strand-->', 'S1']).size().to_frame(name = 'size').reset_index()
If you want to calculate all students at once you can do:
df_m = pd.melt(df, id_vars=['Strand-->'], value_vars=df.columns[1:]).rename({'variable':'result'},axis=1).sort_values(['result'])
df_m['result'].groupby([df_m['Strand-->'],df_m['value']]).value_counts().unstack(fill_value=0).reset_index()

Build a matrix for a multi regression model with qualitative data

I'm trying to build a multi regression model with qualitative data.
In order to do that I need to build a new data frame that creates a new data frame with columns based on the unique values and marks 1 if the index had that value.
Example:
d = {'City': ['Tokyo','Tokyo','Lisbon','Tokyo','Madrid','Lisbon','Madrid','London','Tokyo','London','Tokyo'],
'Card': ['Visa','Visa','Visa','Master Card','Bitcoin','Master Card','Bitcoin','Visa','Master Card','Visa','Bitcoin'],
'Client Number':[1,2,3,4,5,6,7,8,9,10,11],
}
d = pd.DataFrame(data=d).set_index('Client Number')
And get a result equal to this

Let us try get_dummies
df = pd.get_dummies(d,prefix='', prefix_sep='')
Out[202]:
Lisbon London Madrid Tokyo Bitcoin Master Card Visa
Client Number
1 0 0 0 1 0 0 1
2 0 0 0 1 0 0 1
3 1 0 0 0 0 0 1
4 0 0 0 1 0 1 0
5 0 0 1 0 1 0 0
6 1 0 0 0 0 1 0
7 0 0 1 0 1 0 0
8 0 1 0 0 0 0 1
9 0 0 0 1 0 1 0
10 0 1 0 0 0 0 1
11 0 0 0 1 1 0 0

Creating week flags from DOW

I have a dataframe:
DOW
0 0
1 1
2 2
3 3
4 4
5 5
6 6
This corresponds to the dayof the week. Now I want to create this dataframe-
DOW MON_FLAG TUE_FLAG WED_FLAG THUR_FLAG FRI_FLAG SAT_FLAG
0 0 0 0 0 0 0 0
1 1 1 0 0 0 0 0
2 2 0 1 0 0 0 0
3 3 0 0 1 0 0 0
4 4 0 0 0 1 0 0
5 5 0 0 0 0 1 0
6 6 0 0 0 0 0 1
7 0 0 0 0 0 0 0
8 1 1 0 0 0 0 0
Depending on the DOW column for example its 1 then MON_FLAG will be 1 if its 2 then TUES_FLAG will be 1 and so on. I have kept Sunday as 0 that's why all the flag columns are zero in that case.

Use get_dummies with rename columns by dictionary:
d = {0:'SUN_FLAG',1:'MON_FLAG',2:'TUE_FLAG',
3:'WED_FLAG',4:'THUR_FLAG',5: 'FRI_FLAG',6:'SAT_FLAG'}
df = df.join(pd.get_dummies(df['DOW']).rename(columns=d))
print (df)
DOW SUN_FLAG MON_FLAG TUE_FLAG WED_FLAG THUR_FLAG FRI_FLAG SAT_FLAG
0 0 1 0 0 0 0 0 0
1 1 0 1 0 0 0 0 0
2 2 0 0 1 0 0 0 0
3 3 0 0 0 1 0 0 0
4 4 0 0 0 0 1 0 0
5 5 0 0 0 0 0 1 0
6 6 0 0 0 0 0 0 1
7 0 1 0 0 0 0 0 0
8 1 0 1 0 0 0 0 0

Transpose Pandas dataframe preserving the index

I have a problem while transposing a Pandas DataFrame that has the following structure:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
foo 0 4 0 0 0 0 0 0 0 0 14 1 0 1 0 0 0
bar 0 6 0 0 4 0 5 0 0 0 0 0 0 0 1 0 0
lorem 1 3 0 0 0 1 0 0 2 0 3 0 1 2 1 1 0
ipsum 1 2 0 1 0 0 1 0 0 0 0 0 4 0 6 0 0
dolor 1 2 4 0 1 0 0 0 0 0 2 0 0 1 0 0 2
..
With index:
foo,bar,lorem,ipsum,dolor,...
And this is basically a terms-documents matrix, where rows are terms and the headers (0-16) are document indexes.
Since my purpose is clustering documents and not terms, I want to transpose the dataframe and use this to perform a cosine-distance computation between documents themselves.
But when I transpose with:
pd.transpose()
I get:
foo bar ... pippo lorem
0 0 0 ... 0 0
1 4 6 ... 0 0
2 0 0 ... 0 0
3 0 0 ... 0 0
4 0 4 ... 0 0
..
16 0 2 ... 0 1
With index:
0 , 1 , 2 , 3 , ... , 15, 16
What I would like?
I'm looking for a way to make this operation preserving the dataframe index. Basically the first row of my new df should be the index.
Thank you

We can use a series of unstack
df2 = df.unstack().to_frame().unstack(1).droplevel(0,axis=1)
print(df2)
foo bar lorem ipsum dolor
0 0 0 1 1 1
1 4 6 3 2 2
2 0 0 0 0 4
3 0 0 0 1 0
4 0 4 0 0 1
5 0 0 1 0 0
6 0 5 0 1 0
7 0 0 0 0 0
8 0 0 2 0 0
9 0 0 0 0 0
10 14 0 3 0 2
11 1 0 0 0 0
12 0 0 1 4 0
13 1 0 2 0 1
14 0 1 1 6 0
15 0 0 1 0 0
16 0 0 0 0 2

Assuming data is square matrix (n x n) and if I understand the question correctly
df = pd.DataFrame([[0, 4,0], [0,6,0], [1,3,0]],
index =['foo', 'bar', 'lorem'],
columns=[0, 1, 2]
)
df_T = pd.DataFrame(df.values.T, index=df.index, columns=df.columns)

new column in pandas DataFrame based on unique values (lists) of an existing column

I have a dataframe where some cells contain lists of multiple values. How can I create new columns based on unique values of those lists? Those lists can contain values already included in previous observations, and also can be empty. How I create a new column (One Hot Encoding) based on those values?
CHECK EDIT - Data is within quotation marks:
data = {'tokens': ['["Spain", "Germany", "England", "Japan"]',
'["Spain", "Germany"]',
'["Morocco"]',
'[]',
'["Japan"]',
'[]']}
my_new_pd = pd.DataFrame(data)
0 ["Spain", "Germany", "England", "Japan"]
1 ["Spain", "Germany"]
2 ["Morocco"]
3 []
4 ["Japan", ""]
5 []
Name: tokens, dtype: object
I want something like
tokens_Spain|tokens_Germany |tokens_England |tokens_Japan|tokens_Morocco
0 1 1 1 1 0
1 1 1 0 0 0
2 0 0 0 0 1
3. 0 0 0 0 0
4. 0 0 1 1 0
5. 0 0 0 0 0

Method one from sklearn, since you already have the list type column in your dfs
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
yourdf=pd.DataFrame(mlb.fit_transform(df['tokens']),columns=mlb.classes_, index=df.index)
Method two we do explode first then find the dummies
df['tokens'].explode().str.get_dummies().sum(level=0).add_prefix('tokens_')
tokens_A tokens_B tokens_C tokens_D tokens_Z
0 1 1 1 1 0
1 1 1 0 0 0
2 0 0 0 0 1
3 0 0 0 0 0
4 0 0 0 1 1
5 0 0 0 0 0
Method three kind of like "explode" on the axis = 0
pd.get_dummies(pd.DataFrame(df.tokens.tolist()),prefix='tokens',prefix_sep='_').sum(level=0,axis=1)
tokens_A tokens_D tokens_Z tokens_B tokens_C
0 1 1 0 1 1
1 1 0 0 1 0
2 0 0 1 0 0
3 0 0 0 0 0
4 0 1 1 0 0
5 0 0 0 0 0
Update
df['tokens'].explode().str.get_dummies().sum(level=0).add_prefix('tokens_')
tokens_England tokens_Germany tokens_Japan tokens_Morocco tokens_Spain
0 1 1 1 0 1
1 0 1 0 0 1
2 0 0 0 1 0
3 0 0 0 0 0
4 1 0 1 0 0
5 0 0 0 0 0

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Create a dictionary of dictionaries of frequencies from dataframe - python

You can use d = train_data.to_dict(orient='index') See http://pandas.pydata.org/pandas-docs/version/0.17.1/generated/pandas.DataFrame.to_dict.html for more options.

If you're on pandas 0.17.0 or greater as MaxNoe posted: train_data.groupby('crime').sum().to_dict(orient='index') otherwise: train_data.groupby('crime').sum().T.to_dict()

Related

Pandas DF Groupby

Build a matrix for a multi regression model with qualitative data

Creating week flags from DOW

Transpose Pandas dataframe preserving the index

new column in pandas DataFrame based on unique values (lists) of an existing column

Categories

Resources