Collapsing multiple indices into groups based on overlapping targets - python

I'm currently looking at the correlation between features in my dataset and need to group features that have similar targets into larger supergroups that can be used for a more general correlation analysis.
The features are one hot encoded and are in a pandas data-frame that looks similar to this:
1 2 3 4 5 6 7 8 9
A 0 0 1 0 0 1 0 1 0
B 0 0 0 1 0 0 0 0 0
C 1 0 0 0 1 0 0 0 0
D 1 0 0 1 0 0 0 0 0
E 0 1 0 1 0 0 0 0 1
I would like the resulting dataframe to look like this:
1 2 3 4 5 6 7 8 9
group1(A) 0 0 1 0 0 1 0 1 0
group2(B,D,E,C)1 1 0 1 1 0 0 0 1
I've already tried all forms of groupby and some of the methods in networkx.

This is a hidden network problem , so we using networkx after merge
s=df.reset_index().melt('index')
s=s.loc[s.value==1]
s=s.merge(s,on = 'variable')
import networkx as nx
G=nx.from_pandas_edgelist(s, 'index_x', 'index_y')
l=list(nx.connected_components(G))
from collections import ChainMap
L=dict(ChainMap(*[dict.fromkeys(y,x) for x, y in enumerate(l)]))
df.groupby(L).sum().ge(1).astype(int)
Out[133]:
1 2 3 4 5 6 7 8 9
0 1 1 0 1 1 0 0 0 1
1 0 0 1 0 0 1 0 1 0
L
Out[134]: {'A': 1, 'B': 0, 'C': 0, 'D': 0, 'E': 0}

Related

Pandas DF Groupby

I have a dataframe of student responses[S1-S82] and each strand corresponding to the response. I want to know the count of each response given wrt each strand. If the student marked answer correctly I want to know the strand name and no. of correct responses, if the answer is wrong I want to know the strand name and no. of wrong responses(similar to value counts). I am attaching a screenshot of the dataframe.
https://prnt.sc/1125odu
I have written the following code
data_transposed['Counts'] = data_transposed.groupby(['STRAND-->'])['S1'].transform('count')
but it is really not helping me get what I want. I am looking for an option similar to value_counts to plot the data.
Please look into it and help me. Thank you,
I think you are looking to groupby the Strands for each student S1 thru S82.
Here's how I would do it.
Step 1: Create a DataFrame with groupby Strand--> where value is 0
Step 2: Create another DataFrame with groupby Strand--> where value
is 1
Step 3: Add a column in each of the dataframes and assign value of 0 or 1 to represent which data it grouped
Step 4: Concatenate both dataframes.
Step 5: Rearrange the columns to have Strand-->, val, then all students S1 thru S82
Step 6: Sort the dataframe using Strand--> so you get the values in
the right order.
The code is as shown below:
import pandas as pd
import numpy as np
d = {'Strand-->':['Geometry','Geometry','Geometry','Geometry','Mensuration',
'Mensuration','Mensuration','Geometry','Algebra','Algebra',
'Comparing Quantities','Geometry','Data Handling','Geometry','Geometry']}
for i in range(1,83): d ['S'+str(i)] = np.random.randint(0,2,size=15)
df = pd.DataFrame(d)
print (df)
df1 = df.groupby('Strand-->').agg(lambda x: x.eq(0).sum())
df1['val'] = 0
df2 = df.groupby('Strand-->').agg(lambda x: x.ne(0).sum())
df2['val'] = 1
df3 = pd.concat([df1,df2]).reset_index()
dx = [0,-1] + [i for i in range(1,83)]
df3 = df3[df3.columns[dx]].sort_values('Strand-->').reset_index(drop=True)
print (df3)
The output of this will be as follows:
Original DataFrame:
Strand--> S1 S2 S3 S4 S5 ... S77 S78 S79 S80 S81 S82
0 Geometry 0 1 0 0 1 ... 1 0 0 0 1 0
1 Geometry 0 0 0 1 1 ... 1 1 1 0 0 0
2 Geometry 1 1 1 0 0 ... 0 0 1 0 0 0
3 Geometry 0 1 1 0 1 ... 1 0 0 1 0 1
4 Mensuration 1 1 1 0 1 ... 0 1 1 1 0 0
5 Mensuration 0 1 1 1 0 ... 1 0 0 1 1 0
6 Mensuration 1 0 1 1 1 ... 0 1 0 0 1 0
7 Geometry 1 0 1 1 1 ... 1 1 1 0 0 1
8 Algebra 0 0 1 0 1 ... 1 1 0 0 1 1
9 Algebra 0 1 0 1 1 ... 1 1 1 1 0 1
10 Comparing Quantities 1 1 0 1 1 ... 1 1 0 1 1 0
11 Geometry 1 1 1 1 0 ... 0 0 1 0 1 0
12 Data Handling 1 1 0 0 0 ... 1 0 1 1 0 0
13 Geometry 1 1 1 0 0 ... 1 1 1 1 0 0
14 Geometry 0 1 0 0 1 ... 0 1 1 0 1 0
Updated DataFrame:
Note here that column 'val' will be 0 or 1. If 0, then it is the count of 0s. If 1, then it is the count of 1s.
Strand--> val S1 S2 S3 S4 ... S77 S78 S79 S80 S81 S82
0 Algebra 0 2 1 1 1 ... 0 0 1 1 1 0
1 Algebra 1 0 1 1 1 ... 2 2 1 1 1 2
2 Comparing Quantities 0 0 0 1 0 ... 0 0 1 0 0 1
3 Comparing Quantities 1 1 1 0 1 ... 1 1 0 1 1 0
4 Data Handling 0 0 0 1 1 ... 0 1 0 0 1 1
5 Data Handling 1 1 1 0 0 ... 1 0 1 1 0 0
6 Geometry 0 4 2 3 5 ... 3 4 2 6 5 6
7 Geometry 1 4 6 5 3 ... 5 4 6 2 3 2
8 Mensuration 0 1 1 0 1 ... 2 1 2 1 1 3
9 Mensuration 1 2 2 3 2 ... 1 2 1 2 2 0
For single student you can do:
df.groupby(['Strand-->', 'S1']).size().to_frame(name = 'size').reset_index()
If you want to calculate all students at once you can do:
df_m = pd.melt(df, id_vars=['Strand-->'], value_vars=df.columns[1:]).rename({'variable':'result'},axis=1).sort_values(['result'])
df_m['result'].groupby([df_m['Strand-->'],df_m['value']]).value_counts().unstack(fill_value=0).reset_index()

How to label ascending consecutive numbers onto values in a column?

I have a column that looks something like this:
1
0
0
1
0
0
0
1
I want the output to look something like this:
1 <--
0
0
2 <--
0
0
0
3 <--
And so forth. I'm not sure where to begin. There about 10,000 rows and I feel like making a if statement might take awhile. How do I achieve this output?
Efficient and concise:
s.cumsum()*s
0 1
1 0
2 0
3 2
4 0
5 0
6 0
7 3
dtype: int64
Use Series.cumsum + Series.where
Here is an example:
print(df)
0
0 1
1 0
2 0
3 1
4 0
5 0
6 0
7 1
df['0']=df['0'].cumsum().where(df['0'].ne(0),df['0'])
print(df)
0
0 1
1 0
2 0
3 2
4 0
5 0
6 0
7 3
Try this:
s = pd.Series([1,0,0,1,0,0,0,1])
s.cumsum().mask(s==0, 0)
Output:
0 1
1 0
2 0
3 2
4 0
5 0
6 0
7 3
dtype: int64
np.where and cumsum:
df['cum_sum'] = np.where(df.val>0, df.val.cumsum(), 0)
output:
val cum_sum
0 1 1
1 0 0
2 0 0
3 1 2
4 0 0
5 0 0
6 0 0
7 1 3
you could do something like this
df = {'col1': [1, 0,0,0,1,0,0,1] }
count = 0
col = []
for val in zip(df['col1']):
if val[0] == 1:
count+=1
col.append(count)
else:
col.append(val[0])
and you get [1, 0, 0, 0, 2, 0, 0, 3]
Only select the rows that are non-zero and replace those values with cumsum
import pandas as pd
df=pd.DataFrame({'col': [0,1,0,0,1,0,0,0,1,0] })
index=df["col"]!=0
df.loc[index,"col"]=df.loc[index,"col"].cumsum()
print(df)
col
0 0
1 1
2 0
3 0
4 2
5 0
6 0
7 0
8 3
9 0

Is there a way to break a pandas column with categories to seperate true or false columns with the category name as the column name

I have a dataframe with the following column:
df = pd.DataFrame({"A": [1,2,1,2,2,2,0,1,0]})
and i want:
df2 = pd.DataFrame({"0": [0,0,0,0,0,0,1,0,1],"1": [1,0,1,0,0,0,0,1,0],"2": [0,1,0,1,1,1,0,0,0]})
is there an elegant way of doing this using a oneliner.
NOTE
I can do this using df['0'] = df['A'].apply(find_zeros)
I dont mind if 'A' is included in the final.
Use get_dummies:
df2 = pd.get_dummies(df.A)
print (df2)
0 1 2
0 0 1 0
1 0 0 1
2 0 1 0
3 0 0 1
4 0 0 1
5 0 0 1
6 1 0 0
7 0 1 0
8 1 0 0
In [50]: df.A.astype(str).str.get_dummies()
Out[50]:
0 1 2
0 0 1 0
1 0 0 1
2 0 1 0
3 0 0 1
4 0 0 1
5 0 0 1
6 1 0 0
7 0 1 0
8 1 0 0

Encode integer pandas dataframe column to padded 16 bit binary

I would like to encode integers stored in a pandas dataframe column into respective 16-bit binary numbers which correspond to bit positions in those integers. I would also need to pad leading zeros for numbers with corresponding binary less than 16 bits. For example, given one column containing integers ranging from 0 to 33000, for an integer value of 20 (10100 in binary) I would like to produce 16 columns with values 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0 and so on across the entire column.
Setup
Consider the data frame df with column 'A'
df = pd.DataFrame(dict(A=range(16)))
Numpy broadcasting and bit shifting
a = df.A.values
n = int(np.log2(a.max() + 1))
b = (a[:, None] >> np.arange(n)[::-1]) & 1
pd.DataFrame(b)
0 1 2 3
0 0 0 0 0
1 0 0 0 1
2 0 0 1 0
3 0 0 1 1
4 0 1 0 0
5 0 1 0 1
6 0 1 1 0
7 0 1 1 1
8 1 0 0 0
9 1 0 0 1
10 1 0 1 0
11 1 0 1 1
12 1 1 0 0
13 1 1 0 1
14 1 1 1 0
15 1 1 1 1
String formatting with f-strings
n = int(np.log2(df.A.max() + 1))
pd.DataFrame([list(map(int, f'{i:0{n}b}')) for i in df.A])
0 1 2 3
0 0 0 0 0
1 0 0 0 1
2 0 0 1 0
3 0 0 1 1
4 0 1 0 0
5 0 1 0 1
6 0 1 1 0
7 0 1 1 1
8 1 0 0 0
9 1 0 0 1
10 1 0 1 0
11 1 0 1 1
12 1 1 0 0
13 1 1 0 1
14 1 1 1 0
15 1 1 1 1
Could you do something like this?
x = 20
bin_string = format(x, '016b')
df = pd.DataFrame(list(bin_string)).T
I don't know enough about what you're trying to do to know if that's sufficient.

Create a dictionary of dictionaries of frequencies from dataframe

i have a big dataset like this and I'm trying to make a dictionary of dictionaries of the dataframe to organize the column of 'crime' with the frequencies of the other columns.
train_data
23 Wednesday BAYVIEW CENTRAL INGLESIDE NORTHERN PARK RICHMOND crime
0 1 1 0 0 0 1 0 0 3
1 1 1 0 0 0 1 0 0 1
2 1 1 0 0 0 1 0 0 1
3 1 1 0 0 0 1 0 0 0
4 1 1 0 0 0 0 1 0 0
5 1 1 0 0 1 0 0 0 0
6 1 1 0 0 1 0 0 0 2
7 1 1 1 0 0 0 0 0 2
8 1 1 0 0 0 0 0 1 0
9 1 1 0 1 0 0 0 0 0
So i decided first of all to groupby the dataframe with the column of 'crime':
train_data=train_data.groupby(['crime']).sum()
23 Wednesday BAYVIEW CENTRAL INGLESIDE NORTHERN PARK RICHMOND
crime
0 5 5 0 1 1 1 1 1
1 2 2 0 0 0 2 0 0
2 2 2 1 0 1 0 0 0
3 1 1 0 0 0 1 0 0
And then i tried to organize them in a dictionary of dictionaries but i can't make it, i tried in some ways iterating too but there is something wrong with the dataframe.
The result should be something like this:
{0: {23: 5, Wednesday: 1, BAYVIEW: 0, CENTRAL: 1, ...},
1: {23: 2, Wednesday: 2, BAYVIEW: 0, ...},
2: {...}, 3: {...}}
You can use
d = train_data.to_dict(orient='index')
See http://pandas.pydata.org/pandas-docs/version/0.17.1/generated/pandas.DataFrame.to_dict.html for more options.
If you're on pandas 0.17.0 or greater as MaxNoe posted:
train_data.groupby('crime').sum().to_dict(orient='index')
otherwise:
train_data.groupby('crime').sum().T.to_dict()

Categories

Resources