Adding new records to a numpy structured array - python

This is a continuation of an earlier learning on numpy arrays.
A structured array is created from the elements of a list - and thereafter populated with values(not shown below).
>>> o = ['x','y','z']
>>> import numpy as np
>>> b = np.zeros((len(o),), dtype=[(i,object) for i in o])
>>> b
array([(0, 0, 0, 0, 0), (0, 0, 0, 0, 0), (0, 0, 0, 0, 0)],
dtype=[('x', '|O4'), ('y', '|O4'), ('z', '|O4')])
The populated array looks as below:
x y z
x 0 1 0
y 1 0 1,5
z 0 1,5 0
1.How do we add new vertices to the above?
2.Once the vertices have been added,what is the cleanest process to add the following array to the structured array(NOTE:not all vertices in this array are new):
d e y
d 0 '1,2' 0
e '1,2' 0 '1'
f 0 '1' 0
The expected output(please bear with me):
x y z d e f
x 0 1 0 0 0 0
y 1 0 1,5 0 1 0
z 0 1,5 0 0 0 0
d 0 0 0 0 1,2 0
e 0 1 0 1,2 0 0
f 0 0 0 0 1 0

Seems like a job for python pandas.
>>> import numpy as np
>>> import pandas as pd
>>> data=np.zeros((4,5))
>>> df=pd.DataFrame(data,columns=['x','y','z','a','b'])
>>> df
x y z a b
0 0 0 0 0 0
1 0 0 0 0 0
2 0 0 0 0 0
3 0 0 0 0 0
>>> df['c']=0 #Add a new column
>>> df
x y z a b c
0 0 0 0 0 0 0
1 0 0 0 0 0 0
2 0 0 0 0 0 0
3 0 0 0 0 0 0
>>> new_data=pd.DataFrame([['0','1,2','0'],['1,2','0','1'],['0','1','0']],columns=['d','e','y'])
>>> new_data
d e y
0 0 1,2 0
1 1,2 0 1
2 0 1 0
>>> df.merge(new_data,how='outer') #Merge data
x y z a b c d e
0 0 0 0 0 0 0 NaN NaN
1 0 0 0 0 0 0 NaN NaN
2 0 0 0 0 0 0 NaN NaN
3 0 0 0 0 0 0 NaN NaN
4 NaN 0 NaN NaN NaN NaN 0 1,2
5 NaN 0 NaN NaN NaN NaN 0 1
6 NaN 1 NaN NaN NaN NaN 1,2 0
There are many ways to merge the data that you showed, can you please explain in more detail exactly what you would like the ending array to look like?

Related

How to split the column of a dataframe

I would like to split the column of a dataframe as follows.
Here is the main dataframe.
import pandas as pd
df_az = pd.DataFrame(list(zip(storage_AZ)),columns =['AZ Combination'])
df_az
Then, I applied this code to split the column.
out_az = (df_az.stack().apply(pd.Series).rename(columns=lambda x: f'a combination').unstack().swaplevel(0,1,axis=1).sort_index(axis=1))
out_az = pd.concat([out_az], axis=1)
out_az.head()
However, the result is as follows.
Meanwhile, the expected result is:
Could anyone help me what to change on the code, please? Thank you in advance.
You can apply np.ravel:
>>> pd.DataFrame.from_records(df_az['AZ Combination'].apply(np.ravel))
0 1 2 3 4 5
0 0 0 0 0 0 0
1 0 0 0 0 0 1
Convert column to list and reshape for 2d array, so possible use Dataframe contructor.
Then set columns names, for avoid duplicated columns names are add counter:
storage_AZ = [[[0,0,0],[0,0,0]],
[[0,0,0],[0,0,1]],
[[0,0,0],[0,1,0]],
[[0,0,0],[1,0,0]],
[[0,0,0],[1,0,1]]]
df_az = pd.DataFrame(list(zip(storage_AZ)),columns =['AZ Combination'])
N = 3
L = ['a combination','z combination']
df = pd.DataFrame(np.array(df_az['AZ Combination'].tolist()).reshape(df_az.shape[0],-1))
df.columns = [f'{L[a]}_{b}' for a, b in zip(df.columns // N, df.columns % N)]
print(df)
a combination_0 a combination_1 a combination_2 z combination_0 \
0 0 0 0 0
1 0 0 0 0
2 0 0 0 0
3 0 0 0 1
4 0 0 0 1
z combination_1 z combination_2
0 0 0
1 0 1
2 1 0
3 0 0
4 0 1
If need MultiIndex:
df = pd.concat({'AZ Combination':df}, axis=1)
print(df)
AZ Combination \
a combination_0 a combination_1 a combination_2 z combination_0
0 0 0 0 0
1 0 0 0 0
2 0 0 0 0
3 0 0 0 1
4 0 0 0 1
z combination_1 z combination_2
0 0 0
1 0 1
2 1 0
3 0 0
4 0 1

Expand Pandas series into dataframe by unique values

I would like to expand a series or dataframe into a sparse matrix based on the unique values of the series. It's a bit hard to explain verbally but an example should be clearer.
First, simpler version - if I start with this:
Idx Tag
0 A
1 B
2 A
3 C
4 B
I'd like to get something like this, where the unique values in the starting series are the column values here (could be 1s and 0s, Boolean, etc.):
Idx A B C
0 1 0 0
1 0 1 0
2 1 0 0
3 0 0 1
4 0 1 0
Second, more advanced version - if I have values associated with each entry, preserving those and filling the rest of the matrix with a placeholder (0, NaN, something else), e.g. starting from this:
Idx Tag Val
0 A 5
1 B 2
2 A 3
3 C 7
4 B 1
And ending up with this:
Idx A B C
0 5 0 0
1 0 2 0
2 3 0 0
3 0 0 7
4 0 1 0
What's a Pythonic way to do this?
Here's how to do it, using pandas.get_dummies() which was designed specifically for this (often called "one-hot-encoding" in ML). I've done it step-by-step so you can see how it's done ;)
>>> df
Idx Tag Val
0 0 A 5
1 1 B 2
2 2 A 3
3 3 C 7
4 4 B 1
>>> pd.get_dummies(df['Tag'])
A B C
0 1 0 0
1 0 1 0
2 1 0 0
3 0 0 1
4 0 1 0
>>> pd.concat([df[['Idx']], pd.get_dummies(df['Tag'])], axis=1)
Idx A B C
0 0 1 0 0
1 1 0 1 0
2 2 1 0 0
3 3 0 0 1
4 4 0 1 0
>>> pd.get_dummies(df['Tag']).to_numpy()
array([[1, 0, 0],
[0, 1, 0],
[1, 0, 0],
[0, 0, 1],
[0, 1, 0]], dtype=uint8)
>>> df2[['Val']].to_numpy()
array([[5],
[2],
[3],
[7],
[1]])
>>> pd.get_dummies(df2['Tag']).to_numpy() * df2[['Val']].to_numpy()
array([[5, 0, 0],
[0, 2, 0],
[3, 0, 0],
[0, 0, 7],
[0, 1, 0]])
>>> pd.DataFrame(pd.get_dummies(df['Tag']).to_numpy() * df[['Val']].to_numpy(), columns=df['Tag'].unique())
A B C
0 5 0 0
1 0 2 0
2 3 0 0
3 0 0 7
4 0 1 0
>>> pd.concat([df, pd.DataFrame(pd.get_dummies(df['Tag']).to_numpy() * df[['Val']].to_numpy(), columns=df['Tag'].unique())], axis=1)
Idx Tag Val A B C
0 0 A 5 5 0 0
1 1 B 2 0 2 0
2 2 A 3 3 0 0
3 3 C 7 0 0 7
4 4 B 1 0 1 0
Based on #user17242583 's answer, found a pretty simple way to do it using pd.get_dummies combined with DataFrame.multiply:
>>> df
Tag Val
0 A 5
1 B 2
2 A 3
3 C 7
4 B 1
>>> pd.get_dummies(df['Tag'])
A B C
0 1 0 0
1 0 1 0
2 1 0 0
3 0 0 1
4 0 1 0
>>> pd.get_dummies(df['Tag']).multiply(df['Val'], axis=0)
A B C
0 5 0 0
1 0 2 0
2 3 0 0
3 0 0 7
4 0 1 0

How to replace integer values in pandas in python?

I have a pandas dataframe as follows.
a b c d e
a 0 1 0 1 1
b 1 0 1 6 3
c 0 1 0 1 2
d 5 1 1 0 8
e 1 3 2 8 0
I want to replace values that is below 6 <=5 with 0. So my output should be as follows.
a b c d e
a 0 0 0 0 0
b 0 0 0 6 0
c 0 0 0 0 0
d 0 0 0 0 8
e 0 0 0 8 0
I was trying to do this using the following code.
df['a'].replace([1, 2, 3, 4, 5], 0)
df['b'].replace([1, 2, 3, 4, 5], 0)
df['c'].replace([1, 2, 3, 4, 5], 0)
df['d'].replace([1, 2, 3, 4, 5], 0)
df['e'].replace([1, 2, 3, 4, 5], 0)
However, I am sure that there is a more easy way of doing this task in pandas.
I am happy to provide more details if needed.
Using mask
df=df.mask(df<=5,0)
df
Out[380]:
a b c d e
a 0 0 0 0 0
b 0 0 0 6 0
c 0 0 0 0 0
d 0 0 0 0 8
e 0 0 0 8 0
For performance, I recommend np.where. You can assign the array back inplace using sliced assignment (df[:] = ...).
df[:] = np.where(df < 6, 0, df)
df
a b c d e
a 0 0 0 0 0
b 0 0 0 6 0
c 0 0 0 0 0
d 0 0 0 0 8
e 0 0 0 8 0
Another option involves fillna:
df[df>=6].fillna(0, downcast='infer')
a b c d e
a 0 0 0 0 0
b 0 0 0 6 0
c 0 0 0 0 0
d 0 0 0 0 8
e 0 0 0 8 0

is it possible to do the boolean in row by row in pandas?

I would like to 'OR' between row and row+1
for example,
A B C D E F G
r0 0 1 1 0 0 1 0
r1 0 0 0 0 0 0 0
r2 0 0 1 0 1 0 1
and the expected output will be like this
result 0 1 1 0 1 1
I know only how to sum it.
df.loc['result'] = df.sum()
but in this case i would like to do OR
thank you in advance
You can apply any over the first axis.
>>> df
>>>
A B C D E F G
r0 0 1 1 0 0 1 0
r1 0 0 0 0 0 0 0
r2 0 0 1 0 1 0 1
>>>
>>> df.loc['result'] = df.any(axis=0).astype(int)
>>> df
>>>
A B C D E F G
r0 0 1 1 0 0 1 0
r1 0 0 0 0 0 0 0
r2 0 0 1 0 1 0 1
result 0 1 1 0 1 1 1
... assuming that in your output you forgot the last column.

Pandas DataFrame with levels of graph nodes and edges to square matrix

My Googlefu has failed me!
I have a Pandas DataFrame of the form:
Level 1 Level 2 Level 3 Level 4
-------------------------------------
A B C NaN
A B D E
A B D F
G H NaN NaN
G I J K
It basically contains nodes of a graph with the levels depicting an outgoing edge from a level of lower order to a level of a higher order. I want to convert the DataFrame/create a new DataFrame of the form:
A B C D E F G H I J K
---------------------------------------------
A | 0 1 0 0 0 0 0 0 0 0 0
B | 0 0 1 1 0 0 0 0 0 0 0
C | 0 0 0 0 0 0 0 0 0 0 0
D | 0 0 0 0 1 1 0 0 0 0 0
E | 0 0 0 0 0 0 0 0 0 0 0
F | 0 0 0 0 0 0 0 0 0 0 0
G | 0 0 0 0 0 0 0 1 1 0 0
H | 0 0 0 0 0 0 0 0 0 0 0
I | 0 0 0 0 0 0 0 0 0 1 0
J | 0 0 0 0 0 0 0 0 0 0 1
K | 0 0 0 0 0 0 0 0 0 0 0
A cell containing 1 depicts an outgoing edge from the corresponding row to the corresponding column. Is there a Pythonic way to achieve this without loops and conditions in Pandas?
Try this code:
df = pd.DataFrame({'level_1':['A', 'A', 'A', 'G', 'G'], 'level_2':['B', 'B', 'B', 'H', 'I'],
'level_3':['C', 'D', 'D', np.nan, 'J'], 'level_4':[np.nan, 'E', 'F', np.nan, 'K']})
Your input dataframe is:
level_1 level_2 level_3 level_4
0 A B C NaN
1 A B D E
2 A B D F
3 G H NaN NaN
4 G I J K
And the solution is:
# Get unique values from input dataframe and filter out 'nan' values
list_nodes = []
for i_col in df.columns.tolist():
list_nodes.extend(filter(lambda v: v==v, df[i_col].unique().tolist()))
# Initialize your result dataframe
df_res = pd.DataFrame(columns=sorted(list_nodes), index=sorted(list_nodes))
df_res = df_res.fillna(0)
# Get 'index-column' pairs from input dataframe ('nan's are exluded)
list_indexes = []
for i_col in range(df.shape[1]-1):
list_indexes.extend(list(set([tuple(i) for i in df.iloc[:, i_col:i_col+2]\
.dropna(axis=0).values.tolist()])))
# Use 'index-column' pairs to fill the result dataframe
for i_list_indexes in list_indexes:
df_res.set_value(i_list_indexes[0], i_list_indexes[1], 1)
And the final result is:
A B C D E F G H I J K
A 0 1 0 0 0 0 0 0 0 0 0
B 0 0 1 1 0 0 0 0 0 0 0
C 0 0 0 0 0 0 0 0 0 0 0
D 0 0 0 0 1 1 0 0 0 0 0
E 0 0 0 0 0 0 0 0 0 0 0
F 0 0 0 0 0 0 0 0 0 0 0
G 0 0 0 0 0 0 0 1 1 0 0
H 0 0 0 0 0 0 0 0 0 0 0
I 0 0 0 0 0 0 0 0 0 1 0
J 0 0 0 0 0 0 0 0 0 0 1
K 0 0 0 0 0 0 0 0 0 0 0

Categories

Resources