This is a continuation of an earlier learning on numpy arrays.
A structured array is created from the elements of a list - and thereafter populated with values(not shown below).
>>> o = ['x','y','z']
>>> import numpy as np
>>> b = np.zeros((len(o),), dtype=[(i,object) for i in o])
>>> b
array([(0, 0, 0, 0, 0), (0, 0, 0, 0, 0), (0, 0, 0, 0, 0)],
dtype=[('x', '|O4'), ('y', '|O4'), ('z', '|O4')])
The populated array looks as below:
x y z
x 0 1 0
y 1 0 1,5
z 0 1,5 0
1.How do we add new vertices to the above?
2.Once the vertices have been added,what is the cleanest process to add the following array to the structured array(NOTE:not all vertices in this array are new):
d e y
d 0 '1,2' 0
e '1,2' 0 '1'
f 0 '1' 0
The expected output(please bear with me):
x y z d e f
x 0 1 0 0 0 0
y 1 0 1,5 0 1 0
z 0 1,5 0 0 0 0
d 0 0 0 0 1,2 0
e 0 1 0 1,2 0 0
f 0 0 0 0 1 0
Seems like a job for python pandas.
>>> import numpy as np
>>> import pandas as pd
>>> data=np.zeros((4,5))
>>> df=pd.DataFrame(data,columns=['x','y','z','a','b'])
>>> df
x y z a b
0 0 0 0 0 0
1 0 0 0 0 0
2 0 0 0 0 0
3 0 0 0 0 0
>>> df['c']=0 #Add a new column
>>> df
x y z a b c
0 0 0 0 0 0 0
1 0 0 0 0 0 0
2 0 0 0 0 0 0
3 0 0 0 0 0 0
>>> new_data=pd.DataFrame([['0','1,2','0'],['1,2','0','1'],['0','1','0']],columns=['d','e','y'])
>>> new_data
d e y
0 0 1,2 0
1 1,2 0 1
2 0 1 0
>>> df.merge(new_data,how='outer') #Merge data
x y z a b c d e
0 0 0 0 0 0 0 NaN NaN
1 0 0 0 0 0 0 NaN NaN
2 0 0 0 0 0 0 NaN NaN
3 0 0 0 0 0 0 NaN NaN
4 NaN 0 NaN NaN NaN NaN 0 1,2
5 NaN 0 NaN NaN NaN NaN 0 1
6 NaN 1 NaN NaN NaN NaN 1,2 0
There are many ways to merge the data that you showed, can you please explain in more detail exactly what you would like the ending array to look like?
Related
I would like to split the column of a dataframe as follows.
Here is the main dataframe.
import pandas as pd
df_az = pd.DataFrame(list(zip(storage_AZ)),columns =['AZ Combination'])
df_az
Then, I applied this code to split the column.
out_az = (df_az.stack().apply(pd.Series).rename(columns=lambda x: f'a combination').unstack().swaplevel(0,1,axis=1).sort_index(axis=1))
out_az = pd.concat([out_az], axis=1)
out_az.head()
However, the result is as follows.
Meanwhile, the expected result is:
Could anyone help me what to change on the code, please? Thank you in advance.
You can apply np.ravel:
>>> pd.DataFrame.from_records(df_az['AZ Combination'].apply(np.ravel))
0 1 2 3 4 5
0 0 0 0 0 0 0
1 0 0 0 0 0 1
Convert column to list and reshape for 2d array, so possible use Dataframe contructor.
Then set columns names, for avoid duplicated columns names are add counter:
storage_AZ = [[[0,0,0],[0,0,0]],
[[0,0,0],[0,0,1]],
[[0,0,0],[0,1,0]],
[[0,0,0],[1,0,0]],
[[0,0,0],[1,0,1]]]
df_az = pd.DataFrame(list(zip(storage_AZ)),columns =['AZ Combination'])
N = 3
L = ['a combination','z combination']
df = pd.DataFrame(np.array(df_az['AZ Combination'].tolist()).reshape(df_az.shape[0],-1))
df.columns = [f'{L[a]}_{b}' for a, b in zip(df.columns // N, df.columns % N)]
print(df)
a combination_0 a combination_1 a combination_2 z combination_0 \
0 0 0 0 0
1 0 0 0 0
2 0 0 0 0
3 0 0 0 1
4 0 0 0 1
z combination_1 z combination_2
0 0 0
1 0 1
2 1 0
3 0 0
4 0 1
If need MultiIndex:
df = pd.concat({'AZ Combination':df}, axis=1)
print(df)
AZ Combination \
a combination_0 a combination_1 a combination_2 z combination_0
0 0 0 0 0
1 0 0 0 0
2 0 0 0 0
3 0 0 0 1
4 0 0 0 1
z combination_1 z combination_2
0 0 0
1 0 1
2 1 0
3 0 0
4 0 1
I would like to expand a series or dataframe into a sparse matrix based on the unique values of the series. It's a bit hard to explain verbally but an example should be clearer.
First, simpler version - if I start with this:
Idx Tag
0 A
1 B
2 A
3 C
4 B
I'd like to get something like this, where the unique values in the starting series are the column values here (could be 1s and 0s, Boolean, etc.):
Idx A B C
0 1 0 0
1 0 1 0
2 1 0 0
3 0 0 1
4 0 1 0
Second, more advanced version - if I have values associated with each entry, preserving those and filling the rest of the matrix with a placeholder (0, NaN, something else), e.g. starting from this:
Idx Tag Val
0 A 5
1 B 2
2 A 3
3 C 7
4 B 1
And ending up with this:
Idx A B C
0 5 0 0
1 0 2 0
2 3 0 0
3 0 0 7
4 0 1 0
What's a Pythonic way to do this?
Here's how to do it, using pandas.get_dummies() which was designed specifically for this (often called "one-hot-encoding" in ML). I've done it step-by-step so you can see how it's done ;)
>>> df
Idx Tag Val
0 0 A 5
1 1 B 2
2 2 A 3
3 3 C 7
4 4 B 1
>>> pd.get_dummies(df['Tag'])
A B C
0 1 0 0
1 0 1 0
2 1 0 0
3 0 0 1
4 0 1 0
>>> pd.concat([df[['Idx']], pd.get_dummies(df['Tag'])], axis=1)
Idx A B C
0 0 1 0 0
1 1 0 1 0
2 2 1 0 0
3 3 0 0 1
4 4 0 1 0
>>> pd.get_dummies(df['Tag']).to_numpy()
array([[1, 0, 0],
[0, 1, 0],
[1, 0, 0],
[0, 0, 1],
[0, 1, 0]], dtype=uint8)
>>> df2[['Val']].to_numpy()
array([[5],
[2],
[3],
[7],
[1]])
>>> pd.get_dummies(df2['Tag']).to_numpy() * df2[['Val']].to_numpy()
array([[5, 0, 0],
[0, 2, 0],
[3, 0, 0],
[0, 0, 7],
[0, 1, 0]])
>>> pd.DataFrame(pd.get_dummies(df['Tag']).to_numpy() * df[['Val']].to_numpy(), columns=df['Tag'].unique())
A B C
0 5 0 0
1 0 2 0
2 3 0 0
3 0 0 7
4 0 1 0
>>> pd.concat([df, pd.DataFrame(pd.get_dummies(df['Tag']).to_numpy() * df[['Val']].to_numpy(), columns=df['Tag'].unique())], axis=1)
Idx Tag Val A B C
0 0 A 5 5 0 0
1 1 B 2 0 2 0
2 2 A 3 3 0 0
3 3 C 7 0 0 7
4 4 B 1 0 1 0
Based on #user17242583 's answer, found a pretty simple way to do it using pd.get_dummies combined with DataFrame.multiply:
>>> df
Tag Val
0 A 5
1 B 2
2 A 3
3 C 7
4 B 1
>>> pd.get_dummies(df['Tag'])
A B C
0 1 0 0
1 0 1 0
2 1 0 0
3 0 0 1
4 0 1 0
>>> pd.get_dummies(df['Tag']).multiply(df['Val'], axis=0)
A B C
0 5 0 0
1 0 2 0
2 3 0 0
3 0 0 7
4 0 1 0
I have a pandas dataframe as follows.
a b c d e
a 0 1 0 1 1
b 1 0 1 6 3
c 0 1 0 1 2
d 5 1 1 0 8
e 1 3 2 8 0
I want to replace values that is below 6 <=5 with 0. So my output should be as follows.
a b c d e
a 0 0 0 0 0
b 0 0 0 6 0
c 0 0 0 0 0
d 0 0 0 0 8
e 0 0 0 8 0
I was trying to do this using the following code.
df['a'].replace([1, 2, 3, 4, 5], 0)
df['b'].replace([1, 2, 3, 4, 5], 0)
df['c'].replace([1, 2, 3, 4, 5], 0)
df['d'].replace([1, 2, 3, 4, 5], 0)
df['e'].replace([1, 2, 3, 4, 5], 0)
However, I am sure that there is a more easy way of doing this task in pandas.
I am happy to provide more details if needed.
Using mask
df=df.mask(df<=5,0)
df
Out[380]:
a b c d e
a 0 0 0 0 0
b 0 0 0 6 0
c 0 0 0 0 0
d 0 0 0 0 8
e 0 0 0 8 0
For performance, I recommend np.where. You can assign the array back inplace using sliced assignment (df[:] = ...).
df[:] = np.where(df < 6, 0, df)
df
a b c d e
a 0 0 0 0 0
b 0 0 0 6 0
c 0 0 0 0 0
d 0 0 0 0 8
e 0 0 0 8 0
Another option involves fillna:
df[df>=6].fillna(0, downcast='infer')
a b c d e
a 0 0 0 0 0
b 0 0 0 6 0
c 0 0 0 0 0
d 0 0 0 0 8
e 0 0 0 8 0
I would like to 'OR' between row and row+1
for example,
A B C D E F G
r0 0 1 1 0 0 1 0
r1 0 0 0 0 0 0 0
r2 0 0 1 0 1 0 1
and the expected output will be like this
result 0 1 1 0 1 1
I know only how to sum it.
df.loc['result'] = df.sum()
but in this case i would like to do OR
thank you in advance
You can apply any over the first axis.
>>> df
>>>
A B C D E F G
r0 0 1 1 0 0 1 0
r1 0 0 0 0 0 0 0
r2 0 0 1 0 1 0 1
>>>
>>> df.loc['result'] = df.any(axis=0).astype(int)
>>> df
>>>
A B C D E F G
r0 0 1 1 0 0 1 0
r1 0 0 0 0 0 0 0
r2 0 0 1 0 1 0 1
result 0 1 1 0 1 1 1
... assuming that in your output you forgot the last column.
My Googlefu has failed me!
I have a Pandas DataFrame of the form:
Level 1 Level 2 Level 3 Level 4
-------------------------------------
A B C NaN
A B D E
A B D F
G H NaN NaN
G I J K
It basically contains nodes of a graph with the levels depicting an outgoing edge from a level of lower order to a level of a higher order. I want to convert the DataFrame/create a new DataFrame of the form:
A B C D E F G H I J K
---------------------------------------------
A | 0 1 0 0 0 0 0 0 0 0 0
B | 0 0 1 1 0 0 0 0 0 0 0
C | 0 0 0 0 0 0 0 0 0 0 0
D | 0 0 0 0 1 1 0 0 0 0 0
E | 0 0 0 0 0 0 0 0 0 0 0
F | 0 0 0 0 0 0 0 0 0 0 0
G | 0 0 0 0 0 0 0 1 1 0 0
H | 0 0 0 0 0 0 0 0 0 0 0
I | 0 0 0 0 0 0 0 0 0 1 0
J | 0 0 0 0 0 0 0 0 0 0 1
K | 0 0 0 0 0 0 0 0 0 0 0
A cell containing 1 depicts an outgoing edge from the corresponding row to the corresponding column. Is there a Pythonic way to achieve this without loops and conditions in Pandas?
Try this code:
df = pd.DataFrame({'level_1':['A', 'A', 'A', 'G', 'G'], 'level_2':['B', 'B', 'B', 'H', 'I'],
'level_3':['C', 'D', 'D', np.nan, 'J'], 'level_4':[np.nan, 'E', 'F', np.nan, 'K']})
Your input dataframe is:
level_1 level_2 level_3 level_4
0 A B C NaN
1 A B D E
2 A B D F
3 G H NaN NaN
4 G I J K
And the solution is:
# Get unique values from input dataframe and filter out 'nan' values
list_nodes = []
for i_col in df.columns.tolist():
list_nodes.extend(filter(lambda v: v==v, df[i_col].unique().tolist()))
# Initialize your result dataframe
df_res = pd.DataFrame(columns=sorted(list_nodes), index=sorted(list_nodes))
df_res = df_res.fillna(0)
# Get 'index-column' pairs from input dataframe ('nan's are exluded)
list_indexes = []
for i_col in range(df.shape[1]-1):
list_indexes.extend(list(set([tuple(i) for i in df.iloc[:, i_col:i_col+2]\
.dropna(axis=0).values.tolist()])))
# Use 'index-column' pairs to fill the result dataframe
for i_list_indexes in list_indexes:
df_res.set_value(i_list_indexes[0], i_list_indexes[1], 1)
And the final result is:
A B C D E F G H I J K
A 0 1 0 0 0 0 0 0 0 0 0
B 0 0 1 1 0 0 0 0 0 0 0
C 0 0 0 0 0 0 0 0 0 0 0
D 0 0 0 0 1 1 0 0 0 0 0
E 0 0 0 0 0 0 0 0 0 0 0
F 0 0 0 0 0 0 0 0 0 0 0
G 0 0 0 0 0 0 0 1 1 0 0
H 0 0 0 0 0 0 0 0 0 0 0
I 0 0 0 0 0 0 0 0 0 1 0
J 0 0 0 0 0 0 0 0 0 0 1
K 0 0 0 0 0 0 0 0 0 0 0