Python Pandas: DataFrame modification with diagnal value = 0 [duplicate] - python

This question already has answers here:
Set values on the diagonal of pandas.DataFrame
(8 answers)
Closed 5 years ago.
I have a Pandas Dataframe question. I have a df with index=column. It looks like below.
df:
DNA Cat2
Item A B C D E F F H I J .......
DNA Item
Cat2 A 812 62 174 0 4 46 46 7 2 15
B 62 427 27 0 0 12 61 2 4 11
C 174 27 174 0 0 13 22 5 2 4
D 0 0 0 0 0 0 0 0 0 0
E 4 0 0 0 130 10 57 33 4 5
F 46 12 13 0 10 187 4 5 0 0
......
Another words, df=df.transpose(). All I want to do is find pandas (or numpy for df.values())function to delete index=column values. My ideal output would be below.
df:
DNA Cat2
Item A B C D E F F H I J .......
DNA Item
Cat2 A 0 62 174 0 4 46 46 7 2 15
B 62 0 27 0 0 12 61 2 4 11
C 174 27 0 0 0 13 22 5 2 4
D 0 0 0 0 0 0 0 0 0 0
E 4 0 0 0 0 10 57 33 4 5
F 46 12 13 0 10 0 4 5 0 0
......
Is there a python function that makes this step very fast? I tried for loop with df.iloc[i,i]=0 but since my dataset is ver big, it takes long time to finish. Thanks in advance!

Setup
np.random.seed([3,1415])
i = pd.MultiIndex.from_product(
[['Cat2'], list('ABCDEFGHIJ')],
names=['DNA', 'Item']
)
a = np.random.randint(5, size=(10, 10))
df = pd.DataFrame(a + a.T + 1, i, i)
df
DNA Cat2
Item A B C D E F G H I J
DNA Item
Cat2 A 1 6 6 7 7 7 4 4 8 2
B 6 1 3 6 1 6 6 4 8 5
C 6 3 9 8 9 6 7 8 4 9
D 7 6 8 1 6 9 4 5 4 3
E 7 1 9 6 9 7 3 7 2 6
F 7 6 6 9 7 9 3 4 6 6
G 4 6 7 4 3 3 9 4 5 5
H 4 4 8 5 7 4 4 5 4 5
I 8 8 4 4 2 6 5 4 9 7
J 2 5 9 3 6 6 5 5 7 3
Option 1
Simplest way is to multiply by 1 less the identity
df * (1 - np.eye(len(df), dtype=int))
DNA Cat2
Item A B C D E F G H I J
DNA Item
Cat2 A 0 6 6 7 7 7 4 4 8 2
B 6 0 3 6 1 6 6 4 8 5
C 6 3 0 8 9 6 7 8 4 9
D 7 6 8 0 6 9 4 5 4 3
E 7 1 9 6 0 7 3 7 2 6
F 7 6 6 9 7 0 3 4 6 6
G 4 6 7 4 3 3 0 4 5 5
H 4 4 8 5 7 4 4 0 4 5
I 8 8 4 4 2 6 5 4 0 7
J 2 5 9 3 6 6 5 5 7 0
Option 2
However, we can also use pd.DataFrame.mask with np.eye. Masking is nice because it doesn't have to be numeric and it will still work.
df.mask(np.eye(len(df), dtype=bool), 0)
DNA Cat2
Item A B C D E F G H I J
DNA Item
Cat2 A 0 6 6 7 7 7 4 4 8 2
B 6 0 3 6 1 6 6 4 8 5
C 6 3 0 8 9 6 7 8 4 9
D 7 6 8 0 6 9 4 5 4 3
E 7 1 9 6 0 7 3 7 2 6
F 7 6 6 9 7 0 3 4 6 6
G 4 6 7 4 3 3 0 4 5 5
H 4 4 8 5 7 4 4 0 4 5
I 8 8 4 4 2 6 5 4 0 7
J 2 5 9 3 6 6 5 5 7 0
Option 3
In the event the columns and indices are not identical, OR the are out of order. We can use equality to tell us where to mask.
d = df.iloc[::-1]
d.mask(d.index == d.columns.values[:, None], 0)
DNA Cat2
Item A B C D E F G H I J
DNA Item
Cat2 J 2 5 9 3 6 6 5 5 7 0
I 8 8 4 4 2 6 5 4 0 7
H 4 4 8 5 7 4 4 0 4 5
G 4 6 7 4 3 3 0 4 5 5
F 7 6 6 9 7 0 3 4 6 6
E 7 1 9 6 0 7 3 7 2 6
D 7 6 8 0 6 9 4 5 4 3
C 6 3 0 8 9 6 7 8 4 9
B 6 0 3 6 1 6 6 4 8 5
A 0 6 6 7 7 7 4 4 8 2

Related

Fixing columns in sequence

I have pasted some part of df as given below , but I have more than 400 columns in actual df .
>>> df_final
c d name e f g h g h
0 0 0 aa 0 0 0 0 0 0
1 1 2 bb 1 2 1 2 1 2
2 2 4 cc 2 4 2 4 2 4
3 3 6 dd 3 6 3 6 3 6
4 4 8 ee 4 8 4 8 4 8
5 5 10 ff 5 10 5 10 5 10
6 6 12 gg 6 12 6 12 6 12
I want 'name' and 'c' and first and second positions but order for other columns don't matter. I would like to use
cols = ['name' , 'c']
col_position = [1 , 2]
How can I re-order data frame using list cols and col_position?
How can i set datatype as str for cols and float for other columns ?
Thanks in advance
I think need:
df1 = df[cols + np.setdiff1d(df.columns, cols).tolist()]
print (df1)
name c d e f g g.1 h h.1
0 aa 0 0 0 0 0 0 0 0
1 bb 1 2 1 2 1 1 2 2
2 cc 2 4 2 4 2 2 4 4
3 dd 3 6 3 6 3 3 6 6
4 ee 4 8 4 8 4 4 8 8
5 ff 5 10 5 10 5 5 10 10
6 gg 6 12 6 12 6 6 12 12
And:
c1 = df.columns[col_position].tolist()
df1 = df[c1 + np.setdiff1d(df.columns, c1).tolist()]
print (df1)
d name c e f g g.1 h h.1
0 0 aa 0 0 0 0 0 0 0
1 2 bb 1 1 2 1 1 2 2
2 4 cc 2 2 4 2 2 4 4
3 6 dd 3 3 6 3 3 6 6
4 8 ee 4 4 8 4 4 8 8
5 10 ff 5 5 10 5 5 10 10
6 12 gg 6 6 12 6 6 12 12
Alternative with select by positions:
c1 = np.arange(len(df.columns))
df1 = df.iloc[:, col_position + np.setdiff1d(c1, col_position).tolist()]
print (df1)
d name c e f g h g.1 h.1
0 0 aa 0 0 0 0 0 0 0
1 2 bb 1 1 2 1 2 1 2
2 4 cc 2 2 4 2 4 2 4
3 6 dd 3 3 6 3 6 3 6
4 8 ee 4 4 8 4 8 4 8
5 10 ff 5 5 10 5 10 5 10
6 12 gg 6 6 12 6 12 6 12
Construct a list to slice by
cols = ['name', 'c']
df[cols + df.columns.difference(cols).tolist()]
name c d e f g g.1 h h.1
0 aa 0 0 0 0 0 0 0 0
1 bb 1 2 1 2 1 1 2 2
2 cc 2 4 2 4 2 2 4 4
3 dd 3 6 3 6 3 3 6 6
4 ee 4 8 4 8 4 4 8 8
5 ff 5 10 5 10 5 5 10 10
6 gg 6 12 6 12 6 6 12 12
Slice, drop, and join
cols = ['name', 'c']
df[cols].join(df.drop(cols, 1))
name c d e f g h g.1 h.1
0 aa 0 0 0 0 0 0 0 0
1 bb 1 2 1 2 1 2 1 2
2 cc 2 4 2 4 2 4 2 4
3 dd 3 6 3 6 3 6 3 6
4 ee 4 8 4 8 4 8 4 8
5 ff 5 10 5 10 5 10 5 10
6 gg 6 12 6 12 6 12 6 12
Slice, drop, and concat
cols = ['name', 'c']
pd.concat([df[cols], df.drop(cols, 1)], axis=1)
name c d e f g h g.1 h.1
0 aa 0 0 0 0 0 0 0 0
1 bb 1 2 1 2 1 2 1 2
2 cc 2 4 2 4 2 4 2 4
3 dd 3 6 3 6 3 6 3 6
4 ee 4 8 4 8 4 8 4 8
5 ff 5 10 5 10 5 10 5 10
6 gg 6 12 6 12 6 12 6 12
By position with iloc
positions = df.columns.map({'name': 0, 'c': 1}.get).argsort()
df.iloc[:, positions]
name c d e f g h g.1 h.1
0 aa 0 0 0 0 0 0 0 0
1 bb 1 2 1 2 1 2 1 2
2 cc 2 4 2 4 2 4 2 4
3 dd 3 6 3 6 3 6 3 6
4 ee 4 8 4 8 4 8 4 8
5 ff 5 10 5 10 5 10 5 10
6 gg 6 12 6 12 6 12 6 12
Or with a focus on OP's vars
cols = ['name' , 'c']
col_position = [1 , 2]
m = dict(zip(cols, col_position))
positions = df.columns.map(m.get).argsort()
df.iloc[:, positions]
I tried this,
l=df.columns.values
cols = ['name' , 'c']
col_position = [1 , 2]
for u in zip(cols,col_position):
l.remove(u[0])
l.insert(u[1],u[0])
df=df[l]

Holding a first value in a column while another column equals a value?

I would like to hold the first value in a column while another column does not equal zero. For Column B, values alternate between -1, 0, 1. For Column C, values equal any integer. The objective is holding the first value of Column C while Column B equals zero. The current DataFrame is as follows:
A B C
1 8 1 9
2 2 1 1
3 3 0 7
4 9 0 8
5 5 0 9
6 6 0 1
7 1 1 9
8 6 1 10
9 3 0 4
10 8 0 8
11 5 0 9
12 6 0 10
The resulting DataFrame should be as follows:
A B C
1 8 1 9
2 2 1 1
3 3 0 7
4 9 0 7
5 5 0 7
6 6 0 7
7 1 1 9
8 6 1 10
9 3 0 4
10 8 0 4
11 5 0 4
12 6 0 4
13 3 1 9
You need first create NaNs by condition in column C and then add values by ffill:
mask = (df['B'].shift().fillna(False)).astype(bool) | (df['B'])
df['C'] = df.loc[mask, 'C']
df['C'] = df['C'].ffill().astype(int)
print (df)
A B C
1 8 1 9
2 2 1 1
3 3 0 7
4 9 0 7
5 5 0 7
6 6 0 7
7 1 1 9
8 6 1 10
9 3 0 4
10 8 0 4
11 5 0 4
12 6 0 4
13 3 1 9
Or use where and if type of all values is integer, add astype:
mask = (df['B'].shift().fillna(False)).astype(bool) | (df['B'])
df['C'] = df['C'].where(mask).ffill().astype(int)
print (df)
A B C
1 8 1 9
2 2 1 1
3 3 0 7
4 9 0 7
5 5 0 7
6 6 0 7
7 1 1 9
8 6 1 10
9 3 0 4
10 8 0 4
11 5 0 4
12 6 0 4
13 3 1 9

Reshaping dataframe in Pandas

Is there a quick pythonic way to transform this table
index = pd.date_range('2000-1-1', periods=36, freq='M')
df = pd.DataFrame(np.random.randn(36,4), index=index, columns=list('ABCD'))
In[1]: df
Out[1]:
A B C D
2000-01-31 H 1.368795 0.106294 2.108814
2000-02-29 -1.713401 0.557224 0.115956 -0.851140
2000-03-31 -1.454967 -0.791855 -0.461738 -0.410948
2000-04-30 1.688731 -0.216432 -0.690103 -0.319443
2000-05-31 -1.103961 0.181510 -0.600383 -0.164744
2000-06-30 0.216871 -1.018599 0.731617 -0.721986
2000-07-31 0.621375 0.790072 0.967000 1.347533
2000-08-31 0.588970 -0.360169 0.904809 0.606771
...
into this table
2001 2000
12 11 10 9 8 7 6 5 4 3 2 1 12 11 10 9 8 7 6 5 4 3 2 1
A H
B
C
D
Please excuse the missing values. I added the "H" manually. I hope it gets clear what I am looking for.
For easier check, I've created dataframe of the same shape but with integers as values.
The core of the solution is pandas.DataFrame.transpose, but you need to use index.year + index.month as a new index:
>>> df = pd.DataFrame(np.random.randint(10,size=(36, 4)), index=index, columns=list('ABCD'))
>>> df.set_index(keys=[df.index.year, df.index.month]).transpose()
2000 2001 2002
1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12
A 0 0 8 7 8 0 7 1 5 1 5 4 2 1 9 5 2 0 5 3 6 4 9 3 5 1 7 3 1 7 6 5 6 8 4 1
B 4 9 9 5 2 0 8 0 9 5 2 7 5 6 3 6 8 8 8 8 0 6 3 7 5 9 6 3 9 7 1 4 7 8 3 3
C 3 2 4 3 1 9 7 6 9 6 8 6 3 5 3 2 2 1 3 1 1 2 8 2 2 6 9 6 1 5 6 5 4 6 7 5
D 8 1 3 9 2 3 8 7 3 2 1 0 1 3 9 1 8 6 4 7 4 6 3 2 9 8 9 9 0 7 4 7 3 6 5 2
Of course, this will not work properly if you have more then one record per year+month. In this case you need to groupby your data first:
>>> i = pd.date_range('2000-1-1', periods=36, freq='W') # weekly index
>>> df = pd.DataFrame(np.random.randint(10,size=(36, 4)), index=i, columns=list('ABCD'))
>>> df.groupby(by=[df.index.year, df.index.month]).sum().transpose()
2000
1 2 3 4 5 6 7 8 9
A 12 13 15 23 9 21 21 31 7
B 33 24 19 30 15 19 20 7 4
C 20 24 26 24 15 18 29 17 4
D 23 29 14 30 19 12 12 11 5

pandas add variables according to variable value

Suppose the following pandas dataframe
Wafer_Id v1 v2
0 0 9 6
1 0 7 8
2 0 1 5
3 1 6 6
4 1 0 8
5 1 5 0
6 2 8 8
7 2 2 6
8 2 3 5
9 3 5 1
10 3 5 6
11 3 9 8
I want to group it according to WaferId and I would like to get something like
w
Out[60]:
Wafer_Id v1_1 v1_2 v1_3 v2_1 v2_2 v2_3
0 0 9 7 1 6 ... ...
1 1 6 0 5 6
2 2 8 2 3 8
3 3 5 5 9 1
I think that I can obtain the result with the pivot function but I am not sure of how to do it
Possible solution
oes = pd.DataFrame()
oes['Wafer_Id'] = [0,0,0,1,1,1,2,2,2,3,3,3]
oes['v1'] = np.random.randint(0, 10, 12)
oes['v2'] = np.random.randint(0, 10, 12)
oes['id'] = [0, 1, 2] * 4
oes.pivot(index='Wafer_Id', columns='id')
oes
Out[74]:
Wafer_Id v1 v2 id
0 0 8 7 0
1 0 3 3 1
2 0 8 0 2
3 1 2 5 0
4 1 4 1 1
5 1 8 8 2
6 2 8 6 0
7 2 4 7 1
8 2 4 3 2
9 3 4 6 0
10 3 9 2 1
11 3 7 1 2
oes.pivot(index='Wafer_Id', columns='id')
Out[75]:
v1 v2
id 0 1 2 0 1 2
Wafer_Id
0 8 3 8 7 3 0
1 2 4 8 5 1 8
2 8 4 4 6 7 3
3 4 9 7 6 2 1

nested loops results bunched together Python

for j in range(10):
for i in range(10):
print(j,end=" ")
My results are bunched together and I need to have 10 numbers per line. I cant use a print("0123456789"). I have tried print(j,j,j,j,j,j,j,j,j) and I get the results that I'm looking for but I'm sure this isn't the proper way to write the code.
If print(j,j,j,j,j,j,j,j,j) works then you simply need to add another print() after each iteration:
for j in range(10):
for i in range(10):
print(j,end=" ")
print()
Output:
0 0 0 0 0 0 0 0 0 0
1 1 1 1 1 1 1 1 1 1
2 2 2 2 2 2 2 2 2 2
3 3 3 3 3 3 3 3 3 3
4 4 4 4 4 4 4 4 4 4
5 5 5 5 5 5 5 5 5 5
6 6 6 6 6 6 6 6 6 6
7 7 7 7 7 7 7 7 7 7
8 8 8 8 8 8 8 8 8 8
9 9 9 9 9 9 9 9 9 9
Or simply:
for j in range(10):
print(" ".join(str(j) * 10))
0 0 0 0 0 0 0 0 0 0
1 1 1 1 1 1 1 1 1 1
2 2 2 2 2 2 2 2 2 2
3 3 3 3 3 3 3 3 3 3
4 4 4 4 4 4 4 4 4 4
5 5 5 5 5 5 5 5 5 5
6 6 6 6 6 6 6 6 6 6
7 7 7 7 7 7 7 7 7 7
8 8 8 8 8 8 8 8 8 8
9 9 9 9 9 9 9 9 9 9
Why are you using a nested for loop when you can use a single for loop:
for i in range(10):
print('{} '.format(i) * 10)
This is similar to Malik Brahimi's solution, except it doesn't put a space after the last digit on each line:
for i in range(10):
print(' '.join([str(i)]*10))
output
0 0 0 0 0 0 0 0 0 0
1 1 1 1 1 1 1 1 1 1
2 2 2 2 2 2 2 2 2 2
3 3 3 3 3 3 3 3 3 3
4 4 4 4 4 4 4 4 4 4
5 5 5 5 5 5 5 5 5 5
6 6 6 6 6 6 6 6 6 6
7 7 7 7 7 7 7 7 7 7
8 8 8 8 8 8 8 8 8 8
9 9 9 9 9 9 9 9 9 9
Just for fun, here's another way to do it with a single loop, this time using a format string with numbered fields.
fmt = ('{0} ' * 10)[:-1]
for i in range(10):
print(fmt.format(i))

Categories

Resources