Removing matching index values from dataframe - python

df:
0 1 2
0 0.0481948 0.1054251 0.1153076
1 0.0407258 0.0890868 0.0974378
2 0.0172071 0.0376403 0.0411687
etc.
I would like to remove all values in which the x and y titles/values of the dataframe are equal, therefore, my expected output would be something like:
0 1 2
0 NaN 0.1054251 0.1153076
1 0.0407258 NaN 0.0974378
2 0.0172071 0.0376403 NaN
etc.
As shown, the values of (0,0), (1,1), (2,2) and so on, have been removed/replaced.
I thought of looping through the index as followed:
for (idx, row) in df.iterrows():
if (row.index) == ???
But don't know where to carry on or whether it's even the right approach

You can set the diagonal:
In [11]: df.iloc[[np.arange(len(df))] * 2] = np.nan
In [12]: df
Out[12]:
0 1 2
0 NaN 0.105425 0.115308
1 0.040726 NaN 0.097438
2 0.017207 0.037640 NaN

#AndyHayden's answer is really cool and taught me something. However, it depends on iloc and that the array is square and that everything is in the same order.
I generalized the concept here
Consider the data frame df
df = pd.DataFrame(1, list('abcd'), list('xcya'))
df
x c y a
a 1 1 1 1
b 1 1 1 1
c 1 1 1 1
d 1 1 1 1
Then we use numpy broadcasting and np.where to perform the same fancy index assignment:
ij = np.where(df.index.values[:, None] == df.columns.values)
df.iloc[list(map(list, ij))] = 0
df
x c y a
a 1 1 1 0
b 1 1 1 1
c 1 0 1 1
d 1 1 1 1

n is number of rows/columns
df.values[[np.arange(n)]*2] = np.nan
or
np.fill_diagonal(df.values, np.nan)
see https://stackoverflow.com/a/24475214/

Related

Iteration over columns problems on condition

I have a Dataset with several columns and a row named "Total" that stores values between 1 and 4.
I want to iterate over each column, and based on the number stored in the row "Total", add a new row with "yes" or "No".
I also have a list "columns" for iteration.
All data are float64
I´m new at python and i don't now if i'm doing the righ way because i´m getting all "yes".
for c in columns:
if dados_por_periodo.at['Total',c] < 4:
dados_por_periodo.loc['VA'] = "yes"
else:
dados_por_periodo.loc['VA'] = "no"
My dataset:
Thanks.
You can try this, I hope it works for you:
import pandas as pd
import numpy as np
#creation of a dummy df
columns='A B C D E F G H I'.split()
data=[np.random.choice(2, len(columns)).tolist() for col in range(3)]
data.append([1,8,1,1,2,4,1,4,1]) #not real sum of values, just dummy values for testing
index=['Otono','Inverno', 'Primavera','Totals']
df=pd.DataFrame(data, columns=columns, index=index)
df.index.name='periodo' #just adding index name
print(df)
####Adition of the new 'yes/no' row
df = pd.concat([ df, pd.DataFrame([np.where(df.iloc[len(df.index)-1,:].lt(4),'Yes','No')], columns=df.columns, index=['VA'])])
df.index.name='periodo' #just adding index name
print(df)
Output:
df
A B C D E F G H I
periodo
Otono 1 0 0 1 1 0 1 1 0
Inverno 0 1 1 1 0 1 1 1 1
Primavera 1 1 0 0 1 1 1 1 0
Totals 1 8 1 1 2 4 1 4 1
df(with added row)
A B C D E F G H I
periodo
Otono 1 0 0 1 1 0 1 1 0
Inverno 0 1 1 1 0 1 1 1 1
Primavera 1 1 0 0 1 1 1 1 0
Totals 1 8 1 1 2 4 1 4 1
VA Yes No Yes Yes Yes No Yes No Yes
Also try to put some data sample the next times instead of images of the dataset, so someone can help you in a better way :)

How to use lambda function on a pandas data frame via map/apply where lambda takes different values for each column

The idea is to transform a data frame in the fastest way according to the values specific to each column.
For simplicity, here is an example where each element of a column is compared to the mean of the column it belongs to and replaced with 0 if greater than mean(column) or 1 otherwise.
In [26]: df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6]]))
In [27]: df
Out[27]:
0 1 2
0 1 2 3
1 4 5 6
In [28]: df.mean().values.tolist()
Out[28]: [2.5, 3.5, 4.5]
Snippet bellow, it is not real code but more to exemplify the desired behavior. I used apply method but it can be whatever works fastest.
In [29]: f = lambda x: 0 if x < means else 1
In [30]: df.apply(f)
In [27]: df
Out[27]:
0 1 2
0 0 0 0
1 1 1 1
This is a toy example but the solution has to be applied to a big data frame, therefore, it has to be fast.
Cheers!
You can create a boolean mask of the dataframe by comparing each element with the mean of that column. It can be easily achieved using
df > df.mean()
0 1 2
0 False False False
1 True True True
Since True equates to 1 and False to 0, a boolean dataframe can be easily converted to integer using astype.
(df > df.mean()).astype(int)
0 1 2
0 0 0 0
1 1 1 1
If you need the output to be some strings rather than 0 and 1, use np.where which works as (condition, if true, else)
pd.DataFrame(np.where(df > df.mean(), 'm', 'n'))
0 1 2
0 n n n
1 m m m
Edit: Addressing qn in comment; What if m and n are column dependent
df = pd.DataFrame(np.arange(12).reshape(4,3))
0 1 2
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
pd.DataFrame(np.where(df > df.mean(), df.min(), df.max()))
0 1 2
0 9 10 11
1 9 10 11
2 0 1 2
3 0 1 2

Drop all columns where all values are zero

I have a simple question which relates to similar questions here, and here.
I am trying to drop all columns from a pandas dataframe, which have only zeroes (vertically, axis=1). Let me give you an example:
df = pd.DataFrame({'a':[0,0,0,0], 'b':[0,-1,0,1]})
a b
0 0 0
1 0 -1
2 0 0
3 0 1
I'd like to drop column asince it has only zeroes.
However, I'd like to do it in a nice and vectorized fashion if possible. My data set is huge - so I don't want to loop. Hence I tried
df = df.loc[(df).any(1), (df!=0).any(0)]
b
1 -1
3 1
Which allows me to drop both columns and rows. But if I just try to drop the columns, locseems to fail. Any ideas?
You are really close, use any - 0 are casted to Falses:
df = df.loc[:, df.any()]
print (df)
b
0 0
1 1
2 0
3 1
If it's a matter of 0s and not sum, use df.any:
In [291]: df.T[df.any()].T
Out[291]:
b
0 0
1 -1
2 0
3 1
Alternatively:
In [296]: df.T[(df != 0).any()].T # or df.loc[:, (df != 0).any()]
Out[296]:
b
0 0
1 -1
2 0
3 1
In [73]: df.loc[:, df.ne(0).any()]
Out[73]:
b
0 0
1 1
2 0
3 1
or:
In [71]: df.loc[:, ~df.eq(0).all()]
Out[71]:
b
0 0
1 1
2 0
3 1
If we want to check those that do NOT sum up to 0:
In [78]: df.loc[:, df.sum().astype(bool)]
Out[78]:
b
0 0
1 1
2 0
3 1

How to use trailing rows on a column for calculations on that same column | Pandas Python

I'm trying to figure out how to compare the element of the previous row of a column to a different column on the current row in a Pandas DataFrame. For example:
data = pd.DataFrame({'a':['1','1','1','1','1'],'b':['0','0','1','0','0']})
Output:
a b
0 1 0
1 1 0
2 1 1
3 1 0
4 1 0
And now I want to make a new column that asks if (data['a'] + data['b']) is greater then the previous value of that same column.
Theoretically:
data['c'] = np.where(data['a']==( the previous row value of data['a'] ),min((data['b']+( the previous row value of data['c'] )),1),data['b'])
So that I can theoretically output:
a b c
0 1 0 0
1 1 0 0
2 1 1 1
3 1 0 1
4 1 0 1
I'm wondering how to do this because I'm trying to recreate this excel conditional statement: =IF(A70=A69,MIN((P70+Q69),1),P70)
where data['a'] = column A and data['b'] = column P.
If anyone has any ideas on how to do this, I'd greatly appreciate your advice.
According to your statement: 'new column that asks if (data['a'] + data['b']) is greater then the previous value of that same column' I can suggest you to solve it by this way:
>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame({'a':['1','1','1','1','1'],'b':['0','0','1','0','3']})
>>> df
a b
0 1 0
1 1 0
2 1 1
3 1 0
4 1 3
>>> df['c'] = np.where(df['a']+df['b'] > df['a'].shift(1)+df['b'].shift(1), 1, 0)
>>> df
a b c
0 1 0 0
1 1 0 0
2 1 1 1
3 1 0 0
4 1 3 1
But it doesn't looking for 'previous value of that same column'.
If you would try to write df['c'].shift(1) in np.where(), it gonna to raise KeyError: 'c'.

How can I assign a value to a different column for each row in a dataframe?

I have a dataframe dat that looks like this:
p1 p2 type replace
1 0 1 1 1
2 1 0 1 1
3 0 0 2 1
...
I want do something like dat['p + str(type)'] = replace to get:
p1 p2 type replace
1 1 1 1 1
2 1 0 1 1
3 0 1 2 1
...
How can I do this? Of course I can't assign in a loop using something like iterrows...
Maybe there is some one liner to do this, but if performance is not really an issue, you can easily do this with a simple for loop:
In [134]: df
Out[134]:
p1 p2 type replace
0 0 1 1 1
1 1 0 1 1
2 0 0 2 1
In [135]: for i in df.index:
...: df.loc[i, 'p'+str(df.loc[i, 'type'])] = df.loc[i, 'replace']
In [136]: df
Out[136]:
p1 p2 type replace
0 1 1 1 1
1 1 0 1 1
2 0 1 2 1
If you have much more rows than columns, this will be much faster and is actually easier (and if necessary you can loop over 1, 2, ..):
df["p1"][df["type"]==1] = df["replace"][df["type"]==1]
df["p2"][df["type"]==2] = df["replace"][df["type"]==2]
In [47]: df['p1'].where(~(df['type'] == 1), df['replace'], inplace=True)
In [48]: df['p2'].where(~(df['type'] == 2), df['replace'], inplace=True)
In [49]: df
Out[49]:
p1 p2 type replace
1 1 1 1 1
2 1 0 1 1
3 0 1 2 1
Just for completeness, I ended up doing the following, which may or may not be the same as what Dan Allan suggested:
for i in range(2):
df.loc[df['type'] == i + 1, 'p' + str(i + 1)] = df.loc[df['type'] == i + 1, 'replace']
I have a much larger problem than the example I gave (with something like 30 types and thousands of rows in the dataframe), and this solution seems very fast. Thanks to all for your help in thinking about this problem!

Categories

Resources