Creating boolean matrix from one column with pandas - python

I have been searching for an answer but I don't know what to search for so I'll ask here instead. I'm a beginner python and pandas enthusiast.
I have a dataset where i would like to produce a matrix from a column. The matrix should have the value of 1 if the value in the column and its transposed state is equal and 0 if its not.
input:
id x1
A 1
B 3
C 1
D 5
output:
A B C D
A 1 0 1 0
B 0 1 0 0
C 1 0 1 0
D 0 0 0 1
I would like to do this for six different columns and add the resulting matrixes into one matrix where the values range from 0-6 instead of just 0-1.

Partly because there's as of yet no convenient cartesian join (whistles and looks away), I tend to drop down to numpy level and use broadcasting when I need to do things like this. IOW, because we can do things like this
>>> df.x1.values - df.x1.values[:,None]
array([[ 0, 2, 0, 4],
[-2, 0, -2, 2],
[ 0, 2, 0, 4],
[-4, -2, -4, 0]])
We can do
>>> pdf = pd.DataFrame(index=df.id.values, columns=df.id.values,
data=(df.x1.values == df.x1.values[:,None]).astype(int))
>>> pdf
A B C D
A 1 0 1 0
B 0 1 0 0
C 1 0 1 0
D 0 0 0 1

Related

Compare two rows on a loop for on Pandas

I have the following dataframe where I want to determinate if the column A is greater than column B and if column C is greater of column B. In case it is smaller, I want to change that value for 0.
d = {'A': [6, 8, 10, 1, 3], 'B': [4, 9, 12, 0, 2], 'C': [3, 14, 11, 4, 9] }
df = pd.DataFrame(data=d)
df
I have tried this with the np.where and it is working:
df[B] = np.where(df[A] > df[B], 0, df[B])
df[C] = np.where(df[B] > df[C], 0, df[C])
However, I have a huge amount of columns and I want to know if there is any way to do this without writing each comparation separately. For example, a loop for.
Thanks
Solution with different ouput, because is compared original columns with DataFrame.diff and set less like 0 values to 0 by DataFrame.mask:
df1 = df.mask(df.diff(axis=1).lt(0), 0)
print (df1)
A B C
0 6 0 0
1 8 9 14
2 10 12 0
3 1 0 4
4 3 0 9
If use list comprehension with zip shifted columns names output is different, because is compared already assigned columns B, C...:
for a, b in zip(df.columns, df.columns[1:]):
df[b] = np.where(df[a] > df[b], 0, df[b])
print (df)
A B C
0 6 0 3
1 8 9 14
2 10 12 0
3 1 0 4
4 3 0 9
To use a vectorial approach, you cannot simply use a diff as the condition depends on the previous value being replaced or not by 0. Thus two consecutive diff cannot happen.
You can achieve a correct vectorial replacement using a shifted mask:
m1 = df.diff(axis=1).lt(0) # check if < than previous
m2 = ~m1.shift(axis=1, fill_value=False) # and this didn't happen twice
df2 = df.mask(m1&m2, 0)
output:
A B C
0 6 0 3
1 8 9 14
2 10 12 0
3 1 0 4
4 3 0 9

How to conditionally add one hot vector to a Pandas DataFrame

I have the following Pandas DataFrame in Python:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.array([[1, 2, 3], [3, 2, 1], [2, 1, 1]]),
columns=['a', 'b', 'c'])
df
It looks as the following when you output it:
a b c
0 1 2 3
1 3 2 1
2 2 1 1
I need to add 3 new columns, as column "d", column "e", and column "f".
Values in each new column will be determined based on the values of column "b" and column "c".
In a given row:
If the value of column "b" is bigger than the value of column "c", columns [d, e, f] will have the values [1, 0, 0].
If the value of column "b" is equal to the value of column "c", columns [d, e, f] will have the values [0, 1, 0].
If the value of column "b" is smaller than the value of column "c", columns [d, e, f] will have the values [0, 0, 1].
After this operation, the DataFrame needs to look as the following:
a b c d e f
0 1 2 3 0 0 1 # Since b smaller than c
1 3 2 1 1 0 0 # Since b bigger than c
2 2 1 1 0 1 0 # Since b = c
My original DataFrame is much bigger than the one in this example.
Is there a good way of doing this in Python without looping through the DataFrame?
You can use np.where to create condition vector and use str.get_dummies to create dummies
df['vec'] = np.where(df.b>df.c, 'd', np.where(df.b == df.c, 'e', 'f'))
df = df.assign(**df['vec'].str.get_dummies()).drop('vec',1)
a b c d e f
0 1 2 3 0 0 1
1 3 2 1 1 0 0
2 2 1 1 0 1 0
Let us try np.sign with get_dummies, -1 is c<b, 0 is c=b, 1 is c>b
df=df.join(np.sign(df.eval('c-b')).map({-1:'d',0:'e',1:'f'}).astype(str).str.get_dummies())
df
Out[29]:
a b c d e f
0 1 2 3 0 0 1
1 3 2 1 1 0 0
2 2 1 1 0 1 0
You simply harness the Boolean conditions you've already specified.
df["d"] = np.where(df.b > df.c, 1, 0)
df["e"] = np.where(df.b == df.c, 1, 0)
df["f"] = np.where(df.b < df.c, 1, 0)

`pandas.merge` not recognising same index

I have two dataframes with overlapping columns but identical indexes and I want to combine them. I feel like this should be straight forward but I have worked through sooo many examples and SO questions and it's not working but also seems to not be consistent with other examples.
import pandas as pd
# create test data
df = pd.DataFrame({'gen1': [1, 0, 0, 1, 1], 'gen3': [1, 0, 0, 1, 0], 'gen4': [0, 1, 1, 0, 1]}, index = ['a', 'b', 'c', 'd', 'e'])
df1 = pd.DataFrame({'gen1': [1, 0, 0, 1, 1], 'gen2': [0, 1, 1, 1, 1], 'gen3': [1, 0, 0, 1, 0]}, index = ['a', 'b', 'c', 'd', 'e'])
In [1]: df
Out[1]:
gen1 gen2 gen3
a 1 0 1
b 0 1 0
c 0 1 0
d 1 1 1
e 1 1 0
In [2]: df1
Out[2]:
gen1 gen3 gen4
a 1 1 0
b 0 0 1
c 0 0 1
d 1 1 0
e 1 0 1
After working through all the examples here (https://pandas.pydata.org/pandas-docs/stable/merging.html) I'm convinced I have found the correct example (the first and second example of the merges). The second example is this:
In [43]: result = pd.merge(left, right, on=['key1', 'key2'])
In their example they have two DFs (left and right) that have overlapping columns and identical indexs and their resulting dataframe has one version of each column and the original indexs but this is not what happens when I do that:
# get the intersection of columns (I need this to be general)
In [3]: column_intersection = list(set(df).intersection(set(df1))
In [4]: pd.merge(df, df1, on=column_intersection)
Out[4]:
gen1 gen2 gen3 gen4
0 1 0 1 0
1 1 0 1 0
2 1 1 1 0
3 1 1 1 0
4 0 1 0 1
5 0 1 0 1
6 0 1 0 1
7 0 1 0 1
8 1 1 0 1
Here we see that merge has not seen that the indexs are the same! I have fiddled around with the options but cannot get the result I want.
A similar but different question was asked here How to keep index when using pandas merge but I don't really understand the answers and so can't relate it to my problem.
Points for this specific example:
Index will always be identical.
Columns with the same name will always have identical entries (i.e. they are duplicates).
It would be great to have a solution for this specific problem but I would also really like to understand it because I find myself spending lots of time on combining dataframes from time to time. I love pandas and in general I find it very intuitive but I just can't seem to get comfortable with anything other than trivial combinations of dataframes.
Starting v0.23, you can specify an index name for the join key, if you have it.
df.index.name = df1.index.name = 'idx'
df.merge(df1, on=list(set(df).intersection(set(df1)) | {'idx'}))
gen1 gen3 gen4 gen2
idx
a 1 1 0 0
b 0 0 1 1
c 0 0 1 1
d 1 1 0 1
e 1 0 1 1
The assumption here is that your actual DataFrame does not have exactly the same values in overlapping columns. If they did, then your question would be one of concatenation— you can use pd.concat for that:
c = list(set(df).intersection(set(df1)))
pd.concat([df1, df.drop(c, 1)], axis=1)
gen1 gen2 gen3 gen4
a 1 0 1 0
b 0 1 0 1
c 0 1 0 1
d 1 1 1 0
e 1 1 0 1
In this special case, you can use assign
Things in df take priority but all other things in df1 are included.
df1.assign(**df)
gen1 gen2 gen3 gen4
a 1 0 1 0
b 0 1 0 1
c 0 1 0 1
d 1 1 1 0
e 1 1 0 1
**df unpacks df assuming a dictionary context. This unpacking delivers keyword arguments to assign with the names of columns as the keyword and the column as the argument.
It is the same as
df1.assign(gen1=df.gen1, gen3=df.gen3, gen4=df.gen4)

Use a list to conditionally fill a new column based on values in multiple columns

I am trying to populate a new column within a pandas dataframe by using values from several columns. The original columns are either 0 or '1' with exactly a single 1 per series. The new column would correspond to df['A','B','C','D'] by populating new_col = [1, 3, 7, 10] as shown below. (A 1 at A means new_col = 1; if B=1,new_col = 3, etc.)
df
A B C D
1 1 0 0 0
2 0 0 1 0
3 0 0 0 1
4 0 1 0 0
The new df should look like this.
df
A B C D new_col
1 1 0 0 0 1
2 0 0 1 0 7
3 0 0 0 1 10
4 0 1 0 0 3
I've tried to use map, loc, and where but can't seem to formulate an efficient way to get it done. Problem seems very close to this. A couple other posts I've looked at 1 2 3. None of these show how to use multiple columns conditionally to fill a new column based on a list.
I can think of a few ways, mostly involving argmax or idxmax, to get either an ndarray or a Series which we can use to fill the column.
We could drop down to numpy, find the maximum locations (where the 1s are) and use those to index into an array version of new_col:
In [148]: np.take(new_col,np.argmax(df.values,1))
Out[148]: array([ 1, 7, 10, 3])
We could make a Series with new_col as the values and the columns as the index, and index into that with idxmax:
In [116]: pd.Series(new_col, index=df.columns).loc[df.idxmax(1)].values
Out[116]: array([ 1, 7, 10, 3])
We could use get_indexer to turn the column idxmax results into integer offsets we can use with new_col:
In [117]: np.array(new_col)[df.columns.get_indexer(df.idxmax(axis=1))]
Out[117]: array([ 1, 7, 10, 3])
Or (and this seems very wasteful) we could make a new frame with the new columns and use idxmax directly:
In [118]: pd.DataFrame(df.values, columns=new_col).idxmax(1)
Out[118]:
0 1
1 7
2 10
3 3
dtype: int64
It's not the most elegant solution, but for me it beats the if/elif/elif loop:
d = {'A': 1, 'B': 3, 'C': 7, 'D': 10}
def new_col(row):
k = row[row == 1].index.tolist()[0]
return d[k]
df['new_col'] = df.apply(new_col, axis=1)
Output:
A B C D new_col
1 1 0 0 0 1
2 0 0 1 0 7
3 0 0 0 1 10
4 0 1 0 0 3

pandas: Use if-else to populate new column

I have a DataFrame like this:
col1 col2
1 0
0 1
0 0
0 0
3 3
2 0
0 4
I'd like to add a column that is a 1 if col2 is > 0 or 0 otherwise. If I was using R I'd do something like
df1[,'col3'] <- ifelse(df1$col2 > 0, 1, 0)
How would I do this in python / pandas?
You could convert the boolean series df.col2 > 0 to an integer series (True becomes 1 and False becomes 0):
df['col3'] = (df.col2 > 0).astype('int')
(To create a new column, you simply need to name it and assign it to a Series, array or list of the same length as your DataFrame.)
This produces col3 as:
col2 col3
0 0 0
1 1 1
2 0 0
3 0 0
4 3 1
5 0 0
6 4 1
Another way to create the column could be to use np.where, which lets you specify a value for either of the true or false values and is perhaps closer to the syntax of the R function ifelse. For example:
>>> np.where(df['col2'] > 0, 4, -1)
array([-1, 4, -1, -1, 4, -1, 4])
I assume that you're using Pandas (because of the 'df' notation). If so, you can assign col3 a boolean flag by using .gt (greater than) to compare col2 against zero. Multiplying the result by one will convert the boolean flags into ones and zeros.
df1 = pd.DataFrame({'col1': [1, 0, 0, 0, 3, 2, 0],
'col2': [0, 1, 0, 0, 3, 0, 4]})
df1['col3'] = df1.col2.gt(0) * 1
>>> df1
Out[70]:
col1 col2 col3
0 1 0 0
1 0 1 1
2 0 0 0
3 0 0 0
4 3 3 1
5 2 0 0
6 0 4 1
You can also use a lambda expression to achieve the same result, but I believe the method above is simpler for your given example.
df1['col3'] = df1['col2'].apply(lambda x: 1 if x > 0 else 0)

Categories

Resources