Counting differences from the consensus in each row via Pandas - python

I have a DataFrame that looks like this:
import pandas as pd
df = pd.DataFrame({'A':['a','b','c','d'],'B':['a','b','c','x'],'C':['y','b','c','d']})
df
A B C
0 a a y
1 b b b
2 c c c
3 d x d
I want to identify the most common character in each row, and total the number of differences from the consensus:
A B C Consensus
0 a a y a
1 b b b b
2 c c c c
3 d x d d
Total 0 1 1 0
Running through loops is one approach, but it seems inefficient:
consensus = []
for idx in df.index:
consensus.append(df.loc[idx].value_counts().index[0])
df['Consensus'] = consensus
(and so on)
Is there a straightforward way to get the consensus and count differences from it?

You could use the mode to get the consensus value:
>>> df.mode(axis=1)
0
0 a
1 b
2 c
3 d
Note the caveats in the docs though:
Gets the mode(s) of each element along the axis selected. Empty if nothing has 2+ occurrences. Adds a row for each mode per label, fills in gaps with nan.
Note that there could be multiple values returned for the selected axis (when more than one item share the maximum frequency), which is the reason why a dataframe is returned. If you want to impute missing values with the mode in a dataframe df, you can just do this: df.fillna(df.mode().iloc[0])
To count the differences from the consensus for each column you could compare with ne and then sum:
>>> df['consensus'] = df.mode(axis=1)
>>> df.loc[:, 'A':'C'].ne(df['consensus'], axis=0).sum(axis=0)
A 0
B 1
C 1
dtype: int64

Related

pandas dataframe get the value with most occurence per row (Python2)

I have the dataframe
df =A B B A B
B B B B A
A A A B B
A A B A A
And I want to get a vector with the element the appeared the most, per row.
So here I will get [B,B,A,A]
What is the best way to do it? In Python2
Let us using mode
df.T.mode()
0 1 2 3
0 B B A A
You can get your vector v with most appearing values with
v = [_[1].value_counts().idxmax() for _ in df.iterrows()].
Be careful when you have multiple elements that occur the most.

Pythonic Way to Change Series Values for Index Elements in Series and a Separate Dataframe

I have a series:
s
A 1
B 0
C 1
D -1
E -1
F 0
...
and a dataframe with a subset of the series index values:
df
one two three ....
A
C
D
F
...
the contents of the df are not relevant in my question.
I am looking for the most Pythonic way to check the series index against the dataframe index and if the index element is in both series and dataframe indices, change the series value to zero.
The result I'm looking for is for the series to look like this based on sample s and df provided above:
s
A 0
B 0
C 0
D 0
E -1
F 0
Note that some series values were 0 to begin with and they stay 0. The ones where the index elements are in both are changed to 0 in the series.
I can iterate through the index but looking for more pythonic way to do this.
Thanks in advance.
Just do:
s[s.index.isin(df.index)] = 0
Yields:
A 0
B 0
C 0
D 0
E -1
F 0
dtype: int64
Could use update, with a dummy column of all 0. Should be fast.
import pandas as pd
s.update(pd.Series(0, index=df.index))

Find index value of a dataframe by comparing with another series

I am having problem while extracting index value from a data frame by comparing a dataframe column values with another list.
list=[a,b,c,d]
data frame
by comparing list with column X
X Y Z
0 a r t
1 e t y
2 c f h
3 d r t
4 b g q
this should return the index values like
X
0 a
4 b
2 c
3 d
I tried this method
z=dataframe.loc[(dataframe['X'] == list)]
You should use isin as you are comparing to a list of elements:
dataframe = pd.DataFrame(columns = ['X','Y','Z'])
dataframe['X'] = ['a','e','c','d','b']
dataframe['Y'] = ['r','t','f','r','g']
dataframe['Z'] = ['t','y','h','y','k']
mylist = ['a','b','c','d']
(always post a way to create your dataframe in your question, it will be faster to answer)
dataframe[dataframe['X'].isin(mylist)].X
0 a
2 c
3 d
4 b
Name: X, dtype: object
You need to use isin:
Make sure your list is a list of strings, then use dropna to get rid of unwanted rows and columns.
list = ['a','b','c','d']
df[df.isin(list)].dropna(how='all').dropna(axis=1)
Or if you only wanted to compare with column X.
df.X[df.X.isin(list)]
Output:
X
0 a
2 c
3 d
4 b

Selecting rows by a list of values without using several ands

I have a dataframe with columns (a,b,c).
I have a list of values (x,y,z)
How can I select the rows containing exactly this three values, something like:
df = df[df[(a,b,c)] == (x,y,z)]
I know that
df = df[(df[a] == x) & (df[b] == y) & (df[c] == z)]
should work, but I'm looking for something more convenient. Does it exist ?
Solution using Indexing
I would set the columns as the index and use the .loc function
Indexing like this is the fastest way of accessing rows, while masking is very slow on larger datasets.
In [4]: df = pd.DataFrame({'a':[1,2,3,4,5],
'b':['a','b','c','d','e'],
'c':['z','x','y','v','u'],
'othervalue':range(100, 105)})
In [5]: df
Out[5]:
a b c othervalue
0 1 a z 100
1 2 b x 101
2 3 c y 102
3 4 d v 103
4 5 e u 104
In [6]: df.set_index(['a','b','c'], inplace=True)
In [7]: df
Out[7]:
othervalue
a b c
1 a z 100
2 b x 101
3 c y 102
4 d v 103
5 e u 104
In [8]: df.loc[[4,'d','v']]
Out[8]:
othervalue
a b c
4 d v 103
Extra bonus
Also, if you just want to access a certain value of a certain column, you can extend the .loc function to access that certain column for you, like this:
In [9]: df.loc[[4,'d','v'], 'othervalue']
Out[9]:
a b c
4 d v 103
Name: othervalue, dtype: int64
If you're looking for matching the tuple (x,y,z) values no matter the order in the columns (just in the same row), maybe I would use isin as:
df = df[df[['a','b','c']].isin([x,y,z])].dropna()
It would be nice comparing the timing with your boolean mask on a big dataframe.
df = [df['a'],df['b'],df['c']] == [x,y,z]
Hope it will helpful

Vectorized update to pandas DataFrame?

I have a dataframe for which I'd like to update a column with some values from an array. The array is of a different lengths to the dataframe however, but I have the indices for the rows of the dataframe that I'd like to update.
I can do this with a loop through the rows (below) but I expect there is a much more efficient way to do this via a vectorized approach, but I can't seem to get the syntax correct.
In the example below I just fill the column with nan and then use the indices directly through a loop.
df['newcol'] = np.nan
j = 0
for i in update_idx:
df['newcol'][i] = new_values[j]
j+=1
if you have a list of indices already then you can use loc to perform label (row) selection, you can pass the new column name, where your existing rows are not selected these will have NaN assigned:
df.loc[update_idx, 'new_col'] = new_value
Example:
In [4]:
df = pd.DataFrame({'a':np.arange(5), 'b':np.random.randn(5)}, index = list('abcde'))
df
Out[4]:
a b
a 0 1.800300
b 1 0.351843
c 2 0.278122
d 3 1.387417
e 4 1.202503
In [5]:
idx_list = ['b','d','e']
df.loc[idx_list, 'c'] = np.arange(3)
df
Out[5]:
a b c
a 0 1.800300 NaN
b 1 0.351843 0
c 2 0.278122 NaN
d 3 1.387417 1
e 4 1.202503 2

Categories

Resources