I have the dataframe
df =A B B A B
B B B B A
A A A B B
A A B A A
And I want to get a vector with the element the appeared the most, per row.
So here I will get [B,B,A,A]
What is the best way to do it? In Python2
Let us using mode
df.T.mode()
0 1 2 3
0 B B A A
You can get your vector v with most appearing values with
v = [_[1].value_counts().idxmax() for _ in df.iterrows()].
Be careful when you have multiple elements that occur the most.
Related
I have dataframe which consists of five columns and five rows:
Pasquil_gifford_stability_table =pd.DataFrame( {"1":['A','B','B','C','C'],
"2":['A','B','C','D','D'],
"3":['B','C','C','D','D'],
"4":['D','E','D','D','D'],
"5":['D','F','E','D','D']
})
when I want to take element from second column and second row, I realise it:
Pasquil_gifford_stability_table.loc[2][2]
'C'
when I want to take element from second third and firs row, I also realise it:
Pasquil_gifford_stability_table.loc[1][3]
'E'
When I try to do it in arrays, I get an error:
Pasquil_gifford_stability_table.loc[[2,2]],[[1,3]]
( 1 2 3 4 5
2 B C C D E
2 B C C D E, [[1, 3]])
But As the result I should get
['C','E']
How should I solve that problem?
You want lookup:
df.lookup([2, 2], [1, 3])
I have below dataframe.
I want it transform it to below with merging cells with same value in a column
Anyone can provide some sample code?
try this,
df.loc[df.duplicated(['A', 'B']),['A', 'B']]=''
Get duplicate values and mask the values to empty string.
I/P:
A B C
0 1 a A
1 1 a B
2 2 b C
3 2 b A
O/P:
A B C
0 1 a A
1 B
2 2 b C
3 A
note: You can't exactly merge cells using pandas, the idea is suppressing values except first record
Based on the sample data generated by #mohamed thasin ah,
df.groupby(['A', 'B'], as_index=False).agg(', '.join)
A B C
0 1 a A, B
1 2 b C, A
so try:
df.groupby(['cd', 'ci', 'ui', 'module_behavior', 'feature_behavior', 'at']).agg(', '.join)
The output that you want seems to be an Excel file. If that is the case, I suggest :
df.groupby(['cn', 'ci', 'ui', 'module_behaviour', 'feature_behaviour', 'at']).apply(
lambda x: x.sort_values('caseid')).to_excel('filename.xlsx')
Pandas will groupby those columns and turn them into mutilevel indexes, and to_excel saves the DataFrame to an Excel file with the default setting merge_cells=True.
I am having problem while extracting index value from a data frame by comparing a dataframe column values with another list.
list=[a,b,c,d]
data frame
by comparing list with column X
X Y Z
0 a r t
1 e t y
2 c f h
3 d r t
4 b g q
this should return the index values like
X
0 a
4 b
2 c
3 d
I tried this method
z=dataframe.loc[(dataframe['X'] == list)]
You should use isin as you are comparing to a list of elements:
dataframe = pd.DataFrame(columns = ['X','Y','Z'])
dataframe['X'] = ['a','e','c','d','b']
dataframe['Y'] = ['r','t','f','r','g']
dataframe['Z'] = ['t','y','h','y','k']
mylist = ['a','b','c','d']
(always post a way to create your dataframe in your question, it will be faster to answer)
dataframe[dataframe['X'].isin(mylist)].X
0 a
2 c
3 d
4 b
Name: X, dtype: object
You need to use isin:
Make sure your list is a list of strings, then use dropna to get rid of unwanted rows and columns.
list = ['a','b','c','d']
df[df.isin(list)].dropna(how='all').dropna(axis=1)
Or if you only wanted to compare with column X.
df.X[df.X.isin(list)]
Output:
X
0 a
2 c
3 d
4 b
I have a DataFrame that looks like this:
import pandas as pd
df = pd.DataFrame({'A':['a','b','c','d'],'B':['a','b','c','x'],'C':['y','b','c','d']})
df
A B C
0 a a y
1 b b b
2 c c c
3 d x d
I want to identify the most common character in each row, and total the number of differences from the consensus:
A B C Consensus
0 a a y a
1 b b b b
2 c c c c
3 d x d d
Total 0 1 1 0
Running through loops is one approach, but it seems inefficient:
consensus = []
for idx in df.index:
consensus.append(df.loc[idx].value_counts().index[0])
df['Consensus'] = consensus
(and so on)
Is there a straightforward way to get the consensus and count differences from it?
You could use the mode to get the consensus value:
>>> df.mode(axis=1)
0
0 a
1 b
2 c
3 d
Note the caveats in the docs though:
Gets the mode(s) of each element along the axis selected. Empty if nothing has 2+ occurrences. Adds a row for each mode per label, fills in gaps with nan.
Note that there could be multiple values returned for the selected axis (when more than one item share the maximum frequency), which is the reason why a dataframe is returned. If you want to impute missing values with the mode in a dataframe df, you can just do this: df.fillna(df.mode().iloc[0])
To count the differences from the consensus for each column you could compare with ne and then sum:
>>> df['consensus'] = df.mode(axis=1)
>>> df.loc[:, 'A':'C'].ne(df['consensus'], axis=0).sum(axis=0)
A 0
B 1
C 1
dtype: int64
I have a dataframe with columns (a,b,c).
I have a list of values (x,y,z)
How can I select the rows containing exactly this three values, something like:
df = df[df[(a,b,c)] == (x,y,z)]
I know that
df = df[(df[a] == x) & (df[b] == y) & (df[c] == z)]
should work, but I'm looking for something more convenient. Does it exist ?
Solution using Indexing
I would set the columns as the index and use the .loc function
Indexing like this is the fastest way of accessing rows, while masking is very slow on larger datasets.
In [4]: df = pd.DataFrame({'a':[1,2,3,4,5],
'b':['a','b','c','d','e'],
'c':['z','x','y','v','u'],
'othervalue':range(100, 105)})
In [5]: df
Out[5]:
a b c othervalue
0 1 a z 100
1 2 b x 101
2 3 c y 102
3 4 d v 103
4 5 e u 104
In [6]: df.set_index(['a','b','c'], inplace=True)
In [7]: df
Out[7]:
othervalue
a b c
1 a z 100
2 b x 101
3 c y 102
4 d v 103
5 e u 104
In [8]: df.loc[[4,'d','v']]
Out[8]:
othervalue
a b c
4 d v 103
Extra bonus
Also, if you just want to access a certain value of a certain column, you can extend the .loc function to access that certain column for you, like this:
In [9]: df.loc[[4,'d','v'], 'othervalue']
Out[9]:
a b c
4 d v 103
Name: othervalue, dtype: int64
If you're looking for matching the tuple (x,y,z) values no matter the order in the columns (just in the same row), maybe I would use isin as:
df = df[df[['a','b','c']].isin([x,y,z])].dropna()
It would be nice comparing the timing with your boolean mask on a big dataframe.
df = [df['a'],df['b'],df['c']] == [x,y,z]
Hope it will helpful