how to count unique values after groupby ID - python

I have the following pandas dataframe df
ID from to
A 0x 0c
A 0x 0f
A 0f 0n
B 0f 0c
B 0c 0f
C 0k 0j
C 0j 0k
C 0k 0a
First I want to group by id and only keep groups if the number of unique values from from and to combined is less than 3.
so the desired df will be
B 0f 0c
B 0c 0f
C 0k 0j
C 0j 0k
C 0k 0a

What about using a groupby filter with a lambda function that confirms the number of unique values in the from and to columns is less than or equal to 3? You can use DataFrame.stack() as a hacky solution to put all of the values in a dataframe into a single Series to use Series.nunique() with:
import pandas as pd
# Your dataframe here
df = pd.read_clipboard()
out = df.groupby("ID").filter(lambda x: x[["from", "to"]].stack().nunique() <= 3)
out:
ID from to
3 B 0f 0c
4 B 0c 0f
5 C 0k 0j
6 C 0j 0k
7 C 0k 0a

Related

applying pivot table on pandas dataframe instead of grouping

I have a dataframe like this and can group it by library and sample columns and create new columns:
df = pd.DataFrame({'barcode': ['b1', 'b2','b1','b2','b1',
'b2','b1','b2'],
'library': ['l1', 'l1','l1','l1','l2', 'l2','l2','l2'],
'sample': ['s1','s1','s2','s2','s1','s1','s2','s2'],
'category': ['c1', 'c2','c1','c2','c1', 'c2','c1','c2'],
'count': [10,21,13,54,51,16,67,88]})
df
barcode library sample category count
0 b1 l1 s1 c1 10
1 b2 l1 s1 c2 21
2 b1 l1 s2 c1 13
3 b2 l1 s2 c2 54
4 b1 l2 s1 c1 51
5 b2 l2 s1 c2 16
6 b1 l2 s2 c1 67
7 b2 l2 s2 c2 88
I used grouping to reduce dimentions of the df:
grp=df.groupby(['library','sample'])
df=grp.get_group(('l1','s1')).rename(columns={"count":
"l1_s1_count"}).reset_index(drop=True)
df['l1_s2_count']=grp.get_group(('l1','s2'))[['count']].values
df['l2_s1_count']=grp.get_group(('l2','s1'))[['count']].values
df['l2_s2_count']=grp.get_group(('l2','s2'))[['count']].values
df=df.drop(['sample','library'],axis=1)
result
barcode category l1_s1_count l1_s2_count l2_s1_count
l2_s2_count
0 b1 c1 10 13 51 67
1 b2 c2 21 54 16 88
I think there should be a neater way for this transformation, like using pivot table which I failed with, could you please suggest how this could be done with pivot table?
thanks.
try pivot_table function as below,
it will produce multi-index result, which will need to be flattened.
df2 = pd.pivot_table(df,index=['barcode', 'category'], columns= ['sample', 'library'], values='count').reset_index()
df2.columns = ["_".join(a) for a in df2.columns.to_flat_index()]
out:
barcode_ category_ s1_l1 s1_l2 s2_l1 s2_l2
0 b1 c1 10 51 13 67
1 b2 c2 21 16 54 88
or even without , values='count'.
df2 = pd.pivot_table(df,index=['barcode', 'category'], columns= ['sample', 'library']).reset_index()
df2.columns = ["_".join(a) for a in df2.columns.to_flat_index()]
out:
barcode__ category__ count_s1_l1 count_s1_l2 count_s2_l1 count_s2_l2
0 b1 c1 10 51 13 67
1 b2 c2 21 16 54 88
as per your preference

How to initialize or change a dataframe according to my index length?

I have an index that looks like:
MyIndex
11
12
13
and a dataframe which might be longer than my index: (they could be equal under some situations)
OldIndex c1
0 00
1 01
2 02
3 03
4 04
I want to fit the dataframe into the index (by dropping the extra rows at the tail)
.
MyIndex c1
11 00
12 01
13 02
Is there any simple solution? It would be better if I can achieve this without creating a new dataframe.
You can try this:
my_idx = pd.Series([11, 12, 13], name='MyIndex')
df = pd.DataFrame({'OldIndex': [0,1,2,3,4],
'c1':['00', '01', '02', '03', '04']}).set_index('OldIndex')
l = len(my_idx)
df = df.iloc[:l].set_index(idx.index)
c1
MyIndex
11 00
12 01
13 02
simple way to do it would just be this...
assuming you have:
df1 = pd.DataFrame(index=[0,1,2])
df2 = pd.DataFrame({'c1':[1,2,3,4,5]},index=[0,1,2,3,4])
df1['c1'] = df2['c1'].values[:len(df1.index)]
output:
>>> df1
c1
0 1
1 2
2 3
without creating a new df...
say ind = pd.Index([0,1,2])
df2, df2.index = df2.iloc[:len(ind)], ind
output:
>>> df2
c1
0 1
1 2
2 3
You have to create the desired index (as pd.Index) and then set this index to the new df (subset based on the length of the new index)
myindex = pd.Index(['I1','I2','I3'])
df = df.iloc[:len(myindex)].set_index(myindex)
Output:
c1
I1 00
I2 01
I3 02

Python Data frame: If Column Name is contained in the String Row of Another Column Then 1 Otherwise 0

Column A 2C GAD D2 6F ABCDE
2C 1B D2 6F ABC 1 0 1 1 0
2C 1248 Bulers 1 0 0 0 0
Above is the dataframe I want to create.
The first row represents the field names. The logic I want to employ is as follows:
If the column name is in the "Column A" row, then 1 otherwise 0
I have scoured Google looking for code answering a question similar to mine so I can test it out and backward engineer a solution. Unfortunately, I have not been able to find anything.
Otherwise I would post some code that I attempted to solve this problem but I literally have no clue.
You can use a list comprehension to create the desire data based on the columns and rows:
In [39]: row =['2C 1B D2 6F ABC', '2C 1248 Bulers']
In [40]: columns=['2C', 'GAD', 'D2', '6F', 'ABCDE']
In [41]: df = pd.DataFrame([[int(k in r) for k in columns] for r in row], index = ['2C 1B D2 6F ABC','2C 1248 Bulers'], columns=['2C', 'GAD', 'D2', '6F', 'ABCDE'])
In [42]: df
Out[42]:
2C GAD D2 6F ABCDE
2C 1B D2 6F ABC 1 0 1 1 0
2C 1248 Bulers 1 0 0 0 0
If you want a pure Pandas approach you can use pd.Series() instead of list for preserving the columns and rows then use Series.apply and Series.str.contains to get the desire result:
In [73]: data = columns.apply(row.str.contains).astype(int).transpose()
In [74]: df = pd.DataFrame(data.values, index = ['2C 1B D2 6F ABC','2C 1248 Bulers'], columns=['2C', 'GAD', 'D2', '6F', 'ABCDE'])
In [75]: df
Out[75]:
2C GAD D2 6F ABCDE
2C 1B D2 6F ABC 1 0 1 1 0
2C 1248 Bulers 1 0 0 0 0

combine row with different name for a column pandas python

i have a sample data set:
import pandas as pd
df = {
'columA':['1A','2A','3A','4A','5A','6A'],
'count': [ 1, 12, 34, 52, '3',2],
'columnB': ['a','dd','dd','ee','d','f']
}
df = pd.DataFrame(df)
it looks like this:
columA columnB count
1A a 1
2A dd 12
3A dd 34
4A ee 52
5A d 3
6A f 2
Update: The combined 2A and 3A name should be something arbitrary like 'SAB' or '2A plus 3A', etc., I used '2A|3A' as the example and it confused some of the people.
I want to sum up the count the rows 2A and 3A and give it a name SAB
desired output:
columA columnB count
1A a 1
SAB dd 46
4A ee 52
5A d 3
6A f 2
We can use a groupby on columnB
df = {'columA':['1A','2A','3A','4A','5A','6A'],
'count': [ 1, 12, 34, 52, '3',2],
'columnB': ['a','dd','dd','ee','d','f']}
df = pd.DataFrame(df)
df.groupby('columnB').agg({'count': 'sum', 'columA': 'sum'})
columA count
columnB
a 1A 1
d 5A 3
dd 2A3A 46
ee 4A 52
f 6A 2
If you're concerned about the index name you can write a function like so.
def join_by_pipe(s):
return '|'.join(s)
df.groupby('columnB').agg({'count': 'sum', 'columA': join_by_pipe})
columA count
columnB
a 1A 1
d 5A 3
dd 2A|3A 46
ee 4A 52
f 6A 2

Slice pandas DataFrame by MultiIndex level or sublevel

Inspired by this answer and the lack of an easy answer to this question I found myself writing a little syntactic sugar to make life easier to filter by MultiIndex level.
def _filter_series(x, level_name, filter_by):
"""
Filter a pd.Series or pd.DataFrame x by `filter_by` on the MultiIndex level
`level_name`
Uses `pd.Index.get_level_values()` in the background. `filter_by` is either
a string or an iterable.
"""
if isinstance(x, pd.Series) or isinstance(x, pd.DataFrame):
if type(filter_by) is str:
filter_by = [filter_by]
index = x.index.get_level_values(level_name).isin(filter_by)
return x[index]
else:
print "Not a pandas object"
But if I know the pandas development team (and I'm starting to, slowly!) there's already a nice way to do this, and I just don't know what it is yet!
Am I right?
I actually upvoted joris's answer... but unfortunately the refactoring he mentions has not happened in 0.14 and is not happening in 0.17 neither. So for the moment let me suggest a quick and dirty solution (obviously derived from Jeff's one):
def filter_by(df, constraints):
"""Filter MultiIndex by sublevels."""
indexer = [constraints[name] if name in constraints else slice(None)
for name in df.index.names]
return df.loc[tuple(indexer)] if len(df.shape) == 1 else df.loc[tuple(indexer),]
pd.Series.filter_by = filter_by
pd.DataFrame.filter_by = filter_by
... to be used as
df.filter_by({'level_name' : value})
where value can be indeed a single value, but also a list, a slice...
(untested with Panels and higher dimension elements, but I do expect it to work)
This is very easy using the new multi-index slicers in master/0.14 (releasing soon), see here
There is an open issue to make this syntatically easier (its not hard to do), see here
e.g something like this: df.loc[{ 'third' : ['C1','C3'] }] I think is reasonable
Here's how you can do it (requires master/0.14):
In [2]: def mklbl(prefix,n):
...: return ["%s%s" % (prefix,i) for i in range(n)]
...:
In [11]: index = MultiIndex.from_product([mklbl('A',4),
mklbl('B',2),
mklbl('C',4),
mklbl('D',2)],names=['first','second','third','fourth'])
In [12]: columns = ['value']
In [13]: df = DataFrame(np.arange(len(index)*len(columns)).reshape((len(index),len(columns))),index=index,columns=columns).sortlevel()
In [14]: df
Out[14]:
value
first second third fourth
A0 B0 C0 D0 0
D1 1
C1 D0 2
D1 3
C2 D0 4
D1 5
C3 D0 6
D1 7
B1 C0 D0 8
D1 9
C1 D0 10
D1 11
C2 D0 12
D1 13
C3 D0 14
D1 15
A1 B0 C0 D0 16
D1 17
C1 D0 18
D1 19
C2 D0 20
D1 21
C3 D0 22
D1 23
B1 C0 D0 24
D1 25
C1 D0 26
D1 27
C2 D0 28
D1 29
C3 D0 30
D1 31
A2 B0 C0 D0 32
D1 33
C1 D0 34
D1 35
C2 D0 36
D1 37
C3 D0 38
D1 39
B1 C0 D0 40
D1 41
C1 D0 42
D1 43
C2 D0 44
D1 45
C3 D0 46
D1 47
A3 B0 C0 D0 48
D1 49
C1 D0 50
D1 51
C2 D0 52
D1 53
C3 D0 54
D1 55
B1 C0 D0 56
D1 57
C1 D0 58
D1 59
...
[64 rows x 1 columns]
Create an indexer across all of the levels, selecting all entries
In [15]: indexer = [slice(None)]*len(df.index.names)
Make the level we care about only have the entries we care about
In [16]: indexer[df.index.names.index('third')] = ['C1','C3']
Select it (its important that this is a tuple!)
In [18]: df.loc[tuple(indexer),:]
Out[18]:
value
first second third fourth
A0 B0 C1 D0 2
D1 3
C3 D0 6
D1 7
B1 C1 D0 10
D1 11
C3 D0 14
D1 15
A1 B0 C1 D0 18
D1 19
C3 D0 22
D1 23
B1 C1 D0 26
D1 27
C3 D0 30
D1 31
A2 B0 C1 D0 34
D1 35
C3 D0 38
D1 39
B1 C1 D0 42
D1 43
C3 D0 46
D1 47
A3 B0 C1 D0 50
D1 51
C3 D0 54
D1 55
B1 C1 D0 58
D1 59
C3 D0 62
D1 63
[32 rows x 1 columns]
You have the filter method that can do things like this. Eg with the example that was asked in the linked SO question:
In [188]: df.filter(like='0630', axis=0)
Out[188]:
sales cogs net_pft
STK_ID RPT_Date
876 20060630 857483000 729541000 67157200
20070630 1146245000 1050808000 113468500
20080630 1932470000 1777010000 133756300
2254 20070630 501221000 289167000 118012200
The filter method is refactored at the moment (in upcoming 0.14), and a level keyword will be added (because now you can have a problem if the same labels appear in different levels of the index).

Categories

Resources