Say I have following DataFrame with multiple index on both index and columns.
first x y
second m n
A B
A0 B0 0 0
B1 0 0
A1 B0 0 0
B1 0 0
I'm trying to update the values with the condition.The condition will be something like:
`rules:
[{condition:{'A':'A0,'B':'B0'},value:5},
{condition:{'B':'B1'},value:3},
.....]
I'm trying to find something that has similar functionality to
Use pandas.DataFrame.xs for setting value:
for each rule in rules:
df.xs((conditions.values), level=[conditions.keys]) = value
Pass more than one level to pandas.Index.get_level_values for setting value:
for each rule in rules:
df.loc[df.index.get_level_values(conditions.keys) == [conditions.values] = value
The result should be
first x y
second m n
A B
A0 B0 5 5
B1 3 3
A1 B0 0 0
B1 3 3`
Unfortunately selection by dictionary in MultiIndex in pandas is yet not supported, so need custom function adapted for you:
rules = [{'condition':{'A':'A0','B':'B0'},'value':5},
{'condition':{'B':'B1'},'value':3}]
for rule in rules:
d = rule['condition']
indexer = [d[name] if name in d else slice(None) for name in df.index.names]
df.loc[tuple(indexer),] = rule['value']
print (df)
first x y
second m n
A B
A0 B0 5 5
B1 3 3
A1 B0 0 0
B1 3 3
Related
I'm in need of some advice on the following issue:
I have a DataFrame that looks like this:
ID SEQ LEN BEG_GAP END_GAP
0 A1 AABBCCDDEEFFGG 14 2 4
1 A1 AABBCCDDEEFFGG 14 10 12
2 B1 YYUUUUAAAAMMNN 14 4 6
3 B1 YYUUUUAAAAMMNN 14 8 12
4 C1 LLKKHHUUTTYYYYYYYYAA 20 7 9
5 C1 LLKKHHUUTTYYYYYYYYAA 20 12 15
6 C1 LLKKHHUUTTYYYYYYYYAA 20 17 18
And what I need to get is the SEQ that's separated between the different BEG_GAP and END_GAP. I already have worked it out (thanks to a previous question) for sequences that have only one pair of gaps, but here they have multiple.
This is what the sequences should look like:
ID SEQ
0 A1 AA---CDDEE---GG
1 B1 YYUU---A-----NN
2 C1 LLKKHHU---YY----Y--A
Or in an exploded DF:
ID Seq_slice
0 A1 AA
1 A1 CDDEE
2 A1 GG
3 B1 YYUU
4 B1 A
5 B1 NN
6 C1 LLKKHHU
7 C1 YY
8 C1 Y
9 C1 A
At the moment, I'm using a piece of code (that I got thanks to a previous question) that works only if there's one gap, and it looks like this:
import pandas as pd
df = pd.read_csv("..\path_to_the_csv.csv")
df["BEG_GAP"] = df["BEG_GAP"].astype(int)
df["END_GAP"]= df["END_GAP"].astype(int)
df['SEQ'] = df.apply(lambda x: [x.SEQ[:x.BEG_GAP], x.SEQ[x.END_GAP+1:]], axis=1)
output = df.explode('SEQ').query('SEQ!=""')
But this has the problem that it generates a bunch of sequences that don't really exist because they actually have another gap in the middle.
I.e what it would generate:
ID Seq_slice
0 A1 AA
1 A1 CDDEEFFG #<- this one shouldn't exist! Because there's another gap in 10-12
2 A1 AABBCCDDEE #<- Also, this one shouldn't exist, it's missing the previous gap.
3 A1 GG
And so on, with the other sequences. As you can see, there are some slices that are not being generated and some that are wrong, because I don't know how to tell the code to have in mind all the gaps while analyzing the sequence.
All advice is appreciated, I hope I was clear!
Let's try defining a function and apply:
def truncate(data):
seq = data.SEQ.iloc[0]
ll = data.LEN.iloc[0]
return [seq[x:y] for x,y in zip([0]+list(data.END_GAP),
list(data.BEG_GAP)+[ll])]
(df.groupby('ID').apply(truncate)
.explode().reset_index(name='Seq_slice')
)
Output:
ID Seq_slice
0 A1 AA
1 A1 CCDDEE
2 A1 GG
3 B1 YYUU
4 B1 AA
5 B1 NN
6 C1 LLKKHHU
7 C1 TYY
8 C1 YY
9 C1 AA
In one line:
df.groupby('ID').agg({'BEG_GAP': list, 'END_GAP': list, 'SEQ': max, 'LEN': max}).apply(lambda x: [x['SEQ'][b: e] for b, e in zip([0] + x['END_GAP'], x['BEG_GAP'] + [x['LEN']])], axis=1).explode()
ID
A1 AA
A1 CCDDEE
A1 GG
B1 YYUU
B1 AA
B1 NN
C1 LLKKHHU
C1 TYY
C1 YY
C1 AA
I'm new to pandas have tried going through the docs and experiment with various examples, but this problem I'm tacking has really stumped me.
I have the following two dataframes (DataA/DataB) which I would like to merge on a per global_index/item/values basis.
DataA DataB
row item_id valueA row item_id valueB
0 x A1 0 x B1
1 y A2 1 y B2
2 z A3 2 x B3
3 x A4 3 y B4
4 z A5 4 z B5
5 x A6 5 x B6
6 y A7 6 y B7
7 z A8 7 z B8
The list of items(item_ids) is finite and each of the two dataframes represent a the value of a trait (trait A, trait B) for an item at a given global_index value.
The global_index could roughly be thought of as a unit of "time"
The mapping between each data frame (DataA/DataB) and the global_index is done via the following two mapper DFs:
DataA_mapper
global_index start_row num_rows
0 0 3
1 3 2
3 5 3
DataB_mapper
global_index start_row num_rows
0 0 2
2 2 3
4 5 3
Simply put for a given global_index (eg: 1) the mapper will define a list of rows into the respective DFs (DataA or DataB) that are associated with that global_index.
For example, for a global_index value of 0:
In DF DataA rows 0..2 are associated with global_index 0
In DF DataB rows 0..1 are associated with global_index 0
Another example, for a global_index value of 2:
In DF DataB rows 2..4 are associated with global_index 2
In DF DataA there are no rows associated with global_index 2
The ranges [start_row,start_row + num_rows) presented do not overlap each other and represent a unique sequence/range of rows in their respective dataframes (DataA, DataB)
In short no row in either DataA or DataB will be found in more than one range.
I would like to merge the DFs so that I get the following dataframe:
row global_index item_id valueA valueB
0 0 x A1 B1
1 0 y A2 B2
2 0 z A3 NaN
3 1 x A4 B1
4 1 z A5 NaN
5 2 x A4 B3
6 2 y A2 B4
7 2 z A5 NaN
8 3 x A6 B3
9 3 y A7 B4
10 3 z A8 B5
11 4 x A6 B6
12 4 y A7 B7
13 4 z A8 B8
In the final datafram any pair of global_index/item_id there will ever be either:
a value for both valueA and valueB
a value only for valueA
a value only for valueB
With the requirement being if there is only one value for a given global_index/item (eg: valueA but no valueB) for the last value of the missing one to be used.
First, you can create the 'global_index' column using the function pd.cut:
for df, m in [(df_A, map_A), (df_B, map_B)]:
bins = np.insert(m['num_rows'].cumsum().values, 0, 0) # create bins and add zero at the beginning
df['global_index'] = pd.cut(df['row'], bins=bins, labels=m['global_index'], right=False)
Next, you can use outer join to merge both data frames:
df = df_A.merge(df_B, on=['global_index', 'item_id'], how='outer')
And finally you can use functions groupby and ffill to fill missing values:
for val in ['valueA', 'valueB']:
df[val] = df.groupby('item_id')[val].ffill()
Output:
item_id global_index valueA valueB
0 x 0 A1 B1
1 y 0 A2 B2
2 z 0 A3 NaN
3 x 1 A4 B1
4 z 1 A5 NaN
5 x 3 A6 B1
6 y 3 A7 B2
7 z 3 A8 NaN
8 x 2 A6 B3
9 y 2 A7 B4
10 z 2 A8 B5
11 x 4 A6 B6
12 y 4 A7 B7
13 z 4 A8 B8
I haven't tested this out, since I don't have any good test data, but I think something like this should work. Basically what this is doing is, rather than trying to pull off some sort of complicated join, it is building a series of lists to hold your data, which you can then put back together into a final dataframe at the end.
DataA.set_index('row')
DataB.set_index('row')
#we're going to create the new dataframe from scratch, creating a list for each column we want
global_index = []
AValues = []
AIndex = []
BValues = []
BIndex = []
for indexNum in totalIndexes:
#for each global index, we get the total number of rows to extract from DataA and DataB
AStart = DataA_mapper.loc[DataA_mapper['global_index']==indexNum, 'start_row'].values[0]
ARows = DataA_mapper.loc[DataA_mapper['global_index']==indexNum, 'num_rows'].values[0]
AStop = AStart + Arows
BStart = DataB_mapper.loc[DataB_mapper['global_index']==indexNum, 'start_row'].values[0]
BRows = DataB_mapper.loc[DataB_mapper['global_index']==indexNum, 'num_rows'].values[0]
BStop = BStart + Brows
#Next we extract values from DataA and DataB, turn them into lists, and add them to our data
AValues = AValues + list(DataA.iloc[AStart:AStop, 1].values)
AIndex = AIndex + list(DataA.iloc[AStart:AStop, 0].values)
BValues = BValues + list(DataB.iloc[BStart:BStop, 1].values)
BIndex = BIndex + list(DataA.iloc[AStart:AStop, 0].values)
#Create a temporary list of the current global_index, and add it to our data
global_index_temp = []
for row in range(max(ARows,Brows)):
global_index_temp.append(indexNum)
global_index = global_index + global_index_temp
#combine all these individual lists into a dataframe
finalData = list(zip(global_index, AIndex, BIndex, AValues, BValues))
df = pd.DataFrame(data = finalData, columns = ['global_index', 'item1', 'item2', 'valueA', 'valueB'])
#lastly you just need to merge item1 and item2 to get your item_id column
I've tried to comment it nicely so that hopefully the general plan makes sense and you can follow along and correct my mistakes or rewrite it your own way.
I have a pandas dataframe with multiindex where I want to aggregate the duplicate key rows as follows:
import numpy as np
import pandas as pd
df = pd.DataFrame({'S':[0,5,0,5,0,3,5,0],'Q':[6,4,10,6,2,5,17,4],'A':
['A1','A1','A1','A1','A2','A2','A2','A2'],
'B':['B1','B1','B2','B2','B1','B1','B1','B2']})
df.set_index(['A','B'])
Q S
A B
A1 B1 6 0
B1 4 5
B2 10 0
B2 6 5
A2 B1 2 0
B1 5 3
B1 17 5
B2 4 0
and I would like to groupby this dataframe to aggregate the Q values (sum) and keep the S value that corresponds to the maximal row of the Q value yielding this:
df2 = pd.DataFrame({'S':[0,0,5,0],'Q':[10,16,24,4],'A':
['A1','A1','A2','A2'],
'B':['B1','B2','B1','B2']})
df2.set_index(['A','B'])
Q S
A B
A1 B1 10 0
B2 16 0
A2 B1 24 5
B2 4 0
I tried the following, but it didn't work:
df.groupby(by=['A','B']).agg({'Q':'sum','S':df.S[df.Q.idxmax()]})
any hints?
One way is to use agg, apply, and join:
g = df.groupby(['A','B'], group_keys=False)
g.apply(lambda x: x.loc[x.Q == x.Q.max(),['S']]).join(g.agg({'Q':'sum'}))
Output:
S Q
A B
A1 B1 0 10
B2 0 16
A2 B1 5 24
B2 0 4
Here's one way
In [1800]: def agg(x):
...: m = x.S.iloc[np.argmax(x.Q.values)]
...: return pd.Series({'Q': x.Q.sum(), 'S': m})
...:
In [1801]: df.groupby(['A', 'B']).apply(agg)
Out[1801]:
Q S
A B
A1 B1 10 0
B2 16 0
A2 B1 24 5
B2 4 0
I have a dataframe with many columns (around 1000).
Given a set of columns (around 10), which have 0 or 1 as values, I would like to select all the rows where I have 1s in the aforementioned set of columns.
Toy example. My dataframe is something like this:
c1,c2,c3,c4,c5
'a',1,1,0,1
'b',0,1,0,0
'c',0,0,1,1
'd',0,1,0,0
'e',1,0,0,1
And I would like to get the rows where the columns c2 and c5 are equal to 1:
'a',1,1,0,1
'e',1,0,0,1
Which would be the most efficient way to do it?
Thanks!
This would be more generic for multiple columns cols
In [1277]: cols = ['c2', 'c5']
In [1278]: df[(df[cols] == 1).all(1)]
Out[1278]:
c1 c2 c3 c4 c5
0 'a' 1 1 0 1
4 'e' 1 0 0 1
Or,
In [1284]: df[np.logical_and.reduce([df[x]==1 for x in cols])]
Out[1284]:
c1 c2 c3 c4 c5
0 'a' 1 1 0 1
4 'e' 1 0 0 1
Or,
In [1279]: df.query(' and '.join(['%s==1'%x for x in cols]))
Out[1279]:
c1 c2 c3 c4 c5
0 'a' 1 1 0 1
4 'e' 1 0 0 1
Can you try doing something like this:
df.loc[df['c2'] == 1 & df['c5'] == 1]
import pandas as pd
frame = pd.DataFrame([
['a',1,1,0,1],
['b',0,1,0,0],
['c',0,0,1,1],
['d',0,1,0,0],
['e',1,0,0,1]], columns='c1,c2,c3,c4,c5'.split(','))
print(frame.loc[(frame['c2'] == 1) & (frame['c5'] == 1)])
Now I would like to handle dataframe
df
A B
1 A0
1 A1
1 B0
2 B1
2 B2
3 B3
3 A2
3 A3
First, I would like to group by df.A
sub1
A B
1 A0
1 A1
1 B0
Second, I would like to extract first rows which contains letter A
A B
1 A0
If there is no A
sub2
A B
2 B1
2 B2
I would like to extract the first rows
A B
2 B1
So, I would like to get the result below
A B
1 A0
2 B1
3 A2
I would like to handle priority extraction,I tried grouping but Couldnt figure out. How to handle this?
You can groupby column A and for each group use idxmax() on str.contains("A"), then if there is A in column B, it will get the first index which contains letter A, otherwise it falls back to the first row as all values are False:
df.groupby("A", as_index=False).apply(lambda g: g.loc[g.B.str.contains("A").idxmax()])
# A B
#0 1 A0
#1 2 B1
#2 3 A2
In cases where you may have duplicated index, you can use numpy.ndarray.argmax() with iloc which accepts integer as position indexing:
df.groupby("A", as_index=False).apply(lambda g: g.iloc[g.B.str.contains("A").values.argmax()])
# A B
#0 1 A0
#1 2 B1
#2 3 A2