Pandas DataFrame : selection of multiple elements in several columns

Pandas DataFrame : selection of multiple elements in several columns - python

I have this Python Pandas DataFrame DF :
DICT = { 'letter': ['A','B','C','A','B','C','A','B','C'],
'number': [1,1,1,2,2,2,3,3,3],
'word' : ['one','two','three','three','two','one','two','one','three']}
DF = pd.DataFrame(DICT)
Which looks like :
letter number word
0 A 1 one
1 B 1 two
2 C 1 three
3 A 2 three
4 B 2 two
5 C 2 one
6 A 3 two
7 B 3 one
8 C 3 three
And I want to extract the lines
letter number word
A 1 one
B 2 two
C 3 three
First I tired :
DF[(DF['letter'].isin(("A","B","C"))) &
DF['number'].isin((1,2,3)) &
DF['word'].isin(('one','two','three'))]
Of course it didn't work, and everything has been selected
Then I tested :
Bool = DF[['letter','number','word']].isin(("A",1,"one"))
DF[np.all(Bool,axis=1)]
Good, it works ! but only for one line ...
If we take the next step and give an iterable to .isin() :
Bool = DF[['letter','number','word']].isin((("A",1,"one"),
("B",2,"two"),
("C",3,"three")))
Then it fails, the Boolean array is full of False ...
What I'm doing wrong ? Is there a more elegant way to do this selection based on several columns ?
(Anyway, I want to avoid a for loop, because the real DataFrames I'm using are really big, so I'm looking for the fastest optimal way to do the job)

Idea is create new DataFrame with all triple values and then merge with original DataFrame:
L = [("A",1,"one"),
("B",2,"two"),
("C",3,"three")]
df1 = pd.DataFrame(L, columns=['letter','number','word'])
print (df1)
letter number word
0 A 1 one
1 B 2 two
2 C 3 three
df = DF.merge(df1)
print (df)
letter number word
0 A 1 one
1 B 2 two
2 C 3 three
Another idea is create list of tuples, convert to Series and then compare by isin:
s = pd.Series(list(map(tuple, DF[['letter','number','word']].values.tolist())),index=DF.index)
df1 = DF[s.isin(L)]
print (df1)
letter number word
0 A 1 one
4 B 2 two
8 C 3 three

Related

Define column values to be selected / disselected as default

I would like to automate selecting of values in one column - Step_ID.
Insted of defining which Step_ID i would like to filter (shown in the code below) i would like to define, that the first Step_ID and the last Step_ID are being to excluded.
df = df.set_index(['Step_ID'])
df.loc[df.index.isin(['Step_2','Step_3','Step_4','Step_5','Step_6','Step_7','Step_8','Step_9','Step_10','Step_11','Step_12','Step_13','Step_14','Step_15','Step_16','Step_17','Step_18','Step_19','Step_20','Step_21','Step_22','Step_23','Step_24'])]
Is there any option to exclude the first and last value in the column? In this example Step_1 and Step_25.
Or include all values expect of the first and the last value? In this example Step_2-Step_24.
The reason for this is that files have different numbers of ''Step_ID''.
Since I don't have to redefine it all the time I would like to have a solution that simplify filtering of those. It is necessary to exclude the first and last value in the column 'Step_ID', but the number of the STEP_IDs is always different.
By Step_1 - Step_X, I need to have Step_2 - Step_(X-1).

Use:
df = pd.DataFrame({
'Step_ID': ['Step_1','Step_1','Step_2','Step_2','Step_3','Step_4','Step_5',
'Step_6','Step_6'],
'B': list(range(9))})
print (df)
Step_ID B
0 Step_1 0
1 Step_1 1
2 Step_2 2
3 Step_2 3
4 Step_3 4
5 Step_4 5
6 Step_5 6
7 Step_6 7
8 Step_6 8
Select all index values without first and last index values extracted by slicing df.index[[0, -1]]:
df = df.set_index(['Step_ID'])
df = df.loc[~df.index.isin(df.index[[0, -1]].tolist())]
print (df)
B
Step_ID
Step_2 2
Step_2 3
Step_3 4
Step_4 5
Step_5 6

Sum values in df column based on partial name of another column

Given the dataframe
a b
foo123 5
foo456 8
bar234 1
bar324 6
How do I add the values from b based on the only the first several characters of a? The ouput I'm looking for is:
a b
foo 13
bar 7
There are too many entries for column a to set manually, so something like the following won't work:
if df['a'].startswith('foo'):
sum(b)
I'm thinking something more like if df['a'] has first three characters that match, add all the corresponding rows for b.

If your substrings do not all have the same length, use str.extract, extract relevant portions from a and then use that to perform a groupby + sum operation on b:
# assuming your frame is df1
df1.groupby(df1['a'].str.extract(r'^(\D+)', expand=False))['b'].sum().reset_index()
a b
0 bar 7
1 foo 13
For more performance, pre-assign a first;
df1['a'] = df1['a'].str.extract(r'^(\D+)', expand=False)
df1.groupby('a', as_index=False)['b'].sum()
a b
0 bar 7
1 foo 13
If all substrings are of the same size, just slice and groupby:
df1.groupby(df1['a'].str[:3])['b'].sum().reset_index()
a b
0 bar 7
1 foo 13

replace number to ''
df.groupby(df.a.str.replace('\d+', '')).b.sum()
Out[1353]:
a
bar 7
foo 13
Name: b, dtype: int64

Python: Efficently extract a single value for every group

I need to add a description column to a dataframe that is built by grouping items from another dataframe.
grouped= df1.groupby('item')
list= grouped['total'].agg(np.sum)
list= list.reset_index()
to assign a description label to every item I've come up with this solution:
def des(item):
return df1['description'].loc[df1['item']== item].iloc[0]
list['description'] = list['item'].apply(des)
it works but it takes an enourmous amount of time to execute.
I'd like to do something like that
list=list.assign(description= df1['description'].loc[df1['item']==list['item']]
or
list=list.assign(description= df1['description'].loc[df1['item'].isin(list['item'])]
Theese are very wrong but hope you get the idea, hoping there is some pandas stuff that do the trick more efficently but can't find it
Any ideas?

I think you need DataFrameGroupBy.agg by dict of functions - for column total sum and for description first:
df = df1.groupby('item', as_index=False).agg({'total':'sum', 'description':'first'})
Also dont use variable name list, because list is python code reserved word.
Sample:
df1 = pd.DataFrame({'description':list('abcdef'),
'B':[4,5,4,5,5,4],
'total':[5,3,6,9,2,4],
'item':list('aaabbb')})
print (df1)
B description item total
0 4 a a 5
1 5 b a 3
2 4 c a 6
3 5 d b 9
4 5 e b 2
5 4 f b 4
df = df1.groupby('item', as_index=False).agg({'total':'sum', 'description':'first'})
print (df)
item total description
0 a 14 a
1 b 15 d

Python Pandas - filtering df by the number of unique values within a group

Here is an example of data I'm working on. (as a pandas df)
index inv Rev_stream Bill_type Net_rev
1 1 A Original -24.77
2 1 B Original -24.77
3 2 A Original -409.33
4 2 B Original -409.33
5 2 C Original -409.33
6 2 D Original -409.33
7 3 A Original -843.11
8 3 A Rebill 279.5
9 3 B Original -843.11
10 4 A Rebill 279.5
11 4 B Original -843.11
12 5 B Rebill 279.5
How could I filter this df, in a way to only get the lines where invoice/Rev_stream combo has both original and rebill kind of Net_rev. In the example above it would be only lines with index 7 and 8.
Is there an easy way to do it, without iterating over the whole dataframe and building dictionaries of invoice+RevStream : Bill_type?
What I'm looking for is some kind of
df = df[df[['inv','Rev_stream']]['Bill_type'].unique().len() == 2]
Unfortunately the code above doesn't work.
Thanks in advance.

You can group your data by inv and Rev_stream columns and then check for each group if both Original and Rebill are in the Bill_type values and filter based on the condition:
(df.groupby(['inv', 'Rev_stream'])
.filter(lambda g: 'Original' in g.Bill_type.values and 'Rebill' in g.Bill_type.values))

Only allow one to one mapping between two columns in pandas dataframe

I have a two column dataframe df, each row are distinct, one element in one column can map to one or more than one elements in another column. I want to filter OUT those elements. So in the final dataframe, one element in one column only map to a unique element in another column.
What I am doing is to groupby one column and count the duplicates, then remove rows with counts more than 1. and do it again for another column. I am wondering if there is a better, simpler way.
Thanks
edit1: I just realize my solution is INCORRECT, removing multi-mapping elements in column A reduces the number of mapping in column B, consider the following example:
A B
1 4
1 3
2 4
1 maps to 3,4 , so the first two rows should be removed, and 4 maps to 1,2. The final table should be empty. However, my solution will keep the last row.
Can anyone provide me a fast and simple solution ? thanks

Well, You could do something like the following:
>>> df
A B
0 1 4
1 1 3
2 2 4
3 3 5
You only want to keep a row if no other row has the value of 'A' and no other row as that value of 'B'. Only row three meets those conditions in this example:
>>> Aone = df.groupby('A').filter(lambda x: len(x) == 1)
>>> Bone = df.groupby('B').filter(lambda x: len(x) == 1)
>>> Aone.merge(Bone,on=['A','B'],how='inner')
A B
0 3 5
Explanation:
>>> Aone = df.groupby('A').filter(lambda x: len(x) == 1)
>>> Aone
A B
2 2 4
3 3 5
The above grabs the rows that may be allowed based on looking at column 'A' alone.
>>> Bone = df.groupby('B').filter(lambda x: len(x) == 1)
>>> Bone
A B
1 1 3
3 3 5
The above grabs the rows that may be allowed based on looking at column 'B' alone. And then merging the intersection leaves you with rows that only meet both conditions:
>>> Aone.merge(Bone,on=['A','B'],how='inner')
Note, you could also do a similar thing using groupby/transform. But transform tends to be slowish so I didn't do it as an alternative.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas DataFrame : selection of multiple elements in several columns - python

Related

Define column values to be selected / disselected as default

Sum values in df column based on partial name of another column

Python: Efficently extract a single value for every group

Python Pandas - filtering df by the number of unique values within a group

Only allow one to one mapping between two columns in pandas dataframe

Categories

Resources