How to delete row in data frame based on value in set? - python

I want to delete the row in my data frame if it meets any value in a set. I have tried many different iterations of the following code, but they don't work:
if subid in intersection == df_1["SubId"][x]:
for x in range(len(df_1)):
del df_1.iloc[x]
I am getting Key Error:0. Any ideas??
Thanks in advance.
Edit: I have defined intersection as the following:
intersection = set(ABC).intersection(XYZ)

If you just want to remove those then use isin:
df_1[~df_1["SubId"].isin(intersection)]
This will produce a boolean mask of the rows where they match one of the values in intersection and we invert the mask using ~
What you're doing will be slow plus won't your indexing potentially run off the end of the df if you keep removing rows?
Example:
In [2]:
df = pd.DataFrame({'a':[0,1,2,3,4], 'b':np.random.randn(5)})
df
Out[2]:
a b
0 0 0.987283
1 1 0.683479
2 2 1.640774
3 3 1.262665
4 4 -1.462040
In [3]:
df[~df.a.isin([0,3])]
Out[3]:
a b
1 1 0.683479
2 2 1.640774
4 4 -1.462040

Related

selecting rows with min and max values of a defined column in pandas

I have the following dataframe:
A,B,C,D
10,1,2,3
1,4,7,3
10,5,2,3
40,7,9,3
9,9,9,9
I would like to create another dataframe starting from the previous one which have only two row. The selection of these two rows is based on the minimum and maximum value in the column "A". I would like to get:
A,B,C,D
1,4,7,3
40,7,9,3
Do you think I should work with a sort of index.min e index.max and then select only the two rows and append then in a new dataframe? Do you have same other suggestions?
Thanks for any kind of help,
Best
IIUC you can simply subset the dataframe with an OR condition on df.A.min() and df.A.max():
df = df[(df.A==df.A.min())|(df.A==df.A.max())]
df
A B C D
1 1 4 7 3
3 40 7 9 3
Yes, you can use idxmin/idxmax and then use loc:
df.loc[df['A'].agg(['idxmin','idxmax']) ]
Output:
A B C D
1 1 4 7 3
3 40 7 9 3
Note that this only gives one row for min and one for max. If you want all values, you should use #CHRD's solution.

Get only matching rows for groups in Pandas groupby

I have the following df:
d = {"Col1":['a','d','b','c','a','d','b','c'],
"Col2":['x','y','x','z','x','y','z','y'],
"Col3":['n','m','m','l','m','m','l','l'],
"Col4":[1,4,2,2,1,4,2,2]}
df = pd.DataFrame(d)
When I groupby on three fields, I get the result:
gb = df.groupby(['Col1', 'Col2', 'Col3'])['Col4'].agg(['sum', 'mean'])
How can I extract only the groups and rows where a row of a group matches with at least one other row of another group on grouped columns. Please see the picture below, I want to get the highlighted rows
I want to get the rows in red on the basis of the ones in Blue and Black which match eachother
Apologies if my statement is ambiguous. Any help would be appreciated
You can reset_index then use duplicated and boolean index filter your dataframe:
gb = gb.reset_index()
gb[gb.duplicated(subset=['Col2','Col3'], keep=False)]
Output:
Col1 Col2 Col3 sum mean
0 a x m 1 1
2 b x m 2 2
3 b z l 2 2
5 c z l 2 2
Make a table with all allowed combinations and then inner join it with this dataframe.

Reindexing a pandas DataFrame using a dict (python3)

Is there a way, without use of loops, to reindex a DataFrame using a dict? Here is an example:
df = pd.DataFrame([[1,2], [3,4]])
dic = {0:'first', 1:'second'}
I want to apply something efficient to df for obtaining:
0 1
first 1 2
second 3 4
Speed is important, as the index in the actual DataFrame I am dealing with has a huge number of unique values. Thanks
You need the rename function:
df.rename(index=dic)
# 0 1
#first 1 2
#second 3 4
Modified the dic to get the results: dic = {0:'first', 1:'second'}

How do I find duplicate indices in a DataFrame?

I have a pandas DataFrame with a multi-level index ("instance" and "index"). I want to find all the first-level ("instance") index values which are non-unique and to print out those values.
My frame looks like this:
A
instance index
a 1 10
2 12
3 4
b 1 12
2 5
3 2
b 1 12
2 5
3 2
I want to find "b" as the duplicate 0-level index and print its value ("b") out.
You can use the get_duplicates() method:
>>> df.index.get_level_values('instance').get_duplicates()
[0, 1]
(In my example data 0 and 1 both appear multiple times.)
The get_level_values() method can accept a label (such as 'instance') or an integer and retrieves the relevant part of the MultiIndex.
Assuming that your df has an index made of 'instance' and 'index' you could do this:
df1 = df.reset_index().pivot_table(index=['instance','index'], values='A', aggfunc='count')
df1[df1 > 1].index.get_level_values(0).drop_duplicates()
Which yields:
Index([u'b'], dtype='object')
Adding .values at the end (.drop_duplicates().values) will make an array:
array(['b'], dtype=object)
Or the same with one line using .groupby:
df[df.groupby(level=['instance','index']).count() > 1].dropna().index.get_level_values(0).drop_duplicates()
This should give you the whole row which isn't quite what you asked for but might be close enough:
df[df.index.get_level_values('instance').duplicated()]
You want the duplicated method:
df['Instance'].duplicated()

Only allow one to one mapping between two columns in pandas dataframe

I have a two column dataframe df, each row are distinct, one element in one column can map to one or more than one elements in another column. I want to filter OUT those elements. So in the final dataframe, one element in one column only map to a unique element in another column.
What I am doing is to groupby one column and count the duplicates, then remove rows with counts more than 1. and do it again for another column. I am wondering if there is a better, simpler way.
Thanks
edit1: I just realize my solution is INCORRECT, removing multi-mapping elements in column A reduces the number of mapping in column B, consider the following example:
A B
1 4
1 3
2 4
1 maps to 3,4 , so the first two rows should be removed, and 4 maps to 1,2. The final table should be empty. However, my solution will keep the last row.
Can anyone provide me a fast and simple solution ? thanks
Well, You could do something like the following:
>>> df
A B
0 1 4
1 1 3
2 2 4
3 3 5
You only want to keep a row if no other row has the value of 'A' and no other row as that value of 'B'. Only row three meets those conditions in this example:
>>> Aone = df.groupby('A').filter(lambda x: len(x) == 1)
>>> Bone = df.groupby('B').filter(lambda x: len(x) == 1)
>>> Aone.merge(Bone,on=['A','B'],how='inner')
A B
0 3 5
Explanation:
>>> Aone = df.groupby('A').filter(lambda x: len(x) == 1)
>>> Aone
A B
2 2 4
3 3 5
The above grabs the rows that may be allowed based on looking at column 'A' alone.
>>> Bone = df.groupby('B').filter(lambda x: len(x) == 1)
>>> Bone
A B
1 1 3
3 3 5
The above grabs the rows that may be allowed based on looking at column 'B' alone. And then merging the intersection leaves you with rows that only meet both conditions:
>>> Aone.merge(Bone,on=['A','B'],how='inner')
Note, you could also do a similar thing using groupby/transform. But transform tends to be slowish so I didn't do it as an alternative.

Categories

Resources