Sorting data. Pandas dataframe - python

Looking for advise on how to solve the following problem.
I have Pandas dataframe, let's say 1.000.000 rows by 10 columns (A, B, C...J). Data type is float.
The task is to remove all the rows (i) in a dataframe if there exist another raw (j) with all values equal to or greater than values in original raw (i).
(Ai<=Aj) and (Bi<=Bj) and (Ci<=Cj) ... and (Ji<=Jj)
I wonder whether there any tools exist in pandas tool kit or any other analytics python module to efficiently solve this problem.
I have a very inefficient solution with multiple iterations in a simple array. Hoping to find smth more promising.
Simplified example, original data:
0 1 5 4 4 2
2 5 6 4 3 7
-2 5 6 5 3 7
0 0 0 0 0 1
0 0 0 0 0 8
Result to be:
0 1 5 4 4 2
2 5 6 4 3 7
-2 5 6 5 3 7
0 0 0 0 0 8

There is a way from numpy
#import numpy as np
df[np.any(np.sum(df.values>=df.values[:,None],1)==1,1)]
Out[40]:
A B C D E F
0 0 1 5 4 4 2
1 2 5 6 4 3 7
2 -2 5 6 5 3 7
4 0 0 0 0 0 8

Related

Shuffling one Column of a DataFrame By Group Efficiently

I am trying to implement a permutation test on a large Pandas dataframe. The dataframe looks like the following:
group some_value label
0 1 8 1
1 1 7 0
2 1 6 2
3 1 5 2
4 2 1 0
5 2 2 0
6 2 3 1
7 2 4 2
8 3 2 1
9 3 4 1
10 3 2 1
11 3 4 2
I want to group by column group, and shuffle the label column and write back to the data frame, preferably in place. The some_value column should remain intact. The result should look something like the following:
group some_value label
0 1 8 1
1 1 7 2
2 1 6 2
3 1 5 0
4 2 1 1
5 2 2 0
6 2 3 0
7 2 4 2
8 3 2 1
9 3 4 2
10 3 2 1
11 3 4 1
I used np.random.permutation but found it was very slow.
df["label"] = df.groupby("group")["label"].transform(np.random.permutation
It seems that df.sample is much faster. How can I solve this problem using df.sample() instead of np.random.permutation, and inplace?
We can using sample Notice this is assuming df=df.sort_values('group')
df['New']=df.groupby('group').label.apply(lambda x : x.sample(len(x))).values
Or we can do it by
df['New']=df.sample(len(df)).sort_values('group').New.values
What about providing a custom transform function?
def sample(x):
return x.sample(n=x.shape[0])
df.groupby("group")["label"].transform(sample)
This SO explanation of printing out what is passed into the custom function via the transform function is helpful.

Python Pandas Update Value Based on Index using .iloc

I have data frame
a=pd.DataFrame([[1,1,9],[2,1,9],[3,2,9],[4,2,9]],columns=['a','b','c'])
a b c
0 1 1 9
1 2 1 9
2 3 2 9
3 4 2 9
if I run
a['c'].iloc[0]=100
it works and I get
a b c
0 1 1 100
1 2 1 9
2 3 2 9
3 4 2 9
But if I want to update the first observation of group b==2 by running
a['c'][a['b']==2].iloc[0]=100
It doesn't do what I want it do. I still get the same dataframe.
a b c
0 1 1 100
1 2 1 9
2 3 2 9
3 4 2 9
I wonder why? and what's a possible solution for this?
Thank you for your help.
You should using .loc like this, chian with .iloc and .loc sometime will cause the issue
Whether a copy or a reference is returned for a setting operation, may depend on the context. This is sometimes called chained> assignment and should be avoided
a.loc[a.index[a.b==2][0],'c']=10000
a
Out[761]:
a b c
0 1 1 9
1 2 1 9
2 3 2 10000
3 4 2 9

Is it possible to obtain groupby style counts without collapsing Pandas DataFrame?

I have a DataFrame with 9 columns, and I'm trying to add a column of counts of unique values based on the first 3 columns (e.g. Cols A, B, and C, must match to count as a unique value , but the remaining columns can vary. I attempted to do this as with groupby:
df = pd.DataFrame(resultsFile500.groupby(['chr','start','end']).size().reset_index().rename(columns={0:'count'}))
This returns a DataFrame with 5 columns, and the counts are what I want. However, I also need values from the original data frame, so what I have been trying to do is somehow get those values of counts as a column in the original df. So, this would mean that if two rows in columns chr, start, and end, had identical values, the counts column would be 2 in both rows, but they would not be collapsed to one row. Is there an easy solution here that I'm missing, or do I need to hack something together?
You can use .transform to get non-collapsing behavior:
>>> df
a b c d e
0 3 4 1 3 0
1 3 1 4 3 0
2 4 3 3 2 1
3 3 4 1 4 0
4 0 4 3 3 2
5 1 2 0 4 1
6 3 1 4 2 1
7 0 4 3 4 0
8 1 3 0 1 1
9 3 4 1 2 1
>>> df.groupby(['a','b','c']).transform('count')
d e
0 3 3
1 2 2
2 1 1
3 3 3
4 2 2
5 1 1
6 2 2
7 2 2
8 1 1
9 3 3
>>>
Note, i'll have to choose an arbitrary column from the .transform result, but then just do:
>>> df['unique_count'] = df.groupby(['a','b','c']).transform('count')['d']
>>> df
a b c d e unique_count
0 3 4 1 3 0 3
1 3 1 4 3 0 2
2 4 3 3 2 1 1
3 3 4 1 4 0 3
4 0 4 3 3 2 2
5 1 2 0 4 1 1
6 3 1 4 2 1 2
7 0 4 3 4 0 2
8 1 3 0 1 1 1
9 3 4 1 2 1 3

Finding the min (or max) of a class in Pandas

Im working on a large dataset (with pandas in python) and I have a dataframe similar structured to the following:
class value
0 1 6
1 1 4
2 1 5
3 5 6
4 5 2
...
n 225 3
The classes grow through the dataframe continuously, however; missing some values as shown in the example. I was wondering how I can get simple stats like min, or max from each class and assign it to a new feature.
class value min
0 1 6 4
1 1 4 4
2 1 5 4
3 5 6 2
4 5 2 2
...
n 225 3 3
The only solution I can come up with is with a time consuming loop.
By using transform
df['min']=df.groupby('class')['value'].transform('min')
df
Out[497]:
class value min
0 1 6 4
1 1 4 4
2 1 5 4
3 5 6 2
4 5 2 2

TypeError: unhashable type: 'slice' for pandas

I have a pandas datastructure, which I create like this:
test_inputs = pd.read_csv("../input/test.csv", delimiter=',')
Its shape
print(test_inputs.shape)
is this
(28000, 784)
I would like to print a subset of its rows, like this:
print(test_inputs[100:200, :])
print(test_inputs[100:200, :].shape)
However, I am getting:
TypeError: unhashable type: 'slice'
Any idea what could be wrong?
Indexing in pandas is really confusing, as it looks like list indexing but it is not. You need to use .iloc, which is indexing by position
print(test_inputs.iloc[100:200, :])
And if you don't use column selection you can omit it
print(test_inputs.iloc[100:200])
P.S. Using .loc is not what you want, as it would look not for the row number, but for the row index (which can be filled we anything, not even numbers, not even unique). Ranges in .loc will find rows with index value 100 and 200, and return the lines between. If you just created the DataFrame .iloc and .loc may give the same result, but using .loc in this case is a very bad practice as it will lead you to difficult to understand problem when the index will change for some reason (for example you'll select some subset of rows, and from that moment the row number and index will not be the same).
P.P.S. You can use test_inputs[100:200], but not test_inputs[100:200, :] because pandas designers tried to combine different popular approaches into one construction. And test_input['column'] equals to test_input.loc[:, 'column'], but surprisingly slicing with integers test_input[100:200] equals to test_inputs.iloc[100:200] (while slicing with not integer values is equivalent to loc row slicing). And if you pass a pair of values to [] it considers as a tuple for multilevel column indexing so multi_level_columns_df['level_1', 'level_2'] is equivalent to multi_level_columns_df.loc[:, ('level_1', 'level_2')]. That is why your original construction led to the error: slice can't be used as a part of multilevel index.
There is more possible solutions, but output is not same:
loc selects by labels, but iloc and slicing without function, the start bounds is included, while the upper bound is excluded, docs - select by positions:
test_inputs = pd.DataFrame(np.random.randint(10, size=(28, 7)))
print(test_inputs.loc[10:20])
0 1 2 3 4 5 6
10 3 2 0 6 6 0 0
11 5 0 2 4 1 5 2
12 5 3 5 4 1 3 5
13 9 5 6 6 5 0 1
14 7 0 7 4 2 2 5
15 2 4 3 3 7 2 3
16 8 9 6 0 5 3 4
17 1 1 0 7 2 7 7
18 1 2 2 3 5 8 7
19 5 1 1 0 1 8 9
20 3 6 7 3 9 7 1
print(test_inputs.iloc[10:20])
0 1 2 3 4 5 6
10 3 2 0 6 6 0 0
11 5 0 2 4 1 5 2
12 5 3 5 4 1 3 5
13 9 5 6 6 5 0 1
14 7 0 7 4 2 2 5
15 2 4 3 3 7 2 3
16 8 9 6 0 5 3 4
17 1 1 0 7 2 7 7
18 1 2 2 3 5 8 7
19 5 1 1 0 1 8 9
print(test_inputs[10:20])
0 1 2 3 4 5 6
10 3 2 0 6 6 0 0
11 5 0 2 4 1 5 2
12 5 3 5 4 1 3 5
13 9 5 6 6 5 0 1
14 7 0 7 4 2 2 5
15 2 4 3 3 7 2 3
16 8 9 6 0 5 3 4
17 1 1 0 7 2 7 7
18 1 2 2 3 5 8 7
19 5 1 1 0 1 8 9
print(test_inputs.values[100:200, :])
print(test_inputs.values[100:200, :].shape)
This code is also working for me.
I was facing the same problem. Even the above solutions couldn't fix it. It was some problem with pandas, What I did was I changed the array into a numpy array that fixed the issue.
import pandas as pd
import numpy as np
test_inputs = pd.read_csv("../input/test.csv", delimiter=',')
test_inputs = np.asarray(test_inputs)

Categories

Resources