Python Pandas Sort DataFrame by Duplicate Rows - python

What is the nicest way to see which rows are duplicated in DataFrame with the duplicate rows sorted and stacked on top of each other? I know I can filter for duplicates with df.duplicated() or something like df[df.duplicated()==True] but need to be able to produce a dataframe with the duplicates and then sort them to show both records in the Dataframe. I also do not need to use a col subset argument for this. -Thank you

One idea is to sort by all columns. Not sure how efficient that is though.
In [20]: df = pd.DataFrame (np.random.randint (100,size=(3,3)), columns= list('ABC'))
In [21]: df = df.append(df, ignore_index=True)
In [22]: df
Out[22]:
A B C
0 23 71 65
1 63 0 47
2 47 13 44
3 23 71 65
4 63 0 47
5 47 13 44
In [23]: df.sort(df.columns.tolist())
Out[23]:
A B C
0 23 71 65
3 23 71 65
2 47 13 44
5 47 13 44
1 63 0 47
4 63 0 47

Related

Using a pandas function to convert the multiindex name into column?

Say I have a dataframe like this:
I know I can drop levels so that I have columns: ['Season', 'Players', 'Teams']. Is there a function in pandas where I can collapse 'First team' name into a column so that the entire column says 'First team'?
IIUC, you can do a few different things, here are two ways:
Where dummy dataframe,
df = pd.DataFrame(np.random.randint(0,100,(5,3)),
columns = pd.MultiIndex.from_tuples([('Season', 'Season'),
('First team', 'Players'),
('First team', 'Teams')]))
Input dummy dataframe:
Season First team
Season Players Teams
0 28 41 53
1 62 87 87
2 43 94 4
3 23 12 93
4 14 43 62
Then use droplevel:
df = df.droplevel(0, axis=1)
Output:
Season Players Teams
0 54 94 19
1 54 47 91
2 56 35 40
3 37 68 14
4 17 78 68
Or flatten multiindex column header using list comprehension:
df.columns = [f'{i}_{j}' for i, j in df.columns]
#Also, can use df.columns = df.columns.map('_'.join)
df
Output:
Season_Season First team_Players First team_Teams
0 54 94 19
1 54 47 91
2 56 35 40
3 37 68 14
4 17 78 68

Looping a function with Pandas DataFrames

I have some function that takes a DataFrame and an integer as arguments:
func(df, int)
The function returns a new DataFrame, e.g.:
df2 = func(df,2)
I'd like to write a loop for integers 2-10, resulting in 9 DataFrames. If I do this manually it would look like this:
df2 = func(df,2)
df3 = func(df2,3)
df4 = func(df3,4)
df5 = func(df4,5)
df6 = func(df5,6)
df7 = func(df6,7)
df8 = func(df7,8)
df9 = func(df8,9)
df10 = func(df9,10)
Is there a way to write a loop that does this?
This type of thing is what lists are for.
data_frames = [df]
for i in range(2, 11):
data_frames.append(func(data_frames[-1], i))
It's a sign of brittle code when you see variable names like df1, df2, df3, etc. Use lists when you have a sequence of related objects to build.
To clarify, this data_frames is a list of DataFrames that can be concatenated with data_frames = pd.concat(data_frames, sort=False), resulting in one DataFrame that combines the original df with everything that results from the loop, correct?
Yup, that's right. If your goal is one final data frame, you can concatenate the entire list at the end to combine the information into a single frame.
Do you mind explaining why data_frames[-1], which takes the last item of the list, returns a DataFrame? Not clear on this.
Because as you're building the list, at all times each entry is a data frame. data_frames[-1] evaluates to the last element in the list, which in this case, is the data frame you most recently appended.
You may try using itertools.accumulate as follows:
sample data
df:
a b c
0 75 18 17
1 48 56 3
import itertools
def func(x, y):
return x + y
dfs = list(itertools.accumulate([df] + list(range(2, 11)), func))
[ a b c
0 75 18 17
1 48 56 3, a b c
0 77 20 19
1 50 58 5, a b c
0 80 23 22
1 53 61 8, a b c
0 84 27 26
1 57 65 12, a b c
0 89 32 31
1 62 70 17, a b c
0 95 38 37
1 68 76 23, a b c
0 102 45 44
1 75 83 30, a b c
0 110 53 52
1 83 91 38, a b c
0 119 62 61
1 92 100 47, a b c
0 129 72 71
1 102 110 57]
dfs is the list of result dataframes where each one is the adding of 2 - 10 to the previous result
If you want concat them all into one dataframe, Use pd.concat
pd.concat(dfs)
Out[29]:
a b c
0 75 18 17
1 48 56 3
0 77 20 19
1 50 58 5
0 80 23 22
1 53 61 8
0 84 27 26
1 57 65 12
0 89 32 31
1 62 70 17
0 95 38 37
1 68 76 23
0 102 45 44
1 75 83 30
0 110 53 52
1 83 91 38
0 119 62 61
1 92 100 47
0 129 72 71
1 102 110 57
You can use exec with a formatted string:
for i in range(2, 11):
exec("df{0} = func(df{1}, {0})".format(i, i - 1 if i > 2 else ''))

Dask DataFrame calculate mean within multi-column groupings

I have a data frame as shown in Image, what I want to do is to take the mean along the column 'trial'. It for every subject, condition and sample (when all these three columns has value one), take average of data along column trial (100 rows).
what I have done in pandas is as following
sub_erp_pd= pd.DataFrame()
for j in range(1,4):
sub_c=subp[subp['condition']==j]
for i in range(1,3073):
sub_erp_pd=sub_erp_pd.append(sub_c[sub_c['sample']==i].mean(),ignore_index=True)
But this take alot of time..
So i am thinking to use dask instead of Pandas.
But in dask i am having issue in creating an empty data frame. Like we create an empty data frame in pandas and append data to it.
image of data frame
as suggested by #edesz I made changes in my approach
EDIT
%%time
sub_erp=pd.DataFrame()
for subno in progressbar.progressbar(range(1,82)):
try:
sub=pd.read_csv('../input/data/{}.csv'.format(subno,subno),header=None)
except:
sub=pd.read_csv('../input/data/{}.csv'.format(subno,subno),header=None)
sub_erp=sub_erp.append(sub.groupby(['condition','sample'], as_index=False).mean())
Reading a file using pandas take 13.6 seconds while reading a file using dask take 61.3 ms. But in dask, I am having trouble in appending.
NOTE - The original question was titled Create an empty dask dataframe and append values to it.
If I understand correctly, you need to
use groupby (read more here) in order to group the subject, condition and sample columns
this will gather all rows, which have the same value in each of these three columns, into a single group
take the average using .mean()
this will give you the mean within each group
Generate some dummy data
df = df = pd.DataFrame(np.random.randint(0,100,size=(100, 3)),
columns=['trial','condition','sample'])
df.insert(0,'subject',[1]*10 + [2]*30 + [5]*60)
print(df.head())
subject trial condition sample
0 1 71 96 34
1 1 2 89 66
2 1 90 90 81
3 1 93 43 18
4 1 29 82 32
Pandas approach
Aggregate and take mean
df_grouped = df.groupby(['subject','condition','sample'], as_index=False)['trial'].mean()
print(df_grouped.head(15))
subject condition sample trial
0 1 18 24 89
1 1 43 18 93
2 1 67 47 81
3 1 82 32 29
4 1 85 28 97
5 1 88 13 48
6 1 89 59 23
7 1 89 66 2
8 1 90 81 90
9 1 96 34 71
10 2 0 81 19
11 2 2 39 58
12 2 2 59 94
13 2 5 42 13
14 2 9 42 4
Dask approach
Step 1. Imports
import dask.dataframe as dd
from dask.diagnostics import ProgressBar
Step 2. Convert Pandas DataFrame to Dask DataFrame, using .from_pandas
ddf = dd.from_pandas(df, npartitions=2)
Step 3. Aggregate and take mean
ddf_grouped = (
ddf.groupby(['subject','condition','sample'])['trial']
.mean()
.reset_index(drop=False)
)
with ProgressBar():
df_grouped = ddf_grouped.compute()
[ ] | 0% Completed | 0.0s
[########################################] | 100% Completed | 0.1s
print(df_grouped.head(15))
subject condition sample trial
0 1 18 24 89
1 1 43 18 93
2 1 67 47 81
3 1 82 32 29
4 1 85 28 97
5 1 88 13 48
6 1 89 59 23
7 1 89 66 2
8 1 90 81 90
9 1 96 34 71
10 2 0 81 19
11 2 2 39 58
12 2 2 59 94
13 2 5 42 13
14 2 9 42 4
IMPORTANT NOTE: The approach in this answer does not use the approach of creating an empty Dask DataFrame and append values to it in order to calculate a mean within groupings of subject, condition and trial. Instead, this answer provides an alternate approach (using GROUP BY) to obtaining the desired end result (of calculating the mean within groupings of subject, condition and trial).

Pandas Keyerror

I have a very simple code:
stats2 = {'a':[1,2,3,4,5,6],
'b':[43,34,65,56,29,76],
'c':[65,67,78,65,45,52],
'cac':['mns','ab','cd','cd','ab','k']}
f2 = pd.DataFrame(stats2)
f2.set_index(['cac'], inplace = True)
print(f2.ix['mns'])
print(f2['mns'])
f2.ix['mns'] works just fine. However, f2['mns'] reports KeyError. I am trying to understand why it does that. Is that how pandas work? Do I have to use ix even though I have set the index before?
This is your original dataframe:
>>> df
a b c cac
0 1 43 65 mns
1 2 34 67 ab
2 3 65 78 cd
3 4 56 65 cd
4 5 29 45 ab
5 6 76 52 k
>>> df.set_index(['cac'], inplace=True)
>>> df
a b c
cac
mns 1 43 65
ab 2 34 67
cd 3 65 78
cd 4 56 65
ab 5 29 45
k 6 76 52
So, setting the index in pandas is simply replacing the before counter values(0,1,2,...,5) to the new row values i.e (mns, ab,...,k) of cac column name.
>>> df.ix['mns']
a 1
b 43
c 65
This command specifically searches for row in the index column, cac whose value is equal to mns and retrieves it's corresponding elements.
Note: As mns is not a column name of the dataframe, df['mns'] throws a key error.

Fastest way to sort each row in a pandas dataframe

I need to find the quickest way to sort each row in a dataframe with millions of rows and around a hundred columns.
So something like this:
A B C D
3 4 8 1
9 2 7 2
Needs to become:
A B C D
8 4 3 1
9 7 2 2
Right now I'm applying sort to each row and building up a new dataframe row by row. I'm also doing a couple of extra, less important things to each row (hence why I'm using pandas and not numpy). Could it be quicker to instead create a list of lists and then build the new dataframe at once? Or do I need to go cython?
I think I would do this in numpy:
In [11]: a = df.values
In [12]: a.sort(axis=1) # no ascending argument
In [13]: a = a[:, ::-1] # so reverse
In [14]: a
Out[14]:
array([[8, 4, 3, 1],
[9, 7, 2, 2]])
In [15]: pd.DataFrame(a, df.index, df.columns)
Out[15]:
A B C D
0 8 4 3 1
1 9 7 2 2
I had thought this might work, but it sorts the columns:
In [21]: df.sort(axis=1, ascending=False)
Out[21]:
D C B A
0 1 8 4 3
1 2 7 2 9
Ah, pandas raises:
In [22]: df.sort(df.columns, axis=1, ascending=False)
ValueError: When sorting by column, axis must be 0 (rows)
To Add to the answer given by #Andy-Hayden, to do this inplace to the whole frame... not really sure why this works, but it does. There seems to be no control on the order.
In [97]: A = pd.DataFrame(np.random.randint(0,100,(4,5)), columns=['one','two','three','four','five'])
In [98]: A
Out[98]:
one two three four five
0 22 63 72 46 49
1 43 30 69 33 25
2 93 24 21 56 39
3 3 57 52 11 74
In [99]: A.values.sort
Out[99]: <function ndarray.sort>
In [100]: A
Out[100]:
one two three four five
0 22 63 72 46 49
1 43 30 69 33 25
2 93 24 21 56 39
3 3 57 52 11 74
In [101]: A.values.sort()
In [102]: A
Out[102]:
one two three four five
0 22 46 49 63 72
1 25 30 33 43 69
2 21 24 39 56 93
3 3 11 52 57 74
In [103]: A = A.iloc[:,::-1]
In [104]: A
Out[104]:
five four three two one
0 72 63 49 46 22
1 69 43 33 30 25
2 93 56 39 24 21
3 74 57 52 11 3
I hope someone can explain the why of this, just happy that it works 8)
You could use pd.apply.
Eg:
A = pd.DataFrame(np.random.randint(0,100,(4,5)), columns=['one','two','three','four','five'])
print (A)
one two three four five
0 2 75 44 53 46
1 18 51 73 80 66
2 35 91 86 44 25
3 60 97 57 33 79
A = A.apply(np.sort, axis = 1)
print(A)
one two three four five
0 2 44 46 53 75
1 18 51 66 73 80
2 25 35 44 86 91
3 33 57 60 79 97
Since you want it in descending order, you can simply multiply the dataframe with -1 and sort it.
A = pd.DataFrame(np.random.randint(0,100,(4,5)), columns=['one','two','three','four','five'])
A = A * -1
A = A.apply(np.sort, axis = 1)
A = A * -1
Instead of using pd.DataFrame constructor, an easier way to assign the sorted values back is to use double brackets:
original dataframe:
A B C D
3 4 8 1
9 2 7 2
df[['A', 'B', 'C', 'D']] = np.sort(df)[:, ::-1]
A B C D
0 8 4 3 1
1 9 7 2 2
This way you can also sort a part of the columns:
df[['B', 'C']] = np.sort(df[['B', 'C']])[:, ::-1]
A B C D
0 3 8 4 1
1 9 7 2 2
One could try this approach to preserve the integrity of the df:
import pandas as pd
import numpy as np
A = pd.DataFrame(np.random.randint(0,100,(4,5)), columns=['one','two','three','four','five'])
print (A)
print(type(A))
one two three four five
0 85 27 64 50 55
1 3 90 65 22 8
2 0 7 64 66 82
3 58 21 42 27 30
<class 'pandas.core.frame.DataFrame'>
B = A.apply(lambda x: np.sort(x), axis=1, raw=True)
print(B)
print(type(B))
one two three four five
0 27 50 55 64 85
1 3 8 22 65 90
2 0 7 64 66 82
3 21 27 30 42 58
<class 'pandas.core.frame.DataFrame'>

Categories

Resources