I want to order this DataFrame by a given column field and the number of entries I have for this given field.
So let's say I have a very simple dataframe, looking something like this:
name age
0 Paul 12
1 Ryan 17
2 Michael 100
3 Paul 36
4 Paul 66
5 Michael 45
What I want as a result is something like
name age
0 Paul 12
1 Paul 36
2 Paul 66
3 Michael 100
4 Michael 45
5 Ryan 17
So I have 3 Paul's, so they come up first, then 2 Michael's, and finally only 1 Ryan.
One option: use value_counts to get the most frequent names, then set, sort, and reset the index:
x = list(df['name'].value_counts().index)
df.set_index('name').loc[x].reset_index()
returns
name age
0 Paul 12
1 Paul 36
2 Paul 66
3 Michael 100
4 Michael 45
5 Ryan 17
Need to create a helper column to sort, in this case the size of the name groups. Add a .reset_index(drop=True) if you prefer a brand new RangeIndex, or keep as is if the original Index is useful.
Sorting does not change the ordering within equal values, so the first 'Paul' row will always appear first within 'Paul'
(df.assign(s = df.groupby('name').name.transform('size'))
.sort_values('s', ascending=False)
.drop(columns='s'))
Output
name age
0 Paul 12
3 Paul 36
4 Paul 66
2 Michael 100
5 Michael 45
1 Ryan 17
To allay fears raised in comments, this method is performant. Much more so than the above method. Plus you don't ruin your initial index.
import numpy as np
np.random.seed(42)
N = 10**6
df = pd.DataFrame({'name': np.random.randint(1, 10000, N),
'age': np.random.normal(0, 1, N)})
%%timeit
(df.assign(s = df.groupby('name').name.transform('size'))
.sort_values('s', ascending=False)
.drop(columns='s'))
#500 ms ± 31.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
x = list(df['name'].value_counts().index)
df.set_index('name').loc[x].reset_index()
#2.67 s ± 166 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The only change I added was the ability to sort by count of name, and by age.
df['name_count'] = df['name'].map(df['name'].value_counts())
df = df.sort_values(by=['name_count', 'age'],
ascending=[False,True]).drop('name_count', axis=1)
df.reset_index(drop=True)
name age
0 Paul 12
1 Paul 36
2 Paul 66
3 Michael 45
4 Michael 100
5 Ryan 17
Related
I am trying to go through my dataframe two lines at a time, checking if a column value is the same in both rows and removing such rows. My dataframe tracks the locations of different people during different encounters.
I have a dataframe, called transfers, in which each row consists of an ID number for a person, an encounter number, and a location. The transfers dataframe was created by running a duplicated on my original dataframe to find rows with the same person ID, grouping them together.
For example, we would want to get rid of the rows with ID = 2 in the dataframe below because the location was "D" in both encounters, so this person has not moved.
However, we would want to keep the rows with ID = 3 because that person moved from "A" to "F".
Another issue arises because some people have more than two rows, for example where ID = 1. For this person, we would want to keep their rows because they have moved from "A" -> "B" and then from "B" -> "C". However, if you only compare the encounters 12 and 13, it does not look like this person has changed locations.
Example dataframe df:
ID Encounter Location
1 11 A
1 12 B
1 13 B
1 14 C
2 21 D
2 22 D
3 31 A
3 32 F
Expected output:
ID Encounter Location
1 11 A
1 12 B
1 13 B
1 14 C
3 31 A
3 32 F
I have tried a nested for loops using .iterrows(), but I found that this did not work as it was terribly slow and did not properly handle cases where the person had more than two encounters. I have also tried applying a function to my dataframe, but the runtime was nearly the same as crude looping.
EDIT: I should have explicitly stated this, I am trying to keep the data of any person who has moved locations even if they end up back where they started.
Given
>>> df
ID Encounter Location
0 1 11 A
1 1 12 B
2 1 13 B
3 1 14 C
4 2 21 D
5 2 22 D
6 3 31 A
7 3 32 F
you can filter your dataframe via
>>> places = df.groupby('ID')['Location'].transform('nunique')
>>> df[places > 1]
ID Encounter Location
0 1 11 A
1 1 12 B
2 1 13 B
3 1 14 C
6 3 31 A
7 3 32 F
The idea is to count the number of unique places per group (ID) and then drop the rows where a person has only been to one place.
Comparison versus the filter solution:
# setup
>>> df = pd.concat([df.assign(ID=df['ID'] + i) for i in range(1000)], ignore_index=True)
>>> df
ID Encounter Location
0 1 11 A
1 1 12 B
2 1 13 B
3 1 14 C
4 2 21 D
... ... ... ...
7995 1000 14 C
7996 1001 21 D
7997 1001 22 D
7998 1002 31 A
7999 1002 32 F
[8000 rows x 3 columns]
# timings # i5-6200U CPU # 2.30GHz
>>> %timeit df.groupby('ID').filter(lambda x: x['Location'].nunique() > 1)
356 ms ± 19.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit df[df.groupby('ID')['Location'].transform('nunique') > 1]
5.56 ms ± 21.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
I have dataframe in pandas:
In [10]: df
Out[10]:
col_a col_b col_c col_d
0 France Paris 3 4
1 UK Londo 4 5
2 US Chicago 5 6
3 UK Bristol 3 3
4 US Paris 8 9
5 US London 44 4
6 US Chicago 12 4
I need to count unique cities. I can count unique states
In [11]: df['col_a'].nunique()
Out[11]: 3
and I can try to count unique cities
In [12]: df['col_b'].nunique()
Out[12]: 5
but it is wrong because US Paris and Paris in France are different cities. So now I'm doing in like this:
In [13]: df['col_a_b'] = df['col_a'] + ' - ' + df['col_b']
In [14]: df
Out[14]:
col_a col_b col_c col_d col_a_b
0 France Paris 3 4 France - Paris
1 UK Londo 4 5 UK - Londo
2 US Chicago 5 6 US - Chicago
3 UK Bristol 3 3 UK - Bristol
4 US Paris 8 9 US - Paris
5 US London 44 4 US - London
6 US Chicago 12 4 US - Chicago
In [15]: df['col_a_b'].nunique()
Out[15]: 6
Maybe there is a better way? Without creating an additional column.
By using ngroups
df.groupby(['col_a', 'col_b']).ngroups
Out[101]: 6
Or using set
len(set(zip(df['col_a'],df['col_b'])))
Out[106]: 6
You can select col_a and col_b, drop the duplicates, then check the shape/len of the result data frame:
df[['col_a', 'col_b']].drop_duplicates().shape[0]
# 6
len(df[['col_a', 'col_b']].drop_duplicates())
# 6
Because groupby ignore NaNs, and may unnecessarily invoke a sorting process, choose accordingly which method to use if you have NaNs in the columns:
Consider a data frame as following:
df = pd.DataFrame({
'col_a': [1,2,2,pd.np.nan,1,4],
'col_b': [2,2,3,pd.np.nan,2,pd.np.nan]
})
print(df)
# col_a col_b
#0 1.0 2.0
#1 2.0 2.0
#2 2.0 3.0
#3 NaN NaN
#4 1.0 2.0
#5 4.0 NaN
Timing:
df = pd.concat([df] * 1000)
%timeit df.groupby(['col_a', 'col_b']).ngroups
# 1000 loops, best of 3: 625 µs per loop
%timeit len(df[['col_a', 'col_b']].drop_duplicates())
# 1000 loops, best of 3: 1.02 ms per loop
%timeit df[['col_a', 'col_b']].drop_duplicates().shape[0]
# 1000 loops, best of 3: 1.01 ms per loop
%timeit len(set(zip(df['col_a'],df['col_b'])))
# 10 loops, best of 3: 56 ms per loop
%timeit len(df.groupby(['col_a', 'col_b']))
# 1 loop, best of 3: 260 ms per loop
Result:
df.groupby(['col_a', 'col_b']).ngroups
# 3
len(df[['col_a', 'col_b']].drop_duplicates())
# 5
df[['col_a', 'col_b']].drop_duplicates().shape[0]
# 5
len(set(zip(df['col_a'],df['col_b'])))
# 2003
len(df.groupby(['col_a', 'col_b']))
# 2003
So the difference:
Option 1:
df.groupby(['col_a', 'col_b']).ngroups
is fast, and it excludes rows that contain NaNs.
Option 2 & 3:
len(df[['col_a', 'col_b']].drop_duplicates())
df[['col_a', 'col_b']].drop_duplicates().shape[0]
Reasonably fast, it considers NaNs as a unique value.
Option 4 & 5:
len(set(zip(df['col_a'],df['col_b'])))
len(df.groupby(['col_a', 'col_b']))
slow, and it is following the logic that numpy.nan == numpy.nan is False, so different (nan, nan) rows are considered different.
In [105]: len(df.groupby(['col_a', 'col_b']))
Out[105]: 6
import pandas as pd
data = {'field1':[1,4,1,68,9],'field2':[1,1,4,5,9]}
df = pd.DataFrame(data)
results = df.groupby('field1')['field2'].nunique()
results
Output:
field1
1 2
4 1
9 1
68 1
Name: field2, dtype: int64
try this, I'm basically subtracting the number of duplicate groups from the number of rows in df. This is assuming we are grouping all the categories in the df
df.shape[0] - df[['col_a','col_b']].duplicated().sum()
774 µs ± 603 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Test data:
In [1]:
import pandas as pd
import numpy as np
df = pd.DataFrame(
{'AAA' : [4,5,6,7,9,10],
'BBB' : [10,20,30,40,11,10],
'CCC' : [100,50,25,10,10,11]});
In [2]:df
Out[2]:
AAA BBB CCC
0 4 10 100
1 5 20 50
2 6 30 25
3 7 40 10
4 9 11 10
5 10 10 11
In [3]: thresh = 2
df['aligned'] = np.where(df.AAA == df.BBB,max(df.AAA)|(df.BBB),np.nan)
The following np.where statement provides max(df.AAA or df.BBB) when df.AAA and df.BBB are exactly aligned. I would like to have the max when the columns are within the value in thresh and also consider all columns. It does not have to be via np.where. Can you please show me ways of approaching this?
So for row 5 it should be 11.0 in df.aligned as this is the max value and within thresh of df.AAA and df.BBB.
Ultimately I am looking for ways to find levels across multiple columns where the values are closely aligned.
Current Output with my code:
df
AAA BBB CCC aligned
0 4 10 100 NaN
1 5 20 50 NaN
2 6 30 25 NaN
3 7 40 10 NaN
4 9 11 10 NaN
5 10 10 11 10.0
Desired Output:
df
AAA BBB CCC aligned
0 4 10 100 NaN
1 5 20 50 NaN
2 6 30 25 NaN
3 7 40 10 NaN
4 9 11 10 11.0
5 10 10 11 11.0
The desired output shows rows 4 and 5 with values on df.aligned. As these have values within thresh of each other (values 10 and 11 are within the range specified in thresh variable).
"Within thresh distance" to me means that the difference between the max
and the min of a row should be less than thresh. We can use DataFrame.apply with parameter axis=1 so that we apply the lambda function on each row.
In [1]: filt_thresh = df.apply(lambda x: (x.max() - x.min())<thresh, axis=1)
100 loops, best of 3: 1.89 ms per loop
Alternatively there's a faster solution as pointed out below by #root:
filt_thresh = np.ptp(df.values, axis=1) < tresh
10000 loops, best of 3: 48.9 µs per loop
Or, staying with pandas:
filt_thresh = df.max(axis=1) - df.min(axis=1) < thresh
1000 loops, best of 3: 943 µs per loop
We can now use boolean indexing and calculate the max of each row that matches (hence the axis=1 parameter in max()again):
In [2]: df.loc[filt_thresh, 'aligned'] = df[filt_thresh].max(axis=1)
In [3]: df
Out[3]:
AAA BBB CCC aligned
0 4 10 100 NaN
1 5 20 50 NaN
2 6 30 25 NaN
3 7 40 10 NaN
4 9 11 10 NaN
5 10 10 11 11.0
Update:
If you wanted to calculate the minimum distance between elements for each row, that'd be equivalent to sorting the array of values (np.sort()), calculating the difference between consecutive numbers (np.diff), and taking the min of the resulting array. Finally, compare that to tresh.
Here's the apply way that has the advantage of being a bit clearer to understand.
filt_thresh = df.apply(lambda row: np.min(np.diff(np.sort(row))) < thresh, axis=1)
1000 loops, best of 3: 713 µs per loop
And here's the vectorized equivalent:
filt_thresh = np.diff(np.sort(df)).min(axis=1) < thresh
The slowest run took 4.31 times longer than the fastest.
This could mean that an intermediate result is being cached.
10000 loops, best of 3: 67.3 µs per loop
I have a pandas dataframe which looks more like below which contains person Id , characteristics and the count. This is currently in deep/long format.
Person Id Characteristics Count
123 Apple 2
123 Banana 4
124 Pineaple 1
125 Apple 2
I want to efficiently convert this into a wide format and create a matrix which needs to be fed into an algorithm for reducing components.
It should look something like below
Person Id Apple Banana Pineapple
123 2 4 0
124 0 0 1
125 2 0 0
I am looking for an efficient way of doing this . Currently there is about 2000+ Characteristics and so there will be about 2000 or more columns and about 300K person Ids.
As you can see if there is no characteristic present, we need to fill it with zeroes. My approach seems to be clogging up a lot of memory and i was getting memory errors.
I am confused as to how to implement this in a efficient way.
You can use pivot_table with reset_index and rename_axis (new in pandas 0.18.0), but pivoting need much memory:
print df.pivot_table(index='Person Id',
columns='Characteristics',
values='Count',
fill_value=0).reset_index().rename_axis(None, axis=1)
Person Id Apple Banana Pineaple
0 123 2 4 0
1 124 0 0 1
2 125 2 0 0
Maybe faster is:
print df.pivot(index='Person Id',
columns='Characteristics',
values='Count').fillna(0).reset_index().rename_axis(None, axis=1)
Person Id Apple Banana Pineaple
0 123 2.0 4.0 0.0
1 124 0.0 0.0 1.0
2 125 2.0 0.0 0.0
Timings:
In [69]: %timeit df.pivot_table(index='Person Id', columns='Characteristics', values='Count', fill_value=0).reset_index().rename_axis(None, axis=1)
100 loops, best of 3: 5.26 ms per loop
In [70]: %timeit df.pivot(index='Person Id', columns='Characteristics', values='Count').fillna(0).reset_index().rename_axis(None, axis=1)
1000 loops, best of 3: 1.87 ms per loop
Having some trouble with filling NaNs. I want to take a dataframe column with a few NaNs and fill them with a value derived from a 'lookup table' based on a value from another column.
(You might recognize my data from the Titanic data set)...
Pclass Age
0 1 33
1 3 24
2 1 23
3 2 NaN
4 1 Nan
I want to fill the NaN with a value from series 'pclass_lookup':
pclass_lookup
1 38.1
2 29.4
3 25.2
I have tried doing fillna with indexing like:
df.Age.fillna(pclass_lookup[df.Pclass]), but it gives me an error of
ValueError: cannot reindex from a duplicate axis
lambdas were a try too:
df.Age.map(lambda x: x if x else pclass_lookup[df.Pclass]
but, that seems not to fill it right, either. Am I totally missing the boat here? '
Firstly you have a duff value for row 4, you in fact have string 'Nan' which is not the same as 'NaN' so even if your code did work this value would never be replaced.
So you need to replace that duff value and then you can just call map to perform the lookup on the NaN values:
In [317]:
df.Age.replace('Nan', np.NaN, inplace=True)
df.loc[df['Age'].isnull(),'Age'] = df['Pclass'].map(df1.pclass_lookup)
df
Out[317]:
Pclass Age
0 1 33
1 3 24
2 1 23
3 2 29.4
4 1 38.1
Timings
For a df with 5000 rows:
In [26]:
%timeit df.loc[df['Age'].isnull(),'Age'] = df['Pclass'].map(df1.pclass_lookup)
100 loops, best of 3: 2.41 ms per loop
In [27]:
%%timeit
def remove_na(x):
if pd.isnull(x['Age']):
return df1[x['Pclass']]
else:
return x['Age']
df['Age'] =df.apply(remove_na, axis=1)
1 loops, best of 3: 278 ms per loop
In [28]:
%%timeit
nulls = df.loc[df.Age.isnull(), 'Pclass']
df.loc[df.Age.isnull(), 'Age'] = df1.loc[nulls].values
100 loops, best of 3: 3.37 ms per loop
So you see here that apply as it is iterating row-wise scales poorly compared to the other two methods which are vectorised but map is still the fastest.
Building on the response of #vrajs5:
# Create dummy data
df = pd.DataFrame()
df['Pclass'] = [1,3,1,2,1]
df['Age'] = [33,24,23,None, None]
pclass_lookup = pd.Series([38.1,29.4,25.2], index = range(1,4))
# Solution:
nulls = df.loc[df.Age.isnull(), 'Pclass']
df.loc[df.Age.isnull(), 'Age'] = pclass_lookup.loc[nulls].values
>>> df
Pclass Age
0 1 33.0
1 3 24.0
2 1 23.0
3 2 29.4
4 1 38.1
Following should work for you:
df = pd.DataFrame()
df['Pclass'] = [1,3,1,2,1]
df['Age'] = [33,24,23,None, None]
df
Pclass Age
0 1 33
1 3 24
2 1 23
3 2 NaN
4 1 NaN
pclass_lookup = pd.Series([38.1,29.4,25.2], index = range(1,4))
pclass_lookup
1 38.1
2 29.4
3 25.2
dtype: float64
def remove_na(x):
if pd.isnull(x['Age']):
return pclass_lookup[x['Pclass']]
else:
return x['Age']
df['Age'] =df.apply(remove_na, axis=1)
Pclass Age
0 1 33.0
1 3 24.0
2 1 23.0
3 2 29.4
4 1 38.1