Pandas np.where with matching range of values on a row - python

Test data:
In [1]:
import pandas as pd
import numpy as np
df = pd.DataFrame(
{'AAA' : [4,5,6,7,9,10],
'BBB' : [10,20,30,40,11,10],
'CCC' : [100,50,25,10,10,11]});
In [2]:df
Out[2]:
AAA BBB CCC
0 4 10 100
1 5 20 50
2 6 30 25
3 7 40 10
4 9 11 10
5 10 10 11
In [3]: thresh = 2
df['aligned'] = np.where(df.AAA == df.BBB,max(df.AAA)|(df.BBB),np.nan)
The following np.where statement provides max(df.AAA or df.BBB) when df.AAA and df.BBB are exactly aligned. I would like to have the max when the columns are within the value in thresh and also consider all columns. It does not have to be via np.where. Can you please show me ways of approaching this?
So for row 5 it should be 11.0 in df.aligned as this is the max value and within thresh of df.AAA and df.BBB.
Ultimately I am looking for ways to find levels across multiple columns where the values are closely aligned.
Current Output with my code:
df
AAA BBB CCC aligned
0 4 10 100 NaN
1 5 20 50 NaN
2 6 30 25 NaN
3 7 40 10 NaN
4 9 11 10 NaN
5 10 10 11 10.0
Desired Output:
df
AAA BBB CCC aligned
0 4 10 100 NaN
1 5 20 50 NaN
2 6 30 25 NaN
3 7 40 10 NaN
4 9 11 10 11.0
5 10 10 11 11.0
The desired output shows rows 4 and 5 with values on df.aligned. As these have values within thresh of each other (values 10 and 11 are within the range specified in thresh variable).

"Within thresh distance" to me means that the difference between the max
and the min of a row should be less than thresh. We can use DataFrame.apply with parameter axis=1 so that we apply the lambda function on each row.
In [1]: filt_thresh = df.apply(lambda x: (x.max() - x.min())<thresh, axis=1)
100 loops, best of 3: 1.89 ms per loop
Alternatively there's a faster solution as pointed out below by #root:
filt_thresh = np.ptp(df.values, axis=1) < tresh
10000 loops, best of 3: 48.9 µs per loop
Or, staying with pandas:
filt_thresh = df.max(axis=1) - df.min(axis=1) < thresh
1000 loops, best of 3: 943 µs per loop
We can now use boolean indexing and calculate the max of each row that matches (hence the axis=1 parameter in max()again):
In [2]: df.loc[filt_thresh, 'aligned'] = df[filt_thresh].max(axis=1)
In [3]: df
Out[3]:
AAA BBB CCC aligned
0 4 10 100 NaN
1 5 20 50 NaN
2 6 30 25 NaN
3 7 40 10 NaN
4 9 11 10 NaN
5 10 10 11 11.0
Update:
If you wanted to calculate the minimum distance between elements for each row, that'd be equivalent to sorting the array of values (np.sort()), calculating the difference between consecutive numbers (np.diff), and taking the min of the resulting array. Finally, compare that to tresh.
Here's the apply way that has the advantage of being a bit clearer to understand.
filt_thresh = df.apply(lambda row: np.min(np.diff(np.sort(row))) < thresh, axis=1)
1000 loops, best of 3: 713 µs per loop
And here's the vectorized equivalent:
filt_thresh = np.diff(np.sort(df)).min(axis=1) < thresh
The slowest run took 4.31 times longer than the fastest.
This could mean that an intermediate result is being cached.
10000 loops, best of 3: 67.3 µs per loop

Related

Calculating of tolerance

I am working with one data set. Data contains values with different decimal places. Data and code you can see below :
data = {
'value':[9.1,10.5,11.8,
20.1,21.2,22.8,
9.5,10.3,11.9,
]
}
df = pd.DataFrame(data, columns = ['value'])
Which gives the following dataframe:
value
0 9.1
1 10.5
2 11.8
3 20.1
4 21.2
5 22.8
6 9.5
7 10.3
8 11.9
Now I want to add a new column with the title adjusted.This column I want to calculate with numpy.isclose function with a tolerance of 2 (plus or minus 1). At the end I expect to have results as result shown in the next table
value adjusted
0 9.1 10
1 10.5 10
2 11.8 10
3 20.1 21
4 21.2 21
5 22.8 21
6 9.5 10
7 10.3 10
8 11.9 10
I tried with this line but I get only results such true and false and also this is only for one value (10) not for all values.
np.isclose(df1['value'],10,atol=2)
So can anybody help me how to solve this problem and calculate tolerance for values 10 and 21 with one line ?
The exact logic and how this would generalize are not fully clear. Below are two options.
Assuming you want to test your values against a list of defined references, you can use the underlying numpy array and broadcasting:
vals = np.array([10, 21])
a = df['value'].to_numpy()
m = np.isclose(a[:, None], vals, atol=2)
df['adjusted'] = np.where(m.any(1), vals[m.argmax(1)], np.nan)
Assuming you want to group successive values, you can get the diff and start a new group when the difference is above threshold. Then round and get the median per group with groupby.transform:
group = df['value'].diff().abs().gt(2).cumsum()
df['adjusted'] = df['value'].round().groupby(group).transform('median')
Output:
value adjusted
0 9.1 10.0
1 10.5 10.0
2 11.8 10.0
3 20.1 21.0
4 21.2 21.0
5 22.8 21.0
6 9.5 10.0
7 10.3 10.0
8 11.9 10.0

Can dtype be used for assignment in order to set -99.0's for numeric and X's for text?

I have 20+ columns of data. There has to be a non-manual way to use data type in order to fill in blanks with -99.0 (the software I use recognizes -99.0 as a numeric missing) and X (the software I use recognizes X as text missing) if text. I searched and only saw manual way of stating what all the column names are. This would work repeatedly if the column names never changed but from project to project, I won't always have the same columns nor column names. Trying to automate this. Here's a small example:
ID
Project
From
To
Value1
Value2
1
AAA
0
10
15
0.578
1
AAA
10
20
7.6
2
0
100
14
0.777
2
100
200
6.5
1
ABA
0
5
22.7
0.431
1
BBB
15
20
0.8
17.4
2
0
10
1.200
2
BBB
10
20
6.9
200.8
I know I can just do this but it only does numeric:
result.fillna(0, inplace=True)
Also I could try this but put -99.0:
dataframe[list_of_columns].replace(r'\s+', 0, regex=True)
But then that is very manual and I want this to be automated since I have alot of projects and looking to save time and it only does numeric, not text columns.
There's this one I found but I can't convert text blanks to "X". I assume it would be something similar to this where I save the list_of_columns then have a for loop?
def recode_empty_cells(dataframe, list_of_columns):
for column in list_of_columns:
dataframe[column] = dataframe[column].replace(r'\s+', np.nan, regex=True)
dataframe[column] = dataframe[column].fillna(0)
return dataframe
In the end I want it to look like this:
ID
Project
From
To
Value1
Value2
1
AAA
0
10
15
0.578
1
AAA
10
20
7.6
-99.0
2
X
0
100
14
0.777
2
X
100
200
6.5
-99.0
1
ABA
0
5
22.7
0.431
1
BBB
15
20
0.8
17.4
2
X
0
10
-99.0
1.200
2
BBB
10
20
6.9
200.8
Thanks in advance!
If your columns have the correct dtypes then you can use DataFrame.select_dtypes. Select the numeric types and fill with -99 and then exclude the numeric types and fill with X. Then join the results back and reindex (if you care about column ordering).
import pandas as pd
import numpy as np
df = (pd.concat([df.select_dtypes(include=np.number).fillna(-99),
df.select_dtypes(exclude=np.number).fillna('X')], axis=1)
.reindex(df.columns, axis=1))
ID Project From To Value1 Value2
0 1 AAA 0 10 15.0 0.578
1 1 AAA 10 20 7.6 -99.000
2 2 X 0 100 14.0 0.777
3 2 X 100 200 6.5 -99.000
4 1 ABA 0 5 22.7 0.431
5 1 BBB 15 20 0.8 17.400
6 2 X 0 10 -99.0 1.200
7 2 BBB 10 20 6.9 200.800
Another valid option is to use select_dtypes to get the columns, then just manually fill. Since we only care about the column labels, and a column always has a single dtype, we can just use .head(1). It turns out that since df.select_dtypes returns a slice of the DataFrame it becomes slow for larger DataFrames, but we only need one row for this.
num_cols = df.head(1).select_dtypes(include=np.number).columns
oth_cols = df.head(1).select_dtypes(exclude=np.number).columns
df[num_cols] = df[num_cols].fillna(-99)
df[oth_cols] = df[oth_cols].fillna('X')

Remove Rows Where the Person Has Not Changed Locations

I am trying to go through my dataframe two lines at a time, checking if a column value is the same in both rows and removing such rows. My dataframe tracks the locations of different people during different encounters.
I have a dataframe, called transfers, in which each row consists of an ID number for a person, an encounter number, and a location. The transfers dataframe was created by running a duplicated on my original dataframe to find rows with the same person ID, grouping them together.
For example, we would want to get rid of the rows with ID = 2 in the dataframe below because the location was "D" in both encounters, so this person has not moved.
However, we would want to keep the rows with ID = 3 because that person moved from "A" to "F".
Another issue arises because some people have more than two rows, for example where ID = 1. For this person, we would want to keep their rows because they have moved from "A" -> "B" and then from "B" -> "C". However, if you only compare the encounters 12 and 13, it does not look like this person has changed locations.
Example dataframe df:
ID Encounter Location
1 11 A
1 12 B
1 13 B
1 14 C
2 21 D
2 22 D
3 31 A
3 32 F
Expected output:
ID Encounter Location
1 11 A
1 12 B
1 13 B
1 14 C
3 31 A
3 32 F
I have tried a nested for loops using .iterrows(), but I found that this did not work as it was terribly slow and did not properly handle cases where the person had more than two encounters. I have also tried applying a function to my dataframe, but the runtime was nearly the same as crude looping.
EDIT: I should have explicitly stated this, I am trying to keep the data of any person who has moved locations even if they end up back where they started.
Given
>>> df
ID Encounter Location
0 1 11 A
1 1 12 B
2 1 13 B
3 1 14 C
4 2 21 D
5 2 22 D
6 3 31 A
7 3 32 F
you can filter your dataframe via
>>> places = df.groupby('ID')['Location'].transform('nunique')
>>> df[places > 1]
ID Encounter Location
0 1 11 A
1 1 12 B
2 1 13 B
3 1 14 C
6 3 31 A
7 3 32 F
The idea is to count the number of unique places per group (ID) and then drop the rows where a person has only been to one place.
Comparison versus the filter solution:
# setup
>>> df = pd.concat([df.assign(ID=df['ID'] + i) for i in range(1000)], ignore_index=True)
>>> df
ID Encounter Location
0 1 11 A
1 1 12 B
2 1 13 B
3 1 14 C
4 2 21 D
... ... ... ...
7995 1000 14 C
7996 1001 21 D
7997 1001 22 D
7998 1002 31 A
7999 1002 32 F
[8000 rows x 3 columns]
# timings # i5-6200U CPU # 2.30GHz
>>> %timeit df.groupby('ID').filter(lambda x: x['Location'].nunique() > 1)
356 ms ± 19.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit df[df.groupby('ID')['Location'].transform('nunique') > 1]
5.56 ms ± 21.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Pandas fillna with a lookup table

Having some trouble with filling NaNs. I want to take a dataframe column with a few NaNs and fill them with a value derived from a 'lookup table' based on a value from another column.
(You might recognize my data from the Titanic data set)...
Pclass Age
0 1 33
1 3 24
2 1 23
3 2 NaN
4 1 Nan
I want to fill the NaN with a value from series 'pclass_lookup':
pclass_lookup
1 38.1
2 29.4
3 25.2
I have tried doing fillna with indexing like:
df.Age.fillna(pclass_lookup[df.Pclass]), but it gives me an error of
ValueError: cannot reindex from a duplicate axis
lambdas were a try too:
df.Age.map(lambda x: x if x else pclass_lookup[df.Pclass]
but, that seems not to fill it right, either. Am I totally missing the boat here? '
Firstly you have a duff value for row 4, you in fact have string 'Nan' which is not the same as 'NaN' so even if your code did work this value would never be replaced.
So you need to replace that duff value and then you can just call map to perform the lookup on the NaN values:
In [317]:
df.Age.replace('Nan', np.NaN, inplace=True)
df.loc[df['Age'].isnull(),'Age'] = df['Pclass'].map(df1.pclass_lookup)
df
Out[317]:
Pclass Age
0 1 33
1 3 24
2 1 23
3 2 29.4
4 1 38.1
Timings
For a df with 5000 rows:
In [26]:
%timeit df.loc[df['Age'].isnull(),'Age'] = df['Pclass'].map(df1.pclass_lookup)
100 loops, best of 3: 2.41 ms per loop
In [27]:
%%timeit
def remove_na(x):
if pd.isnull(x['Age']):
return df1[x['Pclass']]
else:
return x['Age']
df['Age'] =df.apply(remove_na, axis=1)
1 loops, best of 3: 278 ms per loop
In [28]:
%%timeit
nulls = df.loc[df.Age.isnull(), 'Pclass']
df.loc[df.Age.isnull(), 'Age'] = df1.loc[nulls].values
100 loops, best of 3: 3.37 ms per loop
So you see here that apply as it is iterating row-wise scales poorly compared to the other two methods which are vectorised but map is still the fastest.
Building on the response of #vrajs5:
# Create dummy data
df = pd.DataFrame()
df['Pclass'] = [1,3,1,2,1]
df['Age'] = [33,24,23,None, None]
pclass_lookup = pd.Series([38.1,29.4,25.2], index = range(1,4))
# Solution:
nulls = df.loc[df.Age.isnull(), 'Pclass']
df.loc[df.Age.isnull(), 'Age'] = pclass_lookup.loc[nulls].values
>>> df
Pclass Age
0 1 33.0
1 3 24.0
2 1 23.0
3 2 29.4
4 1 38.1
Following should work for you:
df = pd.DataFrame()
df['Pclass'] = [1,3,1,2,1]
df['Age'] = [33,24,23,None, None]
df
Pclass Age
0 1 33
1 3 24
2 1 23
3 2 NaN
4 1 NaN
pclass_lookup = pd.Series([38.1,29.4,25.2], index = range(1,4))
pclass_lookup
1 38.1
2 29.4
3 25.2
dtype: float64
def remove_na(x):
if pd.isnull(x['Age']):
return pclass_lookup[x['Pclass']]
else:
return x['Age']
df['Age'] =df.apply(remove_na, axis=1)
Pclass Age
0 1 33.0
1 3 24.0
2 1 23.0
3 2 29.4
4 1 38.1

Python Pandas: remove entries based on the number of occurrences

I'm trying to remove entries from a data frame which occur less than 100 times.
The data frame data looks like this:
pid tag
1 23
1 45
1 62
2 24
2 45
3 34
3 25
3 62
Now I count the number of tag occurrences like this:
bytag = data.groupby('tag').aggregate(np.count_nonzero)
But then I can't figure out how to remove those entries which have low count...
New in 0.12, groupby objects have a filter method, allowing you to do these types of operations:
In [11]: g = data.groupby('tag')
In [12]: g.filter(lambda x: len(x) > 1) # pandas 0.13.1
Out[12]:
pid tag
1 1 45
2 1 62
4 2 45
7 3 62
The function (the first argument of filter) is applied to each group (subframe), and the results include elements of the original DataFrame belonging to groups which evaluated to True.
Note: in 0.12 the ordering is different than in the original DataFrame, this was fixed in 0.13+:
In [21]: g.filter(lambda x: len(x) > 1) # pandas 0.12
Out[21]:
pid tag
1 1 45
4 2 45
2 1 62
7 3 62
Edit: Thanks to #WesMcKinney for showing this much more direct way:
data[data.groupby('tag').pid.transform(len) > 1]
import pandas
import numpy as np
data = pandas.DataFrame(
{'pid' : [1,1,1,2,2,3,3,3],
'tag' : [23,45,62,24,45,34,25,62],
})
bytag = data.groupby('tag').aggregate(np.count_nonzero)
tags = bytag[bytag.pid >= 2].index
print(data[data['tag'].isin(tags)])
yields
pid tag
1 1 45
2 1 62
4 2 45
7 3 62
Here are some run times for a couple of the solutions posted here, along with one that was not (using value_counts()) that is much faster than the other solutions:
Create the data:
import pandas as pd
import numpy as np
# Generate some 'users'
np.random.seed(42)
df = pd.DataFrame({'uid': np.random.randint(0, 500, 500)})
# Prove that some entries are 1
print "{:,} users only occur once in dataset".format(sum(df.uid.value_counts() == 1))
Output:
171 users only occur once in dataset
Time a few different ways of removing users with only one entry. These were run in separate cells in a Jupyter Notebook:
%%timeit
df.groupby(by='uid').filter(lambda x: len(x) > 1)
%%timeit
df[df.groupby('uid').uid.transform(len) > 1]
%%timeit
vc = df.uid.value_counts()
df[df.uid.isin(vc.index[vc.values > 1])].uid.value_counts()
These gave the following outputs:
10 loops, best of 3: 46.2 ms per loop
10 loops, best of 3: 30.1 ms per loop
1000 loops, best of 3: 1.27 ms per loop
df = pd.DataFrame([(1, 2), (1, 3), (1, 4), (2, 1),(2,2,)], columns=['col1', 'col2'])
In [36]: df
Out[36]:
col1 col2
0 1 2
1 1 3
2 1 4
3 2 1
4 2 2
gp = df.groupby('col1').aggregate(np.count_nonzero)
In [38]: gp
Out[38]:
col2
col1
1 3
2 2
lets get where the count > 2
tf = gp[gp.col2 > 2].reset_index()
df[df.col1 == tf.col1]
Out[41]:
col1 col2
0 1 2
1 1 3
2 1 4

Categories

Resources