Search for value within a range in a pandas dataframe? - python

I am attempting to search for matching values within a range within a given uncertainty in a pandas dataframe. For instance, if I have a dataframe:
A B C
0 12 12.6 111.20
1 14 23.4 112.20
2 16 45.6 112.30
3 18 56.6 112.40
4 27 34.5 121.60
5 29 65.2 223.23
6 34 45.5 654.50
7 44 65.6 343.50
How can I search for a value that matches 112.6 +/-0.4 without having to create a long and difficult criteria like:
TargetVal_Max= 112.6+0.4
TargetVal_Min= 112.6-0.4
Basically, I want to create a "buffer window" that allows for all values matching a window to be returned back. I have uncertainties package, but have yet to get it working like this.
Optimally, I'd like to be able to return all index values that match a value in both C and B within a given error range.
Edit
As pointed out by #MaxU, the np.isclose f(x) works very well if you know the exact number. But is it possible to match a list of values, such that if I had a second dataframe and wanted to see if the values in C from one matched the values of C (second dataframe) within a tolerance? I have attempted to get them into a list and do it this way, but I am getting problems when attempting to do it for more than a single value at a time.
TEST= Dataframe_2["C"]
HopesNdreams = sample[sample["C"].apply(np.isclose,b=TEST, atol=1.0)]
Edit 2
I found through trying a couple of different work arounds that I can just do:
TEST1= Dataframe_2["C"].tolist
for i in TEST1:
HopesNdreams= sample[sample["C"].apply(np.isclose,b=i, atol=1.0)]
And this returns the hits for the given column. Using the logic set forth in the first answer, I think this will work very well for what I need it to. Are there any hangups that I don't see with this method?
Cheers and thanks for the help!

IIUC you can use np.isclose() function:
In [180]: df[['B','C']].apply(np.isclose, b=112.6, atol=0.4)
Out[180]:
B C
0 False False
1 False True
2 False True
3 False True
4 False False
5 False False
6 False False
7 False False
In [181]: df[['B','C']].apply(np.isclose, b=112.6, atol=0.4).any(1)
Out[181]:
0 False
1 True
2 True
3 True
4 False
5 False
6 False
7 False
dtype: bool
In [182]: df[df[['B','C']].apply(np.isclose, b=112.6, atol=0.4).any(1)]
Out[182]:
A B C
1 14 23.4 112.2
2 16 45.6 112.3
3 18 56.6 112.4

Use Series.between():
df['C'].between(112.6 + .4, 112.6 - .4)

Related

I need to create a dataframe were values reference previous rows

I am just starting to use python and im trying to learn some of the general things about it. As I was playing around with it I wanted to see if I could make a dataframe that shows a starting number which is compounded by a return. Sorry if this description doesnt make much sense but I basically want a dataframe x long that shows me:
number*(return)^(row number) in each row
so for example say number is 10 and the return is 10% so i would like for the dataframe to give me the series
1 11
2 12.1
3 13.3
4 14.6
5 ...
6 ...
Thanks so much in advanced!
Let us try
import numpy as np
val = 10
det = 0.1
n = 4
out = 10*((1+det)**np.arange(n))
s = pd.Series(out)
s
Out[426]:
0 10.00
1 11.00
2 12.10
3 13.31
dtype: float64
Notice here I am using the index from 0 , since 1.1**0 will yield the original value
I think this does what you want:
df = pd.DataFrame({'returns': [x for x in range(1, 10)]})
df.index = df.index + 1
df.returns = df.returns.apply(lambda x: (10 * (1.1**x)))
print(df)
Out:
returns
1 11.000000
2 12.100000
3 13.310000
4 14.641000
5 16.105100
6 17.715610
7 19.487171
8 21.435888
9 23.579477

Equivalent of Dataframe "diff" with strings

Dataframes in pandas have some functions to perform computation between different rows, like diff (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.diff.html).
However, this only works with numeric computation (or at least objects that supports - operation).
Is there a way to perform a different between strings and return a boolean if the strings are equal?
For example:
>>> s = pd.Series(list("ABCCDEF"))
>>> s.str_diff()
0 NaN
1 False
2 False
3 True
4 False
5 False
6 False
dtype: bool
Thanks for Quang Hoang to point out the answer.
You just need to do a new series or Dataframe with a shift of rows and then compare.
>>> s = pd.Series(list("ABBCDDEF"))
# If you seach strings that are different
>>> s.ne(s.shift())
# If you seach strings that are equal
>>> s.eq(s.shift())
0 False
1 False
2 True
3 False
4 False
5 True
6 False
7 False
dtype: bool

Isn't taking the mean of a pandas column of boolean values supposed to return the proportion that is True?

So I took the mean of a pandas data frame column that contains boolean values. I've done this in the past multiple times and understood that it would return the proportion that is True. But when I wrote it in this particular instance, it didn't work. It returns the proportion that is False and not only that, the denominator it uses doesn't seem to relate to anything. I have no idea where it pulls the denominator from to calculate the proportion value. I discovered it works the way I want it to when I remove the second line of code (datadf = datadf[1:])
# get current row value minus previous row value and returns True if > 0
datadf['increase'] = datadf.index.map(lambda x: datadf.loc[x]['price'] - datadf.loc[x-1]['price'] > 0 if x > 0 else None)
# remove first row because it gives 'None'
datadf = datadf[1:]
# calculate proportion that is True
accretionscore = datadf['increase'].mean()
This is the output
date price increase
1 2020-09-28 488.51 True
2 2020-09-29 489.33 True
3 2020-09-30 490.43 True
4 2020-10-01 499.51 True
5 2020-10-02 478.99 False
correct value: 0.8
value given: 0.2
When I try adding another sample that's when things get weirder:
date price increase
1 2020-09-27 479.78 False
2 2020-09-28 488.51 True
3 2020-09-29 489.33 True
4 2020-09-30 490.43 True
5 2020-10-01 499.51 True
6 2020-10-02 478.99 False
correct value: 0.6666666666666666
value given: 0.16666666666666666
they don't even add up to 1!
I'm so confused. Can anyone tell me what is going on? How does taking out the second line fix the problem?
Hint: if you want to convert from boolean to int, then you just can use:
datadf['increase'] = datadf['increase'].astype(int)
and this way things will work fine.
If we run your code, you can see that datadf['increase'] is an object instead of a boolean, so taking mean on this is most likely converting the categories to a number and so on.. basically something weird:
import pandas as pd
datadf = pd.DataFrame({'price':[470,488.51,489.33,490.43,499.51,478.99]})
datadf['increase'] = datadf.index.map(lambda x: datadf.loc[x]['price'] - datadf.loc[x-1]['price'] > 0 if x > 0 else None)
datadf['increase']
Out[8]:
0 None
1 True
2 True
3 True
4 True
5 False
Name: increase, dtype: object
datadf['increase'].dtype
dtype('O')
From what I can see, you want True / False on whether the row is larger than its preceding, so do:
datadf['increase'] = datadf.price > datadf.price.shift(1)
datadf['increase'].dtype
dtype('bool')
And we just omit the first row by doing:
datadf['increase'][1:].mean()
0.8

Select rows where multiple column values are in multiple lists

I want to select values from a dataframe such as:
Vendor_1 Vendor_2 Vendor_3
0 1 0 0
1 0 20 0
2 0 0 300
3 4 0 0
4 0 50 0
5 0 0 500
The values I want to keep from Vendor_1, 2, 3 are all inside a seperate list i.e v_1, v_2, v_3. For example say say v_1 = [1], v_2 = [20], v_3 = [500], meaning I want only these rows to stay.
I've tried something like:
df = df[(df['Vendor_1'].isin(v_1)) & (df['Vendor_2'].isin(v_2)) & ... ]
This gives me an empty dataframe, is this problem to do with the above logic, or is it that there exist no rows with these constraints (highly unlikely in my real dataframe).
Cheers
EDIT:
Ok so I've realised a fundamental difference with my example and what is actually is like in my df, if there is a value for Vendor_1 then Vendor_2,3 must be 0, etc. So my logic with the isin chain doesnt make sense right, ill update the example df.
So I feel like I need to make 3 subsets and then merge them or something?
isin accepts dictionary:
d = {
'Vendor_1':[1],
'Vendor_2':[20],
'Vendor_3':[500]
}
df.isin(d)
Output:
Vendor_1 Vendor_2 Vendor_3
0 True False False
1 False True False
2 False False False
3 False False False
4 False False False
5 False False True
And then depending on your logic, you want to check for any or all:
df[df.isin(d).any(1)]
Output:
Vendor_1 Vendor_2 Vendor_3
0 1 0 0
1 0 20 0
5 0 0 500
But if you use all in this case, for example, you require that Vendor_1=1, Vendor_2=20, and Vendor_3=500 must happen on the same rows and you would keep these rows.
The example you're giving should work unless there are effectively no rows that match that condition.
Those expressions are a bit tricky with the parens so I'd rather split the line in two for easier debugging:
mask = (df['Vendor_1'].isin(v_1)) & (df['Vendor_2'].isin(v_2))
# sanity check that the mask is selecting something
assert mask.any()
df = df[mask]
Note that you must have parens between & because of operator precedence rules.
For example:

Why is this lambda operation not working?

I want to take any values in my dataframe that are shown as 'less than' and report them as numbers half of the less-than value.
e.g. <1 becomes 0.5, <0.5 becomes 0.25, <5 becomes 2.5 etc.
ordinary numbers and text should be unchanged.
I have the following lambda function to apply to my dataframe that I thought was working but it isn't:
df_no_less_thans= df.apply(lambda x: x if str(x[0])!='<' else float(x[1:])/2)
I am still getting '<' values in the new df, no error messages.
What have I done wrong?
df=pd.DataFrame()
df['Cu']=[3.7612,1.3693, 2.7502,1.407,4.2066,6.4409,6.8136,"<0.05","<0.05",0.94,0.07,1.82,2.63,1.36,0.78]
df.apply(lambda x: x if str(x)[0]!='<' else float(str(x)[1:])/2)
df
gives
Cu
0 3.7612
1 1.3693
2 2.7502
3 1.407
4 4.2066
5 6.4409
6 6.8136
7 <0.05
8 <0.05
9 0.94
10 0.07
11 1.82
12 2.63
13 1.36
14 0.7 ```
Your code won't work with non-strings like integers or floats since you cannot index them without converting them to a string. You can explicitly cast everything to string and perform your indexing
You would also want to have a check for empty strings before you perform the lambda operation
#Explicitly cast to string and perform the indexing
func = lambda x: x if str(x)[0]!='<' else float(str(x)[1:])/2
li = ['<1', '<0.5', '<5', 1, 'hello', 4.0, '']
#Filter out empty strings
print([func(item) for item in li if item])
The output will be
[0.5, 0.25, 2.5, 1, 'hello', 4.0]
The method apply has an axis argument. By default, axis=0, which means that your lambda function is applied successively to each column in the dataframe. In your case, the lambda function is applied to the column 'Cu', meaning that the argument x is actually a column and str(x)[0] is not what you think.
You should use applymap instead, to apply the lambda function element-wise:
df.applymap(lambda x: x if str(x)[0] != '<' else float(str(x)[1:])/2)
I think you need apply lambda function only for Cu column, so correct solution is use Series.apply:
df['Cu'] = df['Cu'].apply(lambda x: x if str(x)[0]!='<' else float(str(x)[1:])/2)
print (df)
Cu
0 3.7612
1 1.3693
2 2.7502
3 1.4070
4 4.2066
5 6.4409
6 6.8136
7 0.0250
8 0.0250
9 0.9400
10 0.0700
11 1.8200
12 2.6300
13 1.3600
14 0.7800
If need apply function for all columns use IanS solution.
Here is how it works:
import pandas as pd
df=pd.DataFrame()
df['Cu']=[3.7612,1.3693, 2.7502,1.407,4.2066,6.4409,6.8136,"<0.05","<0.05",0.94,0.07,1.82,2.63,1.36,0.78]
df['Cu'] = df.apply(lambda x: x if not isinstance(x[0],str) else float(x[0][1:])/2, axis=1, raw=True)
print(df)
result:
Cu
0 3.7612
1 1.3693
2 2.7502
3 1.407
4 4.2066
5 6.4409
6 6.8136
7 0.025
8 0.025
9 0.94
10 0.07
11 1.82
12 2.63
13 1.36
14 0.78
In your question you say
e.g. <1 becomes 0.5, <0.5 becomes 0.25, <5 becomes 2.5 etc. ordinary numbers and text should be unchanged.
Now in the example you have given you only have the first two types of data: strings like <1 and floats, but you seem to want to be able to retain any kind of other text, too. However I see mixing different dtypes in one column as a bad dataframe layout that will only cause trouble in the future.
If, for example, you had some text hello in your column, a simple operation like:
df['Cu'] * 2
# [...]
# 6 13.6272
# 7 hellohello
# 8 0.05
# 9 1.88
# [...]
# Name: Cu, dtype: object
This is most likely not what you want.
Now I don't know what other kinds of text you have, but for the examples given I would recommend normalizing the dtypes first: For that we create a new column df['less_than'] from the "uncertainty information":
import pandas as pd
df=pd.DataFrame()
df['Cu']=[3.7612,1.3693, 2.7502,1.407,4.2066,6.4409,6.8136,"<0.05","<0.05",0.94,0.07,1.82,2.63,1.36,0.78]
df['less_than'] = df['Cu'].str.startswith('<', False)
df.loc[df['less_than'], 'Cu'] = df.loc[df['less_than'], 'Cu'].str.slice(1)
df['Cu'] = df['Cu'].astype(float)
# Cu less_than
# 0 3.7612 False
# 1 1.3693 False
# 2 2.7502 False
# 3 1.4070 False
# 4 4.2066 False
# 5 6.4409 False
# 6 6.8136 False
# 7 0.0500 True
# 8 0.0500 True
# 9 0.9400 False
# 10 0.0700 False
# 11 1.8200 False
# 12 2.6300 False
# 13 1.3600 False
# 14 0.7800 False
This enables us to do treat the entire column df['Cu'] the same, and making your "<1 becomes 0.5" operations a simple one-liner:
df.loc[df['less_than'], 'Cu'] /= 2

Categories

Resources