Python dataframe check if a value in a column dataframe is within a range of values reported in another dataframe - python

Apology if the problemis trivial but as a python newby I wasn't able to find the right solution.
I have two dataframes and I need to add a column to the first dataframe that is true if a certain value of the first dataframe is between two values of the second dataframe otherwise false.
for example:
first_df = pd.DataFrame({'code1':[1,1,2,2,3,1,1],'code2':[10,22,15,15,7,130,2]})
second_df = pd.DataFrame({'code1':[1,1,2,2,3,1,1],'code2_start':[5,20,11,11,5,110,220],'code2_end':[15,25,20,20,10,120,230]})
first_df
code1 code2
0 1 10
1 1 22
2 2 15
3 2 15
4 3 7
5 1 130
6 1 2
second_df
code1 code2_end code2_start
0 1 15 5
1 1 25 20
2 2 20 11
3 2 20 11
4 3 10 5
5 1 120 110
6 1 230 220
For each row in the first dataframe I should check if the value reported in the code2 columne is between one of the possible range identified by the row of the second dataframe second_df for example:
in row 1 of first_df code1=1 and code2=22
checking second_df I have 4 rows with code1=1, rows 0,1,5 and 6, the value code2=22 is in the interval identified by code2_start=20 and code2_end=25 so the function should return True.
Considering an example where the function should return False,
in row 5 of first_df code1=1 and code2=130
but there is no interval containing 130 where code1=1
I have tried to use this function
def check(first_df,second_df):
for i in range(len(first_df):
return ((second_df.code2_start <= first_df.code2[i]) & (second_df.code2_end <= first_df.code2[i]) & (second_df.code1 == first_df.code1[i])).any()
and to vectorize it
first_df['output'] = np.vectorize(check)(first_df, second_df)
but obviously with no success.
I would be happy for any input you could provide.
thx.
A.
As a practical example:
first_df.code1[0] = 1
therefore I need to search on second_df all the istances where
second_df.code1 == first_df.code1[0]
0 True
1 True
2 False
3 False
4 False
5 True
6 True
for the instances 0,1,5,6 where the status is True I need to check if the value
first_df.code2[0]
10
is between one of the range identified by
second_df[second_df.code1 == first_df.code1[0]][['code2_start','code2_end']]
code2_start code2_end
0 5 15
1 20 25
5 110 120
6 220 230
since the value of first_df.code2[0] is 10 it is between 5 and 15 so the range identified by row 0 therefore my function should return True. In case of first_df.code1[6] the value vould still be 1 therefore the range table would be still the same above but first_df.code2[6] is 2 in this case and there is no interval containing 2 therefore the resut should be False.

first_df['output'] = (second_df.code2_start <= first_df.code2) & (second_df.code2_end <= first_df.code2)
This works because when you do something like: second_df.code2_start <= first_df.code2
You get a boolean Series. If you then perform a logical AND on two of these boolean series, you get a Series which has value True where both Series were True and False otherwise.
Here's an example:
>>> import pandas as pd
>>> a = pd.DataFrame([{1:2,2:4,3:6},{1:3,2:6,3:9},{1:4,2:8,3:10}])
>>> a['output'] = (a[2] <= a[3]) & (a[2] >= a[1])
>>> a
1 2 3 output
0 2 4 6 True
1 3 6 9 True
2 4 8 10 True
EDIT:
So based on your updated question and my new interpretation of your problem, I would do something like this:
import pandas as pd
# Define some data to work with
df_1 = pd.DataFrame([{'c1':1,'c2':5},{'c1':1,'c2':10},{'c1':1,'c2':20},{'c1':2,'c2':8}])
df_2 = pd.DataFrame([{'c1':1,'start':3,'end':6},{'c1':1,'start':7,'end':15},{'c1':2,'start':5,'end':15}])
# Function checks if c2 value is within any range matching c1 value
def checkRange(x, code_range):
idx = code_range.c1 == x.c1
code_range = code_range.loc[idx]
check = (code_range.start <= x.c2) & (code_range.end >= x.c2)
return check.any()
# Apply the checkRange function to each row of the DataFrame
df_1['output'] = df_1.apply(lambda x: checkRange(x, df_2), axis=1)
What I do here is define a function called checkRange which takes as input x, a single row of df_1 and code_range, the entire df_2 DataFrame. It first finds the rows of code_range which have the same c1 value as the given row, x.c1. Then the non matching rows are discarded. This is done in the first 2 lines:
idx = code_range.c1 == x.c1
code_range = code_range.loc[idx]
Next, we get a boolean Series which tells us if x.c2 falls within any of the ranges given in the reduced code_range DataFrame:
check = (code_range.start <= x.c2) & (code_range.end >= x.c2)
Finally, since we only care that the x.c2 falls within one of the ranges, we return the value of check.any(). When we call any() on a boolean Series, it will return True if any of the values in the Series are True.
To call the checkRange function on each row of df_1, we can use apply(). I define a lambda expression in order to send the checkRange function the row as well as df_2. axis=1 means that the function will be called on each row (instead of each column) for the DataFrame.

Related

Pandas TypeError bad operand type for unary ~: 'list', Python 3.10.1

This is a sample of the dataset where I have getting this error.
In the first row there are 2x times the value of 'Week'.
What I wanted to do is.
Look in the first row check every cell if the value 'Week' exist .
Make a Boolean list of cells that has the value 'Week'.
Remove the columns.
.
import pandas as pdtest
dftest = pd.DataFrame([*zip(['Week','Total','Total'],
[4,685,633],
['2017-01-23 00:00:00',369.37913186561053,341.67926027078333],
['2017-01-24 00:00:00',349.89972501652701,340.126939283226434],
['2017-01-28 00:00:00',353.74896050667999,314.016037939271868],
[5,675,619],
['Week','Total','Total']
)])
df2 = ~dftest.iloc[0].isin(['Week', 'Total']).tolist()
When I try to invert the Boolean list with tilde ~ I get this error. I can use Numpy invert to solve it, but I'm not sure why the tilde is not working.
b = np.invert(df2)
b[0] = True ## Skip first column
df = df.iloc[:,b ]
End result
0 1 2 3 4 5
1 Week 4 2017-01-23 2017-01-24 2017-01-28 5
2 Total 685 369.379132 349.899725 353.748961 675
3 Total 633 341.679260 340.126939 314.016038 619
IIUC, you want to check only in the first row and return True / False for whether the value is 'Week' or 'Total'.
To do that with iloc, you can slightly rewrite your code to this, which returns your expectation:
dftest.iloc[0,:].isin(['Week','Total'])
0 True # --> Column 0 contains either Week or Total
1 False
2 False
3 False
4 False
5 False
6 True # --> Column 6 contains either Week or Total
Then you can use your ~
To your your new filtered dataframe, you can wrap it in loc which which can be used with a boolean array:
df2 = dftest.loc[~dftest.iloc[0,:].isin(['Week','Total'])]
print(df2)
0 1 2 3 4 5 6
1 Total 685 369.379132 349.899725 353.748961 675 Total
2 Total 633 341.679260 340.126939 314.016038 619 Total

Select rows where multiple column values are in multiple lists

I want to select values from a dataframe such as:
Vendor_1 Vendor_2 Vendor_3
0 1 0 0
1 0 20 0
2 0 0 300
3 4 0 0
4 0 50 0
5 0 0 500
The values I want to keep from Vendor_1, 2, 3 are all inside a seperate list i.e v_1, v_2, v_3. For example say say v_1 = [1], v_2 = [20], v_3 = [500], meaning I want only these rows to stay.
I've tried something like:
df = df[(df['Vendor_1'].isin(v_1)) & (df['Vendor_2'].isin(v_2)) & ... ]
This gives me an empty dataframe, is this problem to do with the above logic, or is it that there exist no rows with these constraints (highly unlikely in my real dataframe).
Cheers
EDIT:
Ok so I've realised a fundamental difference with my example and what is actually is like in my df, if there is a value for Vendor_1 then Vendor_2,3 must be 0, etc. So my logic with the isin chain doesnt make sense right, ill update the example df.
So I feel like I need to make 3 subsets and then merge them or something?
isin accepts dictionary:
d = {
'Vendor_1':[1],
'Vendor_2':[20],
'Vendor_3':[500]
}
df.isin(d)
Output:
Vendor_1 Vendor_2 Vendor_3
0 True False False
1 False True False
2 False False False
3 False False False
4 False False False
5 False False True
And then depending on your logic, you want to check for any or all:
df[df.isin(d).any(1)]
Output:
Vendor_1 Vendor_2 Vendor_3
0 1 0 0
1 0 20 0
5 0 0 500
But if you use all in this case, for example, you require that Vendor_1=1, Vendor_2=20, and Vendor_3=500 must happen on the same rows and you would keep these rows.
The example you're giving should work unless there are effectively no rows that match that condition.
Those expressions are a bit tricky with the parens so I'd rather split the line in two for easier debugging:
mask = (df['Vendor_1'].isin(v_1)) & (df['Vendor_2'].isin(v_2))
# sanity check that the mask is selecting something
assert mask.any()
df = df[mask]
Note that you must have parens between & because of operator precedence rules.
For example:

Modify function to evaluate all values on row

Test data:
import pandas as pd
import numpy as np
from itertools import combinations
df2 = pd.DataFrame(
{'AAA' : [4,5,6,7,9,10], 'BBB' : [10,20,30,40,11,10],'CCC' : [100,50,25,10,10,11],'DDD' : [100,50,25,10,10,11]});
thresh = 10
My function:
def closeCols2(df):
for k1,k2 in combinations(df.keys(),2):
if abs(df[k1] - df[k2]) < thresh:
return max(df[k1],df[k2])
This gives me the following output showing the max value of a row if two columns are within thresh:
df2.apply(closeCols2, axis=1)
0 10
1 50
2 30
3 10
4 11
5 10
dtype: int64
But columns DDD (100) and CCC (100) on row 1 also have values within thresh and these are not being evaluated. How do I modify my function to capture this?
In your code the function returns as soon as it finds an absolute difference less than the defined thresh. So the first time the condition is met in the first row for columns 'AAA' (4) and 'BBB' (10) it returns the value (10) and stops the execution without even evaluating the next columns. I don't know exactly what you want to do, but you may try to adapt your function like this.
def closeCols2(df):
max_value = None
for k1,k2 in combinations(df.keys(),2):
if abs(df[k1] - df[k2]) < thresh:
if max_value is None:
max_value = max(df[k1],df[k2])
else:
# Max of the max
max_value = max(max_value, max(df[k1],df[k2]))
return max_value
df2.apply(closeCols2, axis=1)
# 0 100
# 1 50
# 2 30
# 3 10
# 4 11
# 5 11

Python Pandas Counting the Occurrences of a Specific value

I am trying to find the number of times a certain value appears in one column.
I have made the dataframe with data = pd.DataFrame.from_csv('data/DataSet2.csv')
and now I want to find the number of times something appears in a column. How is this done?
I thought it was the below, where I am looking in the education column and counting the number of time ? occurs.
The code below shows that I am trying to find the number of times 9th appears and the error is what I am getting when I run the code
Code
missing2 = df.education.value_counts()['9th']
print(missing2)
Error
KeyError: '9th'
You can create subset of data with your condition and then use shape or len:
print df
col1 education
0 a 9th
1 b 9th
2 c 8th
print df.education == '9th'
0 True
1 True
2 False
Name: education, dtype: bool
print df[df.education == '9th']
col1 education
0 a 9th
1 b 9th
print df[df.education == '9th'].shape[0]
2
print len(df[df['education'] == '9th'])
2
Performance is interesting, the fastest solution is compare numpy array and sum:
Code:
import perfplot, string
np.random.seed(123)
def shape(df):
return df[df.education == 'a'].shape[0]
def len_df(df):
return len(df[df['education'] == 'a'])
def query_count(df):
return df.query('education == "a"').education.count()
def sum_mask(df):
return (df.education == 'a').sum()
def sum_mask_numpy(df):
return (df.education.values == 'a').sum()
def make_df(n):
L = list(string.ascii_letters)
df = pd.DataFrame(np.random.choice(L, size=n), columns=['education'])
return df
perfplot.show(
setup=make_df,
kernels=[shape, len_df, query_count, sum_mask, sum_mask_numpy],
n_range=[2**k for k in range(2, 25)],
logx=True,
logy=True,
equality_check=False,
xlabel='len(df)')
Couple of ways using count or sum
In [338]: df
Out[338]:
col1 education
0 a 9th
1 b 9th
2 c 8th
In [335]: df.loc[df.education == '9th', 'education'].count()
Out[335]: 2
In [336]: (df.education == '9th').sum()
Out[336]: 2
In [337]: df.query('education == "9th"').education.count()
Out[337]: 2
An elegant way to count the occurrence of '?' or any symbol in any column, is to use built-in function isin of a dataframe object.
Suppose that we have loaded the 'Automobile' dataset into df object.
We do not know which columns contain missing value ('?' symbol), so let do:
df.isin(['?']).sum(axis=0)
DataFrame.isin(values) official document says:
it returns boolean DataFrame showing whether each element in the DataFrame
is contained in values
Note that isin accepts an iterable as input, thus we need to pass a list containing the target symbol to this function. df.isin(['?']) will return a boolean dataframe as follows.
symboling normalized-losses make fuel-type aspiration-ratio ...
0 False True False False False
1 False True False False False
2 False True False False False
3 False False False False False
4 False False False False False
5 False True False False False
...
To count the number of occurrence of the target symbol in each column, let's take sum over all the rows of the above dataframe by indicating axis=0.
The final (truncated) result shows what we expect:
symboling 0
normalized-losses 41
...
bore 4
stroke 4
compression-ratio 0
horsepower 2
peak-rpm 2
city-mpg 0
highway-mpg 0
price 4
Try this:
(df[education]=='9th').sum()
easy but not efficient:
list(df.education).count('9th')
Simple example to count occurrences (unique values) in a column in Pandas data frame:
import pandas as pd
# URL to .csv file
data_url = 'https://yoursite.com/Arrests.csv'
# Reading the data
df = pd.read_csv(data_url, index_col=0)
# pandas count distinct values in column
df['education'].value_counts()
Outputs:
Education 47516
9th 41164
8th 25510
7th 25198
6th 25047
...
3rd 2
2nd 2
1st 2
Name: name, Length: 190, dtype: int64
for finding a specific value of a column you can use the code below
irrespective of the preference you can use the any of the method you like
df.col_name.value_counts().Value_you_are_looking_for
take example of the titanic dataset
df.Sex.value_counts().male
this gives a count of all male on the ship
Although if you want to count a numerical data then you cannot use the above method because value_counts() is used only with series type of data hence fails
So for that you can use the second method example
the second method is
#this is an example method of counting on a data frame
df[(df['Survived']==1)&(df['Sex']=='male')].counts()
this is not that efficient as value_counts() but surely will help if you want to count values of a data frame
hope this helps
EDIT --
If you wanna look for something with a space in between
you may use
df.country.count('united states')
I believe this should solve the problem
I think this could be a more easy solution. Suppose you have the following data frame.
DATE LANG POSTS
2008-07-01 c# 3
2008-08-01 assembly 8
2008-08-01 javascript 2
2008-08-01 c 85
2008-08-01 python 11
2008-07-01 c# 3
2008-08-01 assembly 8
2008-08-01 javascript 62
2008-08-01 c 85
2008-08-01 python 14
you can find the occurrence of LANG item's sum like this
df.groupby('LANG').sum()
and you will have the sum of each individual language

pandas dataframe groupby like mysql, yet into new column

df = pd.DataFrame({'A':[11,11,22,22],'mask':[0,0,0,1],'values':np.arange(10,30,5)})
df
A mask values
0 11 0 10
1 11 0 15
2 22 0 20
3 22 1 25
Now how can I group by A, and keep the column names in tact, and yet put a custom function into Z:
def calculate_df_stats(dfs):
mask_ = list(dfs['B'])
mean = np.ma.array(list(dfs['values']), mask=mask_).mean()
return mean
df['Z'] = df.groupby('A').agg(calculate_df_stats) # does not work
and generate:
A mask values Z
0 11 0 10 12.5
1 22 0 20 25
Whatever I do it only replaces values column with the masked mean.
and can your solution be applied for a function on two columns and return in a new column?
Thanks!
Edit:
To clarify more: let's say I have such a table in Mysql:
SELECT * FROM `Reader_datapoint` WHERE `wavelength` = '560'
LIMIT 200;
which gives me such result:
http://pastebin.com/qXiaWcJq
If I run now this:
SELECT *, avg(action_value) FROM `Reader_datapoint` WHERE `wavelength` = '560'
group by `reader_plate_ID`;
I get:
datapoint_ID plate_ID coordinate_x coordinate_y res_value wavelength ignore avg(action_value)
193 1 0 0 2.1783 560 NULL 2.090027083333334
481 2 0 0 1.7544 560 NULL 1.4695583333333333
769 3 0 0 2.0161 560 NULL 1.6637885416666673
How can I replicate this behaviour in Pandas? note that all the column names stay the same, the first value is taken, and the new column is added.
If you want the original columns in your result, you can first calculate the grouped and aggregated dataframe (but you will have to aggregate in some way your original columns. I took the first occuring as an example):
>>> df = pd.DataFrame({'A':[11,11,22,22],'mask':[0,0,0,1],'values':np.arange(10,30,5)})
>>>
>>> grouped = df.groupby("A")
>>>
>>> result = grouped.agg('first')
>>> result
mask values
A
11 0 10
22 0 20
and then add a column 'Z' to that result by applying your function on the groupby result 'grouped':
>>> def calculate_df_stats(dfs):
... mask_ = list(dfs['mask'])
... mean = np.ma.array(list(dfs['values']), mask=mask_).mean()
... return mean
...
>>> result['Z'] = grouped.apply(calculate_df_stats)
>>>
>>> result
mask values Z
A
11 0 10 12.5
22 0 20 20.0
In your function definition you can always use more columns (just by their name) to return the result.

Categories

Resources