I want to select values from a dataframe such as:
Vendor_1 Vendor_2 Vendor_3
0 1 0 0
1 0 20 0
2 0 0 300
3 4 0 0
4 0 50 0
5 0 0 500
The values I want to keep from Vendor_1, 2, 3 are all inside a seperate list i.e v_1, v_2, v_3. For example say say v_1 = [1], v_2 = [20], v_3 = [500], meaning I want only these rows to stay.
I've tried something like:
df = df[(df['Vendor_1'].isin(v_1)) & (df['Vendor_2'].isin(v_2)) & ... ]
This gives me an empty dataframe, is this problem to do with the above logic, or is it that there exist no rows with these constraints (highly unlikely in my real dataframe).
Cheers
EDIT:
Ok so I've realised a fundamental difference with my example and what is actually is like in my df, if there is a value for Vendor_1 then Vendor_2,3 must be 0, etc. So my logic with the isin chain doesnt make sense right, ill update the example df.
So I feel like I need to make 3 subsets and then merge them or something?
isin accepts dictionary:
d = {
'Vendor_1':[1],
'Vendor_2':[20],
'Vendor_3':[500]
}
df.isin(d)
Output:
Vendor_1 Vendor_2 Vendor_3
0 True False False
1 False True False
2 False False False
3 False False False
4 False False False
5 False False True
And then depending on your logic, you want to check for any or all:
df[df.isin(d).any(1)]
Output:
Vendor_1 Vendor_2 Vendor_3
0 1 0 0
1 0 20 0
5 0 0 500
But if you use all in this case, for example, you require that Vendor_1=1, Vendor_2=20, and Vendor_3=500 must happen on the same rows and you would keep these rows.
The example you're giving should work unless there are effectively no rows that match that condition.
Those expressions are a bit tricky with the parens so I'd rather split the line in two for easier debugging:
mask = (df['Vendor_1'].isin(v_1)) & (df['Vendor_2'].isin(v_2))
# sanity check that the mask is selecting something
assert mask.any()
df = df[mask]
Note that you must have parens between & because of operator precedence rules.
For example:
Related
So I took the mean of a pandas data frame column that contains boolean values. I've done this in the past multiple times and understood that it would return the proportion that is True. But when I wrote it in this particular instance, it didn't work. It returns the proportion that is False and not only that, the denominator it uses doesn't seem to relate to anything. I have no idea where it pulls the denominator from to calculate the proportion value. I discovered it works the way I want it to when I remove the second line of code (datadf = datadf[1:])
# get current row value minus previous row value and returns True if > 0
datadf['increase'] = datadf.index.map(lambda x: datadf.loc[x]['price'] - datadf.loc[x-1]['price'] > 0 if x > 0 else None)
# remove first row because it gives 'None'
datadf = datadf[1:]
# calculate proportion that is True
accretionscore = datadf['increase'].mean()
This is the output
date price increase
1 2020-09-28 488.51 True
2 2020-09-29 489.33 True
3 2020-09-30 490.43 True
4 2020-10-01 499.51 True
5 2020-10-02 478.99 False
correct value: 0.8
value given: 0.2
When I try adding another sample that's when things get weirder:
date price increase
1 2020-09-27 479.78 False
2 2020-09-28 488.51 True
3 2020-09-29 489.33 True
4 2020-09-30 490.43 True
5 2020-10-01 499.51 True
6 2020-10-02 478.99 False
correct value: 0.6666666666666666
value given: 0.16666666666666666
they don't even add up to 1!
I'm so confused. Can anyone tell me what is going on? How does taking out the second line fix the problem?
Hint: if you want to convert from boolean to int, then you just can use:
datadf['increase'] = datadf['increase'].astype(int)
and this way things will work fine.
If we run your code, you can see that datadf['increase'] is an object instead of a boolean, so taking mean on this is most likely converting the categories to a number and so on.. basically something weird:
import pandas as pd
datadf = pd.DataFrame({'price':[470,488.51,489.33,490.43,499.51,478.99]})
datadf['increase'] = datadf.index.map(lambda x: datadf.loc[x]['price'] - datadf.loc[x-1]['price'] > 0 if x > 0 else None)
datadf['increase']
Out[8]:
0 None
1 True
2 True
3 True
4 True
5 False
Name: increase, dtype: object
datadf['increase'].dtype
dtype('O')
From what I can see, you want True / False on whether the row is larger than its preceding, so do:
datadf['increase'] = datadf.price > datadf.price.shift(1)
datadf['increase'].dtype
dtype('bool')
And we just omit the first row by doing:
datadf['increase'][1:].mean()
0.8
I have a pandas series, for example: x = pandas.Series([-1,20,"test"]).
Now I would like to test if -1 is contained in x without looping over the whole series. I could transform the whole series to string and then test if "-1" in x but sometimes I have -1.0 and sometime -1 and so on, so this is not a good choice.
Is there another possibility to approach this?
What about
x.isin([-1])
output:
0 True
1 False
2 False
dtype: bool
Or if you want to have a count of how many instances:
x.isin([-1]).sum()
Output:
1
I think you can do something like this to handle data that appears to be string-like and integer-like. Pandas Series are all a single datatype.
x = pd.Series([-1,20,"test","-1.0"])
print(x)
0 -1
1 20
2 test
3 -1.0
dtype: object
(pd.to_numeric(x, errors='coerce') == -1).sum()
Note: Any value that can cast into a number will return NaN.
Output
2
If you just want to see if a -1 appears in x then you can use
(pd.to_numeric(x, errors='coerce') == -1).sum() > 0
Output:
True
x.isin([-1])
Gives me:
0 True
1 False
2 False
dtype: bool
You can refer to docs for more info.
Apology if the problemis trivial but as a python newby I wasn't able to find the right solution.
I have two dataframes and I need to add a column to the first dataframe that is true if a certain value of the first dataframe is between two values of the second dataframe otherwise false.
for example:
first_df = pd.DataFrame({'code1':[1,1,2,2,3,1,1],'code2':[10,22,15,15,7,130,2]})
second_df = pd.DataFrame({'code1':[1,1,2,2,3,1,1],'code2_start':[5,20,11,11,5,110,220],'code2_end':[15,25,20,20,10,120,230]})
first_df
code1 code2
0 1 10
1 1 22
2 2 15
3 2 15
4 3 7
5 1 130
6 1 2
second_df
code1 code2_end code2_start
0 1 15 5
1 1 25 20
2 2 20 11
3 2 20 11
4 3 10 5
5 1 120 110
6 1 230 220
For each row in the first dataframe I should check if the value reported in the code2 columne is between one of the possible range identified by the row of the second dataframe second_df for example:
in row 1 of first_df code1=1 and code2=22
checking second_df I have 4 rows with code1=1, rows 0,1,5 and 6, the value code2=22 is in the interval identified by code2_start=20 and code2_end=25 so the function should return True.
Considering an example where the function should return False,
in row 5 of first_df code1=1 and code2=130
but there is no interval containing 130 where code1=1
I have tried to use this function
def check(first_df,second_df):
for i in range(len(first_df):
return ((second_df.code2_start <= first_df.code2[i]) & (second_df.code2_end <= first_df.code2[i]) & (second_df.code1 == first_df.code1[i])).any()
and to vectorize it
first_df['output'] = np.vectorize(check)(first_df, second_df)
but obviously with no success.
I would be happy for any input you could provide.
thx.
A.
As a practical example:
first_df.code1[0] = 1
therefore I need to search on second_df all the istances where
second_df.code1 == first_df.code1[0]
0 True
1 True
2 False
3 False
4 False
5 True
6 True
for the instances 0,1,5,6 where the status is True I need to check if the value
first_df.code2[0]
10
is between one of the range identified by
second_df[second_df.code1 == first_df.code1[0]][['code2_start','code2_end']]
code2_start code2_end
0 5 15
1 20 25
5 110 120
6 220 230
since the value of first_df.code2[0] is 10 it is between 5 and 15 so the range identified by row 0 therefore my function should return True. In case of first_df.code1[6] the value vould still be 1 therefore the range table would be still the same above but first_df.code2[6] is 2 in this case and there is no interval containing 2 therefore the resut should be False.
first_df['output'] = (second_df.code2_start <= first_df.code2) & (second_df.code2_end <= first_df.code2)
This works because when you do something like: second_df.code2_start <= first_df.code2
You get a boolean Series. If you then perform a logical AND on two of these boolean series, you get a Series which has value True where both Series were True and False otherwise.
Here's an example:
>>> import pandas as pd
>>> a = pd.DataFrame([{1:2,2:4,3:6},{1:3,2:6,3:9},{1:4,2:8,3:10}])
>>> a['output'] = (a[2] <= a[3]) & (a[2] >= a[1])
>>> a
1 2 3 output
0 2 4 6 True
1 3 6 9 True
2 4 8 10 True
EDIT:
So based on your updated question and my new interpretation of your problem, I would do something like this:
import pandas as pd
# Define some data to work with
df_1 = pd.DataFrame([{'c1':1,'c2':5},{'c1':1,'c2':10},{'c1':1,'c2':20},{'c1':2,'c2':8}])
df_2 = pd.DataFrame([{'c1':1,'start':3,'end':6},{'c1':1,'start':7,'end':15},{'c1':2,'start':5,'end':15}])
# Function checks if c2 value is within any range matching c1 value
def checkRange(x, code_range):
idx = code_range.c1 == x.c1
code_range = code_range.loc[idx]
check = (code_range.start <= x.c2) & (code_range.end >= x.c2)
return check.any()
# Apply the checkRange function to each row of the DataFrame
df_1['output'] = df_1.apply(lambda x: checkRange(x, df_2), axis=1)
What I do here is define a function called checkRange which takes as input x, a single row of df_1 and code_range, the entire df_2 DataFrame. It first finds the rows of code_range which have the same c1 value as the given row, x.c1. Then the non matching rows are discarded. This is done in the first 2 lines:
idx = code_range.c1 == x.c1
code_range = code_range.loc[idx]
Next, we get a boolean Series which tells us if x.c2 falls within any of the ranges given in the reduced code_range DataFrame:
check = (code_range.start <= x.c2) & (code_range.end >= x.c2)
Finally, since we only care that the x.c2 falls within one of the ranges, we return the value of check.any(). When we call any() on a boolean Series, it will return True if any of the values in the Series are True.
To call the checkRange function on each row of df_1, we can use apply(). I define a lambda expression in order to send the checkRange function the row as well as df_2. axis=1 means that the function will be called on each row (instead of each column) for the DataFrame.
I am trying to find the number of times a certain value appears in one column.
I have made the dataframe with data = pd.DataFrame.from_csv('data/DataSet2.csv')
and now I want to find the number of times something appears in a column. How is this done?
I thought it was the below, where I am looking in the education column and counting the number of time ? occurs.
The code below shows that I am trying to find the number of times 9th appears and the error is what I am getting when I run the code
Code
missing2 = df.education.value_counts()['9th']
print(missing2)
Error
KeyError: '9th'
You can create subset of data with your condition and then use shape or len:
print df
col1 education
0 a 9th
1 b 9th
2 c 8th
print df.education == '9th'
0 True
1 True
2 False
Name: education, dtype: bool
print df[df.education == '9th']
col1 education
0 a 9th
1 b 9th
print df[df.education == '9th'].shape[0]
2
print len(df[df['education'] == '9th'])
2
Performance is interesting, the fastest solution is compare numpy array and sum:
Code:
import perfplot, string
np.random.seed(123)
def shape(df):
return df[df.education == 'a'].shape[0]
def len_df(df):
return len(df[df['education'] == 'a'])
def query_count(df):
return df.query('education == "a"').education.count()
def sum_mask(df):
return (df.education == 'a').sum()
def sum_mask_numpy(df):
return (df.education.values == 'a').sum()
def make_df(n):
L = list(string.ascii_letters)
df = pd.DataFrame(np.random.choice(L, size=n), columns=['education'])
return df
perfplot.show(
setup=make_df,
kernels=[shape, len_df, query_count, sum_mask, sum_mask_numpy],
n_range=[2**k for k in range(2, 25)],
logx=True,
logy=True,
equality_check=False,
xlabel='len(df)')
Couple of ways using count or sum
In [338]: df
Out[338]:
col1 education
0 a 9th
1 b 9th
2 c 8th
In [335]: df.loc[df.education == '9th', 'education'].count()
Out[335]: 2
In [336]: (df.education == '9th').sum()
Out[336]: 2
In [337]: df.query('education == "9th"').education.count()
Out[337]: 2
An elegant way to count the occurrence of '?' or any symbol in any column, is to use built-in function isin of a dataframe object.
Suppose that we have loaded the 'Automobile' dataset into df object.
We do not know which columns contain missing value ('?' symbol), so let do:
df.isin(['?']).sum(axis=0)
DataFrame.isin(values) official document says:
it returns boolean DataFrame showing whether each element in the DataFrame
is contained in values
Note that isin accepts an iterable as input, thus we need to pass a list containing the target symbol to this function. df.isin(['?']) will return a boolean dataframe as follows.
symboling normalized-losses make fuel-type aspiration-ratio ...
0 False True False False False
1 False True False False False
2 False True False False False
3 False False False False False
4 False False False False False
5 False True False False False
...
To count the number of occurrence of the target symbol in each column, let's take sum over all the rows of the above dataframe by indicating axis=0.
The final (truncated) result shows what we expect:
symboling 0
normalized-losses 41
...
bore 4
stroke 4
compression-ratio 0
horsepower 2
peak-rpm 2
city-mpg 0
highway-mpg 0
price 4
Try this:
(df[education]=='9th').sum()
easy but not efficient:
list(df.education).count('9th')
Simple example to count occurrences (unique values) in a column in Pandas data frame:
import pandas as pd
# URL to .csv file
data_url = 'https://yoursite.com/Arrests.csv'
# Reading the data
df = pd.read_csv(data_url, index_col=0)
# pandas count distinct values in column
df['education'].value_counts()
Outputs:
Education 47516
9th 41164
8th 25510
7th 25198
6th 25047
...
3rd 2
2nd 2
1st 2
Name: name, Length: 190, dtype: int64
for finding a specific value of a column you can use the code below
irrespective of the preference you can use the any of the method you like
df.col_name.value_counts().Value_you_are_looking_for
take example of the titanic dataset
df.Sex.value_counts().male
this gives a count of all male on the ship
Although if you want to count a numerical data then you cannot use the above method because value_counts() is used only with series type of data hence fails
So for that you can use the second method example
the second method is
#this is an example method of counting on a data frame
df[(df['Survived']==1)&(df['Sex']=='male')].counts()
this is not that efficient as value_counts() but surely will help if you want to count values of a data frame
hope this helps
EDIT --
If you wanna look for something with a space in between
you may use
df.country.count('united states')
I believe this should solve the problem
I think this could be a more easy solution. Suppose you have the following data frame.
DATE LANG POSTS
2008-07-01 c# 3
2008-08-01 assembly 8
2008-08-01 javascript 2
2008-08-01 c 85
2008-08-01 python 11
2008-07-01 c# 3
2008-08-01 assembly 8
2008-08-01 javascript 62
2008-08-01 c 85
2008-08-01 python 14
you can find the occurrence of LANG item's sum like this
df.groupby('LANG').sum()
and you will have the sum of each individual language
I need to filter a data frame with a dict, constructed with the key being the column name and the value being the value that I want to filter:
filter_v = {'A':1, 'B':0, 'C':'This is right'}
# this would be the normal approach
df[(df['A'] == 1) & (df['B'] ==0)& (df['C'] == 'This is right')]
But I want to do something on the lines
for column, value in filter_v.items():
df[df[column] == value]
but this will filter the data frame several times, one value at a time, and not apply all filters at the same time. Is there a way to do it programmatically?
EDIT: an example:
df1 = pd.DataFrame({'A':[1,0,1,1, np.nan], 'B':[1,1,1,0,1], 'C':['right','right','wrong','right', 'right'],'D':[1,2,2,3,4]})
filter_v = {'A':1, 'B':0, 'C':'right'}
df1.loc[df1[filter_v.keys()].isin(filter_v.values()).all(axis=1), :]
gives
A B C D
0 1 1 right 1
1 0 1 right 2
3 1 0 right 3
but the expected result was
A B C D
3 1 0 right 3
only the last one should be selected.
IIUC, you should be able to do something like this:
>>> df1.loc[(df1[list(filter_v)] == pd.Series(filter_v)).all(axis=1)]
A B C D
3 1 0 right 3
This works by making a Series to compare against:
>>> pd.Series(filter_v)
A 1
B 0
C right
dtype: object
Selecting the corresponding part of df1:
>>> df1[list(filter_v)]
A C B
0 1 right 1
1 0 right 1
2 1 wrong 1
3 1 right 0
4 NaN right 1
Finding where they match:
>>> df1[list(filter_v)] == pd.Series(filter_v)
A B C
0 True False True
1 False False True
2 True False False
3 True True True
4 False False True
Finding where they all match:
>>> (df1[list(filter_v)] == pd.Series(filter_v)).all(axis=1)
0 False
1 False
2 False
3 True
4 False
dtype: bool
And finally using this to index into df1:
>>> df1.loc[(df1[list(filter_v)] == pd.Series(filter_v)).all(axis=1)]
A B C D
3 1 0 right 3
Abstraction of the above for case of passing array of filter values rather than single value (analogous to pandas.core.series.Series.isin()). Using the same example:
df1 = pd.DataFrame({'A':[1,0,1,1, np.nan], 'B':[1,1,1,0,1], 'C':['right','right','wrong','right', 'right'],'D':[1,2,2,3,4]})
filter_v = {'A':[1], 'B':[1,0], 'C':['right']}
##Start with array of all True
ind = [True] * len(df1)
##Loop through filters, updating index
for col, vals in filter_v.items():
ind = ind & (df1[col].isin(vals))
##Return filtered dataframe
df1[ind]
##Returns
A B C D
0 1.0 1 right 1
3 1.0 0 right 3
Here is a way to do it:
df.loc[df[filter_v.keys()].isin(filter_v.values()).all(axis=1), :]
UPDATE:
With values being the same across columns you could then do something like this:
# Create your filtering function:
def filter_dict(df, dic):
return df[df[dic.keys()].apply(
lambda x: x.equals(pd.Series(dic.values(), index=x.index, name=x.name)), asix=1)]
# Use it on your DataFrame:
filter_dict(df1, filter_v)
Which yields:
A B C D
3 1 0 right 3
If it something that you do frequently you could go as far as to patch DataFrame for an easy access to this filter:
pd.DataFrame.filter_dict_ = filter_dict
And then use this filter like this:
df1.filter_dict_(filter_v)
Which would yield the same result.
BUT, it is not the right way to do it, clearly.
I would use DSM's approach.
For python2, that's OK in #primer's answer. But, you should be careful in Python3 because of dict_keys. For instance,
>> df.loc[df[filter_v.keys()].isin(filter_v.values()).all(axis=1), :]
>> TypeError: unhashable type: 'dict_keys'
The correct way to Python3:
df.loc[df[list(filter_v.keys())].isin(list(filter_v.values())).all(axis=1), :]
Here's another way:
filterSeries = pd.Series(np.ones(df.shape[0],dtype=bool))
for column, value in filter_v.items():
filterSeries = ((df[column] == value) & filterSeries)
This gives:
>>> df[filterSeries]
A B C D
3 1 0 right 3
To follow up on DSM's answer, you can also use any() to turn your query into an OR operation (instead of AND):
df1.loc[(df1[list(filter_v)] == pd.Series(filter_v)).any(axis=1)]
You can also create a query
query_string = ' and '.join(
[f'({key} == "{val}")' if type(val) == str else f'({key} == {val})' for key, val in filter_v.items()]
)
df1.query(query_string)
Combining previous answers, here's a function you can feed to df1.loc. Allows for AND/OR (using how='all'/'any'), plus it allows comparisons other than == using the op keyword, if desired.
import operator
def quick_mask(df, filters, how='all', op=operator.eq) -> pd.Series:
if how == 'all':
comb = pd.Series.all
elif how == 'any':
comb = pd.Series.any
return comb(op(df[[*filters]], pd.Series(filters)), axis=1)
# Usage
df1.loc[quick_mask(df1, filter_v)]
I had an issue due to my dictionary having multiple values for the same key.
I was able to change DSM's query to:
df1.loc[df1[list(filter_v)].isin(filter_v).all(axis=1), :]