How can I create a function that squares the specific column value from a dataframe in pandas?
it should be like as you see below
def func(dataframe,column,value)
Suppose you have dataframe named df
Just create a function:-
def pow(data,column,val):
return data[column]**val
Now just call the function and pass the series,and val as parameter:-
func(df,'col3',2)
Between you can do this without creating a function just by:-
df['column name']**2
I suppose that you wanted to square only those values in the column column which are equal to value parameter:
def func(dataframe, column, value):
s = dataframe[column]
dataframe[column] = s.mask(s == value, lambda x: x**2)
Note:
This function changes the dataframe in place, so in concordance with Python conventions it returns the None value. (Why? Because there is no return statement.)
The explanation:
In the first command (in the body of the function definition) we assign the appropriate column (i.e. a series) to the variable s.
In the second command we apply the method .mask() with 2 arguments:
The first argument is a condition for using the second argument,
the second argument is a function (which is used only for elements satisfying the condition given in the first argument).
A test:
>>> df
A B C D
0 4 4 3 4
1 3 3 4 4
2 4 4 2 2
3 3 2 3 4
4 4 2 4 3
>>> func(df, "D", 4)
>>> df
A B C D
0 4 4 3 16
1 3 3 4 16
2 4 4 2 2
3 3 2 3 16
4 4 2 4 3
Related
I want to find the values corresponding to a column such that no values in another column takes value greater than 3.
For example, in the following dataframe
df = pd.DataFrame({'a':[1,2,3,1,2,3,1,2,3], 'b':[4,5,6,4,5,6,4,5,6], 'c':[4,3,5,4,3,5,4,3,3]})
I want the values of the column 'a' for which all the values of 'c' which are greater than 3.
I think groupby is the correct way to do it. My below code comes closer to it.
df.groupby('a')['c'].max()>3
a
1 True
2 False
3 True
4 False
Name: c, dtype: bool
The above code gives me a boolean frame. How can I get the values of 'a' such that it is true.
I want my output to be [1,3]
Is there a better and efficient way to get this on a very large data frame (with more than 30 million rows).
From your code I see that you actually want to output:
group keys for each group (df grouped by a),
where no value in c column (within the current group) is greater than 3.
In order to get some non-empty result, let's change the source DataFrame to:
a b c
0 1 4 4
1 2 5 1
2 3 6 5
3 1 4 4
4 2 5 2
5 3 6 5
6 1 4 4
7 2 5 2
8 3 6 3
For readability, let's group df by a and print each group.
The code to do it:
for key, grp in df.groupby('a'):
print(f'\nGroup: {key}\n{grp}')
gives result:
Group: 1
a b c
0 1 4 4
3 1 4 4
6 1 4 4
Group: 2
a b c
1 2 5 1
4 2 5 2
7 2 5 2
Group: 3
a b c
2 3 6 5
5 3 6 5
8 3 6 3
And now take a look at each group.
Only group 2 meets the condition that each element in c column
is less than 3.
So actually you need a groupby and filter, passing only groups
meeting the above condition:
To get full rows from the "good" groups, you can run:
df.groupby('a').filter(lambda grp: grp.c.lt(3).all())
getting:
a b c
1 2 5 1
4 2 5 2
7 2 5 2
But you want only values from a column, without repetitions.
So extend the above code to:
df.groupby('a').filter(lambda grp: grp.c.lt(3).all()).a.unique().tolist()
getting:
[2]
Note that your code: df.groupby('a')['c'].max() > 3 is wrong,
as it marks with True groups for which max is greater than 3
(instead of ">" there should be "<").
So an alternative solution is:
res = df.groupby('a')['c'].max()<3
res[res].index.tolist()
giving the same result.
Yet another solution can be based on a list comprehension:
[ key for key, grp in df.groupby('a') if grp.c.lt(3).all() ]
Details:
for key, grp in df.groupby('a') - creates groups,
if grp.c.lt(3).all() - filters groups,
key (at the start) - adds particular group key to the result.
import pandas as pd
#Create DataFrame
df = pd.DataFrame({'a':[1,2,3,1,2,3,1,2,3], 'b':[4,5,6,4,5,6,4,5,6], 'c':[4,3,5,4,3,5,4,3,3]})
#Write a function to find values greater than 3 if found return.
def grt(x):
for i in x:
if i>3:
return(i)
#Groupby column a and call function grt
p = {'c':grt}
grp = df.groupby(['a']).agg(p)
print(grp)
I am using apply to leverage one dataframe to manipulate a second dataframe and return results. Here is a simplified example that I realize could be more easily answered with "in" logic, but for now let's keep the use of .apply() as a constraint:
import pandas as pd
df1 = pd.DataFrame({'Name':['A','B'],'Value':range(1,3)})
df2 = pd.DataFrame({'Name':['A']*3+['B']*4+['C'],'Value':range(1,9)})
def filter_df(x, df):
return df[df['Name']==x['Name']]
df1.apply(filter_df, axis=1, args=(df2, ))
Which is returning:
0 Name Value
0 A 1
1 A 2
2 ...
1 Name Value
3 B 4
4 B 5
5 ...
dtype: object
What I would like to see instead is one formated DataFrame with Name and Value headers. All advice appreciated!
Name Value
0 A 1
1 A 2
2 A 3
3 B 4
4 B 5
5 B 6
6 B 7
In my opinion, this cannot be done solely based on apply, you need pandas.concat:
result = pd.concat(df1.apply(filter_df, axis=1, args=(df2,)).to_list())
print(result)
Output
Name Value
0 A 1
1 A 2
2 A 3
3 B 4
4 B 5
5 B 6
6 B 7
I just tried sorting my dataframe and used the following function:
df[df.count >= df.count.quantile(.95)]
It returned the error:
AttributeError: 'function' object has no attribute 'quantile'
But bracketing the series works fine:
df[df['count'] >= df['count'].quantile(.95)]
It's not the first time I've gotten different results based on this distinction, but it also usually doesn't happen, and I always thought that these were two identical objects.
Why does this happen?
Because count is one of data frame's built in method, when you use . it is recognized as the method instead of the column count, i.e, . prioritize built-in method over columns:
df = pd.DataFrame({
'A':[1,2,3],
'B':[2,3,4],
'count': [4,5,6]
})
df.count()
#A 3
#B 3
#count 3
#dtype: int64
df.count
# V V V V V V V V
#<bound method DataFrame.count of A B count
#0 1 2 4
#1 2 3 5
#2 3 4 6>
Another distinction between dot and bracket is, you can not use dot to create a new column. i.e. if the column doesn't exist, df.column = ... won't work, you have to use bracket in this case. as df[column] = ..., using the above dummy data frame:
# original data frame
df
# A B count
#0 1 2 4
#1 2 3 5
#2 3 4 6
Using dot to create a new column won't work, C is set as an attribute instead of a column:
df.C = 2
df
# A B count
#0 1 2 4
#1 2 3 5
#2 3 4 6
While bracket is the standard way to add a new column to the data frame:
df['C'] = 2
df
# A B count C
#0 1 2 4 2
#1 2 3 5 2
#2 3 4 6 2
If a column already exists, it's valid to modify it with dot assuming the data frame doesn't have an attribute with the same name (as is the case of the count above):
df.B = 3
df
# A B count C
#0 1 3 4 2
#1 2 3 5 2
#2 3 3 6 2
So I have an extremely simple dataframe:
values
1
1
1
2
2
I want to add a new column and for each row assign the sum of it's unique occurences, so the table would look like:
values unique_sum
1 3
1 3
1 3
2 2
2 2
I have seen some examples in R, but for python and pandas I have not come across anything and am stuck. I can list the value counts using .value_counts() and I have tried groupbyroutines but cannot fathom it.
Just use map to map your column onto its value_counts:
>>> x
A
0 1
1 1
2 1
3 2
4 2
>>> x['unique'] = x.A.map(x.A.value_counts())
>>> x
A unique
0 1 3
1 1 3
2 1 3
3 2 2
4 2 2
(I named the column A instead of values. values is not a great choice for a column name, because DataFrames have a special attribute called values, which prevents you from getting the column with x.values --- you'd have to use x['values'] instead.)
Given a DataFrame like this:
>>> df
0 1 2
0 2 3 5
1 3 4 7
and a function that returns multiple results, like this:
def sumprod(x, y, z):
return x+y+z, x*y*z
I want to add new columns, so the result would be:
>>> df
0 1 2 sum prod
0 2 3 5 10 30
1 3 4 7 14 84
I have been successful with functions that returns one result:
df["sum"] = p.apply(sum, axis=1)
but not if it returns more than one result.
One way to do this is to pass the columns of the DataFrame to the function by unpacking the transpose of the array:
>>> df['sum'], df['prod'] = sumprod(*df.values.T)
>>> df
0 1 2 sum prod
0 2 3 5 10 30
1 3 4 7 14 84
sumprod returns a tuple of columns and, since Python supports multiple assignment, you can assign them to new column labels as above.
You could write df['sum'], df['prod'] = sumprod(df[0], df[1], df[2]) to get the same result. This is clearer and is preferable if you need to pass the columns to the function in a particular order. On the other hand, it's a lot more verbose if you have a lot of columns to pass to the function.