pandas Dataframe create new column - python

I have this snippet of the code working with pandas dataframe, i am trying to use the apply function to create a new column called STDEV_TV but i keep running into this error all the columns i am working with are type float
TypeError: ("'float' object is not iterable", 'occurred at index 0')
Can someone help me understand why i keep getting this error
def sigma(df):
val = df.volume2Sum / df.volumeSum - df.vwap * df.vwap
return math.sqrt(max(val))
df['STDEV_TV'] = df.apply(sigma, axis=1)

Try:
import pandas as pd
import numpy as np
import math
df = pd.DataFrame(np.random.randint(1, 10, (5, 3)),
columns=['volume2Sum', 'volumeSum', 'vwap'])
def sigma(df):
val = df.volume2Sum / df.volumeSum - df.vwap * df.vwap
return math.sqrt(val) if val >= 0 else val
df['STDEV_TV'] = df.apply(sigma, axis=1)
Output:
>>> df
volume2Sum volumeSum vwap STDEV_TV
0 4 5 8 -63.200000
1 2 8 4 -15.750000
2 3 3 3 -8.000000
3 8 3 4 -13.333333
4 4 2 3 -7.000000

You need to apply sigma to each set of values not the whole DataFrame.
I would use a lambda function, eg:
def sigma(volume2Sum, volumeSum, vwap):
val = volume2Sum / volumeSum - vwap * vwap
return math.sqrt(val)
df['STDEV_TV'] = df.apply(lambda x: sigma(x.volume2Sum, x.volumeSum, x.vwap), axis=1)
That should put val into the STDEV_TV column and you can find the max value separately.
Take care you not to take the squareroot of a negative number.

You function sigma gives you one number as a result. Because, the first step you find the maximum:
max(val)
and it's only the one number...
After that you try uses you function for data series.
You should use in your code this last string:
df['STDEV_TV'] = sigma(df)
It will be working

Change
return math.sqrt(max(val))
to
return math.sqrt(max(val)) if isinstance(val, pd.Series) else (math.sqrt(val) if val >= 0 else val)
max() iterates over an iterable and find the maximum value. The problem here is since you're applying sigma to every row, local variable val is a float, not a list, so what you have similar to max(1.3).

Related

Apply a function row by row using other dataframes' rows as list inputs in python

I'm trying to apply a function row-by-row which takes 5 inputs, 3 of which are lists. I want these lists to come from each row of 3 correspondings dataframes.
I've tried using 'apply' and 'lambda' as follows:
sol['tf_dd']=sol.apply(lambda tsol, rfsol, rbsol:
taurho_difdif(xy=xy,
l=l,
t=tsol,
rf=rfsol,
rb=rbsol),
axis=1)
However I get the error <lambda>() missing 2 required positional arguments: 'rfsol' and 'rbsol'
The DataFrame sol and the DataFrames tsol, rfsol and rbsol all have the same length. For each row, I want the entire row from tsol, rfsol and rbsol to be input as three lists.
Here is much simplified example (first with single lists, which I then want to replicate row by row with dataframes):
The output with single lists is a single value (120). With dataframes as inputs I want an output dataframe of length 10 where all values are 120.
t=[1,2,3,4,5]
rf=[6,7,8,9,10]
rb=[11,12,13,14,15]
def simple_func(t, rf, rb):
x=sum(t)
y=sum(rf)
z=sum(rb)
return x+y+z
out=simple_func(t,rf,rb)
# dataframe rows as lists
tsol=pd.DataFrame((t,t,t,t,t,t,t,t,t,t))
rfsol=pd.DataFrame((rf,rf,rf,rf,rf,rf,rf,rf,rf,rf))
rbsol=pd.DataFrame((rb,rb,rb,rb,rb,rb,rb,rb,rb,rb))
out2 = pd.DataFrame(index=range(len(tsol)), columns=['output'])
out2['output'] = out2.apply(lambda tsol, rfsol, rbsol:
simple_func(t=tsol.tolist(),
rf=rfsol.tolist(),
rb=rbsol.tolist()),
axis=1)
Try to use "name" field in Series Type to get index value, and then get the same index for the other DataFrame
import pandas as pd
import numpy as np
def postional_sum(inot, df1, df2, df3):
"""
Get input index and gather the same position for the other DataFrame collection
"""
position = inot.name
x = df1.iloc[position].sum()
y = df2.iloc[position].sum()
z = df3.iloc[position].sum()
return x + y + z
# dataframe rows as lists
tsol = pd.DataFrame(np.random.randn(10, 5), columns=range(5))
rfsol = pd.DataFrame(np.random.randn(10, 5), columns=range(5))
rbsol = pd.DataFrame(np.random.randn(10, 5), columns=range(5))
out2 = pd.DataFrame(index=range(len(tsol)), columns=["output"])
out2["output"] = out2.apply(lambda x: postional_sum(x, tsol, rfsol, rbsol), axis=1)
out2
Hope this helps!
When you run df.apply() with axis=1, it does not pass on the columns as individual arguments to the function, but as a Series object, as explained here. The correct way to do this would be
out2['output'] = out2.apply(lambda row:
simple_func(t=row["tsol"],
rf=row["rfsol"],
rb=row["rbsol"]),
axis=1)
You can eliminate the simple function using this:
out2["output"] = tsol.sum(axis=1) + rfsol.sum(axis=1) + rbsol.sum(axis=1)
Here is the complete code:
t=[1,2,3,4,5]
rf=[6,7,8,9,10]
rb=[11,12,13,14,15]
# dataframe rows as lists
tsol=pd.DataFrame((t,t,t,t,t,t,t,t,t,t))
rfsol=pd.DataFrame((rf,rf,rf,rf,rf,rf,rf,rf,rf,rf))
rbsol=pd.DataFrame((rb,rb,rb,rb,rb,rb,rb,rb,rb,rb))
out2 = pd.DataFrame(index=range(len(tsol)), columns=["output"])
out2["output"] = tsol.sum(axis=1) + rfsol.sum(axis=1) + rbsol.sum(axis=1)
print(out2)
OUTPUT:
output
0 120
1 120
2 120
3 120
4 120
5 120
6 120
7 120
8 120
9 120

I want to make a function if the common key is found in both dataframes

I have two dataframes df1 and df2 each of them have column containing product code and product price, I wanted to check the difference between prices in the 2 dataframes and store the result of this function I created in a new dataframe "df3" containing the product code and the final price, Here is my attempt :
Function to calculate the difference in the way I want:
def range_calc(z, y):
final_price = pd.DataFrame(columns = ["Market_price"])
res = z-y
abs_res = abs(res)
if abs_res == 0:
return (z)
if z>y:
final_price = (abs_res / z ) * 100
else:
final_price = (abs_res / y ) * 100
return(final_price)
For loop I created to check the two df and use the function:
Last_df = pd.DataFrame(columns = ["Product_number", "Market_Price"])
for i in df1["product_ID"]:
for x in df2["product_code"]:
if i == x:
Last_df["Product_number"] = i
Last_df["Market_Price"] = range_calc(df1["full_price"],df2["tot_price"])
The problem is that I am getting this error every time:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Why you got the error message "The truth value of a Series is ambiguous"
You got the error message The truth value of a Series is ambiguous
because you tried input a pandas.Series into an if-clause
nums = pd.Series([1.11, 2.22, 3.33])
if nums == 0:
print("nums == zero (nums is equal to zero)")
else:
print("nums != zero (nums is not equal to zero)")
# AN EXCEPTION IS RAISED!
The Error Message is something like the following:
ValueError: The truth value of a Series is ambiguous. Use a.empty,
a.bool(), a.item(), a.any() or a.all().
Somehow, you got a Series into the inside of the if-clause.
Actually, I know how it happened, but it will take me a moment to explain:
Well, suppose that you want the value out of row 3 and column 4 of a pandas dataframe.
If you attempt to extract a single value out of a pandas table in a specific row and column, then that value is sometimes a Series object, not a number.
Consider the following example:
# DATA:
# Name Age Location
# 0 Nik 31 Toronto
# 1 Kate 30 London
# 2 Evan 40 Kingston
# 3 Kyra 33 Hamilton
To create the dataframe above, we can write:
df = pd.DataFrame.from_dict({
'Name': ['Nik', 'Kate', 'Evan', 'Kyra'],
'Age': [31, 30, 40, 33],
'Location': ['Toronto', 'London', 'Kingston', 'Hamilton']
})
Now, let us try to get a specific row of data:
evans_row = df.loc[df['Name'] == 'Evan']
and we try to get a specific value out of that row of data:
evans_age = evans_row['Age']
You might think that evans_age is the integer 40, but you would be wrong.
Let us see what evans_age really is:
print(80*"*", "EVAN\'s AGE", type(Evans_age), sep="\n")
print(Evans_age)
We have:
EVAN's AGE
<class 'pandas.core.series.Series'>
2 40
Name: Age, dtype: int64
Evan's Age is not a number.
evans_age is an instance of the class stored as pandas.Series
After extracting a single cell out of a pandas dataframe you can write .tolist()[0] to extract the number out of that cell.
evans_real_age = evans_age.tolist()[0]
print(80*"*", "EVAN\'s REAL AGE", type(evans_real_age), sep="\n")
print(evans_real_age)
EVAN's REAL AGE
<class 'numpy.int64'>
40
The exception in your original code was probably thrown by if abs_res == 0.
If abs_res is a pandas.Series then abs_res == 0 returns another Series.
There is no way to compare if an entire list of numbers is equal to zero.
Normally people just enter one input to an if-clause.
if (912):
print("912 is True")
else:
print("912 is False")
When an if-statement receives more than one value, then the python interpreter does not know what to do.
For example, what should the following do?
import pandas as pd
data = pd.Series([1, 565, 120, 12, 901])
if data:
print("data is true")
else:
print("data is false")
You should only input one value into an if-condition. Instead, you entered a pandas.Series object as input to the if-clause.
In your case, the pandas.Series only had one number in it. However, in general, pandas.Series contain many values.
The authors of the python pandas library assume that a series contains many numbers, even if it only has one.
The computer thought that you tired to put many different numbers inside of one single if-clause.
The difference between a "function definition" and a "function call"
Your original question was,
"I want to make a function if the common key is found"
Your use of the phrase "make a function" is incorrect. You probably meant, "I want to call a function if a common key is found."
The following are all examples of function "calls":
import pandas as pd
import numpy as np
z = foo(1, 91)
result = funky_function(811, 22, "green eggs and ham")
output = do_stuff()
data = np.random.randn(6, 4)
df = pd.DataFrame(data, index=dates, columns=list("ABCD"))
Suppose that you have two containers.
If you truly want to "make" a function if a common key is found, then you would have code like the following:
dict1 = {'age': 26, 'phone':"303-873-9811"}
dict2 = {'name': "Bob", 'phone':"303-873-9811"}
def foo(dict1, dict2):
union = set(dict2.keys()).intersection(set(dict1.keys()))
# if there is a shared key...
if len(union) > 0:
# make (create) a new function
def bar(*args, **kwargs):
pass
return bar
new_function = foo(dict1, dict2)
print(new_function)
If you are not using the def keyword, that is known as a function call
In python, you "make" a function ("define" a function) with the def keyword.
I think that your question should be re-titled.
You could write, "How do I call a function if two pandas dataframes have a common key?"
A second good question be something like,
"What went wrong if we see the error message, ValueError: The truth value of a Series is ambiguous.?"
Your question was worded strangely, but I think I can answer it.
Generating Test Data
Your question did not include test data. If you ask a question on stack overflow again, please provide a small example of some test data.
The following is an example of data we can use:
product_ID full_price
0 prod_id 1-1-1-1 11.11
1 prod_id 2-2-2-2 22.22
2 prod_id 3-3-3-3 33.33
3 prod_id 4-4-4-4 44.44
4 prod_id 5-5-5-5 55.55
5 prod_id 6-6-6-6 66.66
6 prod_id 7-7-7-7 77.77
------------------------------------------------------------
product_code tot_price
0 prod_id 3-3-3-3 34.08
1 prod_id 4-4-4-4 45.19
2 prod_id 5-5-5-5 56.30
3 prod_id 6-6-6-6 67.41
4 prod_id 7-7-7-7 78.52
5 prod_id 8-8-8-8 89.63
6 prod_id 9-9-9-9 100.74
Products 1 and 2 are unique to data-frame 1
Products 8 and 9 are unique to data-frame 2
Both data-frames contain data for products 3, 4, 5, ..., 7.
The prices are slightly different between data-frames.
The test data above is generated by the following code:
import pandas as pd
from copy import copy
raw_data = [
[
"prod_id {}-{}-{}-{}".format(k, k, k, k),
int("{}{}{}{}".format(k, k, k, k))/100
] for k in range(1, 10)
]
raw_data = [row for row in raw_data]
df1 = pd.DataFrame(data=copy(raw_data[:-2]), columns=["product_ID", "full_price"])
df2 = pd.DataFrame(data=copy(raw_data[2:]), columns=["product_code", "tot_price"])
for rowid in range(0, len(df2.index)):
df2.at[rowid, "tot_price"] += 0.75
print(df1)
print(60*"-")
print(df2)
Add some error checking
It is considered to be "best-practice" to make sure that your function
inputs are in the correct format.
You wrote a function named range_calc(z, y). I reccomend making sure that z and y are integers, and not something else (such as a pandas Series object).
def range_calc(z, y):
try:
z = float(z)
y = float(y)
except ValueError:
function_name = inspect.stack()[0][3]
with io.StringIO() as string_stream:
print(
"Error: In " + function_name + "(). Inputs should be
like decimal numbers.",
"Instead, we have: " + str(type(y)) + " \'" +
repr(str(y))[1:-1] + "\'",
file=string_stream,
sep="\n"
)
err_msg = string_stream.getvalue()
raise ValueError(err_msg)
# DO STUFF
return
Now we get error messages:
import pandas as pd
data = pd.Series([1, 565, 120, 12, 901])
range_calc("I am supposed to be an integer", data)
# ValueError: Error in function range_calc(). Inputs should be like
decimal numbers.
# Instead, we have: <class 'str'> "I am supposed to be an integer"
Code which Accomplishes what you Wanted.
The following is some rather ugly code which computes what you wanted:
# You can continue to use your original `range_calc()` function unmodified
# Use the test data I provided earlier in this answer.
def foo(df1, df2):
last_df = pd.DataFrame(columns = ["Product_number", "Market_Price"])
df1_ids = set(df1["product_ID"].tolist())
df2_ids = set(df2["product_code"].tolist())
pids = df1_ids.intersection(df2_ids) # common_product_ids
for pid in pids:
row1 = df1.loc[df1['product_ID'] == pid]
row2 = df2.loc[df2["product_code"] == pid]
price1 = row1["full_price"].tolist()[0]
price2 = row2["tot_price"].tolist()[0]
price3 = range_calc(price1, price2)
row3 = pd.DataFrame([[pid, price3]], columns=["Product_number", "Market_Price"])
last_df = pd.concat([last_df, row3])
return last_df
# ---------------------------------------
last_df = foo(df1, df2)
The result is:
Product_number Market_Price
0 prod_id 6-6-6-6 1.112595
0 prod_id 7-7-7-7 0.955171
0 prod_id 4-4-4-4 1.659659
0 prod_id 5-5-5-5 1.332149
0 prod_id 3-3-3-3 2.200704
Note that one of many reasons that my solution is ugly is in the following line of code:
last_df = pd.concat([last_df, row3])
if last_df is large (thousands of rows), then the code will run very slowly.
This is because instead of inserting a new row of data, we:
copy the original dataframe
append a new row of data to the copy.
delete/destroy the original data-frame.
It is really silly to copy 10,000 rows of data only to add one new value, and then delete the old 10,000 rows.
However, my solution has fewer bugs than your original code, relatively speaking.
sometimes when you check a condition on series or dataframes, your output is a series such as ( , False).
In this case you must use any, all, item,...
use print function for your condition to see the series.
Also I must tell your code is very very slow and has O(n**2). You can first calculate df3 as joining df1 and df2 then using apply method for fast calculating.

Check if a pandas Dataframe string column contains all the elements given in an array

I have a dataframe as shown below:
>>> import pandas as pd
>>> df = pd.DataFrame(data = [['app;',1,2,3],['app; web;',4,5,6],['web;',7,8,9],['',1,4,5]],columns = ['a','b','c','d'])
>>> df
a b c d
0 app; 1 2 3
1 app; web; 4 5 6
2 web; 7 8 9
3 1 4 5
I have an input array that looks like this: ["app","web"]
For each of these values I want to check against a specific column of a dataframe and return a decision as shown below:
>>> df.a.str.contains("app")
0 True
1 True
2 False
3 False
Since str.contains only allows me to look for an individual value, I was wondering if there's some other direct way to determine the same something like:
df.a.str.contains(["app","web"]) # Returns TypeError: unhashable type: 'list'
My end goal is not to do an absolute match (df.a.isin(["app", "web"]) but rather a 'contains' logic that says return true even if it has those characters present in that cell of data frame.
Note: I can of course use apply method to create my own function for the same logic such as:
elementsToLookFor = ["app","web"]
df[header] = df.apply(lambda element: all([a in element for a in elementsToLookFor]))
But I am more interested in the optimal algorithm for this and so prefer to use a native pandas function within pandas, or else the next most optimized custom solution.
This should work too:
l = ["app","web"]
df['a'].str.findall('|'.join(l)).map(lambda x: len(set(x)) == len(l))
also this should work as well:
pd.concat([df['a'].str.contains(i) for i in l],axis=1).all(axis = 1)
so many solutions, which one is the most efficient
The str.contains-based answers are generally fastest, though str.findall is also very fast on smaller dfs:
values = ['app', 'web']
pattern = ''.join(f'(?=.*{value})' for value in values)
def replace_dummies_all(df):
return df.a.str.replace(' ', '').str.get_dummies(';')[values].all(1)
def findall_map(df):
return df.a.str.findall('|'.join(values)).map(lambda x: len(set(x)) == len(values))
def lower_contains(df):
return df.a.astype(str).str.lower().str.contains(pattern)
def contains_concat_all(df):
return pd.concat([df.a.str.contains(l) for l in values], axis=1).all(1)
def contains(df):
return df.a.str.contains(pattern)
Try with str.get_dummies
df.a.str.replace(' ','').str.get_dummies(';')[['web','app']].all(1)
0 False
1 True
2 False
3 False
dtype: bool
Update
df['a'].str.contains(r'^(?=.*web)(?=.*app)')
Update 2: (To ensure case insenstivity doesn't matter and the column dtype is str without which the logic may fail):
elementList = ['app','web']
for eachValue in elementList:
valueString += f'(?=.*{eachValue})'
df[header] = df[header].astype(str).str.lower() #To ensure case insenstivity and the dtype of the column is string
result = df[header].str.contains(valueString)

How to implement python custom function on dictionary of dataframes

I have a dictionary that contains 3 dataframes.
How do I implement a custom function to each dataframes in the dictionary.
In simpler terms, I want to apply the function find_outliers as seen below
# User defined function : find_outliers
#(I)
from scipy import stats
outlier_threshold = 1.5
ddof = 0
def find_outliers(s: pd.Series):
outlier_mask = np.abs(stats.zscore(s, ddof=ddof)) > outlier_threshold
# replace boolean values with corresponding strings
return ['background-color:blue' if val else '' for val in outlier_mask]
To the dictionary of dataframes dict_of_dfs below
# the dataset
import numpy as np
import pandas as pd
df = {
'col_A':['A_1001', 'A_1001', 'A_1001', 'A_1001', 'B_1002','B_1002','B_1002','B_1002','D_1003','D_1003','D_1003','D_1003'],
'col_X':[110.21, 191.12, 190.21, 12.00, 245.09,4321.8,122.99,122.88,134.28,148.14,161.17,132.17],
'col_Y':[100.22,199.10, 191.13,199.99, 255.19,131.22,144.27,192.21,7005.15,12.02,185.42,198.00],
'col_Z':[140.29, 291.07, 390.22, 245.09, 4122.62,4004.52,395.17,149.19,288.91,123.93,913.17,1434.85]
}
df = pd.DataFrame(df)
df
#dictionary_of_dataframes
#(II)
dict_of_dfs=dict(tuple(df.groupby('col_A')))
and lastly, flag outliers in each df of the dict_of_dfs
# end goal is to have find/flag outliers in each `df` of the `dict_of_dfs`
#(III)
desired_cols = ['col_X','col_Y','col_Z']
dict_of_dfs.style.apply(find_outliers, subset=desired_cols)
summarily, I want to apply I to II and finally flag outliers in III
Thanks for your attempt. :)
Desired output should look like this, but for the three dataframes
This may not be what you want, but this is how I'd approach it, but you'll have to work out the details of the function because you have it written to receive a series rather a dataframe. Groupby apply() will send the subsets of rows and then you can perform the actions on that subset and return the result.
For consideration:
inside the function you may be able to handle all columns like so:
def find_outliers(x):
for col in ['col_X','col_Y','col_Z']:
outlier_mask = np.abs(stats.zscore(x[col], ddof=ddof)) > outlier_threshold
x[col] = ['outlier' if val else '' for val in outlier_mask]
return x
newdf = df.groupby('col_A').apply(find_outliers)
col_A col_X col_Y col_Z
0 A_1001 outlier
1 A_1001
2 A_1001
3 A_1001 outlier
4 B_1002 outlier
5 B_1002 outlier
6 B_1002
7 B_1002
8 D_1003 outlier
9 D_1003
10 D_1003

Pass a function with parameters specified for resample() method on a pandas dataframe

I want to pass a function to resample() on a pandas dataframe with certain parameters specified when it is passed (as opposed to defining several separate functions).
This is the function
import itertools
def spell(X, kind='wet', how='mean', threshold=0.5):
if kind=='wet':
condition = X>threshold
else:
condition = X<=threshold
length = [sum(1 if x==True else nan for x in group) for key,group in itertools.groupby(condition)]
if not length:
res = 0
elif how=='mean':
res = np.mean(length)
else:
res = np.max(length)
return res
here is a dataframe
idx = pd.DatetimeIndex(start='1960-01-01', periods=100, freq='d')
values = np.random.random(100)
df = pd.DataFrame(values, index=idx)
And heres sort of what I want to do with it
df.resample('M', how=spell(kind='dry',how='max',threshold=0.7))
But I get the error TypeError: spell() takes at least 1 argument (3 given). I want to be able to pass this function with these parameters specified except for the input array. Is there a way to do this?
EDIT:
X is the input array that is passed to the function when calling the resample method on a dataframe object like so df.resample('M', how=my_func) for a monthly resampling interval.
If I try df.resample('M', how=spell) I get:
0
1960-01-31 1.875000
1960-02-29 1.500000
1960-03-31 1.888889
1960-04-30 3.000000
which is exactly what I want for the default parameters but I want to be able to specify the input parameters to the function before passing it. This might include storing the definition in another variable but I'm not sure how to do this with the default parameters changed.
I think this may be what you're looking for, though it's a little hard to tell.. Let me know if this helps. First, the example dataframe:
idx = pd.DatetimeIndex(start='1960-01-01', periods=100, freq='d')
values = np.random.random(100)
df = pd.DataFrame(values, index=idx)
EDIT- had a greater than instead of less than or equal to originally...
Next, the function:
def spell(df, column='', kind='wet', rule='M', how='mean', threshold=0.5):
if kind=='wet':
df = df[df[column] > threshold]
else:
df = df[df[column] <= threshold]
df = df.resample(rule=rule, how=how)
return df
So, you would call it by:
spell(df, 0)
To get:
0
1960-01-31 0.721519
1960-02-29 0.754054
1960-03-31 0.746341
1960-04-30 0.654872
You can change around the parameters as well:
spell(df, 0, kind='something else', rule='W', how='max', threshold=0.7)
0
1960-01-03 0.570638
1960-01-10 0.529357
1960-01-17 0.565959
1960-01-24 0.682973
1960-01-31 0.676349
1960-02-07 0.379397
1960-02-14 0.680303
1960-02-21 0.654014
1960-02-28 0.546587
1960-03-06 0.699459
1960-03-13 0.626460
1960-03-20 0.611464
1960-03-27 0.685950
1960-04-03 0.688385
1960-04-10 0.697602

Categories

Resources