Inner function with "not enough values to unpack" - python

I have been stuck with an inner function tentative and after a lot of changes, I'm still seeing the same error at the end when I run it.
My function code is as below.
def test(name,df,col='',col2=''):
format_type = list(df[col])
d = {name: pd.DataFrame() for name in format_type} #create two sub_dataframe
for name, df, col2 in d.items(): #look at one sub_df at the time
df['revenue_share']= (df[col2]/df[col2].sum())*100 #calculate revenue_share of each line
print(df['revenue_share'])
def function(df,col3='revenue_share'): #function to separate the companies within several groups depending on their rev_share
if (df[col3] < 0.5):
return 'longtail'
else:
return df['company_name']
df['company_name'] = df.apply(lambda x: function(x[col2],x),axis=1) . #create a new column with the group name
return df
and the error code when I run test(print,format_company_df,col='format',col2='buyer_spend') :
ValueError Traceback (most recent call last)
<ipython-input-42-19c1a2b58a26> in <module>
----> 1 test(display,miq,col='format',col2='buyer_spend')
2
<ipython-input-41-5380164aff21> in test(name, df, col, col2)
5 d = {name: pd.DataFrame() for name in format_type} #create two sub_dataframe - filtered by format (display or video)
6
----> 7 for name, df, col2 in d.items(): #look at display or video df
8
9 df['revenue_share']= (df[col2]/df[col2].sum())*100 #calculate revenue_share of each line
ValueError: not enough values to unpack (expected 3, got 2)
Thanks a lot for your help!

d is a dictionary. d.items() yields the pair (key, value) at a time. So you can only assign them to 2 variables.
for name, df, col2 in d.items():
Here you are trying to assign them to 3 variables. And that's what the error is trying to say.
for df_name, sub_df in d.items():
This should work.
Nothing to do with inner functions.

Related

How to tell if a pandas date time difference is null?

I need to fill in missing dates in a pandas data frame. The dataframe consists of weekly sales data for multiple items. I am looping through each item to see if there are missing weeks of dates with the intention of filling in those dates with a '0' for sales and all other information copied down.
I use the following code to find the missing dates:
pd.date_range(start="2017-01-13", end="2022-12-16", freq = "W-SAT").difference(df_['week_date'])
While I can print the missing dates and search manually for the few items that are missing sales weeks, I have not found a way to do this programmatically.
I tried
for item in df['ord_base7'].unique():
df_ = df[df['ord_base7'] == item]
if pd.date_range(start="2017-01-13", end="2022-12-16", freq = "W-SAT").difference(df_['week_date']).isnan() == True:
pass
else:
print(item, pd.date_range(start="2017-01-13", end="2022-12-16", freq = "W-SAT").difference(df_['week_date']))
That yielded the error:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
/tmp/ipykernel_55320/2582723605.py in <module>
1 for item in df['ord_base7'].unique():
2 df_ = df[df['ord_base7'] == item]
----> 3 if pd.date_range(start="2017-01-13", end="2022-12-16", freq = "W-SAT").difference(df_['week_date']).isnan() == True:
4 pass
5 else:
AttributeError: 'DatetimeIndex' object has no attribute 'isnan'
How can I program a way to see if there are no dates missing so those items can be passed over?
Looping on a pandas dataframe is not a good idea because it's inefficient. Just use the .fillna() and pass in whatever value you want to be set instead of NaN:
df['week_date'].fillna(0)
Nevermind... I just tried the following and it worked.
for item in df['ord_base7'].unique():
df_ = df[df['ord_base7'] == item]
if pd.date_range(start="2017-01-13", end="2022-12-16", freq = "W-SAT").difference(df_['week_date']).empty == True:
pass
else:
print(item, pd.date_range(start="2017-01-13", end="2022-12-16", freq = "W-SAT").difference(df_['week_date']))
The .empty is how to do this with a date time index.

I want to make a function if the common key is found in both dataframes

I have two dataframes df1 and df2 each of them have column containing product code and product price, I wanted to check the difference between prices in the 2 dataframes and store the result of this function I created in a new dataframe "df3" containing the product code and the final price, Here is my attempt :
Function to calculate the difference in the way I want:
def range_calc(z, y):
final_price = pd.DataFrame(columns = ["Market_price"])
res = z-y
abs_res = abs(res)
if abs_res == 0:
return (z)
if z>y:
final_price = (abs_res / z ) * 100
else:
final_price = (abs_res / y ) * 100
return(final_price)
For loop I created to check the two df and use the function:
Last_df = pd.DataFrame(columns = ["Product_number", "Market_Price"])
for i in df1["product_ID"]:
for x in df2["product_code"]:
if i == x:
Last_df["Product_number"] = i
Last_df["Market_Price"] = range_calc(df1["full_price"],df2["tot_price"])
The problem is that I am getting this error every time:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Why you got the error message "The truth value of a Series is ambiguous"
You got the error message The truth value of a Series is ambiguous
because you tried input a pandas.Series into an if-clause
nums = pd.Series([1.11, 2.22, 3.33])
if nums == 0:
print("nums == zero (nums is equal to zero)")
else:
print("nums != zero (nums is not equal to zero)")
# AN EXCEPTION IS RAISED!
The Error Message is something like the following:
ValueError: The truth value of a Series is ambiguous. Use a.empty,
a.bool(), a.item(), a.any() or a.all().
Somehow, you got a Series into the inside of the if-clause.
Actually, I know how it happened, but it will take me a moment to explain:
Well, suppose that you want the value out of row 3 and column 4 of a pandas dataframe.
If you attempt to extract a single value out of a pandas table in a specific row and column, then that value is sometimes a Series object, not a number.
Consider the following example:
# DATA:
# Name Age Location
# 0 Nik 31 Toronto
# 1 Kate 30 London
# 2 Evan 40 Kingston
# 3 Kyra 33 Hamilton
To create the dataframe above, we can write:
df = pd.DataFrame.from_dict({
'Name': ['Nik', 'Kate', 'Evan', 'Kyra'],
'Age': [31, 30, 40, 33],
'Location': ['Toronto', 'London', 'Kingston', 'Hamilton']
})
Now, let us try to get a specific row of data:
evans_row = df.loc[df['Name'] == 'Evan']
and we try to get a specific value out of that row of data:
evans_age = evans_row['Age']
You might think that evans_age is the integer 40, but you would be wrong.
Let us see what evans_age really is:
print(80*"*", "EVAN\'s AGE", type(Evans_age), sep="\n")
print(Evans_age)
We have:
EVAN's AGE
<class 'pandas.core.series.Series'>
2 40
Name: Age, dtype: int64
Evan's Age is not a number.
evans_age is an instance of the class stored as pandas.Series
After extracting a single cell out of a pandas dataframe you can write .tolist()[0] to extract the number out of that cell.
evans_real_age = evans_age.tolist()[0]
print(80*"*", "EVAN\'s REAL AGE", type(evans_real_age), sep="\n")
print(evans_real_age)
EVAN's REAL AGE
<class 'numpy.int64'>
40
The exception in your original code was probably thrown by if abs_res == 0.
If abs_res is a pandas.Series then abs_res == 0 returns another Series.
There is no way to compare if an entire list of numbers is equal to zero.
Normally people just enter one input to an if-clause.
if (912):
print("912 is True")
else:
print("912 is False")
When an if-statement receives more than one value, then the python interpreter does not know what to do.
For example, what should the following do?
import pandas as pd
data = pd.Series([1, 565, 120, 12, 901])
if data:
print("data is true")
else:
print("data is false")
You should only input one value into an if-condition. Instead, you entered a pandas.Series object as input to the if-clause.
In your case, the pandas.Series only had one number in it. However, in general, pandas.Series contain many values.
The authors of the python pandas library assume that a series contains many numbers, even if it only has one.
The computer thought that you tired to put many different numbers inside of one single if-clause.
The difference between a "function definition" and a "function call"
Your original question was,
"I want to make a function if the common key is found"
Your use of the phrase "make a function" is incorrect. You probably meant, "I want to call a function if a common key is found."
The following are all examples of function "calls":
import pandas as pd
import numpy as np
z = foo(1, 91)
result = funky_function(811, 22, "green eggs and ham")
output = do_stuff()
data = np.random.randn(6, 4)
df = pd.DataFrame(data, index=dates, columns=list("ABCD"))
Suppose that you have two containers.
If you truly want to "make" a function if a common key is found, then you would have code like the following:
dict1 = {'age': 26, 'phone':"303-873-9811"}
dict2 = {'name': "Bob", 'phone':"303-873-9811"}
def foo(dict1, dict2):
union = set(dict2.keys()).intersection(set(dict1.keys()))
# if there is a shared key...
if len(union) > 0:
# make (create) a new function
def bar(*args, **kwargs):
pass
return bar
new_function = foo(dict1, dict2)
print(new_function)
If you are not using the def keyword, that is known as a function call
In python, you "make" a function ("define" a function) with the def keyword.
I think that your question should be re-titled.
You could write, "How do I call a function if two pandas dataframes have a common key?"
A second good question be something like,
"What went wrong if we see the error message, ValueError: The truth value of a Series is ambiguous.?"
Your question was worded strangely, but I think I can answer it.
Generating Test Data
Your question did not include test data. If you ask a question on stack overflow again, please provide a small example of some test data.
The following is an example of data we can use:
product_ID full_price
0 prod_id 1-1-1-1 11.11
1 prod_id 2-2-2-2 22.22
2 prod_id 3-3-3-3 33.33
3 prod_id 4-4-4-4 44.44
4 prod_id 5-5-5-5 55.55
5 prod_id 6-6-6-6 66.66
6 prod_id 7-7-7-7 77.77
------------------------------------------------------------
product_code tot_price
0 prod_id 3-3-3-3 34.08
1 prod_id 4-4-4-4 45.19
2 prod_id 5-5-5-5 56.30
3 prod_id 6-6-6-6 67.41
4 prod_id 7-7-7-7 78.52
5 prod_id 8-8-8-8 89.63
6 prod_id 9-9-9-9 100.74
Products 1 and 2 are unique to data-frame 1
Products 8 and 9 are unique to data-frame 2
Both data-frames contain data for products 3, 4, 5, ..., 7.
The prices are slightly different between data-frames.
The test data above is generated by the following code:
import pandas as pd
from copy import copy
raw_data = [
[
"prod_id {}-{}-{}-{}".format(k, k, k, k),
int("{}{}{}{}".format(k, k, k, k))/100
] for k in range(1, 10)
]
raw_data = [row for row in raw_data]
df1 = pd.DataFrame(data=copy(raw_data[:-2]), columns=["product_ID", "full_price"])
df2 = pd.DataFrame(data=copy(raw_data[2:]), columns=["product_code", "tot_price"])
for rowid in range(0, len(df2.index)):
df2.at[rowid, "tot_price"] += 0.75
print(df1)
print(60*"-")
print(df2)
Add some error checking
It is considered to be "best-practice" to make sure that your function
inputs are in the correct format.
You wrote a function named range_calc(z, y). I reccomend making sure that z and y are integers, and not something else (such as a pandas Series object).
def range_calc(z, y):
try:
z = float(z)
y = float(y)
except ValueError:
function_name = inspect.stack()[0][3]
with io.StringIO() as string_stream:
print(
"Error: In " + function_name + "(). Inputs should be
like decimal numbers.",
"Instead, we have: " + str(type(y)) + " \'" +
repr(str(y))[1:-1] + "\'",
file=string_stream,
sep="\n"
)
err_msg = string_stream.getvalue()
raise ValueError(err_msg)
# DO STUFF
return
Now we get error messages:
import pandas as pd
data = pd.Series([1, 565, 120, 12, 901])
range_calc("I am supposed to be an integer", data)
# ValueError: Error in function range_calc(). Inputs should be like
decimal numbers.
# Instead, we have: <class 'str'> "I am supposed to be an integer"
Code which Accomplishes what you Wanted.
The following is some rather ugly code which computes what you wanted:
# You can continue to use your original `range_calc()` function unmodified
# Use the test data I provided earlier in this answer.
def foo(df1, df2):
last_df = pd.DataFrame(columns = ["Product_number", "Market_Price"])
df1_ids = set(df1["product_ID"].tolist())
df2_ids = set(df2["product_code"].tolist())
pids = df1_ids.intersection(df2_ids) # common_product_ids
for pid in pids:
row1 = df1.loc[df1['product_ID'] == pid]
row2 = df2.loc[df2["product_code"] == pid]
price1 = row1["full_price"].tolist()[0]
price2 = row2["tot_price"].tolist()[0]
price3 = range_calc(price1, price2)
row3 = pd.DataFrame([[pid, price3]], columns=["Product_number", "Market_Price"])
last_df = pd.concat([last_df, row3])
return last_df
# ---------------------------------------
last_df = foo(df1, df2)
The result is:
Product_number Market_Price
0 prod_id 6-6-6-6 1.112595
0 prod_id 7-7-7-7 0.955171
0 prod_id 4-4-4-4 1.659659
0 prod_id 5-5-5-5 1.332149
0 prod_id 3-3-3-3 2.200704
Note that one of many reasons that my solution is ugly is in the following line of code:
last_df = pd.concat([last_df, row3])
if last_df is large (thousands of rows), then the code will run very slowly.
This is because instead of inserting a new row of data, we:
copy the original dataframe
append a new row of data to the copy.
delete/destroy the original data-frame.
It is really silly to copy 10,000 rows of data only to add one new value, and then delete the old 10,000 rows.
However, my solution has fewer bugs than your original code, relatively speaking.
sometimes when you check a condition on series or dataframes, your output is a series such as ( , False).
In this case you must use any, all, item,...
use print function for your condition to see the series.
Also I must tell your code is very very slow and has O(n**2). You can first calculate df3 as joining df1 and df2 then using apply method for fast calculating.

While passing same number of column name and column data get the error

a=["ExpNCCIFactor","Requestid","EffDate","TransresposnseDate","QuoteEffDate","ApplicationID","PortUrl","UQuestion","DescriptionofOperations","Error"]
d = [ExpNCCIFactor,Requestid,EffDate,TransresposnseDate,QuoteEffDate,ApplicationID,PortUrl,UQuestion,DescriptionofOperations,Error]
df2 = pd.DataFrame(data = d , columns = a)
Got error
Traceback (most recent call last):
File "C:\Users\praka\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\core\internals\construction.py", line 982, in _finalize_columns_and_data
columns = _validate_or_indexify_columns(contents, columns)
File "C:\Users\praka\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\core\internals\construction.py", line 1030, in _validate_or_indexify_columns
raise AssertionError(
AssertionError: 10 columns passed, passed data had 11 columns
You are creating a dataframe from list.
In your case, data arugment should be list of list.
a=["ExpNCCIFactor","Requestid","EffDate","TransresposnseDate","QuoteEffDate","ApplicationID","PortUrl","UQuestion","DescriptionofOperations","Error"]
d = [[ExpNCCIFactor,Requestid,EffDate,TransresposnseDate,QuoteEffDate,ApplicationID,PortUrl,UQuestion,DescriptionofOperations,Error]]
It seems like one of your columns in variable d has a different shape probably a double column on it.
Try this code to iterate over them and print their shapes:
import numpy as np
c = 0
for i in d:
print('Shape of column {} is {}'.format(c, np.shape(i)))
c += 1

TypeError: can only concatenate str (not "numpy.float64") to str data set question

Please help Ive spent three hours on stack now i tried ''.join, str, removing the "+" for "," and nothing works to remove this error! The last 2 comments is where the error happens!
#Description: Build a anime recommendation using python
#Store the data
df = pd.read_csv('animes.csv')
#Show the first 3 rows of data
df.head(2)
#Count of the number of rows/animes in the data set and the number oof colums
df.shape
#List of wanted columns for anime recommendations
columns =['title','synopsis','genre','aired','episodes']
#Data updated
df[columns].head(3)
#Missing values check
df[columns].isnull().values.any()
#Create a funtion to combine the values of the new columns into a single string
def get_new_features(data):
new_features =[]
for i in range (0, data.shape[0]):
new_features.append(data['title'][i]+''+data['synopsis'][i]+''+data['genre'][i]+''+data['aired'][i]+''+data['episodes'][i])
return new_features
#Create a column to hold combine strings
df ['new_features'] = get_new_features(df)
#show data
df.head(4)
--------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-97-d5676456ab85> in <module>()
1 #Create a column to hold combine strings
----> 2 df ['new_features'] = get_new_features(df)
3
4 #show data
5 df.head(4)
<ipython-input-95-842623950c0e> in get_new_features(data)
3 new_features =[]
4 for i in range (0, data.shape[0]):
----> 5 new_features.append(data['title'][i]+''+data['synopsis'][i]+''+data['genre'][i]+''+data['aired'][i]+''+data['episodes'][i])
6
7 return new_features
TypeError: can only concatenate str (not "numpy.float64") to str
First off I would print data to see what items/types it contain.
for key, value in data.items():
print(value)
print(key)
print(type(data[key]))
Using this output you should be able to discern what areas in data you need to convert to string or change the formatting of.
Most likely you can do the lazy route of just casting each item to string:
new_features.append(str(data['title'][i])+''+str(data['synopsis'][i])+''+str(data['genre'][i])+''+str(data['aired'][i])+''+str(data['episodes'][i]))

How to iterate through two columns in a pandas dataframe to add the values to a list

i'm trying to evaluate a condition in one pandas column, and depending on the condition, take the value from another pandas column and append it to a list.
I tried the following:
def roc_table(df, row_count, signal, returns):
"""
Parameters
----------
df : dataframe
row_count : length of data
signal : signal/s
returns : log returns
Returns
-------
table - hopefully
"""
df = df.copy()
bins = [-48.13,-38.70, -29.28, -19.85, -10.42, -1.01,
8.42, 17.85, 27.27, 36.7]
win_above = 0
lose_above = 0
lose_below = 0
win_below = 0
# df = df.sort_values([signal, returns])
for bin in bins:
k = bin
for row, value in df.iterrows():
if row[signal] < k:
lose_below += row[returns]
else:
win_below -= row[returns]
for row, value in df.iterrows():
if row[signal] >= k:
win_above += row[returns]
else:
lose_above -= row[returns]
print(win_above, lose_above, lose_below, win_below)
roc_table(df = df_train, row_count = df_train.shape[0],
signal = 'predicted_RSI_indicator',
returns = 'log_return')
But I only get
Traceback (most recent call last):
File "<ipython-input-135-cd5513bb0778>", line 50, in <module>
roc_table(df = df_train, row_count = df_train.shape[0],
File "<ipython-input-135-cd5513bb0778>", line 32, in roc_table
if row[signal] < k:
TypeError: 'Timestamp' object is not subscriptable
The index is a date time stamp.
Here is a sample of the input df
signal returns
-.23 .045
2.3 -.09
9.8 1.2
The output would look something like this
bins win_above lose_above win_below lose_below
-48.13 123
-38.70 -98
-29.28 100
-19.85 -34
-10.42 567
...
So the idea is if df[singal] is below the bin, that associated return, if greater than 0, is added to win_below, else it's added to lose_below.
I'll eventually add a loop for those signals greater than the bin and add those to win_above, lose_above.
As per Pandas documentation, pandas.DataFrame.iterrows yields "the index of the row and the data of the row as a Series".
So, you should be doing (twice in you for loop):
for i, row in df.iterrows():
...
instead of:
for row, value in df.iterrows():
...

Categories

Resources