how would you combine selected rows of a dataframe with builtin functions in correct syntax?
The key equation (that has error) is marked with '***' below. There are three aspects of this equation:
(1) Operation is only on selected rows [lo:hi] and column [ColumnName] of a dataframe
(2) Go through NaN entries in this selection and set each of them to a random number as defined by (3)
(3) The random number is defined by library function np.random.randint with
(a) range of values between (avg+std) and (avg-std) with a total of size=null_total[ColumnName] entries to be generated.
(b) The random number is then divided by avg to normalize the value.
avg and std are the mean and standard deviation of all of selected row values under [ColumnName] as computed by built in dataframe functions .mean and .std , respectively. avg,std and null_total are declared to be Dataframe type although they could be just series.
def process_Fill_and_Normalize(df,lo,hi,ColumnName):
avg = pd.DataFrame()
std = pd.DataFrame()
null_total = pd.DataFrame()
avg[ColumnName] = df[ColumnName][lo:hi].mean()
std[ColumnName] = df[ColumnName][lo:hi].std()
null_total[ColumnName] = df[ColumnName][lo:hi].isnull().sum()
***df[ColumnName][lo:hi][np.isnan(combined[ColumnName][lo:hi])] =
np.random.randint(avg[ColumnName] - std[ColumnName], avg[ColumnName] +
std[ColumnName], size=null_total[ColumnName])/avg[ColumnName]
return df
Error message is as follows:
ValueError: The truth value of a Series is ambiguous. Use a.empty,
a.bool(), a.item(), a.any() or a.all().
/opt/conda/lib/python3.6/site-packages/pandas/core/generic.py in
__nonzero__(self)
915 raise ValueError("The truth value of a {0} is ambiguous. "
916 "Use a.empty, a.bool(), a.item(), a.any() or
a.all()."
--> 917 .format(self.__class__.__name__))
918
919 __bool__ = __nonzero__
You advice on how to modify the syntax would be very much appreciated.
Many thanks to reply from hpaulj that advises to break up the long equation. The index expression is defined in a separate equation. The following modified code works:
def process_Fill_and_Normalize(df,lo,hi,ColumnName):
avg = pd.DataFrame()
std = pd.DataFrame()
avg[ColumnName] = df[ColumnName][lo:hi].mean()
std[ColumnName] = df[ColumnName][lo:hi].std()
null_total[ColumnName] = df[ColumnName][lo:hi].isnull().sum()
mull_entry_index = np.isnan(combined[ColumnName][lo:hi])
df[ColumnName][lo:hi][mull_entry_index] =
np.random.randint(avg[ColumnName]
- std[ColumnName], avg[ColumnName] + std[ColumnName],
size=null_total[ColumnName])/avg[ColumnName]
return df
Related
I want to use a value from a specific column in my Pandas dataframe as the Y-axis label. The reason for this is that the label could change depending on the Unit of Measure (UoM) - it could be kg, number of bags etc.
#create function using plant and material input to chart planned and actual manufactured quantities
def filter_df(df, plant: str = "", material: str = ""):
output_df = df.loc[(df['Plant'] == plant) & (df['Material'].str.contains(material))].reset_index()
return output_df['Planned_Qty_Cumsum'].plot.area (label = 'Planned Quantity'),\
output_df['Goods_Receipted_Qty_Cumsum'].plot.line(label = 'Delivered Quantity'),\
plt.title('Planned and Deliverd Quanties'),\
plt.legend(),\
plt.xlabel('Number of Process Orders'),\
plt.ylabel(output_df['UoM (of GR)']),\
plt.show()
#run function
filter_df(df_yield_data_formatted,'*plant*','*material*')
When running the function I get the following error message:
ValueError: The truth value of a Series is ambiguous. Use a.empty,
a.bool(), a.item(), a.any() or a.all().
Yes you can, but the way you are doing you are saying all the values of the Dataframe in that column and you should indicate what row and column you want for the label, use iloc for instace and it will work.
plt.ylabel(df.iloc[2,1])
I am trying to create an indicator equal to 1 if my meeting_date variable matches my date variable, and zero otherwise. I am getting an error in my code that consists of the following:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Please let me know what I am doing wrong! Here is my code:
if crsp_12['meeting_date'] == crsp_12['date']:
crsp_12['i_meeting_date_dayof'] == 1
else:
crsp_12['i_meeting_date_dayof'] == 0
You should always avoid classical if/for constructs with pandas. Use vectorial code:
crsp_12['i_meeting_date_dayof'] = crsp_12['meeting_date'].eq(crsp_12['date']).astype(int)
With more than 250 independent variables, I am trying to find variables that are statistically significant. For this, I am trying to build a for loop which will only return the variables whose P-value is less than alpha.
cols = x2.columns
alpha = 0.05
for i in cols:
if (est2.pvalues[i] < alpha) == True:
print(i)
where est2 = sm.OLS(y,x2).fit(). This is the output that I get:
LotArea
OverallQual
OverallCond
YearBuilt
YearRemodAdd
BsmtFinSF1
TotalBsmtSF
1stFlrSF
2ndFlrSF
GrLivArea
BsmtFullBath
HalfBath
GarageArea
WoodDeckSF
EnclosedPorch
ScreenPorch
MSZoning_FV
MSZoning_RH
MSZoning_RL
MSZoning_RM
LotConfig_FR2
LotConfig_Inside
LandSlope_Sev
Neighborhood_Crawfor
Neighborhood_Edwards
Neighborhood_MeadowV
Neighborhood_NridgHt
Neighborhood_StoneBr
Condition1_Norm
Condition1_PosN
Condition2_PosN
Condition2_RRAe
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-192-8387e9b8424a> in <module>
2 alpha = 0.05
3 for i in cols:
----> 4 if (est2.pvalues[i] < alpha) == True:
5 print(i)
6 #print(i, est2.pvalues[i] > alpha)
~\anaconda3\lib\site-packages\pandas\core\generic.py in __nonzero__(self)
1440 #final
1441 def __nonzero__(self):
-> 1442 raise ValueError(
1443 f"The truth value of a {type(self).__name__} is ambiguous. "
1444 "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
It stops in the middle like this.
First, the == True is superfluous, but that shouldn't have anything to do with the error.
The error indicates that for one of the variable names i (This is a bit misleading by the way, since i would usually be an integer in a loop like this.) the expression est2.pvalues[i] is a pandas series, not just a single value. Why exactly that happens is impossible to tell without seeing the problematic variable name.
In any case, est2.pvalues is a pandas series, so you can get all the low p-values (and the corresponding variable names) with Boolean indexing like this:
est2.pvalues[est2.pvalues < 0.05]
So I have a dataframe containing a float64 type column and an object type column containing a string.
If object column contains substring 'abc' I want to subtract 12 from the float column. If object column contains substring 'def' I want to subtract 24 from the float column. If object column contains neither 'abc' or 'def', I want to leave float column as is.
Example:
Nmbr Strng
52 abcghi
80 defghi
10 ghijkl
Expected output:
Nmbr Strng
40 abcghi
56 defghi
10 ghijkl
I have tried the following but keep getting an error:
if df.Strng.str.contains("abc"):
df.Nmbr = (df.Nmbr - 12)
elif df.Strng.str.contains("def"):
df.Nmbr = (df.Nmbr - 24)
else:
df.Nmbr = df.Nmbr
The error I'm getting is as follows:
915 raise ValueError("The truth value of a {0} is ambiguous. "
916 "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
917 .format(self.__class__.__name__))
918
919 __bool__ = __nonzero__
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Note:Line 917 is the one that's highlighted as the error.
Your error occurs because you are testing whether a Boolean series is True or False. This is not possible. You could test if all or any values are True, to return a single Boolean, but this isn't what you are looking for.
It is good practice to vectorize your calculations rather than introduce loops. Below is how you can implement your logic via the .loc accessor.
df.loc[df['Strng'].str.contains('abc', regex=False, na=False), 'Nmbr'] -= 12
df.loc[df['Strng'].str.contains('def', regex=False, na=False), 'Nmbr'] -= 24
Result:
Nmbr Strng
0 40 abcghi
1 56 defghi
2 10 ghijkl
I have the same code,I'm trying to create new field in pandas dataframe with simple conditions:
if df_reader['email1_b']=='NaN':
df_reader['email1_fin']=df_reader['email1_a']
else:
df_reader['email1_fin']=df_reader['email1_b']
But I see this strange mistake:
ValueError Traceback (most recent call last)
<ipython-input-92-46d604271768> in <module>()
----> 1 if df_reader['email1_b']=='NaN':
2 df_reader['email1_fin']=df_reader['email1_a']
3 else:
4 df_reader['email1_fin']=df_reader['email1_b']
/home/user/GL-env_py-gcc4.8.5/lib/python2.7/site-packages/pandas/core/generic.pyc in __nonzero__(self)
953 raise ValueError("The truth value of a {0} is ambiguous. "
954 "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
--> 955 .format(self.__class__.__name__))
956
957 __bool__ = __nonzero__
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Can anybody explain me, what I need to with this?
df_reader['email1_b']=='NaN' is a vector of Boolean values (one per row), but you need one Boolean value for if to work. Use this instead:
df_reader['email1_fin'] = np.where(df_reader['email1_b']=='NaN',
df_reader['email1_a'],
df_reader['email1_b'])
As a side note, are you sure about 'NaN'? Is it not NaN? In the latter case, your expression should be:
df_reader['email1_fin'] = np.where(df_reader['email1_b'].isnull(),
df_reader['email1_a'],
df_reader['email1_b'])
if expects a scalar value to be returned, it doesn't understand an array of booleans which is what is returned by your conditions. If you think about it what should it do if a single value in this array is False/True?
to do this properly you can do the following:
df_reader['email1_fin'] = np.where(df_reader['email1_b'] == 'NaN', df_reader['email1_a'], df_reader['email1_b'] )
also you seem to be comparing against the str 'NaN' rather than the numerical NaN is this intended?