Use numpy.average with weights for resampling a pandas array

Use numpy.average with weights for resampling a pandas array - python

I need to resample some data with numpys weighted-average-function - and it just doesn't work... .
This is my test-case:
import numpy as np
import pandas as pd
time_vec = [datetime.datetime(2007,1,1,0,0)
,datetime.datetime(2007,1,1,0,1)
,datetime.datetime(2007,1,1,0,5)
,datetime.datetime(2007,1,1,0,8)
,datetime.datetime(2007,1,1,0,10)
]
df = pd.DataFrame([2,3,1,7,4],index = time_vec)
A normal resampling without weights works fine (using the lambda function as a parameter to how is suggested here: Pandas resampling using numpy percentile? Thanks!):
df.resample('5min',how = lambda x: np.average(x[0]))
But if i try to use some weights, it always returns a TypeError: Axis must be specified when shapes of a and weights differ:
df.resample('5min',how = lambda x: np.average(x[0],weights = [1,2,3,4,5]))
I tried this with many different numbers of weights, but it did not get better:
for i in xrange(20):
try:
print range(i)
print df.resample('5min',how = lambda x:np.average(x[0],weights = range(i)))
print i
break
except TypeError:
print i,'typeError'
I'd be glad about any suggestions.

The short answer here is that the weights in your lambda need to be created dynamically based on the length of the series that is being averaged. In addition, you need to be careful about the types of objects that you're manipulating.
The code that I got to compute what I think you're trying to do is as follows:
df.resample('5min', how=lambda x: np.average(x, weights=1+np.arange(len(x))))
There are two differences compared with the line that was giving you problems:
x[0] is now just x. The x object in the lambda is a pd.Series, and so x[0] gives just the first value in the series. This was working without raising an exception in the first example (without the weights) because np.average(c) just returns c when c is a scalar. But I think it was actually computing incorrect averages even in that case, because each of the sampled subsets was just returning its first value as the "average".
The weights are created dynamically based on the length of data in the Series being resampled. You need to do this because the x in your lambda might be a Series of different length for each time interval being computed.
The way I figured this out was through some simple type debugging, by replacing the lambda with a proper function definition:
def avg(x):
print(type(x), x.shape, type(x[0]))
return np.average(x, weights=np.arange(1, 1+len(x)))
df.resample('5Min', how=avg)
This let me have a look at what was happening with the x variable. Hope that helps!

Related

How to use np.Vectorize() with Pandas function?

I have the function that operates in Pandas DataFrame format. It works with pandas.apply() but it does not work with np.Vectorize(). Find the function below:
def AMTTL(inputData, amortization = []):
rate = inputData['EIR']
payment = inputData['INSTALMENT']
amount = inputData['OUTSTANDING']
amortization = [amount]
if amount - payment <= 0:
return amortization
else:
while amount > 0:
amount = BALTL(rate, payment, amount)
if amount <= 0:
continue
amortization.append(amount)
return amortization
The function receives inputData as Pandas DataFrame format. The EIR, INSTALMENT and OUTSTANDING are the columns name. This function works well with pandas.apply()
data.apply(AMTTL, axis = 1)
However, I have tried to use np.Vectorize(). it does not work with the code below:
vfunc = np.vectorize(AMTTL)
vfunc(data)
It got error like 'Timestamp' object is not subscriptable. So, I tried to drop other columns that not used but it still got the another error like invalid index to scalar variable.
I am not sure how to adjust pandas.apply() to np.Vectorize().
Any suggestion? Thank you in advance.

np.vectorize is nothing more than a map function that is applied to all the elements of the array - meaning you cannot differentiate between the columns with in the function. It has no idea of the column names like EIR or INSTALMENT. Therefore your current implementation for numpy will not work.
From the docs:
The vectorized function evaluates pyfunc over successive tuples of the
input arrays like the python map function, except it uses the
broadcasting rules of numpy.
The vectorize function is provided primarily for convenience, not for
performance. The implementation is essentially a for loop.
Based on your problem, you should try np.apply_along_axis instead, where you can refer different columns with their indexes.

How to using the .apply(lambda x: function) over all the columns of a dataframe

I'm trying to pass every column of a dataframe through a custom function by using the apply(lamdba x: function in python.
The custom function I have created works individually but when put it into the apply(lamdba x: structure only returns NaN values into the selected dataframe.
first is the custom function -
def snr_pd(wavenumber_arr):
intensity_arr = Zhangfit_output
signal_low = 1650
signal_high = 1750
noise_low = 1750
noise_high = 1850
signal_mask = np.logical_and((wavenumber_arr >= signal_low), (wavenumber_arr <
signal_high))
noise_mask = np.logical_and((wavenumber_arr >= noise_low), (wavenumber_arr < noise_high))
signal = np.max(intensity_arr[signal_mask])
noise = np.std(intensity_arr[noise_mask])
return signal / noise
And this is the setup of the lambda function -
sd['s/n'] = df.apply(lambda x: snr_pd(x), axis =0,)
Currently I believe this is taking the columns form df, passing them to the snr_pd() and appending them to sd under the column ['s/n'], but the only answer produced is NaN.
I have also tried a couple structure changes like using applymap() instead of apply().
sd['s/n'] = fd.applymap(lambda x: snr_pd(x), na_action = 'ignore')
However this return this error instead :
ValueError: zero-size array to reduction operation maximum which has no identity
Which I have even less understanding of.
Any help would be much apricated.

It looks as though your function snr_pd() expects an entire array as an argument.
Without seeing your data it's hard to say, but you should be able to apply the function directly to the DataFrame using np.apply_along_axis():
np.apply_along_axis(snr_pd, axis=0, arr=df)
Note that this assumes that every column in df is numeric. If not, then simply select the columns of the df on which you'd like to apply the function.
Note also that np.apply_along_axis() will return a numpy array.

Python, manipulating dataframes

Department = input("what dept")
editfile = pd.read_csv('52.csv', encoding='Latin-1')
editfilevalues= editfile.loc[editfile['Customer'].str.contains(Department, na=False), 'May-18\nQty']
editfilevalues = editfilevalues.fillna(int(0))
print(int(editfilevalues) *1.3)
I have looked through stackoverflow and no answer seems to help me this problem. I simply want to be able to manipulate data in a series like this but I get different errors, with this current code I receive this:
"{0}".format(str(converter))) TypeError: cannot convert the series to <class 'int'>
My main issue is converting a series to an int type, I have tried several different ways to do this and none are giving me the results

So a pandas Series is a bit like a list, but with different functions and properties. You can't convert the Series to int using int() because the function wasn't designed to work on list-like objects in that way.
If you need to convert the Series to all integers, this method will work.
int_series = your_series.astype(int)
This will get your entire series as 'int32' specifically. Below is a bonus if you want it in a numpy array.
int_array = your_series.values.astype(int)
From here you have a few options to do your calculation.
# where x is a value in your series and lambda is a nameless function
calculated_series = int_series.apply(lambda x: some_number*x)
The output will be another Series object with your rows calculated. Bonus using numpy array below.
calculated_array = int_array * some_number
Edit to show everything at once.
# for series
int_series = your_series.astype(int)
calculated_series = int_series.apply(lambda x: x * some_number)
# for np.array
int_array = your_series.values.astype(int)
calculated_array = int_array * some_number
Either will work, and it is ultimately up to what kind of data structure you want at the end of it all.

pandas: check whether an element is in dataframe or given column leads to strange results

I am doing some data handling based on a DataFrame with the shape of (135150, 12) so double checking my results manually is not applicable anymore.
I encountered some 'strange' behavior when I tried to check if an element is part of the dataframe or a given column.
This behavior is reproducible with even smaller dataframes as follows:
import numpy as np
import pandas as pd
start = 1e-3
end = 2e-3
step = 0.01e-3
arr = np.arange(start, end+step, step)
val = 0.0019
df = pd.DataFrame(arr, columns=['example_value'])
print(val in df) # prints `False`
print(val in df['example_value']) # prints `True`
print(val in df.values) # prints `False`
print(val in df['example_value'].values) # prints `False`
print(df['example_value'].isin([val]).any()) # prints `False`
Since I am a very beginner in data analysis I am not able to explain this behavior.
I know that I am using different approaches involving different datatypes (like pd.Series, np.ndarray or np.array) in order to check if the given value exists in the dataframe. Additionally when using np.array or np.ndarray the machine accuracy comes in play which I am aware of in mind.
However, at the end, I need to implement several functions to filter the dataframe and count the occurrences of some values, which I have done several times before based on boolean columns in combination with performed operations like > and < successfully.
But in this case I need to filter by the exact value and count its occurrences which after all lead me to the issue described above.
So could anyone explain, what's going on here?

The underlying issue, as Divakar suggested, is floating point precision. Because DataFrames/Series are built on top of numpy, there isn't really a penalty for using numpy methods though, so you can just do something like:
df['example_value'].apply(lambda x: np.isclose(x, val)).any()
or
np.isclose(df['example_value'], val).any()
both of which correctly return True.

Pandas ValueError: Function does not reduce

I have been trying to use pandas groupby to analyze data, then I encountered an issue after update pandas from version 0.15.0 to 0.18.1 that did not exist before.
I want to calculate the number of consercutive periods where the value of 'equality' is 1 (it can only take values of 0 or 1). I defined the followin lambda function, and used groupby command as follows:
import pandas as pd
E = lambda x: np.sum(x.diff()==1) + x.head(1)
grouped = df.groupby(['run_'])
agg_data = grouped[['equality','avg_payoff']].mean()
agg_data['E'] = grouped.equality.agg(E) # number of "equality" epochs
but received the error message for the last line of code:
ValueError: Function does not reduce
It is weird that this code ran perfectly before update. This is not the first time that I encounter an issue after update of scientific computing packages, which makes me a bit frustrated.Could anyone help solve the issue? Or I have to roll back to the old versions...

x.head(1) returns series (with one row but series).
You can make a silly workaround like this
E = lambda x: np.sum(x.diff()==1) + np.sum(x.head(1))
or a little bit smarter
E = lambda x: np.sum(x.diff()==1) + x.iloc[0]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Use numpy.average with weights for resampling a pandas array - python

Related

How to use np.Vectorize() with Pandas function?

How to using the .apply(lambda x: function) over all the columns of a dataframe

Python, manipulating dataframes

pandas: check whether an element is in dataframe or given column leads to strange results

Pandas ValueError: Function does not reduce

Categories

Resources