Using latest panda APIs to compute exponential moving average - python

I have a python v3.6 function using panda to compute exponential moving average of a list containing floating numbers. Here is the function and it is tested to work;
def get_moving_average(values, period):
import pandas as pd
import numpy as np
values = np.array(values)
moving_average = pd.ewma(values, span=period)[-1]
return moving_average
However, pd.ewma is a deprecated function and although it still works, I would like to use the latest API to use panda the correct way.
Here is the documentation for the latest exponential moving average API.
http://pandas.pydata.org/pandas-docs/stable/api.html#exponentially-weighted-moving-window-functions
I modified the original function into this to use the latest API;
def get_moving_average(values, period, type="exponential"):
import pandas as pd
import numpy as np
values = np.array(values)
moving_average = 0
moving_average = pd.ewm.mean(values, span=period)[-1]
return moving_average
Unfortunately, I got the error AttributeError: module 'pandas' has no attribute 'EWM'

The ewm() method now has a similar API to moving() and expanding(): you call ewm() and then follow it with a compatible method like mean(). For example:
df=pd.DataFrame({'x':np.random.randn(5)})
df['x'].ewm(halflife=2).mean()
0 -0.442148
1 -0.318170
2 0.099168
3 -0.062827
4 -0.371739
Name: x, dtype: float64
If you try df['x'].ewm() without arguments it will tell you:
Must pass one of com, span, halflife, or alpha
See below for documentation that may be more clear than the link in the OP:
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.ewm.html#pandas.DataFrame.ewm
http://pandas.pydata.org/pandas-docs/stable/computation.html#exponentially-weighted-windows

Related

AttributeError: 'SingleBlockManager' object has no attribute 'log'

I am using a big data with million rows and 1000 columns. I already referred this post here. Don't mark it as duplicate.
If sample data required, you can use the below
from numpy import *
m = pd.DataFrame(array([[1,0],
[2,3]]))
I have some continuous variables with 0 values in them.
I would like to compute logarithmic transformation of all those continuous variables.
However, I encounter divide by zero error. So, I tried the below suggestion based on above linked post
df['salary'] = np.log(df['salary'], where=0<df['salary'], out=np.nan*df['salary']) #not working `python stopped working` problem`
from numpy import ma
ma.log(df['app_reg_diff']) # error
My questions are as follows
a) How to avoid divide by zero error when applying for 1000 columns? How to do this for all continuous columns?
b) How to exclude zeros from log transformation and get the log values for rest of the non-zero observations?
You can replace the zero values with a value you like and do the logarithm operation normally.
import numpy as np
import pandas as pd
m = pd.DataFrame(np.array([[1,0], [2,3]]))
m[m == 0] = 1
print(np.log(m))
Here you would get zeros for zero items. You can for example replace it with -1 to get NaN.

How do I check if pandas import is modin or original

While doing some OLS regressions, I discovered that statsmodels.api.add_constant() does the following:
if _is_using_pandas(data, None) or _is_recarray(data):
from statsmodels.tsa.tsatools import add_trend
return add_trend(data, trend='c', prepend=prepend, has_constant=has_constant)
If not, it treats data as an ndarray and so you lose some contextual information (e.g. the column names which are the regressor variables names). When importing pandas from modin, the is_using_pandas() above returns False.
It is possible that statsmodels need to add modin as a supported option to their _is_using_pandas() but for now, I'd like to do something like:
if is_using_modin_pandas(x):
from statsmodels.tsa.tsatools import add_trend
X = add_trend(x, trend='c', prepend=True, has_constant='skip')
else:
X = sm.add_constant(x)
How would one write is_using_modin_pandas()?

Pandas .round() is not rounded as desired

I have the following id from a huge dataframe with a lot of ids, i pick this one in particular to show you what is the problem
id year anual_jobs anual_wage
874180 20001150368 2010 10.5 1071.595917
after this i code
df.anual_jobs= df.anual_jobs.round()
i get this error but code runs anyways.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-
docs/stable/indexing.html#indexing-view-versus-copy
self[name] = value
my result is:
id year anual_jobs anual_wage
874180 20001150368 2010 10.0 1071.595917
when i want to round anual_jobs to 11.0 instead of 10.0
As #cᴏʟᴅsᴘᴇᴇᴅ pointed out, this is happening because numpy rounds half-values to the nearest even integer (see docs here and a more general discussion here), and pandas uses numpy for most of its numerical work. You can resolve this by rounding the "old-fashioned" way:
import numpy as np
df.anual_jobs = np.floor(df.anual_jobs + 0.5)
or
import pandas as pd
df.anual_jobs = pd.np.floor(df.anual_jobs + 0.5)
As #cᴏʟᴅsᴘᴇᴇᴅ pointed out you can also resolve the slice assignment warning by creating your dataframe as a free-standing frame instead of a view on an older dataframe, i.e., execute the following at some point before you assign values into the dataframe:
df = df.copy()
If what you want is because of the half-integer use decimal
from decimal import Decimal, ROUND_HALF_UP
print(Decimal(10.5).quantize(0, ROUND_HALF_UP))
print(Decimal(10.2).quantize(0, ROUND_HALF_UP))
>> 11
>> 10

Ignoring NaN/null values while looping through data

I wasn't able to find any clear answers on what I assume to be a simple question. This is for Python 3. What are some of your tips and tricks when applying functions, loops, etc... on your data when your column has both null and non null values?
Here is the example I ran into when I was cleaning some data today. I have a function that takes two columns from my merged dataframe then calculates a ratio showing how similar two strings are.
imports:
from difflib import SequenceMatcher
import pandas as pd
import numpy as np
import pyodbc
import difflib
import os
from functools import partial
import datetime
my function:
def apply_sm(merged, c1, c2):
return difflib.SequenceMatcher(None, merged[c1], merged[c2]).ratio()
Here is me calling the function in my code example:
merged['NameMatchRatio'] = merged.apply(partial(apply_sm, c1='CLIENT NAME', c2='ClientName'), axis=1)
CLIENT NAME has no null values, while ClientName does have null values (which throw out errors when I try to apply my function). How can I apply my function while ignoring the NaN values (in either column just in case)?
Thank you for your time and assistance.
You can use math.isnan to check if a value is nan and skip it. Alternatively, you can also replace nan with zero or something else and then apply your function on it. It really depends on what you want to achieve.
A simple example:
import math
test_variable = math.nan
if math.isnan(test_variable):
print("it is a nan value")
Just incorporate this logic into your code as you deem fit.
def apply_sm(merged, c1, c2):
if not merged[[c1,c2]].isnull().any():
return difflib.SequenceMatcher(None, merged[c1], merged[c2]).ratio()
return 0.0 # <-- you could handle the Null case here

interesting pandas loc behavior

I've attached a screenshot of a pd.DataFrame I am using and I've observed some interesting behavior within loc that to me is counter-intuitive / doesn't make much sense to me, as after reading the pandas API I would have thought are equivalent (at just for example quicker than loc).
Effectively, in a two-dimensional dataframe I thought these methods should be equivalent, but they come up with different results:
Method 1
df.loc[label, column]
Method 2
df.at[label, column]
Method 3
df[column].loc[label]
Out[77] depicts the structure of the table, and what I find interesting is how the output method df.loc[label, column] for label '3T19' and column 'wing1' outputs results that (I don't understand and) are at odds with both the results when using either of the other two methods, AND different to the results when using method 1 on any other label.
Thanks a ton for your patient kind help, this must be one of the most basic questions.
Running Python 3.4 on Anaconda 2.1 w pandas 0.14.1
The problem can be reproduced with:
import pandas as pd
belly = '216 3T19'.split()
wing1 = '2T15 4H19'.split()
wing2 = '416 4T20'.split()
mat = pd.to_datetime('2016-01-22 2019-09-07'.split())
tbondfly = pd.DataFrame({'wing1':wing1, 'wing2':wing2, 'mat':mat}, index=belly)
# mat wing1 wing2
# 216 2016-01-22 2T15 416
# 3T19 2019-09-07 4H19 4T20

Categories

Resources