I am using a big data with million rows and 1000 columns. I already referred this post here. Don't mark it as duplicate.
If sample data required, you can use the below
from numpy import *
m = pd.DataFrame(array([[1,0],
[2,3]]))
I have some continuous variables with 0 values in them.
I would like to compute logarithmic transformation of all those continuous variables.
However, I encounter divide by zero error. So, I tried the below suggestion based on above linked post
df['salary'] = np.log(df['salary'], where=0<df['salary'], out=np.nan*df['salary']) #not working `python stopped working` problem`
from numpy import ma
ma.log(df['app_reg_diff']) # error
My questions are as follows
a) How to avoid divide by zero error when applying for 1000 columns? How to do this for all continuous columns?
b) How to exclude zeros from log transformation and get the log values for rest of the non-zero observations?
You can replace the zero values with a value you like and do the logarithm operation normally.
import numpy as np
import pandas as pd
m = pd.DataFrame(np.array([[1,0], [2,3]]))
m[m == 0] = 1
print(np.log(m))
Here you would get zeros for zero items. You can for example replace it with -1 to get NaN.
Related
I am trying to do Kruskal Wallis test in python that not only gives me the H statistics and the p value, but also the effect size.
I have tried scipy's stats.kruskal() function, but only the H and p were returned.
with a pandas dataframe, so I converted the two columns of the dataframe (in the future I may need more than two) into arrays and ran scipy.stats.kruskal(L_arr,E_arr)
I first converted the two columns of interest from a pandas dataframe to two arrays L_arr and E_arr. Then I ran:
import scipy.stats as stats
stats.kruskal(L_arr, E_arr)
The result I got:
KruskalResult(statistic=1.2752179327521276, pvalue=0.2587900768563777)
I wish there is some way for me to get the effect size as well?
I am using the numpy rate function in order to mimic the Excel Rate function on loans.
The function returns the correct result when working with a subset of my dataframe (1 million records).
However, when working with the entire dataframe (over 10 million records), it returns null results for all.
Could this be a memory issue? If that is the case, how can it be solved?
I have already tried to chunk the data and use a while/for loop to calculate, but this didn't solve the problem.
This worked (not when I looped through the 10 million records though):
test = df2.iloc[:1000000,:]
test = test.loc[:,['LoanTerm',Instalment,'LoanAmount']]
test['True_Effective_Rate'] = ((1+np.rate(test['LoanTerm'],-test['Instalment'],test['LoanAmount'],0))**12-1)*100
I am trying to get this to work:
df2['True_Effective_Rate'] = ((1+np.rate(df2['LoanTerm'],-df2['Instalment'],df2['LoanAmount'],0))**12-1)*100
I see a similar question has been asked in the past where all the values returned are nulls when one of the parameter inputs are incorrect.
Using numpy.rate, on numpy array returns nan's unexpectedly
My dataframe doesn't have 0 values though. How can I prevent this from happening?
You can use apply to calculate this value once per row, so only invalid rows will be nan, not the entire result.
import pandas as pd
import numpy_financial as npf # i get a warning using np.rate
i = {
'LoanAmount': [5_000,20_000,15_000, 50_000.0, 14_000,1_000_000,10_000],
'LoanTerm': [72, 12,60, 36,72,12,-1],
'Instalment': [336.0,5000.0,333.0,0.0,-10,1000.0,20],}
df = pd.DataFrame(i)
df.apply(lambda x: npf.rate(nper=x.LoanTerm,pv=x.LoanAmount,pmt=-1*x.Instalment,fv=0),axis=1)
This will be slower for large datasets since you cannot take advantage of vectorisation.
You can also filter your dataframe entries to be only the valid values. It is hard to reproduce what is invalid, since you are not sharing the inputs but in my example above both loan term and installment must be >0.
valid = df.loc[(df.Installment > 0) & (df.LoanTerm > 0)]
npf.rate(nper=valid.LoanTerm,pv=valid.LoanAmount,pmt=-1*valid.Installment,fv=0)
Can anyone explain why the second method of computing log change yields a numpy array, discarding the index instead of DataFrame? If I specify DataFrame I get one with integer based index. The first method works as desired. Thanks for any insight.
import numpy as np
import pandas as pd
import pandas_datareader as pdr
aapl = pdr.get_data_yahoo('AAPL')
close = pd.DataFrame(aapl['Close'])
change = np.log(close['Close'] / close['Close'].shift(1))
another_change = np.diff(np.log(close['Close']))
I can't find documentation to back this up, but it seems that the type returned is being converted to ndarray when there's a reduction in dimension from the Series input. This happens with diff but not with log.
Taking the simple example:
x = pd.Series(range(5))
change = np.log(x / x.shift(1)) # Series of float64 of length 5
another_change = np.diff(np.log(x)) # array of float64 of length 4
We can observe that x / x.shift(1) is still a 5-element Series (even though elements 0 and 1 are NaN and inf) So np.log, which doesn't reduce dimension, will still return a 5-element something, which matches the dimensionality of x.
However, np.diff does reduce dimension -- it is supposed to return (according to doc)
diff : ndarray
The n-th differences. The shape of the output is the same as a except along axis where the dimension is smaller by n. [...]
The next sentence appears in the above doc for numpy 1.13 but not 1.12 and earlier:
[...] The type of the output is the same as that of the input.
So the type of the output is still an array-like structure, but because of the dimension being reduced, perhaps it doesn't get re-converted to a Series (the array-like input). At least in versions 1.12 and earlier.
That's my best guess.
Ive currently got a set of data as you can see here;
I am trying to use the .std() and .mean() functions within Panda to find the deviation and mean to reject outliers. Unfortunately I keep getting the error shown below the piece of code. I have no idea why, might be because of the headers not being numerical? I am not sure.
def reject_outliers(new1, m=3):
return new1[abs(new1 - np.mean(new1)) < m * np.std(new1)]
new2 = reject_outliers(new1, m=3)
new2.to_csv('final.csv')
ValueError: can only convert an array of size 1 to a Python scalar
Isolate the numeric columns and only apply the transformation to them
# get list of numeric columns
numcols = list(new1.select_dtypes(include=['number']).columns.values
# run function only on numeric columns
new1[numcols] = reject_outliers(new1[numcols], m=3)
I am doing some data handling based on a DataFrame with the shape of (135150, 12) so double checking my results manually is not applicable anymore.
I encountered some 'strange' behavior when I tried to check if an element is part of the dataframe or a given column.
This behavior is reproducible with even smaller dataframes as follows:
import numpy as np
import pandas as pd
start = 1e-3
end = 2e-3
step = 0.01e-3
arr = np.arange(start, end+step, step)
val = 0.0019
df = pd.DataFrame(arr, columns=['example_value'])
print(val in df) # prints `False`
print(val in df['example_value']) # prints `True`
print(val in df.values) # prints `False`
print(val in df['example_value'].values) # prints `False`
print(df['example_value'].isin([val]).any()) # prints `False`
Since I am a very beginner in data analysis I am not able to explain this behavior.
I know that I am using different approaches involving different datatypes (like pd.Series, np.ndarray or np.array) in order to check if the given value exists in the dataframe. Additionally when using np.array or np.ndarray the machine accuracy comes in play which I am aware of in mind.
However, at the end, I need to implement several functions to filter the dataframe and count the occurrences of some values, which I have done several times before based on boolean columns in combination with performed operations like > and < successfully.
But in this case I need to filter by the exact value and count its occurrences which after all lead me to the issue described above.
So could anyone explain, what's going on here?
The underlying issue, as Divakar suggested, is floating point precision. Because DataFrames/Series are built on top of numpy, there isn't really a penalty for using numpy methods though, so you can just do something like:
df['example_value'].apply(lambda x: np.isclose(x, val)).any()
or
np.isclose(df['example_value'], val).any()
both of which correctly return True.