Can anyone explain why the second method of computing log change yields a numpy array, discarding the index instead of DataFrame? If I specify DataFrame I get one with integer based index. The first method works as desired. Thanks for any insight.
import numpy as np
import pandas as pd
import pandas_datareader as pdr
aapl = pdr.get_data_yahoo('AAPL')
close = pd.DataFrame(aapl['Close'])
change = np.log(close['Close'] / close['Close'].shift(1))
another_change = np.diff(np.log(close['Close']))
I can't find documentation to back this up, but it seems that the type returned is being converted to ndarray when there's a reduction in dimension from the Series input. This happens with diff but not with log.
Taking the simple example:
x = pd.Series(range(5))
change = np.log(x / x.shift(1)) # Series of float64 of length 5
another_change = np.diff(np.log(x)) # array of float64 of length 4
We can observe that x / x.shift(1) is still a 5-element Series (even though elements 0 and 1 are NaN and inf) So np.log, which doesn't reduce dimension, will still return a 5-element something, which matches the dimensionality of x.
However, np.diff does reduce dimension -- it is supposed to return (according to doc)
diff : ndarray
The n-th differences. The shape of the output is the same as a except along axis where the dimension is smaller by n. [...]
The next sentence appears in the above doc for numpy 1.13 but not 1.12 and earlier:
[...] The type of the output is the same as that of the input.
So the type of the output is still an array-like structure, but because of the dimension being reduced, perhaps it doesn't get re-converted to a Series (the array-like input). At least in versions 1.12 and earlier.
That's my best guess.
Related
I currently have a pretty large 3D numpy array (atlasarray - 14M elements with type int64) in which I want to create a duplicate array where every element is a float based on a separate dataframe lookup (organfile).
I'm very much a beginner, so I'm sure that there must be a better (quicker) way to do this. Currently, it takes around 90s, which isn't ages but I'm sure can probably be reduced. Most of this code below is taken from hours of Googling, so surely isn't optimised.
import pandas as pd
organfile = pd.read_excel('/media/sf_VMachine_Shared_Path/ValidationData/ICRP110/AF/AF_OrgansSimp.xlsx')
densityarray = atlasarray
densityarray = densityarray.astype(float)
#create an iterable list of elements that can be written over and go for each elements
for idx, x in tqdm(np.ndenumerate(densityarray), total =densityarray.size):
densityarray[idx] = organfile.loc[x,'Density']
All of the elements in the original numpy array are integers which correspond to an organID. I used pandas to read in the key from an excel file and generate a 4-column dataframe, where in this particular case I want to extract the 4th column (which is a float). OrganIDs go up to 142. Apologies for the table format below, I couldn't get it to work so put it in code format instead.
|:OrganID:|:OrganName:|:TissueType:|:Density:|
|:-------:|:---------:|:----------:|:-------:|
|:---0---:|:---Air---:|:----53----:|:-0.001-:|
|:---1---:|:-Adrenal-:|:----43----:|:-1.030-:|
Any recommendations on ways I can speed this up would be gratefully received.
Put the density from the dataframe into a numpy array:
density = np.array(organfile['Density'])
Then run:
density[atlasarray]
Don't use loops, they are slow. The following example with 14M elements takes less than 1 second to run:
density = np.random.random((143))
atlasarray = np.random.randint(0, 142, (1000, 1000, 14))
densityarray = density[atlasarray]
Shape of densityarray:
print(densityarray.shape)
(1000, 1000, 14)
I am using a big data with million rows and 1000 columns. I already referred this post here. Don't mark it as duplicate.
If sample data required, you can use the below
from numpy import *
m = pd.DataFrame(array([[1,0],
[2,3]]))
I have some continuous variables with 0 values in them.
I would like to compute logarithmic transformation of all those continuous variables.
However, I encounter divide by zero error. So, I tried the below suggestion based on above linked post
df['salary'] = np.log(df['salary'], where=0<df['salary'], out=np.nan*df['salary']) #not working `python stopped working` problem`
from numpy import ma
ma.log(df['app_reg_diff']) # error
My questions are as follows
a) How to avoid divide by zero error when applying for 1000 columns? How to do this for all continuous columns?
b) How to exclude zeros from log transformation and get the log values for rest of the non-zero observations?
You can replace the zero values with a value you like and do the logarithm operation normally.
import numpy as np
import pandas as pd
m = pd.DataFrame(np.array([[1,0], [2,3]]))
m[m == 0] = 1
print(np.log(m))
Here you would get zeros for zero items. You can for example replace it with -1 to get NaN.
I have a question on the difference between just using max(list array) and np.max(list array).
Is the only difference here the time it takes for Python to return the code?
They may differ in edge cases, such as a list containing NaNs.
import numpy as np
a = max([2, 4, np.nan]) # 4
b = np.max([2, 4, np.nan]) # nan
NumPy propagates NaN in such cases, while the behavior of Python's max is less certain.
There are also subtle issues regarding data types:
a = max([10**n for n in range(20)]) # a is an integer
b = np.max([10**n for n in range(20)]) # b is a float
And of course running time differences documented in numpy.max or max ? Which one is faster?
Generally, one should use max for Python lists and np.max for NumPy arrays to minimize the number of surprises. For instance, my second example is not really about np.max but about the data type conversion: to use np.max the list is first converted to a NumPy array, but elements like 10**19 are too large to be represented by NumPy integer types so they become floats.
I am doing some data handling based on a DataFrame with the shape of (135150, 12) so double checking my results manually is not applicable anymore.
I encountered some 'strange' behavior when I tried to check if an element is part of the dataframe or a given column.
This behavior is reproducible with even smaller dataframes as follows:
import numpy as np
import pandas as pd
start = 1e-3
end = 2e-3
step = 0.01e-3
arr = np.arange(start, end+step, step)
val = 0.0019
df = pd.DataFrame(arr, columns=['example_value'])
print(val in df) # prints `False`
print(val in df['example_value']) # prints `True`
print(val in df.values) # prints `False`
print(val in df['example_value'].values) # prints `False`
print(df['example_value'].isin([val]).any()) # prints `False`
Since I am a very beginner in data analysis I am not able to explain this behavior.
I know that I am using different approaches involving different datatypes (like pd.Series, np.ndarray or np.array) in order to check if the given value exists in the dataframe. Additionally when using np.array or np.ndarray the machine accuracy comes in play which I am aware of in mind.
However, at the end, I need to implement several functions to filter the dataframe and count the occurrences of some values, which I have done several times before based on boolean columns in combination with performed operations like > and < successfully.
But in this case I need to filter by the exact value and count its occurrences which after all lead me to the issue described above.
So could anyone explain, what's going on here?
The underlying issue, as Divakar suggested, is floating point precision. Because DataFrames/Series are built on top of numpy, there isn't really a penalty for using numpy methods though, so you can just do something like:
df['example_value'].apply(lambda x: np.isclose(x, val)).any()
or
np.isclose(df['example_value'], val).any()
both of which correctly return True.
I would like to combine an array full of floats with an array full of strings. Is there a way to do this?
(I am also having trouble rounding my floats, insert is changing them to scientific notation; I am unable to reproduce this with a small example)
A=np.array([[1/3,257/35],[3,4],[5,6]],dtype=float)
B=np.array([7,8,9],dtype=float)
C=np.insert(A,A.shape[1],B,axis=1)
print(np.arround(B,decimals=2))
D=np.array(['name1','name2','name3'])
How do I append D onto the end of C in the same way that I appended B onto A (insert D as the last column of C)?
I suspect that there is a type issue between having strings and floats in the same array. It would also answer my questions if there were a way to change a float (or maybe a scientific number, my numbers are displayed as '5.02512563e-02') to a string with about 4 digits (.0502).
I believe concatenate will not work, because the array dimensions are (3,3) and (,3). D is a 1-D array, D.T is no different than D. Also, when I plug this in I get "ValueError: all the input arrays must have same number of dimensions."
I don't care about accuracy loss due to appending, as this is the last step before I print.
Use dtype=object in your numpy array; like bellow:
np.array([1, 'a'], dtype=object)
Try making D a numpy array first, then transposing and concatenating with C:
D=np.array([['name1','name2','name3']])
np.concatenate((C, D.T), axis=1)
See the documentation for concatenate for explanation and examples:
http://docs.scipy.org/doc/numpy/reference/generated/numpy.concatenate.html
numpy arrays support only one type of data in the array. Changing the float to str is not a good idea as it will only result in values very close to the original value.
Try using pandas, it support multiple data types in single column.
import numpy as np
import pandas as pd
np_ar1 = np.array([1.3, 1.4, 1.5])
np_ar2 = np.array(['name1', 'name2', 'name3'])
df1 = pd.DataFrame({'ar1':np_ar1})
df2 = pd.DataFrame({'ar2':np_ar2})
pd.concat([df1.ar1, df2.ar2], axis=0)