Unable to update specific entries in a DataFrame - python

I have a dataframe where some entries in column_1 have NaN values. I want to replace these by the corresponding values in column_2. Both columns hold float64 values.
I tried the following but strangely it does not update the values.
ix = np.isnan(mydf.loc[:,'column_1'])
mydf[ix]['column_1'] = tchart[ix]['column_2']
Really strange, since I can perfectly see that:
mydf[ix]['column_1']
is the series with the NaN values
and that
mydf[ix]['column_2']
has valid values.
Why isn't it working?
I can't even do:
mydf[ix]['column_1'] = 45

This is an example of chained indexing. For getting values, this is generally ok; however for setting values, it may or may not work as you may be trying to set values on a copy. It is always better to set via the indexers ix/loc for multi-dimensional setting.
In this example, use mydf.loc[ix,'columns_1'] = 45
See here for a more complete explanation.

Related

How to create a partial dataframe from another dataframe with a different index

In python 3.9, I have declared the following python DataFrame, raw
>>> print(raw)
Date AAPL.O MSFT.O INTC.O AMZN.O GS.N SPY \
1 2010-01-04 30.572827 30.950 20.88 133.90 173.08 113.33
...
from which I would like to extract one column (e.g. AAPL.O) and the corresponding Date (as index)
I thought that a possible way would be
raw2=pandas.DataFrame(data=raw['AAPL.O'], index=raw['Date'])
This doesn't return any error, however, when I print it, I get:
Date AAPL.O
2010-01-04 NaN
2010-01-05 NaN
...
If I remove index=raw['Date'] from the declaration
raw2=pandas.DataFrame(data=raw['AAPL.O'])
it works as expected (except for the fact that I don't have the index that I wanted).
So I don't understand why but passing a column of the former dataframe (as data) and another column as index, brings me these unexpected NaN.
While there are better ways to achieve my goal, I expected this one to work as well. So probably I am missing something more fundamental which I would like to understand
Well, the index is not part of the data, so it might be part of the problem.
Even so, I don't really know how it expects to handle the index parameter. It should work with a df.index.
Anyway, what you can do
raw2=pandas.DataFrame(data=raw[['AAPL.O','Date']])
raw2.set_index('Date')
In this case, I have chosen the data from both columns. In fact, I might just do raw[['AAPL.O','Date']].copy().

pandas - vectorized formula computation with nans

I have a DataFrame (Called signal) that is a simple timeseries with 5 columns. This is what its .describe() looks like:
ES NK NQ YM
count 5294.000000 6673.000000 4798.000000 3415.000000
mean -0.000340 0.000074 -0.000075 -0.000420
std 0.016726 0.018401 0.023868 0.015399
min -0.118724 -0.156342 -0.144667 -0.103101
25% -0.008862 -0.010297 -0.011481 -0.008162
50% -0.001422 -0.000590 -0.001747 -0.001324
75% 0.007069 0.009163 0.009841 0.006304
max 0.156365 0.192686 0.181245 0.132630
I want to apply a simple function on every single row, and receive back a matrix with the same dimensions:
weights = -2*signal.subtract( signal.mean(axis=1), axis=0).divide( signal.sub( signal.mean(axis=1), axis=0).abs().sum(axis=1), axis=0 )
However, when I run this line, the program gets stuck. I believe this issue comes from the difference in length/presence of nans. Dropping the nans/filling it is not an option, for any given row that has a nan I want that nan to simply be excluded from the computation. A temporary solution would be to do this iteratively using .iterrows(), but this is not an efficient solution.
Are there any smart solutions to this problem?
The thing is, the pandas mean and sum methods already exclude NaN values by default (see the description of the skipna keyword in the linked docs). Additionally, subtract and divide allow for the use of a fill_value keyword arg:
fill_value : None or float value, default None
Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing
So you may be able to get what you want by setting fill_value=0 in the calls to subtract, and fill_value=1 in the calls to divide.
However, I suspect that the default behavior (NaN is ignored in mean and sum, NaN - anything = NaN, NaN\anything = NaN) is what you actually want. In that case, your problem isn't directly related to NaNs, and you're going to have to clarify your statement "when I run this line, the program gets stuck" in order to get a useful answer.

df.set_index returns key error python pandas dataframe

I have this Pandas DataFrame and I have to convert some of the items into coordinates, (meaning they have to be floats) and it includes the indexes while trying to convert them into floats. So I tried to set the indexes to the first thing in the DataFrame but it doesn't work. I wonder if it has anything to do with the fact that it is a part of the whole DataFrame, only the section that is "Latitude" and "Longitude".
df = df_volc.iloc(axis = 0)[0:, 3:5]
df.set_index("hello", inplace = True, drop = True)
df
and I get the a really long error, but this is the last part of it:
KeyError: '34.50'
if I don't do the set_index part I get:
Latitude Longitude
0 34.50 131.60
1 -23.30 -67.62
2 14.50 -90.88
I just wanna know if its possible to get rid of the indexes or set them.
The parameter you need to pass to set_index() function is keys : column label or list of column labels / arrays. In your scenario, it seems like "hello" is not a column name.
I just wanna know if its possible to get rid of the indexes or set them.
It is possible to replace the 0, 1, 2 index with something else, though it doesn't sound like it's necessary for your end goal:
to convert some of the items into [...] floats
To achieve this, you could overwrite the existing values by using astype():
df['Latitude'] = df['Latitude'].astype('float')

Values being altered in numpy array

So I have a 2D numpy array (256,256), containing values between 0 and 10, which is essentially an image. I need to remove the 0 values and set them to NaN so that I can plot the array using a specific library (APLpy). However whenever I try and change all of the 0 values, some of the other values get altered, in some cases to 100 times their original value (no idea why).
The code I'm using is:
for index, value in np.ndenumerate(tex_data):
if value == 0:
tex_data[index] = 'NaN'
where tex_data is the data array from which I need to remove the zeros. Unfortunately I can't just use a mask for the values I don't need, as APLpy wont except masked arrays as far as I can tell.
Is there anyway I can set the 0 values to NaN without changing the other values in the array?
Use fancy-indexing. Like this:
tex_data[tex_data==0] = np.nan
I don't know why your original code was failing. It looks correct to me, although terribly inefficient.
Using float rules,
tex_data/tex_data*tex_data
make the job here also.

How to calculate percentage and store it in a new column?

I have the following dataframe and I want to add a new column with the percentage value:
df =
TIME_1 TIME_2
80 150
120 20
I want to get a new columt TIME_1_PROC that will store the percentage value of TIME_1 from TIME_1 + TIME_2.
This is my code, but it triggers a warning:
df.TIME_1_PROC = (df.TIME_1*100/(df.TIME_1+df.TIME_2))
Warning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
This creates a new variable:
df['TIME1_PROC'] = (df.TIME_1 * 100 / (df.TIME_1 + df.TIME_2))
Out[27]:
TIME_1 TIME_2 TIME1_PROC
0 80 150 34.782609
1 120 20 85.714286
Generally speaking...
Just a quick elaboration on #Imo's correct answer. Most of the time you are better off creating and referring to columns like this:
df['x']
rather than this:
df.x
And when you are creating a new variable, you MUST use the first method. But even for existing variables, the first way is considered better because you avoid potential errors if you happen to have a column called "index". E.g. if you type df.index, will that return the index or the column named "index"? Of course, we all use the attribute-style as a shortcut on occasion, so perhaps a more reasonable rule of thumb would be to only use the shortcut on the right hand side.
This particular example...
All that said, the behavior by pandas here does not seem ideal. The warning message you got here is a common one in pandas and often ignorable (as it is here). But what is unfortunate is that you didn't get an error message about attempting to access a non-existent column. And furthermore consider the following:
df['TIME_1_PROC'] # KeyError: 'TIME_1_PROC'
df.TIME_1_PROC
0 34.782609
1 85.714286
dtype: float64
So your new column did get created, but as an attribute rather than a column. To be more explicit here, usually when we use the attribute-style reference, it is interpreted by pandas as referring to a column. But in this case it actually is an attibute (and that's not what you want).
use pd.set_option('chained',None) to avoid such messages

Categories

Resources