I experienced some quite unexpected behavior when using the clip function of pandas.
So, here is a reproducible example:
import pandas as pd
df_init = pd.DataFrame({'W': [-1.00, 0.0, 0.0, 0.3, 0.5, 1.0]})
df_init['W_EX'] = df_init['W'] + 0.1
df_init['W_EX'].clip(upper=1.0, inplace=True)
df_init.loc[df_init['W']==-1.0, 'W_EX'] = -1.0
df_init
The output is, as one would expect:
Out[2]:
W W_EX
0 -1.0 -1.0
1 0.0 0.1
2 0.0 0.1
3 0.3 0.4
4 0.5 0.6
5 1.0 1.0
However, when I inspect a specific value:
df_init.loc[df_init['W']==-1.0, 'W_EX']
I see the following output:
Out[3]:
0 -0.9
Name: W_EX, dtype: float64
Although I used .loc to overwrite the first value on the column, and although when printing the data frame I can see the new value, when I use .loc with a row slice, I see the value, which I had before using .clip.
Now it gets more complicated. If I inspect the series on the new column, I can see the value has been indeed updated:
df_init.loc[df_init['W']==-1.0, ['W_EX']]
Out[4]:
W_EX
0 -1.0
And lastly, like in Schrödinger's cat experiment, if I now go back and look a the column values, after having inspected the column series, I can now actually see, that the value's indeed the one I would have expected in the first place (as in Out[3]:):
df_init.loc[df_init['W']==-1.0, 'W_EX']
Out[5]:
0 -1.0
Name: W_EX, dtype: float64
If I skip the .clip call, all is fine. Would someone more knowledgeable than myself, please explain me what is going here?
It may be better to default to explicit assignment so that it's clearer what's happening. inplace=True performed on this slice of the dataframe doesn't appear to be assigning as expected, consistently.
There's some debate on whether the flag should stick around at all. In pandas, is inplace = True considered harmful, or not?
df_init['W_EX'] = df_init['W_EX'].clip(upper=1.0)
Related
I've got a pandas dataframe, and I'm trying to fill a new column in the dataframe, which takes the maximum value of two values situated in another column of the dataframe, iteratively. I'm trying to build a loop to do this, and save time with computation as I realise I could probably do it with more lines of code.
for x in ((jac_input.index)):
jac_output['Max Load'][x] = jac_input[['load'][x],['load'][x+1]].max()
However, I keep getting this error during the comparison
IndexError: list index out of range
Any ideas as to where I'm going wrong here? Any help would be appreciated!
Many things are wrong with your current code.
When you do ['abc'][x], x can only take the value 0 and this will return 'abc' as you are slicing a list. Not at all what you expect it to do (I imagine, slicing the Series).
For your code to be valid, you should do something like:
jac_input = pd.DataFrame({'load': [1,0,3,2,5,4]})
for x in jac_input.index:
print(jac_input['load'].loc[x:x+1].max())
output:
1
3
3
5
5
4
Also, when assigning, if you use jac_output['Max Load'][x] = ... you will likely encounter a SettingWithCopyWarning. You should rather use loc: jac_outputLoc[x, 'Max Load'] = .
But you do not need all that, use vectorial code instead!
You can perform rolling on the reversed dataframe:
jac_output['Max Load'] = jac_input['load'][::-1].rolling(2, min_periods=1).max()[::-1]
Or using concat:
jac_output['Max Load'] = pd.concat([jac_input['load'], jac_input['load'].shift(-1)], axis=1).max(1)
output (without assignment):
0 1.0
1 3.0
2 3.0
3 5.0
4 5.0
5 4.0
dtype: float64
I would like to round the index values of a pandas dataframe with the name results, such that they do not have any decimal values. I use the followin code that I took from here Round columns in pandas dataframe. So basically I have a column with the name "set_timeslot" and I would like to round its values and then use it as an index
cols = ['set_timeslots']
results[cols]= results [cols].round(0)
results.set_index('set_timeslots', inplace=True)
However, I still get a decimal value as you can see in the screenshot
Do you know what I have to do in order to get rid of the decimal values? I'd appreciate every comment.
If need round and convert to integers add Series.astype:
results[cols]= results [cols].round(0).astype(int)
We can't use pandas.DataFrame.round() in this scenario because the round module is used for the trimming of decimal points. let's take our case only.
# Import all-Important Libraries
import pandas as pd
# Reproduced Sample 'set_timeslots'
results = pd.DataFrame({
'set_timeslots':[1.0, 2.0, 3.0, 4.0, 5.0]
})
# Declaration of 'cols' variable for storing 'set_timeslots' column
cols = ['set_timeslots']
# Print result
results[cols]
# Output of above cell:-
set_timeslots
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
# Implementation of 'round' module:-
results[cols] = results[cols].round(0)
# Print result after round function
results
# Output of above cell:-
set_timeslots
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
Appropriate Solution:-
So, for the conversion of set_timeslots from decimal to int we can use pandas.DataFrame.astype() Module.
Code for above mention scenario was stated below:-
# Implementation of 'astype(int)' function
results[cols] = results[cols].astype(int)
# Print result after the Conversion
results
# Output of above cell:-
set_timeslots
0 1
1 2
2 3
3 4
4 5
As you can see we have achieved our desired output which is to remove decimal points from set_timeslots column. and Hope this Solution helps you in the clarification of round() function and astype() function.
To Learn more about pandas.DataFrame.round():- Click Here To Learn more about pandas.DataFrame.astype():- Click Here
I am using python 3.6.
I have a pandas.core.frame.DataFrame and would like to filter the entire DataFrame based on if the column called "Closed Date" is not null. In other words, if it is null in the "Closed Date" column, then remove the whole row from the DataFrame.
My code right now is the following:
data = raw_data.ix[raw_data['Closed Date'].notnull()]
Though it gets the job done, I get an warming message saying the following:
C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: DeprecationWarning:
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing
I tried this code:
data1 = raw_data.loc[raw_data.notnull(), 'Closed Date']
But get this error:
ValueError: Cannot index with multidimensional key
How do I fix this? Any suggestions?
This should work for you:
data1 = raw_data.loc[raw_data['Closed Date'].notnull()]
.ix was very similar to the current .loc (which is why the correct .loc syntax is equivalent to what you were originally doing with .ix). The difference, according to this detailed answer is: "ix usually tries to behave like loc but falls back to behaving like iloc if a label is not present in the index"
Example:
Taking this dataframe as an example (let's call it raw_data):
Closed Date x
0 1.0 1.0
1 2.0 2.0
2 3.0 NaN
3 NaN 3.0
4 4.0 4.0
raw_data.notnull() returns this DataFrame:
Closed Date x
0 True True
1 True True
2 True False
3 False True
4 True True
You can't index using .loc based on a dataframe of boolean values. However, when you do raw_data['Closed Date'].notnull(), you end up with a Series:
0 True
1 True
2 True
3 False
4 True
Which can be passed to .loc as a sort of "boolean filter" to apply onto your dataframe.
Alternate Solution
As pointed out by John Clemens, the same can be achieved with raw_data.dropna(subset=['Closed Date']). The documentation for the .dropna method outlines how this could be more flexible in some situations (for instance, allowing to drop rows or columns in which any or all values are NaN using the how argument, etc...)
i have the following example:
import pandas as pd
import numpy as np
df = pd.DataFrame([(0,2,5), (2,4,None),(7,-5,4), (1,None,None)])
def clean(series):
start = np.min(list(series.index[pd.isnull(series)]))
end = len(series)
series[start:] = series[start-1]
return series
my objective is to obtain a dataframe in which each row which contains a None value is filled in with the last available numerical value.
so, for example, running this function on just the 3rd row of the dataframe, i would produce the following:
row = df.ix[3]
test = clean(row)
test
0 1.0
1 1.0
2 1.0
Name: 3, dtype: float64
i cannot get this to work using the .apply() method, i.e. df.apply(clean,axis=1)
i should mention that this is a toy example - the custom function i would write in the real one is more dynamic in how it fills the values - so i am not looking for basic utilities like .ffill or .fillna
The apply method didn't work because when the row is completely filled your clean function will not know where to start the index from because of empty array for the given series.
So use a condition before altering series data i.e
def clean(series):
# Creating a copy for the sake of safety
series = series.copy()
# Alter series if only there exists a None value
if pd.isnull(series).any():
start = np.min(list(series.index[pd.isnull(series)]))
# for completely filled row
# series.index[pd.isnull(series)] will return
# Int64Index([], dtype='int64')
end = len(series)
series[start:] = series[start-1]
return series
df.apply(clean,1)
Output :
0 1 2
0 0.0 2.0 5.0
1 2.0 4.0 4.0
2 7.0 -5.0 4.0
3 1.0 1.0 1.0
Hope it clarifies why apply didn't work. I also suggest to take builtins to consideration to clean the data rather than writing functions from scratch.
At first, This is the code to solve your toy problem. But this code isn't what you want.
df.ffill(axis=1)
Next, I try to test your code.
df.apply(clean,axis=1)
#...start = np.min(list(series.index[pd.isnull(series)]))...
#=>ValueError: ('zero-size array to reduction operation minimum
# which has no identity', 'occurred at index 0')
To understand the situation, test with lambda function.
df.apply(lambda series:list(series.index[pd.isnull(series)]),axis=1)
0 []
1 [2]
2 []
3 [1, 2]
dtype: object
And next expression puts the same value error:
import numpy as np
np.min([])
In conclusion, pandas.apply() works well but clean function doesn't.
Could you use something like the fillna with backfill? I think this might be more efficient, if backfill meets your scenario..
i.e.
df.fillna(method='backfill')
However, this assumes a np.nan in the cells?
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.fillna.html
I have a dataframe with multiple columns and a few 1000 rows with text data. One column contains floats that represent time in ascending order (0, 0.45, 0.87, 1.10 etc). From this I want to build a new dataframe that contains only all the rows where these time values are closest to the integers x = 0,1,2,3......etc
Here on Stackoverflow I found an answer to a very similar question, answer posted by DSM. The code is essentially this, modified (hopefully) to give -the- closest number to x, df is my data frame.
df.loc[(df.ElapsedTime-x).abs().argsort()[:1]]
This seems to essentially do what I need for one x value but I can't figure out how to iterate this over the -entire- data frame to extract -all- rows where the column value is closest to x = 0,1,2,3....in ascending order. This code gives me a data frame, there must be a way to loop this and append the resulting data frames to get the desired result?
I have tried this:
L=[]
for x in np.arange(len(df)):
L.append(df.loc[(df.ElapsedTime-x).abs().argsort()[:1]])
L
L, in principle has the right rows but it is a messy list and it takes a long time to execute because for loops are not a great way to iterate over a data frame. I'd prefer to get a data frame as the result.
I feel I am missing something trivial.
Not sure how to post the desired dataframe.
Lets say the timevalues are (taken from my dataframe):
0.00,0.03,0.58,1.59,1.71,1.96,2.21,2.33,2.46,2.58,2.7,2.83,2.95,3.07
The values grabbed for 0,1,2,3 would be 0, .58, 1.96, 2.95
#beroe: if the numbers are 0.8, 1.1, 1.4, 2.8, in this case 1.1 should be grabbed for 1 and 1.4 should be grabbed for 2. If as an example the numbers are 0.5 1.5 2.5. While I think it is unlikely this will happen in my data I think it would be fine to grab 1.5 as 1 and 2.5 as 2. In this application I don't think it is that critical, although I am not sure how I would implement this.
Please let me know if anyone needs any additional info.
Don't know how fast this would be, but you could round the times to get "integer" candidates, take the absolute value of the difference to give yourself a way to find the closest, then sort by difference, and then groupby the integer time to return just the rows that are close to integers:
# setting up my fake data
df=pd.DataFrame()
df['ElapsedTime']=pd.Series([0.5, 0.8, 1.1, 1.4, 1.8, 2.2, 3.1])
# To use your own data set, set df = Z, and start here...
df['bintime'] = df.ElapsedTime.round()
df['d'] = abs(df.ElapsedTime - df.bintime)
dfindex = df.sort('d').groupby('bintime').first()
For the fake time series defined above, the contents of dfindex is:
ElapsedTime d
bintime
0 0.5 0.5
1 1.1 0.1
2 1.8 0.2
3 3.1 0.1
Consider the following pd.Series s
s = pd.Series(np.arange(5000), np.random.rand(5000) * 100).sort_index()
s.head()
0.002587 3007
0.003418 4332
0.060767 2045
0.125182 3179
0.134487 4614
dtype: int64
Get all integers to get closest to with:
idx = (s.index // 1).unique()
Then reindex with method='nearest'
s.reindex(idx, method='nearest').head()
0.0 3912
1.0 3617
2.0 2574
3.0 811
4.0 932
dtype: int64