I would like to round the index values of a pandas dataframe with the name results, such that they do not have any decimal values. I use the followin code that I took from here Round columns in pandas dataframe. So basically I have a column with the name "set_timeslot" and I would like to round its values and then use it as an index
cols = ['set_timeslots']
results[cols]= results [cols].round(0)
results.set_index('set_timeslots', inplace=True)
However, I still get a decimal value as you can see in the screenshot
Do you know what I have to do in order to get rid of the decimal values? I'd appreciate every comment.
If need round and convert to integers add Series.astype:
results[cols]= results [cols].round(0).astype(int)
We can't use pandas.DataFrame.round() in this scenario because the round module is used for the trimming of decimal points. let's take our case only.
# Import all-Important Libraries
import pandas as pd
# Reproduced Sample 'set_timeslots'
results = pd.DataFrame({
'set_timeslots':[1.0, 2.0, 3.0, 4.0, 5.0]
})
# Declaration of 'cols' variable for storing 'set_timeslots' column
cols = ['set_timeslots']
# Print result
results[cols]
# Output of above cell:-
set_timeslots
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
# Implementation of 'round' module:-
results[cols] = results[cols].round(0)
# Print result after round function
results
# Output of above cell:-
set_timeslots
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
Appropriate Solution:-
So, for the conversion of set_timeslots from decimal to int we can use pandas.DataFrame.astype() Module.
Code for above mention scenario was stated below:-
# Implementation of 'astype(int)' function
results[cols] = results[cols].astype(int)
# Print result after the Conversion
results
# Output of above cell:-
set_timeslots
0 1
1 2
2 3
3 4
4 5
As you can see we have achieved our desired output which is to remove decimal points from set_timeslots column. and Hope this Solution helps you in the clarification of round() function and astype() function.
To Learn more about pandas.DataFrame.round():- Click Here To Learn more about pandas.DataFrame.astype():- Click Here
Related
I experienced some quite unexpected behavior when using the clip function of pandas.
So, here is a reproducible example:
import pandas as pd
df_init = pd.DataFrame({'W': [-1.00, 0.0, 0.0, 0.3, 0.5, 1.0]})
df_init['W_EX'] = df_init['W'] + 0.1
df_init['W_EX'].clip(upper=1.0, inplace=True)
df_init.loc[df_init['W']==-1.0, 'W_EX'] = -1.0
df_init
The output is, as one would expect:
Out[2]:
W W_EX
0 -1.0 -1.0
1 0.0 0.1
2 0.0 0.1
3 0.3 0.4
4 0.5 0.6
5 1.0 1.0
However, when I inspect a specific value:
df_init.loc[df_init['W']==-1.0, 'W_EX']
I see the following output:
Out[3]:
0 -0.9
Name: W_EX, dtype: float64
Although I used .loc to overwrite the first value on the column, and although when printing the data frame I can see the new value, when I use .loc with a row slice, I see the value, which I had before using .clip.
Now it gets more complicated. If I inspect the series on the new column, I can see the value has been indeed updated:
df_init.loc[df_init['W']==-1.0, ['W_EX']]
Out[4]:
W_EX
0 -1.0
And lastly, like in Schrödinger's cat experiment, if I now go back and look a the column values, after having inspected the column series, I can now actually see, that the value's indeed the one I would have expected in the first place (as in Out[3]:):
df_init.loc[df_init['W']==-1.0, 'W_EX']
Out[5]:
0 -1.0
Name: W_EX, dtype: float64
If I skip the .clip call, all is fine. Would someone more knowledgeable than myself, please explain me what is going here?
It may be better to default to explicit assignment so that it's clearer what's happening. inplace=True performed on this slice of the dataframe doesn't appear to be assigning as expected, consistently.
There's some debate on whether the flag should stick around at all. In pandas, is inplace = True considered harmful, or not?
df_init['W_EX'] = df_init['W_EX'].clip(upper=1.0)
I've got a pandas dataframe, and I'm trying to fill a new column in the dataframe, which takes the maximum value of two values situated in another column of the dataframe, iteratively. I'm trying to build a loop to do this, and save time with computation as I realise I could probably do it with more lines of code.
for x in ((jac_input.index)):
jac_output['Max Load'][x] = jac_input[['load'][x],['load'][x+1]].max()
However, I keep getting this error during the comparison
IndexError: list index out of range
Any ideas as to where I'm going wrong here? Any help would be appreciated!
Many things are wrong with your current code.
When you do ['abc'][x], x can only take the value 0 and this will return 'abc' as you are slicing a list. Not at all what you expect it to do (I imagine, slicing the Series).
For your code to be valid, you should do something like:
jac_input = pd.DataFrame({'load': [1,0,3,2,5,4]})
for x in jac_input.index:
print(jac_input['load'].loc[x:x+1].max())
output:
1
3
3
5
5
4
Also, when assigning, if you use jac_output['Max Load'][x] = ... you will likely encounter a SettingWithCopyWarning. You should rather use loc: jac_outputLoc[x, 'Max Load'] = .
But you do not need all that, use vectorial code instead!
You can perform rolling on the reversed dataframe:
jac_output['Max Load'] = jac_input['load'][::-1].rolling(2, min_periods=1).max()[::-1]
Or using concat:
jac_output['Max Load'] = pd.concat([jac_input['load'], jac_input['load'].shift(-1)], axis=1).max(1)
output (without assignment):
0 1.0
1 3.0
2 3.0
3 5.0
4 5.0
5 4.0
dtype: float64
The 'azdias' is a dataframe which is my main dataset and meta data or feature summary of it lies in dataframe 'feat_info'. The 'feat_info' shows the values in every column that have been displayed as NaN.
Ex: column1 has values [-1,0] as NaN values. So my job will be to find and replace these -1,0 in column1 as NaN.
azdias dataframe:
feat_info dataframe:
I have tried following in jupyter notebook.
def NAFunc(x, miss_unknown_list):
x_output = x
for i in miss_unknown_list:
try:
miss_unknown_value = float(i)
except ValueError:
miss_unknown_value = i
if x == miss_unknown_value:
x_output = np.nan
break
return x_output
for cols in azdias.columns.tolist():
NAList = feat_info[feat_info.attribute == cols]['missing_or_unknown'].values[0]
azdias[cols] = azdias[cols].apply(lambda x: NAFunc(x, NAList))
Question 1: I am trying to impute NaN values. But my code is very
slow. I wish to speed up my process of execution.
I have attached sample of both dataframes:
azdias_sample
AGER_TYP ALTERSKATEGORIE_GROB ANREDE_KZ CJT_GESAMTTYP FINANZ_MINIMALIST
0 -1 2 1 2.0 3
1 -1 1 2 5.0 1
2 -1 3 2 3.0 1
3 2 4 2 2.0 4
4 -1 3 1 5.0 4
feat_info_sample
attribute information_level type missing_or_unknown
AGER_TYP person categorical [-1,0]
ALTERSKATEGORIE_GROB person ordinal [-1,0,9]
ANREDE_KZ person categorical [-1,0]
CJT_GESAMTTYP person categorical [0]
FINANZ_MINIMALIST person ordinal [-1]
If the azdias dataset is obtained from read_csv or similar IO functions, the na_values keyword argument can be used to specify column-specific missing value representations to make sure the returned data frame already has in-place NaN values from the very beginning. The sample code is shown in the following.
from ast import literal_eval
feat_info.set_index("attribute", inplace=True)
# A more concise but less efficient alternative is
# na_dict = feat_info["missing_or_unknown"].apply(literal_eval).to_dict()
na_dict = {attr: literal_eval(val) for attr, val in feat_info["missing_or_unknown"].items()}
df_azdias = pd.read_csv("azidas.csv", na_values=na_dict)
As for the data type, there is no built-in NaN representation for integer data types. Hence a float data type is needed. If the missing values are imputed using fillna, the downcast argument can be specified to make the returned series or data frame have an appropriate data type.
Try using the DataFrame's replace method. How about this?
for c in azdias.columns.tolist():
replace_list = feat_info[feat_info['attribute'] == c]['missing_or_unknown'].values
azidias[c] = azidias[c].replace(to_replace=list(replace_list), value=np.nan)
A couple things I'm not sure about without being able to execute your code:
In your example, you used .values[0]. Don't you want all the values?
I'm not sure if it's necessary to do to_replace=list(replace_list), it may work to just use to_replace=replace_list.
In general, I recommend thinking to yourself "surely Pandas has a function to do this for me." Often, they do. For performance with Pandas generally, avoid looping over and setting things. Vectorized methods tend to be much faster.
I am trying to subtract two columns in the dataframe but it is giving me same result for all the values?
Here is my data:
a b
0 0.35805 -0.01315
1 0.35809 -0.01311
2 0.35820 -0.01300
3 0.35852 -0.01268
I tried following approach suggested in here, but it is repeating same result for me in all the rows.
More like a precision issue , I always using decimal
from decimal import *
df.z.map(Decimal)-df.dist.map(Decimal)
Out[189]:
0 0.3711999999999999796246319406
1 0.3712000000000000195232718880
2 0.3712000000000000177885484121
3 0.3712000000000000056454840802
dtype: object
I think this will work fine
df['a-b'] = df['a']-df['b']
i have the following example:
import pandas as pd
import numpy as np
df = pd.DataFrame([(0,2,5), (2,4,None),(7,-5,4), (1,None,None)])
def clean(series):
start = np.min(list(series.index[pd.isnull(series)]))
end = len(series)
series[start:] = series[start-1]
return series
my objective is to obtain a dataframe in which each row which contains a None value is filled in with the last available numerical value.
so, for example, running this function on just the 3rd row of the dataframe, i would produce the following:
row = df.ix[3]
test = clean(row)
test
0 1.0
1 1.0
2 1.0
Name: 3, dtype: float64
i cannot get this to work using the .apply() method, i.e. df.apply(clean,axis=1)
i should mention that this is a toy example - the custom function i would write in the real one is more dynamic in how it fills the values - so i am not looking for basic utilities like .ffill or .fillna
The apply method didn't work because when the row is completely filled your clean function will not know where to start the index from because of empty array for the given series.
So use a condition before altering series data i.e
def clean(series):
# Creating a copy for the sake of safety
series = series.copy()
# Alter series if only there exists a None value
if pd.isnull(series).any():
start = np.min(list(series.index[pd.isnull(series)]))
# for completely filled row
# series.index[pd.isnull(series)] will return
# Int64Index([], dtype='int64')
end = len(series)
series[start:] = series[start-1]
return series
df.apply(clean,1)
Output :
0 1 2
0 0.0 2.0 5.0
1 2.0 4.0 4.0
2 7.0 -5.0 4.0
3 1.0 1.0 1.0
Hope it clarifies why apply didn't work. I also suggest to take builtins to consideration to clean the data rather than writing functions from scratch.
At first, This is the code to solve your toy problem. But this code isn't what you want.
df.ffill(axis=1)
Next, I try to test your code.
df.apply(clean,axis=1)
#...start = np.min(list(series.index[pd.isnull(series)]))...
#=>ValueError: ('zero-size array to reduction operation minimum
# which has no identity', 'occurred at index 0')
To understand the situation, test with lambda function.
df.apply(lambda series:list(series.index[pd.isnull(series)]),axis=1)
0 []
1 [2]
2 []
3 [1, 2]
dtype: object
And next expression puts the same value error:
import numpy as np
np.min([])
In conclusion, pandas.apply() works well but clean function doesn't.
Could you use something like the fillna with backfill? I think this might be more efficient, if backfill meets your scenario..
i.e.
df.fillna(method='backfill')
However, this assumes a np.nan in the cells?
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.fillna.html