This question already has answers here:
Round each number in a Python pandas data frame by 2 decimals
(6 answers)
Closed 2 years ago.
How to round off column number - two(beta) and three(gama) to two decimal places for this dataframe
df
alpha beta gama theta
4.615556 4.637778 4.415556 4.5
3.612727 3.616364 3.556364 5.5
2.608000 2.661333 2.680000 7.5
2.512500 2.550000 2.566250 8.0
You can use the .round function.
Here is an example program:
# Import pandas
import pandas as pd
# Example data from question
alpha = [4.615556, 3.612727, 2.608000, 2.512500]
beta = [4.637778, 3.616364, 2.661333, 2.550000]
gamma = [4.415556, 3.556364, 2.680000, 2.566250]
theta = [4.5, 5.5, 7.5, 8.0]
# Build dataframe
df = pd.DataFrame({'alpha':alpha, 'beta':beta, 'gamma':gamma, 'theta':theta})
# Print it out
print(df)
# Use round function
df[['beta', 'gamma']] = df[['beta', 'gamma']].round(2)
# Show results
print(df)
Yields:
alpha beta gamma theta
0 4.615556 4.637778 4.415556 4.5
1 3.612727 3.616364 3.556364 5.5
2 2.608000 2.661333 2.680000 7.5
3 2.512500 2.550000 2.566250 8.0
alpha beta gamma theta
0 4.615556 4.64 4.42 4.5
1 3.612727 3.62 3.56 5.5
2 2.608000 2.66 2.68 7.5
3 2.512500 2.55 2.57 8.0
df[['beta','gama']] = df[['beta','gama']].round(2)
Related
My goal is for a dataset similar to the example below, to group by [s_num, ip, f_num, direction] and then filter the score columns using separate thresholds and count how many values are above the threshold.
id s_num ip f_num direction algo_1_x algo_2_x algo_1_score algo_2_score
0 0.0 0.0 0.0 0.0 X -4.63 -4.45 0.624356 0.664009
15 19.0 0.0 2.0 0.0 X -5.44 -5.02 0.411217 0.515843
16 20.0 0.0 2.0 0.0 X -12.36 -5.09 0.397237 0.541112
20 24.0 0.0 2.0 1.0 X -4.94 -5.15 0.401744 0.526032
21 25.0 0.0 2.0 1.0 X -4.78 -4.98 0.386410 0.564934
22 26.0 0.0 2.0 1.0 X -4.89 -5.03 0.394326 0.513896
24 28.0 0.0 2.0 2.0 X -4.78 -5.00 0.420078 0.521993
25 29.0 0.0 2.0 2.0 X -4.91 -5.14 0.407355 0.485878
26 30.0 0.0 2.0 2.0 X 11.83 -4.97 0.392242 0.659122
27 31.0 0.0 2.0 2.0 X -4.73 -5.07 0.377011 0.524774
the result should look something like:
Each entry in algo_i column is the # of values in the group larger than the corresponding threshold
So far I tried first grouping, and applying custom aggregation like so:
def count_success(x,thresh):
return ((x > thresh)*1).sum()
thresholds=[0.1,0.2]
df.groupby(attr_cols).agg({f'algo_{i+1}_score':count_success(thresh) for i, thresh in enumerate(thresholds)})
but this results in an error :
count_success() missing 1 required positional argument: 'thresh'
So, how can I pass another argument to a function using .agg( )? or is there an easier way to do it using some pandas function?
Named aggregation does not allow extra parameter to be passed to your function. You can use numpy boardcasting:
attr_cols = ["s_num", "ip", "f_num", "direction"]
score_cols = df.columns[df.columns.str.match("algo_\d+_score")]
# Convert everything to numpy to prepare for broadcasting
score = df[score_cols].to_numpy()
threshold = np.array([0.1, 0.5])
# Raise `threshold` up 2 dimensions so that every value in `score` is
# broadcast against every value in `threshold`
mask = score > threshold[:, None, None]
# Assemble the result
row_index = pd.MultiIndex.from_frame(df[attr_cols])
col_index = pd.MultiIndex.from_product([threshold, score_cols], names=["threshold", "algo"])
result = (
pd.DataFrame(np.hstack(mask), index=row_index, columns=col_index)
.groupby(attr_cols)
.sum()
)
Result:
threshold 0.1 0.5
algo algo_1_score algo_2_score algo_1_score algo_2_score
s_num ip f_num direction
0.0 0.0 0.0 X 1 1 1 1
2.0 0.0 X 2 2 0 2
1.0 X 3 3 0 3
2.0 X 4 4 0 3
This is my data frame (labeled unp):
data frame unp
LOCATION TIME Unemployment_Rate Unit_Labour_Cost GDP_CAP PTEmployment HR_WKD Collective IndividualCollective Individual Temp GDPCAP_ULC GDP_Growth
0 AUT 2013 5.336031 2.632506 47936.67796 19.863556 1632.1 2.14 1.80 1.66 1.47 18209.522774 NaN
1 AUT 2014 5.621219 1.996807 48813.53441 20.939237 1621.6 2.14 1.80 1.66 1.47 24445.794917 876.85645
2 AUT 2015 5.723468 1.515733 49925.22780 21.026548 1598.9 2.14 1.80 1.66 1.47 32938.009399 1111.69339
3 AUT 2016 6.014071 1.610391 50923.69330 20.889132 1609.4 2.14 1.80 1.66 1.47 31621.943553 998.46550
4 BEL 2013 8.425185 1.988013 43745.95156 18.212509 1558.0 2.48 2.22 2.11 1.91 22004.861920 -7177.74174
... ... ... ... ... ... ... ... ... ... ... ... ... ...
101 SWE 2016 6.991096 1.899792 48690.14644 13.800736 1626.0 2.72 2.54 2.48 1.55 25629.198586 779.74573
102 USA 2013 7.375000 1.099109 53016.28880 12.255613 1782.0 1.33 1.31 1.30 0.27 48235.697096 4326.14236
103 USA 2014 6.166667 2.027852 54935.20048 10.611552 1784.0 1.33 1.31 1.30 0.27 27090.340163 1918.91168
104 USA 2015 5.291667 1.912012 56700.88042 9.879047 1785.0 1.33 1.31 1.30 0.27 29655.086066 1765.67994
105 USA 2016 4.866667 1.045644 57797.46221 9.454144 1781.0 1.33 1.31 1.30 0.27 55274.512367 1096.58179
I want to change the row GDP_Growth which is currently blank to have the value of:
unp.GDP_CAP - unp.GDP_CAP.shift(1)
If it fulfils the condition that the 'TIME' is not 2014 or >2014, else it should be N/A
Tried using the if function directly but it's not working:
if unp.loc[unp['TIME'] > 2014]:
unp['GDP_Growth'] = unp.GDP_CAP - unp.GDP_CAP.shift(1)
else:
return
You should avoid the if statement when using dataframes as it will be slower (less efficient).
In place, depending on what you need, you can use np.where().
because the dataframe in the question is a picture (as opposed to text), i give you the standard implementation, which looks like this:
import pandas as pd
import numpy as np
# Create a sample DataFrame
df = pd.DataFrame({'A': [1, 2, 3, 4, 5],
'B': [5, 6, 7, 8, 9]})
# Use np.where() to select values from column 'A' where column 'B' is greater than 7
result = np.where(df['B'] > 7, df['A'], 0)
# Print the result
print(result)
The result of the above is this:
[0, 0, 0, 4, 5]
You will need to modify the above for your particular dataframe.
The question in title is currently Python: How do I use the if function when calling out a specific row?, which my answer will not apply to. Instead, we will compute the derivate / 'growth' and selectively apply it.
Explanation: In Python, you generally want to use a functional programming style to keep most computations outside of the Python interpreter and instead work with C-implemented functions.
Solution:
A. Obtain the derivate/'growth'
For your dataframe df = pd.DataFrame(...) you can obtain the change in value for a specific column with df['column_name'].diff(), e.g.
# This is your dataframe
In : df
Out:
gdp growth year
0 0 <NA> 2000
1 1 <NA> 2001
2 2 <NA> 2002
3 3 <NA> 2003
4 4 <NA> 2004
In : df['gdp'].diff()
Out:
0 NaN
1 1.0
2 1.0
3 1.0
4 1.0
Name: year, dtype: float64
B. Apply it to the 'growth' column
In :df['growth'] = df['gdp'].diff()
df
Out:
gdp growth year
0 0 NaN 2000
1 1 1.0 2001
2 2 1.0 2002
3 3 1.0 2003
4 4 1.0 2004
C. Selectively exclude values
If you then want specific years to have a certain value, apply them selectively
In : df['growth'].iloc[np.where(df['year']<2003)] = np.nan
df
Out:
gdp growth year
0 0 NaN 2000
1 1 NaN 2001
2 2 NaN 2002
3 3 1.0 2003
4 4 1.0 2004
There is a large csv file imported. Below is the output, where Flavor_Score and Overall_Score are results of applying df.groupby('beer_name').mean() across a multitude of testers. I would like to add a column Std Deviation for each: Flavor_Score and Overall_Score to the right of the mean column. The function is clear but how to add a column for display? Of course, I can generate an array and append it (right?) but it would seem to be a cumbersome way.
Beer_name Beer_Style Flavor_Score Overall_Score
Coors Light 2.0 3.0
Sam Adams Dark 4.0 4.5
Becks Light 3.5 3.5
Guinness Dark 2.0 2.2
Heineken Light 3.5 3.7
You could use
df.groupby('Beer_name').agg(['mean','std'])
This computes the mean and the std for each group.
For example,
import numpy as np
import pandas as pd
np.random.seed(2015)
N = 100
beers = ['Coors', 'Sam Adams', 'Becks', 'Guinness', 'Heineken']
style = ['Light', 'Dark', 'Light', 'Dark', 'Light']
df = pd.DataFrame({'Beer_name': np.random.choice(beers, N),
'Flavor_Score': np.random.uniform(0, 10, N),
'Overall_Score': np.random.uniform(0, 10, N)})
df['Beer_Style'] = df['Beer_name'].map(dict(zip(beers, style)))
print(df.groupby('Beer_name').agg(['mean','std']))
yields
Flavor_Score Overall_Score
mean std mean std
Beer_name
Becks 5.779266 3.033939 6.995177 2.697787
Coors 6.521966 2.008911 4.066374 3.070217
Guinness 4.836690 2.644291 5.577085 2.466997
Heineken 4.622213 3.108812 6.372361 2.904932
Sam Adams 5.443279 3.311825 4.697961 3.164757
groupby.agg([fun1, fun2]) computes any number of functions in one step:
from random import choice, random
import pandas as pd
import numpy as np
beers = ['Coors', 'Sam Adams', 'Becks', 'Guinness', 'Heineken']
styles = ['Light', 'Dark']
def generate():
for i in xrange(0, 100):
yield dict(beer=choice(beers), style=choice(styles),
flavor_score=random()*10.0,
overall_score=random()*10.0)
pd.options.display.float_format = ' {:,.1f} '.format
df = pd.DataFrame(generate())
print df.groupby(['beer', 'style']).agg([np.mean, np.std])
=>
flavor_score overall_score
mean std mean std
beer style
Becks Dark 7.1 3.6 1.9 1.6
Light 4.7 2.4 2.0 1.0
Coors Dark 5.5 3.2 2.6 1.1
Light 5.3 2.5 1.9 1.1
Guinness Dark 3.3 1.4 2.1 1.1
Light 4.7 3.6 2.2 1.1
Heineken Dark 4.4 3.0 2.7 1.0
Light 6.0 2.3 2.1 1.3
Sam Adams Dark 3.4 3.0 1.7 1.2
Light 5.2 3.6 1.6 1.3
What if I need to use a user-defined function to just a flavor_score column? let's say I want subtract 0.5 from a flavor_score column (from all rows, except for Heineken, for which I want to add 0.25)
grouped[grouped.beer != 'Heineken']['flavor_score']['mean'] - 0.5
grouped[grouped.beer == 'Heineken']['flavor_score']['mean'] + 0.25
I'm using Pandas for data analysis. I have an input file like this snippet:
VEH SEC POS ACCELL SPEED
2 8.4 36.51 -0.2929 27.39
3 8.4 23.57 -0.7381 33.09
4 8.4 6.18 0.6164 38.8
1 8.5 47.76 0 25.57
I need to reorganize the data so that the rows are the unique (ordered) values from SEC as the 1st column, and then the other columns would be VEH1_POS, VEH1_SPEED, VEH1_ACCELL, VEH2_POS, VEH2_SPEED, VEH2_ACCELL, etc.:
TIME VEH1_POS VEH1_SPEED VEH1_ACCEL VEH2_POS, VEH2_SPEED, etc.
0.1 6.2 3.7 0.0 7.5 2.1
0.2 6.8 3.2 -0.5 8.3 2.1
etc.
So, for example, the value for VEH1_POS for each row in the new dataframe would be filled in by selecting values from the POS column in the original dataframe using the row where the SEC value matches the TIME value for the row in the new dataframe and the VEH value == 1.
To set up the rows in the new data frame I'm doing this:
start = inputdf['SIMSEC'].min()
end = inputdf['SIMSEC'].max()
time_steps = frange(start, end, 0.1)
outputdf['TIME'] = time_steps
But I'm lost at how to select the right values from the input dataframe and create the rest of the new dataframe for further analysis. Note also that the input file will NOT have data for every VEH for every SEC (time stamp). So the solution needs to handle that as well. My best guess was:
outputdf['veh1_pos'] = np.where((inputdf['VEH NO'] == 1) & (inputdf['SIMSEC'] == row['Time Step']))
but that doesn't work.
import pandas as pd
# your data
# ==========================
print(df)
Out[272]:
VEH SEC POS ACCELL SPEED
0 2 8.4 36.51 -0.2929 27.39
1 3 8.4 23.57 -0.7381 33.09
2 4 8.4 6.18 0.6164 38.80
3 1 8.5 47.76 0.0000 25.57
# reshaping
# ==========================
result = df.set_index(['SEC','VEH']).unstack()
Out[278]:
POS ACCELL SPEED
VEH 1 2 3 4 1 2 3 4 1 2 3 4
SEC
8.4 NaN 36.51 23.57 6.18 NaN -0.2929 -0.7381 0.6164 NaN 27.39 33.09 38.8
8.5 47.76 NaN NaN NaN 0 NaN NaN NaN 25.57 NaN NaN NaN
So here, the column has multi-level index where 1st level is POS, ACCELL, SPEED and 2nd level is VEH=1,2,3,4.
# if you want to rename the column
temp_z = result.columns.get_level_values(0)
temp_y = result.columns.get_level_values(1)
temp_x = ['VEH'] * len(temp_y)
result.columns = ['{}{}_{}'.format(x,y,z) for x,y,z in zip(temp_x, temp_y, temp_z)]
Out[298]:
VEH1_POS VEH2_POS VEH3_POS VEH4_POS VEH1_ACCELL VEH2_ACCELL VEH3_ACCELL VEH4_ACCELL VEH1_SPEED VEH2_SPEED VEH3_SPEED VEH4_SPEED
SEC
8.4 NaN 36.51 23.57 6.18 NaN -0.2929 -0.7381 0.6164 NaN 27.39 33.09 38.8
8.5 47.76 NaN NaN NaN 0 NaN NaN NaN 25.57 NaN NaN NaN
I have a data frame (df) in pandas with four columns and I want a new column to represent the mean of this four columns: df['mean']= df.mean(1)
1 2 3 4 mean
NaN NaN NaN NaN NaN
5.9 5.4 2.4 3.2 4.225
0.6 0.7 0.7 0.7 0.675
2.5 1.6 1.5 1.2 1.700
0.4 0.4 0.4 0.4 0.400
So far so good. But when I save the results to a csv file this is what I found:
5.9,5.4,2.4,3.2,4.2250000000000005
0.6,0.7,0.7,0.7,0.6749999999999999
2.5,1.6,1.5,1.2,1.7
0.4,0.4,0.4,0.4,0.4
I guess I can force the format in the mean column, but any idea why this is happenning?
I am using winpython with python 3.3.2 and pandas 0.11.0
You could use the float_format parameter:
import pandas as pd
import io
content = '''\
1 2 3 4 mean
NaN NaN NaN NaN NaN
5.9 5.4 2.4 3.2 4.225
0.6 0.7 0.7 0.7 0.675
2.5 1.6 1.5 1.2 1.700
0.4 0.4 0.4 0.4 0.400'''
df = pd.read_table(io.BytesIO(content), sep='\s+')
df.to_csv('/tmp/test.csv', float_format='%g', index=False)
yields
1,2,3,4,mean
,,,,
5.9,5.4,2.4,3.2,4.225
0.6,0.7,0.7,0.7,0.675
2.5,1.6,1.5,1.2,1.7
0.4,0.4,0.4,0.4,0.4
The answers seem correct. Floating point numbers cannot be perfectly represented on our systems. There are bound to be some differences. Read The Floating Point Guide.
>>> a = 5.9+5.4+2.4+3.2
>>> a / 4
4.2250000000000005
As you said, you could always format the results if you want to get only a fixed number of points after the decimal.
>>> "{:.3f}".format(a/4)
'4.225'