Round float columns in pandas dataframe - python

I have got the following pandas data frame
Y X id WP_NER
0 35.973496 -2.734554 1 WP_01
1 35.592138 -2.903913 2 WP_02
2 35.329853 -3.391070 3 WP_03
3 35.392608 -3.928513 4 WP_04
4 35.579265 -3.942995 5 WP_05
5 35.519728 -3.408771 6 WP_06
6 35.759485 -3.078903 7 WP_07
I´d like to round Y and X columns using pandas.
How can I do that ?

You can now, use round on dataframe
Option 1
In [661]: df.round({'Y': 2, 'X': 2})
Out[661]:
Y X id WP_NER
0 35.97 -2.73 1 WP_01
1 35.59 -2.90 2 WP_02
2 35.33 -3.39 3 WP_03
3 35.39 -3.93 4 WP_04
4 35.58 -3.94 5 WP_05
5 35.52 -3.41 6 WP_06
6 35.76 -3.08 7 WP_07
Option 2
In [662]: cols = ['Y', 'X']
In [663]: df[cols] = df[cols].round(2)
In [664]: df
Out[664]:
Y X id WP_NER
0 35.97 -2.73 1 WP_01
1 35.59 -2.90 2 WP_02
2 35.33 -3.39 3 WP_03
3 35.39 -3.93 4 WP_04
4 35.58 -3.94 5 WP_05
5 35.52 -3.41 6 WP_06
6 35.76 -3.08 7 WP_07

You can apply round:
In [142]:
df[['Y','X']].apply(pd.Series.round)
Out[142]:
Y X
0 36 -3
1 36 -3
2 35 -3
3 35 -4
4 36 -4
5 36 -3
6 36 -3
If you want to apply to a specific number of places:
In [143]:
df[['Y','X']].apply(lambda x: pd.Series.round(x, 3))
Out[143]:
Y X
0 35.973 -2.735
1 35.592 -2.904
2 35.330 -3.391
3 35.393 -3.929
4 35.579 -3.943
5 35.520 -3.409
6 35.759 -3.079
EDIT
You assign the above to the columns you want to modify like the following:
In [144]:
df[['Y','X']] = df[['Y','X']].apply(lambda x: pd.Series.round(x, 3))
df
Out[144]:
Y X id WP_NER
0 35.973 -2.735 1 WP_01
1 35.592 -2.904 2 WP_02
2 35.330 -3.391 3 WP_03
3 35.393 -3.929 4 WP_04
4 35.579 -3.943 5 WP_05
5 35.520 -3.409 6 WP_06
6 35.759 -3.079 7 WP_07

Round is so smart that it works just on float columns, so the simplest solution is just:
df = df.round(2)

you can do the below:
df['column_name'] = df['column_name'].apply(lambda x: round(x,2) if isinstance(x, float) else x)
that check as well if the value of the cell is a float number. if is not float return the same value. that comes from the fact that a cell value can be a string or a NAN.

You can also - first check to see which columns are of type float - then round those columns:
for col in df.select_dtypes(include=['float']).columns:
df[col] = df[col].apply(lambda x: x if(math.isnan(x)) else round(x,1))
This also manages potential errors if trying to round nanvalues by implementing if(math.isnan(x))

Related

Faster way to count occurrences of values over a certain value in a column of lists in pandas?

I have this dataFrame. Each row of column data is a list containing around 50 data points. And I want to count the number of occurrences of numbers over 50 and over 20.
>>> df['data'].head(10)
0 [33.23, 51.02, 32.01 ...
1 [99.04, 38.06, 39.57...
2 [96.04, 96.72, 401.93...
3 [96.64, 99.15, 99.83...
4 [96.71, 38.93, 53.02....
5 [88.72, 37.61, 39.61...
6 [38.93, 88.72, 37.31...
7 [88.72, 39.61, 35.71...
8 [97.44, 99.04, 88.56....
9 [00.14, 89.61, 39.95...
If we transform the df to dic, it would look like below:
>>> df.to_dict()
{'data': {'row1': [33.23, 51.02, 32.01,...], 'row2': [99.04, 38.06, 39.57,...],'row3': [96.04, 96.72, 401.93,...],'row4'...}}
The expected result i would like to get is a new column called result and it stores the count of values in data column over 50.0 or over 20.0 if no values are over 50.0:
>>> df.show()
data result
0 [33.23, 51.02, 32.01 ... 1
1 [99.04, 38.06, 39.57... 1
2 [96.04, 96.72, 401.93... 3
3 [96.64, 99.15, 99.83... 3
4 [96.71, 38.93, 53.02.... 2
This is the method i used:
pandas_data_frame[result_column] = pandas_data_frame.apply(
lambda row: count_values(row[data]), axis=1)
def count__values(numlist):
count1 = sum(
x >= 50.0 for x in list)
count2 = sum(
x >= 20.0 for x in list)
return count1 if count1 > 0 else count2
However the dataframe can be extremely huge and i was wondering if there is any pandas method to improve the performance? Thanks.
Try with explode and groupby:
df[[">=50", ">=20"]] = (df.explode("data")
.groupby(level=0)["data"]
.agg([lambda x: x.ge(50).sum(),
lambda x: x.ge(20).sum()]
)
)
>>> df
data >=50 >=20
0 [33.23, 51.02, 32.01] 1 3
1 [99.04, 38.06, 39.57] 1 3
2 [96.04, 96.72, 401.93] 3 3
3 [96.64, 99.15, 99.83] 3 3
4 [96.71, 38.93, 53.02] 2 3
5 [88.72, 37.61, 39.61] 1 3
6 [38.93, 88.72, 37.31] 1 3
7 [88.72, 39.61, 35.71] 1 3
8 [97.44, 99.04, 88.56] 3 3
9 [0.14, 89.61, 39.95] 1 2

Is there faster way to get values based on the linear regression model and append it to a new column in a DataFrame?

I created this code below to make a new column in my dataframe to compare the actual values and regressed value:
b = dfSemoga.loc[:, ['DoB','AA','logtime']]
y = dfSemoga.loc[:,'logCO2'].values.reshape(len(dfSemoga)+1,1)
lr = LinearRegression().fit(b,y)
z = lr.coef_[0,0]
j = lr.coef_[0,1]
k = lr.coef_[0,2]
c = lr.intercept_[0]
for i in range (0,len(dfSemoga)):
dfSemoga.loc[i,'EF CO2 Predict'] = (c + dfSemoga.loc[i,'DoB']*z +
dfSemoga.loc[i,'logtime']*k + dfSemoga.loc[i, 'AA']*j)
So, I basically regress a column with three variables: 1) AA, 2) logtime, and 3) DoB. But in this code, to get the regressed value in a new column called dfSemoga['EF CO2 Predict'] I assign the coefficient manually, as shown in the for loop.
Is there any fancy one-liner code that I can write to make my work more efficient?
Without sample data I can't confirm but you should just be able to do
dfSemoga["EF CO2 Predict"] = c + (z * dfSemoga["DoB"]) + (k * dfSemoga["logtime"]) + (j * dfSemoga["AA"])
Demo:
In [4]: df
Out[4]:
a b
0 0 0
1 0 8
2 7 6
3 3 1
4 3 8
5 6 6
6 4 8
7 2 7
8 3 8
9 8 1
In [5]: df["c"] = 3 + 0.5 * df["a"] - 6 * df["b"]
In [6]: df
Out[6]:
a b c
0 0 0 3.0
1 0 8 -45.0
2 7 6 -29.5
3 3 1 -1.5
4 3 8 -43.5
5 6 6 -30.0
6 4 8 -43.0
7 2 7 -38.0
8 3 8 -43.5
9 8 1 1.0

Drop rows if value in column changes

Assume I have the following pandas data frame:
my_class value
0 1 1
1 1 2
2 1 3
3 2 4
4 2 5
5 2 6
6 2 7
7 2 8
8 2 9
9 3 10
10 3 11
11 3 12
I want to identify the indices of "my_class" where the class changes and remove n rows after and before this index. The output of this example (with n=2) should look like:
my_class value
0 1 1
5 2 6
6 2 7
11 3 12
My approach:
# where class changes happen
s = df['my_class'].ne(df['my_class'].shift(-1).fillna(df['my_class']))
# mask with `bfill` and `ffill`
df[~(s.where(s).bfill(limit=1).ffill(limit=2).eq(1))]
Output:
my_class value
0 1 1
5 2 6
6 2 7
11 3 12
One of possible solutions is to:
Make use of the fact that the index contains consecutive integers.
Find index values where class changes.
For each such index generate a sequence of indices from n-2
to n+1 and concatenate them.
Retrieve rows with indices not in this list.
The code to do it is:
ind = df[df['my_class'].diff().fillna(0, downcast='infer') == 1].index
df[~df.index.isin([item for sublist in
[ range(i-2, i+2) for i in ind ] for item in sublist])]
my_class = np.array([1] * 3 + [2] * 6 + [3] * 3)
cols = np.c_[my_class, np.arange(len(my_class)) + 1]
df = pd.DataFrame(cols, columns=['my_class', 'value'])
df['diff'] = df['my_class'].diff().fillna(0)
idx2drop = []
for i in df[df['diff'] == 1].index:
idx2drop += range(i - 2, i + 2)
print(df.drop(idx_drop)[['my_class', 'value']])
Output:
my_class value
0 1 1
5 2 6
6 2 7
11 3 12

Compute average of the pandas df conditioned on a parameter

I have the following df:
import numpy as np
import pandas as pd
a = []
for i in range(5):
tmp_df = pd.DataFrame(np.random.random((10,4)))
tmp_df['lvl'] = i
a.append(tmp_df)
df = pd.concat(a, axis=0)
df =
0 1 2 3 lvl
0 0.928623 0.868600 0.854186 0.129116 0
1 0.667870 0.901285 0.539412 0.883890 0
2 0.384494 0.697995 0.242959 0.725847 0
3 0.993400 0.695436 0.596957 0.142975 0
4 0.518237 0.550585 0.426362 0.766760 0
5 0.359842 0.417702 0.873988 0.217259 0
6 0.820216 0.823426 0.585223 0.553131 0
7 0.492683 0.401155 0.479228 0.506862 0
..............................................
3 0.505096 0.426465 0.356006 0.584958 3
4 0.145472 0.558932 0.636995 0.318406 3
5 0.957969 0.068841 0.612658 0.184291 3
6 0.059908 0.298270 0.334564 0.738438 3
7 0.662056 0.074136 0.244039 0.848246 3
8 0.997610 0.043430 0.774946 0.097294 3
9 0.795873 0.977817 0.780772 0.849418 3
0 0.577173 0.430014 0.133300 0.760223 4
1 0.916126 0.623035 0.240492 0.638203 4
2 0.165028 0.626054 0.225580 0.356118 4
3 0.104375 0.137684 0.084631 0.987290 4
4 0.934663 0.835608 0.764334 0.651370 4
5 0.743265 0.072671 0.911947 0.925644 4
6 0.212196 0.587033 0.230939 0.994131 4
7 0.945275 0.238572 0.696123 0.536136 4
8 0.989021 0.073608 0.720132 0.254656 4
9 0.513966 0.666534 0.270577 0.055597 4
I am learning neat pandas functionality and thus wondering, what is the easiest way to compute average along lvl column?
What I mean is:
(df[df.lvl ==0 ] + df[df.lvl ==1 ] + df[df.lvl ==2 ] + df[df.lvl ==3 ] + df[df.lvl ==4 ]) / 5
The desired output should be a table of shape (10,4), without the column lvl, where each element is the average of 5 elements (with lvl = [0,1,2,3,4]. I hope it helps.
I think need:
np.random.seed(456)
a = []
for i in range(5):
tmp_df = pd.DataFrame(np.random.random((10,4)))
tmp_df['lvl'] = i
a.append(tmp_df)
df = pd.concat(a, axis=0)
#print (df)
df1 = (df[df.lvl ==0 ] + df[df.lvl ==1 ] +
df[df.lvl ==2 ] + df[df.lvl ==3 ] +
df[df.lvl ==4 ]) / 5
print (df1)
0 1 2 3 lvl
0 0.411557 0.520560 0.578900 0.541576 2
1 0.253469 0.655714 0.532784 0.620744 2
2 0.468099 0.576198 0.400485 0.333533 2
3 0.620207 0.367649 0.531639 0.475587 2
4 0.699554 0.548005 0.683745 0.457997 2
5 0.322487 0.316137 0.489660 0.362146 2
6 0.430058 0.159712 0.631610 0.641141 2
7 0.399944 0.511944 0.346402 0.754591 2
8 0.400190 0.373925 0.340727 0.407988 2
9 0.502879 0.399614 0.321710 0.715812 2
df = df.set_index('lvl')
df2 = df.groupby(df.groupby('lvl').cumcount()).mean()
print (df2)
0 1 2 3
0 0.411557 0.520560 0.578900 0.541576
1 0.253469 0.655714 0.532784 0.620744
2 0.468099 0.576198 0.400485 0.333533
3 0.620207 0.367649 0.531639 0.475587
4 0.699554 0.548005 0.683745 0.457997
5 0.322487 0.316137 0.489660 0.362146
6 0.430058 0.159712 0.631610 0.641141
7 0.399944 0.511944 0.346402 0.754591
8 0.400190 0.373925 0.340727 0.407988
9 0.502879 0.399614 0.321710 0.715812
EDIT:
If each subset of DataFrame have index from 0 to len(subset):
df2 = df.mean(level=0)
print (df2)
0 1 2 3 lvl
0 0.411557 0.520560 0.578900 0.541576 2
1 0.253469 0.655714 0.532784 0.620744 2
2 0.468099 0.576198 0.400485 0.333533 2
3 0.620207 0.367649 0.531639 0.475587 2
4 0.699554 0.548005 0.683745 0.457997 2
5 0.322487 0.316137 0.489660 0.362146 2
6 0.430058 0.159712 0.631610 0.641141 2
7 0.399944 0.511944 0.346402 0.754591 2
8 0.400190 0.373925 0.340727 0.407988 2
9 0.502879 0.399614 0.321710 0.715812 2
The groupby function is exactly what you want. It will group based on a condition, in this case where 'lvl' is the same, and then apply the mean function to the values for each column in that group.
df.groupby('lvl').mean()
it seems like you want to group by the index and take average of all the columns except lvl
i.e.
df.groupby(df.index)[[0,1,2,3]].mean()
For a dataframe generated using
np.random.seed(456)
a = []
for i in range(5):
tmp_df = pd.DataFrame(np.random.random((10,4)))
tmp_df['lvl'] = i
a.append(tmp_df)
df = pd.concat(a, axis=0)
df.groupby(df.index)[[0,1,2,3]].mean()
outputs:
0 1 2 3
0 0.411557 0.520560 0.578900 0.541576
1 0.253469 0.655714 0.532784 0.620744
2 0.468099 0.576198 0.400485 0.333533
3 0.620207 0.367649 0.531639 0.475587
4 0.699554 0.548005 0.683745 0.457997
5 0.322487 0.316137 0.489660 0.362146
6 0.430058 0.159712 0.631610 0.641141
7 0.399944 0.511944 0.346402 0.754591
8 0.400190 0.373925 0.340727 0.407988
9 0.502879 0.399614 0.321710 0.715812
which is identical to the output from
df.groupby(df.groupby('lvl').cumcount()).mean()
without resorting to double groupby.
IMO this is cleaner to read and will for large dataframe, will be much faster.

Combining index and value sorting with top-K selection

Say I have a dataframe with columns A, B, C, and data.
I would like to:
Convert it to a multi-index dataframe with indices A, B and C
Sort the rows by the the indices A and B of this dataframe.
Within each A B pair of the index, sort the rows (i.e. the C index) by the value on the column data.
Get the top 20 rows within each such A B pair, according to the previous sorting on data.
This shouldn't be hard, but I have tried all sorts of approaches, and none of them give me what I want. The following, for example, is close, but it gives me only values for the first group of A B indices.
temp = mdf.set_index(['A', 'B','C']).sort_index()
# Sorting by value and retrieving the top 20 entries:
func = lambda x: x.sort('data', ascending=False).head(20)
temp = temp.groupby(level=['A','B'],as_index=False).apply(func)
# Drop the dummy index (?) introduced in the line above
temp = temp.reset_index(level=0)['data']
Update:
def create_random_multi_index():
df = pd.DataFrame({'A' : [np.random.random_integers(10) for x in xrange(500)],
'B' : [np.random.random_integers(10) for x in xrange(500)],
'C' : [np.random.random_integers(10) for x in xrange(500)],
'data' : randn(500) })
return df
E.g. of what I am looking for (showing top 3 elements, note how the data is sorted within each A-B pair) :
data
A B C
1 1 10 2.057864
5 1.234252
7 0.235246
2 7 1.309126
6 0.450208
8 0.397360
2 2 2 1.609126
1 0.250208
4 0.597360
...
Not sure I 100% understand what you want, but I think this will do it. When you reset it stays in the same order. The key is the sortlevel(), it sorts lexiographically the levels (and the remaining levels on ties). In 0.14 (coming soon) their is an option sort_remaining which you can play with I think.
In [48]: np.random.seed(1234)
In [49]: df = pd.DataFrame({'A' : [np.random.random_integers(10) for x in xrange(500)],
....: 'B' : [np.random.random_integers(10) for x in xrange(500)],
....: 'C' : [np.random.random_integers(10) for x in xrange(500)],
....: 'data' : randn(500) })
First set the index, then sort it and reset.
Then groupby A,B and pull out the first 20 biggest elements.
df.set_index(['A','B','C']).sortlevel().reset_index().groupby(
['A','B']).apply(lambda x: x.sort(columns='data',ascending=False).head(20)).set_index(['A','B','C'])
Out[8]:
data
A B C
1 1 1 0.959688
2 0.918230
2 0.731919
10 0.212463
1 0.103644
1 -0.035266
2 8 1.459579
8 1.277935
5 -0.075886
2 -0.684101
3 -0.928110
3 5 0.675987
4 0.065301
5 -0.800067
7 -1.349503
4 4 1.167308
8 1.148327
9 0.417590
6 -1.274146
10 -2.656304
5 2 -0.962994
1 -0.982679
6 2 1.410920
6 1.352527
10 0.510330
4 0.033275
1 -0.679686
10 -0.896797
1 -2.858669
7 8 -0.219342
8 -0.591054
2 -0.773227
1 -0.781850
3 -1.259089
10 -1.387992
10 -1.891734
8 7 1.578855
2 -0.498898
9 3 0.644277
8 0.572177
2 0.058431
9 -0.146912
4 -0.334690
10 9 0.795346
8 -0.137661
10 -1.335385
2 1 9 1.309405
3 0.328546
5 0.198422
1 -0.561974
3 -0.578069
2 5 0.645426
1 -0.138808
5 -0.400199
5 -0.513738
10 -0.667343
9 -1.983470
3 3 1.210882
6 0.894201
3 0.743652
...
[500 rows x 1 columns]
Try this
df.sort('data', ascending=False).set_index('C').groupby(['A', 'B']).data.head(3)
Its not the most readable syntax, but will get the job done
A B C
1 1 9 1.380526
1 0.903524
7 -0.112363
2 2 0.284057
5 0.131392
1 0.111512

Categories

Resources