Categorizing Pandas column with the indiviual custom bins (tresholds) - python

I have a dataframe with some numeric values stored in column "value", accompanied by their respective categorical tresholds (warning levels in this case), stored in other columns (in my case "low", "middle", "high"):
value low middle high
0 179.69 17.42 88.87 239.85
1 2.58 17.81 93.37 236.58
2 1.21 0.05 0.01 0.91
3 1.66 0.20 0.32 4.57
4 3.54 0.04 0.04 0.71
5 5.97 0.16 0.17 2.55
6 5.39 0.86 1.62 9.01
7 1.20 0.03 0.01 0.31
8 3.19 0.08 0.01 0.45
9 0.02 0.03 0.01 0.10
10 3.98 0.18 0.05 0.83
11 134.51 78.63 136.86 478.27
12 254.53 83.73 146.33 486.65
13 15.36 86.07 13.74 185.16
14 85.10 86.12 13.74 185.16
15 15.12 1.37 6.09 30.12
I would like to know in which category each value falls (e.g. first value would be middle, second would be below_low, since it's smaller than any of its tresholds, third would be high, ... you get the idea). So here is the expected output:
value low middle high category
0 179.69 17.42 88.87 239.85 middle
1 2.58 17.81 93.37 236.58 below_low
2 1.21 0.05 0.01 0.91 high
3 1.66 0.20 0.32 4.57 middle
4 3.54 0.04 0.04 0.71 high
5 5.97 0.16 0.17 2.55 high
6 5.39 0.86 1.62 9.01 middle
7 1.20 0.03 0.01 0.31 high
8 3.19 0.08 0.01 0.45 high
9 0.02 0.03 0.01 0.10 middle
10 3.98 0.18 0.05 0.83 high
11 134.51 78.63 136.86 478.27 low
12 254.53 83.73 146.33 486.65 middle
13 15.36 86.07 13.74 185.16 middle
14 85.10 86.12 13.74 185.16 middle
15 15.12 1.37 6.09 30.12 middle
So far I use this ugly procedure of "manually" checking line by line, stopping at the first category (from higher to lower), being bigger that the current value:
df["category"]="below_low"
for i in df.index:
for cat in ["high","middle","low"]:
if df.loc[i,"value"]>df.loc[i,cat]:
df.loc[i,"category"]=cat
break
I am aware of the pd.cut() method, but I only know how to use it with a predefined generic tresholds list. Can somebody tell what am I missing?

You can use:
remove column value
compare with lt (less then)
change order of columns
cumulative sum columns - first True get 1
compare with 1 by eq
mask = df.drop('value',axis=1)
.lt(df['value'], axis=0)
.reindex(columns=['high','middle','low'])
.cumsum(axis=1)
.eq(1)
If all values in columns high, middle and low are False then some correctness is necessary. I create new column with inverting mask and all.
mask['below_low'] = (~mask).all(axis=1)
print (mask)
high middle low below_low
0 True False False False
1 False False False True
2 True False False False
3 False True False False
4 True False False False
5 True False False False
6 False True False False
7 True False False False
8 True False False False
9 False True True False
10 True False False False
11 False False True False
12 False True False False
13 False True True False
14 False True True False
15 False True False False
Last call DataFrame.idxmax:
df['category'] = mask.idxmax(axis=1)
print (df)
value low middle high category
0 179.69 17.42 88.87 239.85 high
1 2.58 17.81 93.37 236.58 below_low
2 1.21 0.05 0.01 0.91 high
3 1.66 0.20 0.32 4.57 middle
4 3.54 0.04 0.04 0.71 high
5 5.97 0.16 0.17 2.55 high
6 5.39 0.86 1.62 9.01 middle
7 1.20 0.03 0.01 0.31 high
8 3.19 0.08 0.01 0.45 high
9 0.02 0.03 0.01 0.10 middle
10 3.98 0.18 0.05 0.83 high
11 134.51 78.63 136.86 478.27 low
12 254.53 83.73 146.33 486.65 middle
13 15.36 86.07 13.74 185.16 middle
14 85.10 86.12 13.74 185.16 middle
15 15.12 1.37 6.09 30.12 middle
Solution with multiple numpy.where as pointed Paul H:
df['category'] = np.where(df['high'] < df['value'], 'high',
np.where(df['middle'] < df['value'], 'medium',
np.where(df['low'] < df['value'], 'low', 'below_low')))
print (df)
value low middle high category
0 179.69 17.42 88.87 239.85 high
1 2.58 17.81 93.37 236.58 below_low
2 1.21 0.05 0.01 0.91 high
3 1.66 0.20 0.32 4.57 medium
4 3.54 0.04 0.04 0.71 high
5 5.97 0.16 0.17 2.55 high
6 5.39 0.86 1.62 9.01 medium
7 1.20 0.03 0.01 0.31 high
8 3.19 0.08 0.01 0.45 high
9 0.02 0.03 0.01 0.10 medium
10 3.98 0.18 0.05 0.83 high
11 134.51 78.63 136.86 478.27 low
12 254.53 83.73 146.33 486.65 medium
13 15.36 86.07 13.74 185.16 medium
14 85.10 86.12 13.74 185.16 medium
15 15.12 1.37 6.09 30.12 medium

In every other universe, you should use jezrael classic vector ways. However, if you're curious about apply way of doing things, then, you could
In [702]: df.apply(lambda x: 'high' if x.value > x['high']
else 'middle' if x.value > x['middle']
else 'low' if x.value > x['low']
else 'below low', axis=1)
Out[702]:
0 middle
1 below low
2 high
3 middle
4 high
5 high
6 middle
7 high
8 high
9 middle
10 high
11 low
12 middle
13 middle
14 middle
15 middle
dtype: object

Related

How to list the specific countries in a df which have NaN values?

I have created the df_nan below which shows the sum of NaN values from the main df, which, as seen, shows how many are in each specific column.
However, I want to create a new df, which has a column/index of countries, then another with the number of NaN values for the given country.
Country Number of NaN Values
Aruba 4
Finland 3
I feel like I have to use groupby, to create something along the lines of this below, but .isna is not an attribute of the groupby function. Any help would be great, thanks!
df_nan2= df_nan.groupby(['Country']).isna().sum()
Current code
import pandas as pd
import seaborn as sns
import numpy as np
from scipy.stats import spearmanr
# given dataframe df
df = pd.read_csv('countries.csv')
df.drop(columns= ['Population (millions)', 'HDI', 'GDP per Capita','Fish Footprint','Fishing Water',
'Urban Land','Earths Required', 'Countries Required', 'Data Quality'], axis=1, inplace = True)
df_nan= df.isna().sum()
Head of main df
0 Afghanistan Middle East/Central Asia 0.30 0.20 0.08 0.18 0.79 0.24 0.20 0.02 0.50 -0.30
1 Albania Northern/Eastern Europe 0.78 0.22 0.25 0.87 2.21 0.55 0.21 0.29 1.18 -1.03
2 Algeria Africa 0.60 0.16 0.17 1.14 2.12 0.24 0.27 0.03 0.59 -1.53
3 Angola Africa 0.33 0.15 0.12 0.20 0.93 0.20 1.42 0.64 2.55 1.61
4 Antigua and Barbuda Latin America NaN NaN NaN NaN 5.38 NaN NaN NaN 0.94 -4.44
5 Argentina Latin America 0.78 0.79 0.29 1.08 3.14 2.64 1.86 0.66 6.92 3.78
6 Armenia Middle East/Central Asia 0.74 0.18 0.34 0.89 2.23 0.44 0.26 0.10 0.89 -1.35
7 Aruba Latin America NaN NaN NaN NaN 11.88 NaN NaN NaN 0.57 -11.31
8 Australia Asia-Pacific 2.68 0.63 0.89 4.85 9.31 5.42 5.81 2.01 16.57 7.26
9 Austria European Union 0.82 0.27 0.63 4.14 6.06 0.71 0.16 2.04 3.07 -3.00
10 Azerbaijan Middle East/Central Asia 0.66 0.22 0.11 1.25 2.31 0.46 0.20 0.11 0.85 -1.46
11 Bahamas Latin America 0.97 1.05 0.19 4.46 6.84 0.05 0.00 1.18 9.55 2.71
12 Bahrain Middle East/Central Asia 0.52 0.45 0.16 6.19 7.49 0.01 0.00 0.00 0.58 -6.91
13 Bangladesh Asia-Pacific 0.29 0.00 0.08 0.26 0.72 0.25 0.00 0.00 0.38 -0.35
14 Barbados Latin America 0.56 0.24 0.14 3.28 4.48 0.08 0.00 0.02 0.19 -4.29
15 Belarus Northern/Eastern Europe 1.32 0.12 0.91 2.57 5.09 1.52 0.30 1.71 3.64 -1.45
16 Belgium European Union 1.15 0.48 0.99 4.43 7.44 0.56 0.03 0.28 1.19 -6.25
17 Benin Africa 0.49 0.04 0.26 0.51 1.41 0.44 0.04 0.34 0.88 -0.53
18 Bermuda North America NaN NaN NaN NaN 5.77 NaN NaN NaN 0.13 -5.64
19 Bhutan Asia-Pacific 0.50 0.42 3.03 0.63 4.84 0.28 0.34 4.38 5.27 0.43
Nan head
Country 0
Region 0
Cropland Footprint 15
Grazing Footprint 15
Forest Footprint 15
Carbon Footprint 15
Total Ecological Footprint 0
Cropland 15
Grazing Land 15
Forest Land 15
Total Biocapacity 0
Biocapacity Deficit or Reserve 0
dtype: int64
Suppose, you want to get Null count for each Country from "Cropland Footprint" column, then you can use the following code -
Unique_Country = df['Country'].unique()
Col1 = 'Cropland Footprint'
NullCount = []
for i in Unique_Country:
s = df[df['Country']==i][Col1].isnull().sum()
NullCount.append(s)
df2 = pd.DataFrame({'Country': Unique_Country,
'Number of NaN Values': NullCount})
df2 = df2[df2['Number of NaN Values']!=0]
df2
Output -
Country Number of NaN Values
Antigua and Barbuda 1
Aruba 1
Bermuda 1
If you want to get Null Count from another Column then just change the Value of Col1 variable.

Pandas rolling cumulative sum of across two dataframes

I'm looking to create a rolling grouped cumulative sum across two dataframes. I can get the result via iteration, but wanted to see if there was a more intelligent way.
I need the 5 row block of A to roll through the rows of B and accumulate. Think of it as rolling balance with a block of contributions and rolling returns.
So, here's the calculation for C
A B
1 100.00 1 0.01 101.00
2 110.00 2 0.02 215.22 102.00
3 120.00 3 0.03 345.28 218.36 103.00
4 130.00 4 0.04 494.29 351.89 221.52 104.00
5 140.00 5 0.05 666.00 505.99 358.60 224.70 105.00
6 0.06 684.75 517.91 365.38 227.90 106.00
7 0.07 703.97 530.06 372.25 231.12
8 0.08 723.66 542.43 379.21
9 0.09 743.85 555.04
10 0.10 764.54
C Row 5
Begining Balance Contribution Return Ending Balance
0.00 100.00 0.01 101.00
101.00 110.00 0.02 215.22
215.22 120.00 0.03 345.28
345.28 130.00 0.04 494.29
494.29 140.00 0.05 666.00
C Row 6
Begining Balance Contribution Return Ending Balance
0.00 100.00 0.02 102.00
102.00 110.00 0.03 218.36
218.36 120.00 0.04 351.89
351.89 130.00 0.05 505.99
505.99 140.00 0.06 684.75
Here's what the source data looks like:
A B
1 100.00 1 0.01
2 110.00 2 0.02
3 120.00 3 0.03
4 130.00 4 0.04
5 140.00 5 0.05
6 0.06
7 0.07
8 0.08
9 0.09
10 0.10
Here is the desired result:
C
1 Nan
2 Nan
3 Nan
4 Nan
5 666.00
6 684.75
7 703.97
8 723.66
9 743.85
10 764.54

Get overall smallest elements' distribution in dataframe with sorted columns more efficiently

I have a dataframe with sorted columns, something like this:
df = pd.DataFrame({q: np.sort(np.random.randn(10).round(2)) for q in ['blue', 'green', 'red']})
blue green red
0 -2.15 -0.76 -2.62
1 -0.88 -0.62 -1.65
2 -0.77 -0.55 -1.51
3 -0.73 -0.17 -1.14
4 -0.06 -0.16 -0.75
5 -0.03 0.05 -0.08
6 0.06 0.38 0.37
7 0.41 0.76 1.04
8 0.56 0.89 1.16
9 0.97 2.94 1.79
What I want to know is how many of the n smallest elements in the whole frame are in each column. This is the only thing I came up with:
is_small = df.isin(np.partition(df.values.flatten(), n)[:n])
with n=10 it looks like this:
blue green red
0 True True True
1 True False True
2 True False True
3 True False True
4 False False True
5 False False False
6 False False False
7 False False False
8 False False False
9 False False False
Then by applying np.sum I get the number corresponding to each column.
I'm dissatisfied with this solution because it in no way utilizes the sortedness of the original data. All the data gets partitioned and all the data is then checked for whether it's in the partition. It seems wasteful, and I can't seem to figure out a better way.
Think you can compare the largest of n-smallest values against the partitioned one and then use idxmin to leverage the sorted nature -
# Find largest of n smallest numbers
N = (np.partition(df.values.flatten(), n)[:n]).max()
out = (df<=N).idxmin(axis=0)
Sample run -
In [152]: np.random.seed(0)
In [153]: df = pd.DataFrame({q: np.sort(np.random.randn(10).round(2)) \
for q in ['blue', 'green', 'red']})
In [154]: df
Out[154]:
blue green red
0 -0.98 -0.85 -2.55
1 -0.15 -0.21 -1.45
2 -0.10 0.12 -0.74
3 0.40 0.14 -0.19
4 0.41 0.31 0.05
5 0.95 0.33 0.65
6 0.98 0.44 0.86
7 1.76 0.76 1.47
8 1.87 1.45 1.53
9 2.24 1.49 2.27
In [198]: n = 5
In [199]: N = (np.partition(df.values.flatten(), n)[:n]).max()
In [200]: (df<=N).idxmin(axis=0)
Out[200]:
blue 1
green 1
red 3
dtype: int64
Lets say, you are looking at 10 smallest, you can stack and find value_count for the 10 smallest
df.stack().nsmallest(10).index.get_level_values(1).value_counts()
You get
red 5
blue 4
green 1

Pandas round is not working for DataFrame

Round works on a single element but not the DataFrame, tried DataFrame.round() but didn't work... any idea? Thanks.
Have code below:
print "Panda Version: ", pd.__version__
print "['5am'][0]: ", x3['5am'][0]
print "Round element: ", np.round(x3['5am'][0]*4) /4
print "Round Dataframe: \r\n", np.round(x3 * 4, decimals=2) / 4
df = np.round(x3 * 4, decimals=2) / 4
print "Round Dataframe Again: \r\n", df.round(2)
Got result:
Panda Version: 0.18.0
['5am'][0]: 0.279914529915
Round element: 0.25
Round Dataframe:
5am 6am 7am 8am 9am 10am 11am
Date
2016-07-11 0.279915 0.279915 2.85256 4.52778 6.23291 9.01496 8.53632
2016-07-12 0.339744 0.369658 2.67308 4.52778 5.00641 7.30983 6.98077
2016-07-13 0.399573 0.459402 2.61325 3.83974 5.48504 6.77137 5.24573
2016-07-14 0.339744 0.549145 2.64316 3.36111 5.66453 5.96368 7.87821
2016-07-15 0.309829 0.459402 2.55342 4.64744 4.46795 6.80128 6.17308
2016-07-16 0.25 0.369658 2.46368 2.67308 4.58761 6.35256 5.63462
2016-07-17 0.279915 0.369658 2.58333 2.91239 4.19872 5.51496 6.65171
Round Dataframe Again:
5am 6am 7am 8am 9am 10am 11am
Date
2016-07-11 0.279915 0.279915 2.85256 4.52778 6.23291 9.01496 8.53632
2016-07-12 0.339744 0.369658 2.67308 4.52778 5.00641 7.30983 6.98077
2016-07-13 0.399573 0.459402 2.61325 3.83974 5.48504 6.77137 5.24573
2016-07-14 0.339744 0.549145 2.64316 3.36111 5.66453 5.96368 7.87821
2016-07-15 0.309829 0.459402 2.55342 4.64744 4.46795 6.80128 6.17308
2016-07-16 0.25 0.369658 2.46368 2.67308 4.58761 6.35256 5.63462
2016-07-17 0.279915 0.369658 2.58333 2.91239 4.19872 5.51496 6.65171
Try to cast to float type:
x3.astype(float).round(2)
as simple as this
df['col_name'] = df['col_name'].astype(float).round(2)
Explanation of your code:
In [166]: np.round(df * 4, decimals=2)
Out[166]:
a b c d
0 0.11 0.45 1.65 3.38
1 3.97 2.90 1.89 3.42
2 1.46 0.79 3.00 1.44
3 3.48 2.33 0.81 1.02
4 1.03 0.65 1.94 2.92
5 1.88 2.21 0.59 0.39
6 0.08 2.09 4.00 1.02
7 2.86 0.71 3.56 0.57
8 1.23 1.38 3.47 0.03
9 3.09 1.10 1.12 3.31
In [167]: np.round(df * 4, decimals=2) / 4
Out[167]:
a b c d
0 0.0275 0.1125 0.4125 0.8450
1 0.9925 0.7250 0.4725 0.8550
2 0.3650 0.1975 0.7500 0.3600
3 0.8700 0.5825 0.2025 0.2550
4 0.2575 0.1625 0.4850 0.7300
5 0.4700 0.5525 0.1475 0.0975
6 0.0200 0.5225 1.0000 0.2550
7 0.7150 0.1775 0.8900 0.1425
8 0.3075 0.3450 0.8675 0.0075
9 0.7725 0.2750 0.2800 0.8275
In [168]: np.round(np.round(df * 4, decimals=2) / 4, 2)
Out[168]:
a b c d
0 0.03 0.11 0.41 0.84
1 0.99 0.72 0.47 0.86
2 0.36 0.20 0.75 0.36
3 0.87 0.58 0.20 0.26
4 0.26 0.16 0.48 0.73
5 0.47 0.55 0.15 0.10
6 0.02 0.52 1.00 0.26
7 0.72 0.18 0.89 0.14
8 0.31 0.34 0.87 0.01
9 0.77 0.28 0.28 0.83
This is working properly for me (pandas 0.18.1)
In [162]: df = pd.DataFrame(np.random.rand(10,4), columns=list('abcd'))
In [163]: df
Out[163]:
a b c d
0 0.028700 0.112959 0.412192 0.845663
1 0.991907 0.725550 0.472020 0.856240
2 0.365117 0.197468 0.750554 0.360272
3 0.870041 0.582081 0.203692 0.255915
4 0.257433 0.161543 0.483978 0.730548
5 0.470767 0.553341 0.146612 0.096358
6 0.020052 0.522482 0.999089 0.254312
7 0.714934 0.178061 0.889703 0.143701
8 0.308284 0.344552 0.868151 0.007825
9 0.771984 0.274245 0.280431 0.827999
In [164]: df.round(2)
Out[164]:
a b c d
0 0.03 0.11 0.41 0.85
1 0.99 0.73 0.47 0.86
2 0.37 0.20 0.75 0.36
3 0.87 0.58 0.20 0.26
4 0.26 0.16 0.48 0.73
5 0.47 0.55 0.15 0.10
6 0.02 0.52 1.00 0.25
7 0.71 0.18 0.89 0.14
8 0.31 0.34 0.87 0.01
9 0.77 0.27 0.28 0.83
Similar issue. df.round(1) didn't round as expected (e.g. .400000000123) but df.astype('float64').round(1) worked. Significantly, the dtype of df is float32. Apparently round() doesn't work properly on float32. How is this behavior not a bug?
As I just found here,
"round does not modify in-place. Rather, it returns the dataframe
rounded."
It might be helpful to think of this as follows:
df.round(2) is doing the correct rounding operation, but you are not asking it to see the result or saving it anywhere.
Thus, df_final = df.round(2) will likely complete your expected functionality, instead of just df.round(2). That's because the results of the rounding operation are now being saved to the df_final dataframe.
Additionally, it might be best to do one additional thing and use df_final = df.round(2).copy() instead of simply df_final = df.round(2). I find that some things return unexpected results if I don't assign a copy of the old dataframe to the new dataframe.
I've tried to reproduce your situation. and it seems to work nicely.
import pandas as pd
import numpy as np
from io import StringIO
s = """Date 5am 6am 7am 8am 9am 10am 11am
2016-07-11 0.279915 0.279915 2.85256 4.52778 6.23291 9.01496 8.53632
2016-07-12 0.339744 0.369658 2.67308 4.52778 5.00641 7.30983 6.98077
2016-07-13 0.399573 0.459402 2.61325 3.83974 5.48504 6.77137 5.24573
2016-07-14 0.339744 0.549145 2.64316 3.36111 5.66453 5.96368 7.87821
2016-07-15 0.309829 0.459402 2.55342 4.64744 4.46795 6.80128 6.17308
2016-07-16 0.25 0.369658 2.46368 2.67308 4.58761 6.35256 5.63462
2016-07-17 0.279915 0.369658 2.58333 2.91239 4.19872 5.51496 6.65171
"""
df = pd.read_table(StringIO(s), delim_whitespace=True)
df.set_index('Date').round(2)

Access single cell of pandas dataframe?

I have the following data with some missing holes. I've looked over the 'how to handle missing data' but can't find anything that applies in this situation. Here is the data:
Species GearUsed AverageFishWeight(lbs) NormalRange(lbs) Caught
0 BlackBullhead Gillnet 0.11 0.8-7.7 0.18
1 BlackCrappie Trapnet 6.22 0.7-3.4 0.30
2 NaN Gillnet 1.00 0.6-3.5 0.30
3 Bluegill Trapnet 11.56 6.1-46.6 0.14
4 NaN Gillnet 1.44 NaN 0.21
5 BrownBullhead Trapnet 0.11 0.4-2.1 1.01
6 NorthernPike Trapnet 0.22 NaN 4.32
7 NaN Gillnet 2.22 3.5-10.5 5.63
8 Pumpkinseed Trapnet 0.89 2.0-8.5 0.23
9 RockBass Trapnet 0.22 0.5-1.8 0.04
10 Walleye Trapnet 0.22 0.3-0.7 0.28
11 NaN Gillnet 1.56 1.3-5.0 2.54
12 WhiteSucker Trapnet 0.33 0.3-1.4 2.76
13 NaN Gillnet 1.78 0.5-2.7 1.32
14 YellowPerch Trapnet 1.33 0.5-3.3 0.14
15 NaN Gillnet 27.67 3.4-43.6 0.14
I need the NaNs in the species column to just be the name above it, for example row 2 would be BlackCrappie. I would like to iterate through the frame and manually specify the species name but am not too sure of how, and also other answers recommend against iterating through the dataframe in the first place.
How do I access each cell of the frame individually? Thanks!
PS the column names are incorrect, there is not a 27lb yellow perch. :)
Do you want to fill the missing values in other rows as well? Seems to be what fillna() is for:
In [83]:
print df.fillna(method='pad')
Species GearUsed AverageFishWeight(lbs) NormalRange(lbs) Caught
0 BlackBullhead Gillnet 0.11 0.8-7.7 0.18
1 BlackCrappie Trapnet 6.22 0.7-3.4 0.30
2 BlackCrappie Gillnet 1.00 0.6-3.5 0.30
3 Bluegill Trapnet 11.56 6.1-46.6 0.14
4 Bluegill Gillnet 1.44 6.1-46.6 0.21
5 BrownBullhead Trapnet 0.11 0.4-2.1 1.01
6 NorthernPike Trapnet 0.22 0.4-2.1 4.32
7 NorthernPike Gillnet 2.22 3.5-10.5 5.63
8 Pumpkinseed Trapnet 0.89 2.0-8.5 0.23
9 RockBass Trapnet 0.22 0.5-1.8 0.04
10 Walleye Trapnet 0.22 0.3-0.7 0.28
11 Walleye Gillnet 1.56 1.3-5.0 2.54
12 WhiteSucker Trapnet 0.33 0.3-1.4 2.76
13 WhiteSucker Gillnet 1.78 0.5-2.7 1.32
14 YellowPerch Trapnet 1.33 0.5-3.3 0.14
15 YellowPerch Gillnet 27.67 3.4-43.6 0.14

Categories

Resources