understanding cut command in python DataFrame - python

Class variable in Dataframe consist of several numbers, those numbers are :
5 681
6 638
7 199
4 53
8 18
3 10
i have seen following command on website :
bins = (2,6.5,8)
group_names = ['bad','good']
categories = pd.cut(df['quality'], bins, labels = group_names)
df['quality'] = categories
after that one in quality column we have only two categorical variables : bad and good, i am interested how exactly it works? if number is between 2.6 and 5.8 it is bad and all others are good or vice versa? please explain me this things

Consider:
import pandas as pd
df = pd.DataFrame({
'score': range(10)
})
bins = (2, 6.5, 8)
labels = ('bad', 'good')
df['quality'] = pd.cut(df['score'], bins, labels=labels)
print(df)
The output is:
score quality
0 0 NaN
1 1 NaN
2 2 NaN
3 3 bad
4 4 bad
5 5 bad
6 6 bad
7 7 good
8 8 good
9 9 NaN
There are 2 bins into which score data is assigned.
(2, 6.5] and (6.5, 8]
The left end is exclusive and right end is inclusive.
All numbers in (2, 6.5] will be evaluated to bad and those in (6.5, 8] will be evaluated to good.
Those data points that are outside these intervals will not have any value and hence NaN.

Related

Automated way to get binary variables from a database

I have a problem in relation to a database related to dengue. I have in this database some variables, among them the variable "Cases", which indicates the number of dengue cases in a given period. But I want to apply the logistic regression model to these data, so the idea is to make this variable with integers, to become a binary variable, that is, for places that did not have dengue cases in that period, I want to put 0 in place of the quantity that I already have, and for places that have had cases, put 1. As there are 35628 lines, I want to do it in an automated way, to avoid doing it, manually. Would anyone have any idea how to proceed in order to put this idea into practice? I'm new to programming and I'm trying to implement it in the R language. If they know of a package that does this, it helps a lot. Each neighborhood is conditioned to a number.
I appreciate any help and thank you very much.
neighborhood
Dates
Cases
precipitation
Temperature
0
Jan/14
10
149,6
33,25
1
Fev/14
0
254
30,1
2
Mar/14
6
150
25,4
3
Apr/14
0
244,1
32,5
4
May/14
3
44,3
33,2
I appreciate any help and thank you very much.
R
Pick from among
dat$CasesBin1 <- (dat$Cases > 0)
dat$CasesBin2 <- +(dat$Cases > 0)
dat
# neighborhood Dates Cases precipitation Temperature CasesBin1 CasesBin2
# 1 0 Jan/14 10 149.6 33.25 TRUE 1
# 2 1 Fev/14 0 254.0 30.10 FALSE 0
# 3 2 Mar/14 6 150.0 25.40 TRUE 1
# 4 3 Apr/14 0 244.1 32.50 FALSE 0
# 5 4 May/14 3 44.3 33.20 TRUE 1
In R at least, most logistic regression tools I've used work fine with either integer (0/1) or logical, but you may need to verify with the tools you will use.
Data:
dat <- structure(list(neighborhood = 0:4, Dates = c("Jan/14", "Fev/14", "Mar/14", "Apr/14", "May/14"), Cases = c(10L, 0L, 6L, 0L, 3L), precipitation = c(149.6, 254, 150, 244.1, 44.3), Temperature = c(33.25, 30.1, 25.4, 32.5, 33.2)), class = "data.frame", row.names = c(NA, -5L))
python
In [13]: dat
Out[13]:
neighborhood Dates Cases precipitation Temperature
0 0 Jan/14 10 149.6 33.25
1 1 Fev/14 0 254.0 30.10
2 2 Mar/14 6 150.0 25.40
3 3 Apr/14 0 244.1 32.50
4 4 May/14 3 44.3 33.20
In [17]: dat['CasesBin1'] = dat['Cases'].apply(lambda x: (x > 0))
In [18]: dat['CasesBin2'] = dat['Cases'].apply(lambda x: int(x > 0))
In [19]: dat
Out[19]:
neighborhood Dates Cases ... Temperature CasesBin1 CasesBin2
0 0 Jan/14 10 ... 33.25 True 1
1 1 Fev/14 0 ... 30.10 False 0
2 2 Mar/14 6 ... 25.40 True 1
3 3 Apr/14 0 ... 32.50 False 0
4 4 May/14 3 ... 33.20 True 1
[5 rows x 7 columns]
Data:
In [11]: js
Out[11]: '[{"neighborhood":0,"Dates":"Jan/14","Cases":10,"precipitation":149.6,"Temperature":33.25},{"neighborhood":1,"Dates":"Fev/14","Cases":0,"precipitation":254,"Temperature":30.1},{"neighborhood":2,"Dates":"Mar/14","Cases":6,"precipitation":150,"Temperature":25.4},{"neighborhood":3,"Dates":"Apr/14","Cases":0,"precipitation":244.1,"Temperature":32.5},{"neighborhood":4,"Dates":"May/14","Cases":3,"precipitation":44.3,"Temperature":33.2}]'
In [12]: dat = pd.read_json(js)
Sorry I didn't see you would like to implement it in the R Language. Below is suggested code in Python...
Assuming that the table is in a DataFrame df, you could create a new column 'dengue_cases' with 0 when there are no cases, and 1 when there are cases
df['Cases'] = df['Cases'].astype('int') #to ensure the correct data type in column
df['dengue_cases'] = df['Cases'].apply(lambda x: 0 if x==0 else 1)
The above lines will create a new column. If you are replacing the original column use below line:
df['Cases'] = df['Cases'].apply(lambda x: 0 if x==0 else 1)

Calculating the max and index of max within a section of array

Following the StackOverflow post Elegantly calculate mean of first three values of a list I have tweaked the code to find the maximum.
However, I also require to know the position/index of the max.
So the code below calculates the max value for the first 3 numbers and then the max value for the next 3 numbers and so on.
For example for a list of values [6 3 7 4 6 9 2 6 7 4 3 7 7 2 5 4 1 7 5 1]. The code below takes the first 3 values 6,3,7 and outputs the max as 7 and then for the next 3 values 4,6,9 outputs the value 9 and so on.
But I also want to find which position/index they are at, 1.e 7 is at position 2 and 9 at position 5. The final result [2,5,8,11,12,...]. Any ideas on how to calculate the index. Thanks in advance.
import numpy as np
np.random.seed(42)
test_data = np.random.randint(low = 0, high = 10, size = 20)
maxval = [max(test_data[i:i+3]) for i in range(0,len(test_data),3)]
print(test_data)
print(maxval)
output: test_data : [6 3 7 4 6 9 2 6 7 4 3 7 7 2 5 4 1 7 5 1]
output: [7, 9, 7, 7, 7, 7, 5]
import numpy as np
np.random.seed(42)
test_data = np.random.randint(low = 0, high = 10, size = 20)
maxval = [max(test_data[i:i+3]) for i in range(0,len(test_data),3)]
index = [(np.argmax(test_data[i: i+3]) + i) for i in range(0,len(test_data),3)]
print(test_data)
print(maxval)
print(index)

Creation dataframe from several list of lists

I need to build a dataframe from 10 list of list. I did it manually, but it's need a time. What is a better way to do it?
I have tried to do it manually. It works fine (#1)
I tried code (#2) for better perfomance, but it returns only last column.
1
import pandas as pd
import numpy as np
a1T=[([7,8,9]),([10,11,12]),([13,14,15])]
a2T=[([1,2,3]),([5,0,2]),([3,4,5])]
print (a1T)
#Output[[7, 8, 9], [10, 11, 12], [13, 14, 15]]
vis1=np.array (a1T)
vis_1_1=vis1.T
tmp2=np.array (a2T)
tmp_2_1=tmp2.T
X=np.column_stack([vis_1_1, tmp_2_1])
dataset_all = pd.DataFrame({"Visab1":X[:,0], "Visab2":X[:,1], "Visab3":X[:,2], "Temp1":X[:,3], "Temp2":X[:,4], "Temp3":X[:,5]})
print (dataset_all)
Output: Visab1 Visab2 Visab3 Temp1 Temp2 Temp3
0 7 10 13 1 5 3
1 8 11 14 2 0 4
2 9 12 15 3 2 5
> Actually I have varying number of columns in dataframe (500-1500), thats why I need auto generated column names. Extra index (1, 2, 3) after name Visab_, Temp_ and so on - constant for every case. See code below.
For better perfomance I tried
code<br>
#2
n=3 # This is varying parameter. The parameter affects the number of columns in the table.
m=2 # This is constant for every case. here is 2, because we have "Visab", "Temp"
mlist=('Visab', 'Temp')
nlist=[range(1, n)]
for j in range (1,n):
for i in range (1,m):
col=i+(j-1)*n
dataset_all=pd.DataFrame({mlist[j]+str(i):X[:, col]})
I expect output like
Visab1 Visab2 Visab3 Temp1 Temp2 Temp3
0 7 10 13 1 5 3
1 8 11 14 2 0 4
2 9 12 15 3 2 5
but there is not any result (only error expected an indented block)
Ok, so the number of columns n is the number of sublists in each list, right? You can measure that with len:
len(a1T)
#Output
3
I'll simplify the answer above so you don't need X and add automatic column-names creation:
my_lists = [a1T,a2T]
my_names = ["Visab","Temp"]
dfs=[]
for one_list,name in zip(my_lists,my_names):
n_columns = len(one_list)
col_names=[name+"_"+str(n) for n in range(n_columns)]
df = pd.DataFrame(one_list).T
df.columns = col_names
dfs.append(df)
dataset_all = pd.concat(dfs,axis=1)
#Output
Visab_0 Visab_1 Visab_2 Temp_0 Temp_1 Temp_2
0 7 10 13 1 5 3
1 8 11 14 2 0 4
2 9 12 15 3 2 5
Now is much clearer. So you have:
X=np.column_stack([vis_1_1, tmp_2_1])
Let's create a list with the names of the columns:
columns_names = ["Visab1","Visab2","Visab3","Temp1","Temp2","Temp3"]
Now you can directly make a dataframe like this:
dataset_all = pd.DataFrame(X,columns=columns_names)
#Output
Visab1 Visab2 Visab3 Temp1 Temp2 Temp3
0 7 10 13 1 5 3
1 8 11 14 2 0 4
2 9 12 15 3 2 5

Making a bar chart to represent the number of occurrences in a Pandas Series

I was wondering if anyone could help me with how to make a bar chart to show the frequencies of values in a Pandas Series.
I start with a Pandas DataFrame of shape (2000, 7), and from there I extract the last column. The column is shape (2000,).
The entries in the Series that I mentioned vary from 0 to 17, each with different frequencies, and I tried to plot them using a bar chart but faced some difficulties. Here is my code:
# First, I counted the number of occurrences.
count = np.zeros(max(data_val))
for i in range(count.shape[0]):
for j in range(data_val.shape[0]):
if (i == data_val[j]):
count[i] = count[i] + 1
'''
This gives us
count = array([192., 105., ... 19.])
'''
temp = np.arange(0, 18, 1) # Array for the x-axis.
plt.bar(temp, count)
I am getting an error on the last line of code, saying that the objects cannot be broadcast to a single shape.
What I ultimately want is a bar chart where each bar corresponds to an integer value from 0 to 17, and the height of each bar (i.e. the y-axis) represents the frequencies.
Thank you.
UPDATE
I decided to post the fixed code using the suggestions that people were kind enough to give below, just in case anybody facing similar issues will be able to see my revised code in the future.
data = pd.read_csv("./data/train.csv") # Original data is a (2000, 7) DataFrame
# data contains 6 feature columns and 1 target column.
# Separate the design matrix from the target labels.
X = data.iloc[:, :-1]
y = data['target']
'''
The next line of code uses pandas.Series.value_counts() on y in order to count
the number of occurrences for each label, and then proceeds to sort these according to
index (i.e. label).
You can also use pandas.DataFrame.sort_values() instead if you're interested in sorting
according to the number of frequencies rather than labels.
'''
y.value_counts().sort_index().plot.bar(x='Target Value', y='Number of Occurrences')
There was no need to use for loops if we use the methods that are built into the Pandas library.
The specific methods that were mentioned in the answers are pandas.Series.values_count(), pandas.DataFrame.sort_index(), and pandas.DataFrame.plot.bar().
I believe you need value_counts with Series.plot.bar:
df = pd.DataFrame({
'a':[4,5,4,5,5,4],
'b':[7,8,9,4,2,3],
'c':[1,3,5,7,1,0],
'd':[1,1,6,1,6,5],
})
print (df)
a b c d
0 4 7 1 1
1 5 8 3 1
2 4 9 5 6
3 5 4 7 1
4 5 2 1 6
5 4 3 0 5
df['d'].value_counts(sort=False).plot.bar()
If possible some value missing and need set it to 0 add reindex:
df['d'].value_counts(sort=False).reindex(np.arange(18), fill_value=0).plot.bar()
Detail:
print (df['d'].value_counts(sort=False))
1 3
5 1
6 2
Name: d, dtype: int64
print (df['d'].value_counts(sort=False).reindex(np.arange(18), fill_value=0))
0 0
1 3
2 0
3 0
4 0
5 1
6 2
7 0
8 0
9 0
10 0
11 0
12 0
13 0
14 0
15 0
16 0
17 0
Name: d, dtype: int64
Here's an approach using Seaborn
import numpy as np
import pandas as pd
import seaborn as sns
s = pd.Series(np.random.choice(17, 10))
s
# 0 10
# 1 13
# 2 12
# 3 0
# 4 0
# 5 5
# 6 13
# 7 9
# 8 11
# 9 0
# dtype: int64
val, cnt = np.unique(s, return_counts=True)
val, cnt
# (array([ 0, 5, 9, 10, 11, 12, 13]), array([3, 1, 1, 1, 1, 1, 2]))
sns.barplot(val, cnt)

How can I replace age binning feature into numerical data?

I have created agebin column from age columns. I have range of ages but how can I convert them into agebin numerical data type because I want to check whether agebin is important feature or not.
I tried following code for age binning:
traindata = data.assign(age_bins = pd.cut(data.age, 4, retbins=False, include_lowest=True))
data['agebin'] = traindata['age_bins']
data['agebin'].unique()
[[16.954, 28.5], (28.5, 40], (40, 51.5], (51.5, 63]]
Categories (4, object): [[16.954, 28.5] < (28.5, 40] < (40, 51.5] < (51.5, 63]]
What I tried :
data['enc_agebin'] = data.agebin.map({[16.954, 28.5]:1,(28.5, 40]:2,(40, 51.5]:3,(51.5, 63]:4})
I tried to map each range and convert it to numerical but I am getting syntax error. Please suggest some good technique for converting agebin which is categorical to numerical data.
I think need parameter labels in cut:
data = pd.DataFrame({'age':[10,20,40,50,44,56,12,34,56]})
data['agebin'] = pd.cut(data.age,bins=4,labels=range(1, 5), retbins=False,include_lowest=True)
print (data)
age agebin
0 10 1
1 20 1
2 40 3
3 50 4
4 44 3
5 56 4
6 12 1
7 34 3
8 56 4
Or use labels=False, then first bin is 0 and last 3 (like range(4)):
data['agebin'] = pd.cut(data.age, bins=4, labels=False, retbins=False, include_lowest=True)
print (data)
age agebin
0 10 0
1 20 0
2 40 2
3 50 3
4 44 2
5 56 3
6 12 0
7 34 2
8 56 3

Categories

Resources