Adding column names and values to statistic output in Python? - python

Background:
I'm currently developing some data profiling in SQL Server. This consists of calculating aggregate statistics on the values in targeted columns.
I'm using SQL for most of the heavy lifting, but calling Python for some of the statistics that SQL is poor at calculating. I'm leveraging the Pandas package through SQL Server Machine Language Services.
However,
I'm currently developing this script on Visual Studio. The SQL portion is irrelevant other than as background.
Problem:
My issue is that when I call one of the Python statistics functions, it produces the output as a series with the labels seemingly not part of the data. I cannot access the labels at all. I need the values of these labels, and I need to normalize the data and insert a column with static values describing which calculation was performed on that row.
Constraints:
I will need to normalize each statistic so I can union the datasets and pass the values back to SQL for further processing. All output needs to accept dynamic schemas, so no hardcoding labels etc.
Attempted solutions:
I've tried explicitly coercing output to dataframes. This just results in a series with label "0".
I've also tried adding static values to the columns. This just adds the target column name as one of the inaccessible labels, and the intended static value as part of the series.
I've searched many times for a solution, and couldn't find anything relevant to the problem.
Code and results below. Using the iris dataset as an example.
###########################
## AGG STATS TEST SCRIPT
##
###########################
#LOAD MODULES
import pandas as pds
#GET SAMPLE DATASET
iris = pds.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
#CENTRAL TENDENCY
mode1 = iris.mode()
stat_mode = pds.melt(
mode1
)
stat_median = iris.median()
stat_median['STAT_NAME'] = 'STAT_MEDIAN' #Try to add a column with the value 'STAT_MEDIAN'
#AGGREGATE STATS
stat_describe = iris.describe()
#PRINT RESULTS
print(iris)
print(stat_median)
print(stat_describe)
###########################
## OUTPUT
##
###########################
>>> #PRINT RESULTS
... print(iris) #ORIGINAL DATASET
...
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
.. ... ... ... ... ...
145 6.7 3.0 5.2 2.3 virginica
146 6.3 2.5 5.0 1.9 virginica
147 6.5 3.0 5.2 2.0 virginica
148 6.2 3.4 5.4 2.3 virginica
149 5.9 3.0 5.1 1.8 virginica
[150 rows x 5 columns]
>>> print(stat_median) #YOU CAN SEE THAT IT INSERTED COLUMN INTO ROW LABELS, VALUE INTO RESULTS SERIES
sepal_length 5.8
sepal_width 3
petal_length 4.35
petal_width 1.3
STAT_NAME STAT_MEDIAN
dtype: object
>>> print(stat_describe) #BASIC DESCRIPTIVE STATS, NEED TO LABEL THE STATISTIC NAMES TO UNPIVOT THIS
sepal_length sepal_width petal_length petal_width
count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.057333 3.758000 1.199333
std 0.828066 0.435866 1.765298 0.762238
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000
>>>
Any assistance is greatly appreciated. Thank you!

I figured it out. There's a function called reset_index that will convert the index to a column, and create a new numerical index.
stat_median = pds.DataFrame(stat_median)
stat_median.reset_index(inplace=True)
stat_median = stat_median.rename(columns={'index' : 'fieldname', 0: 'value'})
stat_median['stat_name'] = 'median'

Related

How to combine data replicates for PCA visualization

I have a dataset where each sample/row is a unique protein and that protein is quantified across 7 features/columns. This dataset includes thousands of proteins and will be classified by machine learning (Support Vector Machine). To give an example of the data:
Protein
Feature 1
Feature 2
Feature 3
Feature 4
Feature 5
Feature 6
Feature 7
Protein 1
10.0
8.7
5.4
28.0
7.9
11.3
5.3
Protein 2
6.5
9.3
4.8
2.7
12.3
14.2
0.7
...
...
...
...
...
...
...
...
Protein N
8.0
6.8
4.9
6.2
10.0
19.3
4.8
In addition to this dataset, I also have 2 more replicates that are structured the exact same and have the same proteins for a total of 3 replicates. Normally if I wanted to visualize one of these datasets, I could transform my 7 features using PCA and plot the first two principal components with each point/protein colored by its classification. However, is there a way that I can take my 3 replicates and get some sort of "consensus" PCA plot for them?
I've seen two possible solutions for handling this:
Average each feature for each protein to get a single dataset with N rows and 7 columns, then PCA transform and plot
Concatenate the 3 replicates into a single dataset such that each row now has 7x3 columns, then PCA transform and plot
To clarify what's being said in solution 2, let's call Feature 1 from replicate 1 Feature 1.1, Feature 1 from replicate 2 Feature 1.2, etc.:
Protein
Feature 1.1
...
Feature 7.1
Feature 1.2
...
Feature 7.2
Feature 1.3
...
Feature 7.3
Protein 1
10.0
...
5.3
8.4
...
5.9
9.7
...
5.2
Protein 2
6.5
...
0.7
6.8
...
0.8
6.3
...
0.7
...
...
...
...
...
...
...
...
...
...
Protein N
8.0
...
4.8
7.9
...
4.9
8.1
...
4.7
What I'm looking for is if there's an accepted solution for such a problem or if there's a solution that's more statistically sound. Thanks in advance!

Find the range of all columns (difference between maximum and minimum) while gracefully handling string columns

I have a scenario where I have to find the range of all the columns in a dataset which contains multiple columns with numeric value but one column has string values.
Please find sample records from my data set below:
import seaborn as sns
iris = sns.load_dataset('iris')
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
The maximum and minimum of these columns are given by
sepal_length 7.9
sepal_width 4.4
petal_length 6.9
petal_width 2.5
species virginica
dtype: object
and
sepal_length 4.3
sepal_width 2
petal_length 1
petal_width 0.1
species setosa
dtype: object
...respectively. To find the range of all the columns I can use the below code:
iris.max() - iris.min()
But as the column 'species' has string values, the above code is throwing the below error:
TypeError: unsupported operand type(s) for -: 'str' and 'str'
If the above error occurs, I want to print the value as the
"{max string value}" - "{min string value}"
IOW, my expected output would be something like:
sepal_length 3.6
sepal_width 2.4
petal_length 5.9
petal_width 2.4
species virginica - setosa
How do I resolve this issue?
Handle the numeric and string columns separately. You can select these using df.select_dtypes. Finally, concat the result.
u = Iris.select_dtypes(include=[np.number])
# U = u.apply(np.ptp, axis=0)
U = u.max() - u.min()
v = Iris.select_dtypes(include=[object])
V = v.max() + ' - ' + v.min()
U.append(V)
sepal_length 3.6
sepal_width 2.4
petal_length 5.9
petal_width 2.4
species virginica - setosa
dtype: object

Preprocessing csv files to use with tflearn

My question is about preprocessing csv files before inputing them into a neural network.
I want to build a deep neural network for the famous iris dataset using tflearn in python 3.
Dataset: http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data
I'm using tflearn to load the csv file. However, the classes column of my data set has words such as iris-setosa, iris-versicolor, iris-virginica.
Nueral networks work only with numbers. So, I have to find a way to change the classes from words to numbers. Since it is a very small dataset, I can do it manually using Excel/text editor. I manually assigned numbers for different classes.
But, I can't possibly do it for every dataset I work with. So, I tried using pandas to perform one hot encoding.
preprocess_data = pd.read_csv("F:\Gautam\.....\Dataset\iris_data.csv")
preprocess_data = pd.get_dummies(preprocess_data)
But now, I can't use this piece of code:
data, labels = load_csv('filepath', categorical_labels=True,
n_classes=3)
'filepath' should only be a directory to the csv file, not any variable like preprocess_data.
Original Dataset:
Sepal Length Sepal Width Petal Length Petal Width Class
89 5.5 2.5 4.0 1.3 iris-versicolor
85 6.0 3.4 4.5 1.6 iris-versicolor
31 5.4 3.4 1.5 0.4 iris-setosa
52 6.9 3.1 4.9 1.5 iris-versicolor
111 6.4 2.7 5.3 1.9 iris-virginica
Manually modified dataset:
Sepal Length Sepal Width Petal Length Petal Width Class
89 5.5 2.5 4.0 1.3 1
85 6.0 3.4 4.5 1.6 1
31 5.4 3.4 1.5 0.4 0
52 6.9 3.1 4.9 1.5 1
111 6.4 2.7 5.3 1.9 2
Here's my code which runs perfectly, but, I have modified the dataset manually.
import numpy as np
import pandas as pd
import tflearn
from tflearn.layers.core import input_data, fully_connected
from tflearn.layers.estimator import regression
from tflearn.data_utils import load_csv
data_source = 'F:\Gautam\.....\Dataset\iris_data.csv'
data, labels = load_csv(data_source, categorical_labels=True,
n_classes=3)
network = input_data(shape=[None, 4], name='InputLayer')
network = fully_connected(network, 9, activation='sigmoid', name='Hidden_Layer_1')
network = fully_connected(network, 3, activation='softmax', name='Output_Layer')
network = regression(network, batch_size=1, optimizer='sgd', learning_rate=0.2)
model = tflearn.DNN(network)
model.fit(data, labels, show_metric=True, run_id='iris_dataset', validation_set=0.1, n_epoch=2000)
I want to know if there's any other built-in function in tflearn (or in any other module, for that matter) that I can use to modify the value of my classes from words to numbers. I don't think manually modifying the datasets would be productive.
I'm a beginner in tflearn and neural networks also. Any help would be appreciated. Thanks.
Use label encoder from sklearn library:
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
df = pd.read_csv('iris_data.csv',header=None)
df.columns=[Sepal Length,Sepal Width,Petal Length,Petal Width,Class]
enc=LabelEncoder()
df['Class']=enc.fit_transform(df['Class'])
print df.head(5)
if you want One-hot encoding then first you need to labelEncode then do OneHotEncoding :
enc=LabelEncoder()
enc_1=OneHotEncoder()
df['Class']=enc.fit_transform(df['Class'])
df['Class']=enc_1.fit_transform([df['Class']]).toarray()
print df.head(5)
These encoders first sort the words in alphabetical order then assign them labels. If you want to see which label is assigned to which class, do:
for k in list(enc.classes_) :
print 'name ::{}, label ::{}'.format(k,enc.transform([k]))
If you want to save this dataframe as a csv file, do:
df.to_csv('Processed_Irisdataset.csv',sep=',')
The simpliest solution is map by dict of all possible values:
df['Class'] = df['Class'].map({'iris-versicolor': 1, 'iris-setosa': 0, 'iris-virginica': 2})
print (df)
Sepal Length Sepal Width Petal Length Petal Width Class
0 89 5.5 2.5 4.0 1.3 1
1 85 6.0 3.4 4.5 1.6 1
2 31 5.4 3.4 1.5 0.4 0
3 52 6.9 3.1 4.9 1.5 1
4 111 6.4 2.7 5.3 1.9 2
If want generate dictionary by all unique values:
d = {v:k for k, v in enumerate(df['Class'].unique())}
print (d)
{'iris-versicolor': 0, 'iris-virginica': 2, 'iris-setosa': 1}
df['Class'] = df['Class'].map(d)
print (df)
Sepal Length Sepal Width Petal Length Petal Width Class
0 89 5.5 2.5 4.0 1.3 0
1 85 6.0 3.4 4.5 1.6 0
2 31 5.4 3.4 1.5 0.4 1
3 52 6.9 3.1 4.9 1.5 0
4 111 6.4 2.7 5.3 1.9 2

Is it possible to add "range" (ie.max-min) to the pandas describe function in python?

Is it possible to add "range" (ie.max-min) to the pandas describe function in python?
I would like to get like this ?
sepal_length sepal_width
count 150 150
mean 5.843333 3.054
std 0.828066 0.433594
min 4.3 2
25% 5.1 2.8
50% 5.8 3
75% 6.4 3.3
max 7.9 4.4
Range 3.6 2.4
I think simpliest is add to output subtracting rows and wrap to function:
def describe_new(df):
df1 = df.describe()
df1.loc["range"] = df1.loc['max'] - df1.loc['min']
return df1
print (describe_new(df))

Find the average for user-defined window in pandas

I have a pandas dataframe that has raw heart rate data with the index of time (in seconds).
I am trying to bin the data so that I can have the average of a user define window (e.g. 10s) - not a rolling average, just an average of 10s, then the 10s following, etc.
import pandas as pd
hr_raw = pd.read_csv('hr_data.csv', index_col='time')
print(hr_raw)
heart_rate
time
0.6 164.0
1.0 182.0
1.3 164.0
1.6 150.0
2.0 152.0
2.4 141.0
2.9 163.0
3.2 141.0
3.7 124.0
4.2 116.0
4.7 126.0
5.1 116.0
5.7 107.0
Using the example data above, I would like to be able to set a user defined window size (let's use 2 seconds) and produce a new dataframe that has index of 2sec increments and averages the 'heart_rate' values if the time falls into that window (and should continue to the end of the dataframe).
For example:
heart_rate
time
2.0 162.40
4.0 142.25
6.0 116.25
I can only seem to find methods to bin the data based on a predetermined number of bins (e.g. making a histogram) and this only returns the count/frequency.
thanks.
A groupby should do it.
df.groupby((df.index // 2 + 1) * 2).mean()
heart_rate
time
2.0 165.00
4.0 144.20
6.0 116.25
Note that the reason for the slight difference between our answers is that the upper bound is excluded. That means, a reading taken at 2.0s will be considered for the 4.0s time interval. This is how it is usually done, a similar solution with the TimeGrouper will yield the same result.
Like coldspeed pointed out, 2s will be considered in 4s, however, if you need it in 2x bucket, you can
In [1038]: df.groupby(np.ceil(df.index/2)*2).mean()
Out[1038]:
heart_rate
time
2.0 162.40
4.0 142.25
6.0 116.25

Categories

Resources