Arrange Pandas DataFrame to be used for scipy interpn - python

I have to perform an interpolation in 3 or 4 dimensions moving from a tabular data stored as Pandas DataFrame.
I have the following data stored in the variable df : DataFrame:
xm xA xl z
2.3 4.6 10.0 1.905
2.3 4.6 11.0 1.907
2.3 4.8 10.0 1.908
2.3 4.8 11.0 1.909
2.4 4.6 10.0 1.811
2.4 4.6 11.0 1.812
2.4 4.8 10.0 1.813
2.4 4.8 11.0 1.814
xm, xa, xl are the axis from which the grid should be drawn. The column z contains the values from which the interpolation is to be performed. Indeed, the regular grid I came up with is calculated as:
grid = np.meshgrid(*(df.xm,df.xA,df.xl))
Now my problem is how to turn the Z-series data from the DataFrame into a np.array to be passed to the Scipy function:
from scipy import interpolate
p0 = (xm0,xA0,xl0)
z0 = interpolate.interpn(grid, myarray, p0)

Thanks to SCKU for the hint on the z-column reshape. I was using
grid = np.meshgrid(*(df.xm,df.xA,df.xl))
following the example from scipy doc.
It was actually enough to pass the tuple of base axis array:
grid = np.meshgrid(xm,xA,xLn)
z = df.z.values.reshape(grid[0].shape)
xt = (df.xM,df.xA,df.xLn)
p0 = (xM0,xA0,xLn0)
val = interpolate.interpn(xt, z, p0)

Related

Adding column names and values to statistic output in Python?

Background:
I'm currently developing some data profiling in SQL Server. This consists of calculating aggregate statistics on the values in targeted columns.
I'm using SQL for most of the heavy lifting, but calling Python for some of the statistics that SQL is poor at calculating. I'm leveraging the Pandas package through SQL Server Machine Language Services.
However,
I'm currently developing this script on Visual Studio. The SQL portion is irrelevant other than as background.
Problem:
My issue is that when I call one of the Python statistics functions, it produces the output as a series with the labels seemingly not part of the data. I cannot access the labels at all. I need the values of these labels, and I need to normalize the data and insert a column with static values describing which calculation was performed on that row.
Constraints:
I will need to normalize each statistic so I can union the datasets and pass the values back to SQL for further processing. All output needs to accept dynamic schemas, so no hardcoding labels etc.
Attempted solutions:
I've tried explicitly coercing output to dataframes. This just results in a series with label "0".
I've also tried adding static values to the columns. This just adds the target column name as one of the inaccessible labels, and the intended static value as part of the series.
I've searched many times for a solution, and couldn't find anything relevant to the problem.
Code and results below. Using the iris dataset as an example.
###########################
## AGG STATS TEST SCRIPT
##
###########################
#LOAD MODULES
import pandas as pds
#GET SAMPLE DATASET
iris = pds.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
#CENTRAL TENDENCY
mode1 = iris.mode()
stat_mode = pds.melt(
mode1
)
stat_median = iris.median()
stat_median['STAT_NAME'] = 'STAT_MEDIAN' #Try to add a column with the value 'STAT_MEDIAN'
#AGGREGATE STATS
stat_describe = iris.describe()
#PRINT RESULTS
print(iris)
print(stat_median)
print(stat_describe)
###########################
## OUTPUT
##
###########################
>>> #PRINT RESULTS
... print(iris) #ORIGINAL DATASET
...
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
.. ... ... ... ... ...
145 6.7 3.0 5.2 2.3 virginica
146 6.3 2.5 5.0 1.9 virginica
147 6.5 3.0 5.2 2.0 virginica
148 6.2 3.4 5.4 2.3 virginica
149 5.9 3.0 5.1 1.8 virginica
[150 rows x 5 columns]
>>> print(stat_median) #YOU CAN SEE THAT IT INSERTED COLUMN INTO ROW LABELS, VALUE INTO RESULTS SERIES
sepal_length 5.8
sepal_width 3
petal_length 4.35
petal_width 1.3
STAT_NAME STAT_MEDIAN
dtype: object
>>> print(stat_describe) #BASIC DESCRIPTIVE STATS, NEED TO LABEL THE STATISTIC NAMES TO UNPIVOT THIS
sepal_length sepal_width petal_length petal_width
count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.057333 3.758000 1.199333
std 0.828066 0.435866 1.765298 0.762238
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000
>>>
Any assistance is greatly appreciated. Thank you!
I figured it out. There's a function called reset_index that will convert the index to a column, and create a new numerical index.
stat_median = pds.DataFrame(stat_median)
stat_median.reset_index(inplace=True)
stat_median = stat_median.rename(columns={'index' : 'fieldname', 0: 'value'})
stat_median['stat_name'] = 'median'

rounding off columns to two decimal place [duplicate]

This question already has answers here:
Round each number in a Python pandas data frame by 2 decimals
(6 answers)
Closed 2 years ago.
How to round off column number - two(beta) and three(gama) to two decimal places for this dataframe
df
alpha beta gama theta
4.615556 4.637778 4.415556 4.5
3.612727 3.616364 3.556364 5.5
2.608000 2.661333 2.680000 7.5
2.512500 2.550000 2.566250 8.0
You can use the .round function.
Here is an example program:
# Import pandas
import pandas as pd
# Example data from question
alpha = [4.615556, 3.612727, 2.608000, 2.512500]
beta = [4.637778, 3.616364, 2.661333, 2.550000]
gamma = [4.415556, 3.556364, 2.680000, 2.566250]
theta = [4.5, 5.5, 7.5, 8.0]
# Build dataframe
df = pd.DataFrame({'alpha':alpha, 'beta':beta, 'gamma':gamma, 'theta':theta})
# Print it out
print(df)
# Use round function
df[['beta', 'gamma']] = df[['beta', 'gamma']].round(2)
# Show results
print(df)
Yields:
alpha beta gamma theta
0 4.615556 4.637778 4.415556 4.5
1 3.612727 3.616364 3.556364 5.5
2 2.608000 2.661333 2.680000 7.5
3 2.512500 2.550000 2.566250 8.0
alpha beta gamma theta
0 4.615556 4.64 4.42 4.5
1 3.612727 3.62 3.56 5.5
2 2.608000 2.66 2.68 7.5
3 2.512500 2.55 2.57 8.0
df[['beta','gama']] = df[['beta','gama']].round(2)

One hot encoding error python machine learning

I am working with categorical variables in Machine Learning.Here is sample of my data:
age,gender,height,class,label
25,m,43,A,0
35,f,45,B,1
12,m,36,C,0
14,f,42,A,0
There are two categorical variables gender and height.I have used LabelEncoding technique.
My code:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
df=pd.read_csv('test.csv')
X=df.drop(['label'],1)
y=np.array(df['label'])
data=X.iloc[:,:].values
lben = LabelEncoder()
data[:,1] = lben.fit_transform(data[:,1])
data[:,3] = lben.fit_transform(data[:,3])
onehotencoder = OneHotEncoder(categorical_features=[1])
data = onehotencoder.fit_transform(data).toarray()
onehotencoder = OneHotEncoder(categorical_features=[3])
data = onehotencoder.fit_transform(data).toarray()
print(data.shape)
np.savetxt('data.csv',data,fmt='%s')
The data.csv looks like this:
0.0 0.0 1.0 0.0 0.0 1.0 25.0 0.0
0.0 0.0 0.0 1.0 1.0 0.0 35.0 1.0
1.0 0.0 0.0 0.0 0.0 1.0 12.0 2.0
0.0 1.0 0.0 0.0 1.0 0.0 14.0 0.0
I am unable to understand why the column is like this i.e where is the value of the 'height' column.Also the data.shape is (4,8) instead of (4,7) i.e(gender represented by 2 columns and class by 3 and 'age' and 'height' features.
Are you sure that you need to use LabelEncoder+OneHotEncoder? There is a much simpler method (which does not allow to do advanced procedures, but so far you seem to work on basics):
import pandas as pd
import numpy as np
df=pd.read_csv('test.csv')
X=df.drop(['label'],1)
y=np.array(df['label'])
data = pd.get_dummies(X)
The problem with the current code is that after you have done the first OHE:
onehotencoder = OneHotEncoder(categorical_features=[1])
data = onehotencoder.fit_transform(data).toarray()
the columns get shifted and column 3 is in fact the original height column instead of the label-encoded class column. So change the second one to use column 4 and you will get what you want.

How can I manage units in pandas data?

I'm trying to figure out if there is a good way to manage units in my pandas data. For example, I have a DataFrame that looks like this:
length (m) width (m) thickness (cm)
0 1.2 3.4 5.6
1 7.8 9.0 1.2
2 3.4 5.6 7.8
Currently, the measurement units are encoded in column names. Downsides include:
column selection is awkward -- df['width (m)'] vs. df['width']
things will likely break if the units of my source data change
If I wanted to strip the units out of the column names, is there somewhere else that the information could be stored?
There isn't any great way to do this right now, see github issue here for some discussion.
As a quick hack, could do something like this, maintaining a separate dict with the units.
In [3]: units = {}
In [5]: newcols = []
...: for col in df:
...: name, unit = col.split(' ')
...: units[name] = unit
...: newcols.append(name)
In [6]: df.columns = newcols
In [7]: df
Out[7]:
length width thickness
0 1.2 3.4 5.6
1 7.8 9.0 1.2
2 3.4 5.6 7.8
In [8]: units['length']
Out[8]: '(m)'
As I was searching for this, too. Here is what pint and the (experimental) pint_pandas is capable of today:
import pandas as pd
import pint
import pint_pandas
ureg = pint.UnitRegistry()
ureg.Unit.default_format = "~P"
pint_pandas.PintType.ureg.default_format = "~P"
df = pd.DataFrame({
"length": pd.Series([1.2, 7.8, 3.4], dtype="pint[m]"),
"width": pd.Series([3.4, 9.0, 5.6], dtype="pint[m]"),
"thickness": pd.Series([5.6, 1.2, 7.8], dtype="pint[cm]"),
})
print(df.pint.dequantify())
length width thickness
unit m m cm
0 1.2 3.4 5.6
1 7.8 9.0 1.2
2 3.4 5.6 7.8
df['width'] = df['width'].pint.to("inch")
print(df.pint.dequantify())
length width thickness
unit m in cm
0 1.2 133.858268 5.6
1 7.8 354.330709 1.2
2 3.4 220.472441 7.8
Offer you some methods:
pands-units-extension: janpipek/pandas-units-extension: Units extension array for pandas based on astropy
pint-pandas: hgrecco/pint-pandas: Pandas support for pint
you can also extend the pandas by yourself following this Extending pandas — pandas 1.3.0 documentation

Adding calculated columns to the Dataframe in pandas

There is a large csv file imported. Below is the output, where Flavor_Score and Overall_Score are results of applying df.groupby('beer_name').mean() across a multitude of testers. I would like to add a column Std Deviation for each: Flavor_Score and Overall_Score to the right of the mean column. The function is clear but how to add a column for display? Of course, I can generate an array and append it (right?) but it would seem to be a cumbersome way.
Beer_name Beer_Style Flavor_Score Overall_Score
Coors Light 2.0 3.0
Sam Adams Dark 4.0 4.5
Becks Light 3.5 3.5
Guinness Dark 2.0 2.2
Heineken Light 3.5 3.7
You could use
df.groupby('Beer_name').agg(['mean','std'])
This computes the mean and the std for each group.
For example,
import numpy as np
import pandas as pd
np.random.seed(2015)
N = 100
beers = ['Coors', 'Sam Adams', 'Becks', 'Guinness', 'Heineken']
style = ['Light', 'Dark', 'Light', 'Dark', 'Light']
df = pd.DataFrame({'Beer_name': np.random.choice(beers, N),
'Flavor_Score': np.random.uniform(0, 10, N),
'Overall_Score': np.random.uniform(0, 10, N)})
df['Beer_Style'] = df['Beer_name'].map(dict(zip(beers, style)))
print(df.groupby('Beer_name').agg(['mean','std']))
yields
Flavor_Score Overall_Score
mean std mean std
Beer_name
Becks 5.779266 3.033939 6.995177 2.697787
Coors 6.521966 2.008911 4.066374 3.070217
Guinness 4.836690 2.644291 5.577085 2.466997
Heineken 4.622213 3.108812 6.372361 2.904932
Sam Adams 5.443279 3.311825 4.697961 3.164757
groupby.agg([fun1, fun2]) computes any number of functions in one step:
from random import choice, random
import pandas as pd
import numpy as np
beers = ['Coors', 'Sam Adams', 'Becks', 'Guinness', 'Heineken']
styles = ['Light', 'Dark']
def generate():
for i in xrange(0, 100):
yield dict(beer=choice(beers), style=choice(styles),
flavor_score=random()*10.0,
overall_score=random()*10.0)
pd.options.display.float_format = ' {:,.1f} '.format
df = pd.DataFrame(generate())
print df.groupby(['beer', 'style']).agg([np.mean, np.std])
=>
flavor_score overall_score
mean std mean std
beer style
Becks Dark 7.1 3.6 1.9 1.6
Light 4.7 2.4 2.0 1.0
Coors Dark 5.5 3.2 2.6 1.1
Light 5.3 2.5 1.9 1.1
Guinness Dark 3.3 1.4 2.1 1.1
Light 4.7 3.6 2.2 1.1
Heineken Dark 4.4 3.0 2.7 1.0
Light 6.0 2.3 2.1 1.3
Sam Adams Dark 3.4 3.0 1.7 1.2
Light 5.2 3.6 1.6 1.3
What if I need to use a user-defined function to just a flavor_score column? let's say I want subtract 0.5 from a flavor_score column (from all rows, except for Heineken, for which I want to add 0.25)
grouped[grouped.beer != 'Heineken']['flavor_score']['mean'] - 0.5
grouped[grouped.beer == 'Heineken']['flavor_score']['mean'] + 0.25

Categories

Resources