Using column name as a new attribute in pandas - python

I have the following data structure
Date Agric Food
01/01/1990 1.3 0.9
01/02/1990 1.2 0.9
I would like to covert it into the format
Date Sector Beta
01/01/1990 Agric 1.3
01/02/1990 Agric 1.2
01/01/1990 Food 0.9
01/02/1990 Food 0.9
while I am sure I can do this in a complicated way, is there a way of doing this in a few line of code?

Using pd.DataFrame.melt
df.melt('Date', var_name='Sector', value_name='Beta')
Date Sector Beta
0 01/01/1990 Agric 1.3
1 01/02/1990 Agric 1.2
2 01/01/1990 Food 0.9
3 01/02/1990 Food 0.9

Use set_index and stack:
df.set_index('Date').rename_axis('Sector',axis=1).stack()\
.reset_index(name='Beta')
Output:
Date Sector Beta
0 01/01/1990 Agric 1.3
1 01/01/1990 Food 0.9
2 01/02/1990 Agric 1.2
3 01/02/1990 Food 0.9

Or you can using lreshape
df=pd.lreshape(df2, {'Date': ["Date","Date"], 'Beta': ['Agric', 'Food']})
df['Sector']=sorted(df2.columns.tolist()[1:3]*2)
Out[654]:
Date Beta Sector
0 01/01/1990 1.3 Agric
1 01/02/1990 1.2 Agric
2 01/01/1990 0.9 Food
3 01/02/1990 0.9 Food
In case you have 48 columns
df=pd.lreshape(df2, {'Date':['Date']*2, 'Beta': df2.columns.tolist()[1:3]})
df['Sector']=sorted(df2.columns.tolist()[1:3]*2)
also for the columns Sector , it is more safety create it by
import itertools
list(itertools.chain.from_iterable(itertools.repeat(x, 2) for x in df2.columns.tolist()[1:3]))
EDIT Cause lreshap is undocumented (As per# Ted Petrou It's best to use available DataFrame methods if possible and then if none available use documented functions. pandas is constantly looking to improve its API and calling undocumented, old and experimental functions like lreshape for anything is unwarranted. Furthermore, this problem is a very straightforward usecase for melt or stack. It is a bad precedent to set for those new to pandas to come to Stack Overflow and find upvoted answers with lreshape. )
Also , if you want to know more about this , you can check it at github
Below are the method by using pd.wide_to_long
dict1 = {'Agric':'A_Agric','Food':'A_Food'}
df2 = df.rename(columns=dict1)
pd.wide_to_long(df2.reset_index(),['A'],i='Date',j='Sector',sep='_',suffix='.').reset_index().drop('index',axis=1).rename(columns={'A':'Beta '})
Out[2149]:
Date Sector Beta
0 01/01/1990 Agric 1.3
1 01/02/1990 Agric 1.2
2 01/01/1990 Food 0.9
3 01/02/1990 Food 0.9

Related

Pandas data frame multiplication where the data frames are of different matrix

df1 is from excel file with columns as below:
Currency
Net Original
Net USD
COGS
USD
1.5
1.2
2.1
USD
1.3
2.1
1.2
USD
1.1
2.3
-1.1
Peso Mexicano
1.6
2.2
2.1
Step 1: Need to derive conversion rate column 'Conv' where 'Currency' is 'Peso Mexicano'
#Filter "Peso Mexicano" currency & take it as a separate data frame (df2)
df2 = df1[df1['Currency']== "Peso Mexicano"]
Step 2:
#Next use formula to get the "Conversion Rate" from df2 using formula
df2['Conv']= (df2['Net USD']/df2['Net Original'])
#Output 1.37
#Multiply the filtered result 'Conv' with 'COGS' column to get the desired result
df1['Inv'] = (df2['Conv']*df1['COGS'])*-1
display(df1)
However the result shows 'NaN' column 'Inv' wherever the currency is 'USD'.
Expected output:
Currency
Net Original
Net USD
COGS
Inv
USD
1.5
1.2
2.1
1.87
USD
1.3
2.1
1.2
0.64
USD
1.1
2.3
-1.1
-2.50
Peso Mexicano
1.6
2.2
2.1
1.87
You needed to aggregate your conv computation, even if there is only one value (I took the mean here).
Here is a working code:
df2 = df1[df1['Currency'] == "Peso Mexicano"]
conv = (df2['Net USD']/df2['Net Original']).mean()
df['Inv'] = conv*df['COGS']-1
output:
Currency Net Original Net USD COGS Inv
0 USD 1.5 1.2 2.1 1.8875
1 USD 1.3 2.1 1.2 0.6500
2 USD 1.1 2.3 -1.1 -2.5125
3 Peso Mexicano 1.6 2.2 2.1 1.8875

Adding column names and values to statistic output in Python?

Background:
I'm currently developing some data profiling in SQL Server. This consists of calculating aggregate statistics on the values in targeted columns.
I'm using SQL for most of the heavy lifting, but calling Python for some of the statistics that SQL is poor at calculating. I'm leveraging the Pandas package through SQL Server Machine Language Services.
However,
I'm currently developing this script on Visual Studio. The SQL portion is irrelevant other than as background.
Problem:
My issue is that when I call one of the Python statistics functions, it produces the output as a series with the labels seemingly not part of the data. I cannot access the labels at all. I need the values of these labels, and I need to normalize the data and insert a column with static values describing which calculation was performed on that row.
Constraints:
I will need to normalize each statistic so I can union the datasets and pass the values back to SQL for further processing. All output needs to accept dynamic schemas, so no hardcoding labels etc.
Attempted solutions:
I've tried explicitly coercing output to dataframes. This just results in a series with label "0".
I've also tried adding static values to the columns. This just adds the target column name as one of the inaccessible labels, and the intended static value as part of the series.
I've searched many times for a solution, and couldn't find anything relevant to the problem.
Code and results below. Using the iris dataset as an example.
###########################
## AGG STATS TEST SCRIPT
##
###########################
#LOAD MODULES
import pandas as pds
#GET SAMPLE DATASET
iris = pds.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
#CENTRAL TENDENCY
mode1 = iris.mode()
stat_mode = pds.melt(
mode1
)
stat_median = iris.median()
stat_median['STAT_NAME'] = 'STAT_MEDIAN' #Try to add a column with the value 'STAT_MEDIAN'
#AGGREGATE STATS
stat_describe = iris.describe()
#PRINT RESULTS
print(iris)
print(stat_median)
print(stat_describe)
###########################
## OUTPUT
##
###########################
>>> #PRINT RESULTS
... print(iris) #ORIGINAL DATASET
...
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
.. ... ... ... ... ...
145 6.7 3.0 5.2 2.3 virginica
146 6.3 2.5 5.0 1.9 virginica
147 6.5 3.0 5.2 2.0 virginica
148 6.2 3.4 5.4 2.3 virginica
149 5.9 3.0 5.1 1.8 virginica
[150 rows x 5 columns]
>>> print(stat_median) #YOU CAN SEE THAT IT INSERTED COLUMN INTO ROW LABELS, VALUE INTO RESULTS SERIES
sepal_length 5.8
sepal_width 3
petal_length 4.35
petal_width 1.3
STAT_NAME STAT_MEDIAN
dtype: object
>>> print(stat_describe) #BASIC DESCRIPTIVE STATS, NEED TO LABEL THE STATISTIC NAMES TO UNPIVOT THIS
sepal_length sepal_width petal_length petal_width
count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.057333 3.758000 1.199333
std 0.828066 0.435866 1.765298 0.762238
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000
>>>
Any assistance is greatly appreciated. Thank you!
I figured it out. There's a function called reset_index that will convert the index to a column, and create a new numerical index.
stat_median = pds.DataFrame(stat_median)
stat_median.reset_index(inplace=True)
stat_median = stat_median.rename(columns={'index' : 'fieldname', 0: 'value'})
stat_median['stat_name'] = 'median'

Adding calculated columns to the Dataframe in pandas

There is a large csv file imported. Below is the output, where Flavor_Score and Overall_Score are results of applying df.groupby('beer_name').mean() across a multitude of testers. I would like to add a column Std Deviation for each: Flavor_Score and Overall_Score to the right of the mean column. The function is clear but how to add a column for display? Of course, I can generate an array and append it (right?) but it would seem to be a cumbersome way.
Beer_name Beer_Style Flavor_Score Overall_Score
Coors Light 2.0 3.0
Sam Adams Dark 4.0 4.5
Becks Light 3.5 3.5
Guinness Dark 2.0 2.2
Heineken Light 3.5 3.7
You could use
df.groupby('Beer_name').agg(['mean','std'])
This computes the mean and the std for each group.
For example,
import numpy as np
import pandas as pd
np.random.seed(2015)
N = 100
beers = ['Coors', 'Sam Adams', 'Becks', 'Guinness', 'Heineken']
style = ['Light', 'Dark', 'Light', 'Dark', 'Light']
df = pd.DataFrame({'Beer_name': np.random.choice(beers, N),
'Flavor_Score': np.random.uniform(0, 10, N),
'Overall_Score': np.random.uniform(0, 10, N)})
df['Beer_Style'] = df['Beer_name'].map(dict(zip(beers, style)))
print(df.groupby('Beer_name').agg(['mean','std']))
yields
Flavor_Score Overall_Score
mean std mean std
Beer_name
Becks 5.779266 3.033939 6.995177 2.697787
Coors 6.521966 2.008911 4.066374 3.070217
Guinness 4.836690 2.644291 5.577085 2.466997
Heineken 4.622213 3.108812 6.372361 2.904932
Sam Adams 5.443279 3.311825 4.697961 3.164757
groupby.agg([fun1, fun2]) computes any number of functions in one step:
from random import choice, random
import pandas as pd
import numpy as np
beers = ['Coors', 'Sam Adams', 'Becks', 'Guinness', 'Heineken']
styles = ['Light', 'Dark']
def generate():
for i in xrange(0, 100):
yield dict(beer=choice(beers), style=choice(styles),
flavor_score=random()*10.0,
overall_score=random()*10.0)
pd.options.display.float_format = ' {:,.1f} '.format
df = pd.DataFrame(generate())
print df.groupby(['beer', 'style']).agg([np.mean, np.std])
=>
flavor_score overall_score
mean std mean std
beer style
Becks Dark 7.1 3.6 1.9 1.6
Light 4.7 2.4 2.0 1.0
Coors Dark 5.5 3.2 2.6 1.1
Light 5.3 2.5 1.9 1.1
Guinness Dark 3.3 1.4 2.1 1.1
Light 4.7 3.6 2.2 1.1
Heineken Dark 4.4 3.0 2.7 1.0
Light 6.0 2.3 2.1 1.3
Sam Adams Dark 3.4 3.0 1.7 1.2
Light 5.2 3.6 1.6 1.3
What if I need to use a user-defined function to just a flavor_score column? let's say I want subtract 0.5 from a flavor_score column (from all rows, except for Heineken, for which I want to add 0.25)
grouped[grouped.beer != 'Heineken']['flavor_score']['mean'] - 0.5
grouped[grouped.beer == 'Heineken']['flavor_score']['mean'] + 0.25

Wrong decimal calculations with pandas

I have a data frame (df) in pandas with four columns and I want a new column to represent the mean of this four columns: df['mean']= df.mean(1)
1 2 3 4 mean
NaN NaN NaN NaN NaN
5.9 5.4 2.4 3.2 4.225
0.6 0.7 0.7 0.7 0.675
2.5 1.6 1.5 1.2 1.700
0.4 0.4 0.4 0.4 0.400
So far so good. But when I save the results to a csv file this is what I found:
5.9,5.4,2.4,3.2,4.2250000000000005
0.6,0.7,0.7,0.7,0.6749999999999999
2.5,1.6,1.5,1.2,1.7
0.4,0.4,0.4,0.4,0.4
I guess I can force the format in the mean column, but any idea why this is happenning?
I am using winpython with python 3.3.2 and pandas 0.11.0
You could use the float_format parameter:
import pandas as pd
import io
content = '''\
1 2 3 4 mean
NaN NaN NaN NaN NaN
5.9 5.4 2.4 3.2 4.225
0.6 0.7 0.7 0.7 0.675
2.5 1.6 1.5 1.2 1.700
0.4 0.4 0.4 0.4 0.400'''
df = pd.read_table(io.BytesIO(content), sep='\s+')
df.to_csv('/tmp/test.csv', float_format='%g', index=False)
yields
1,2,3,4,mean
,,,,
5.9,5.4,2.4,3.2,4.225
0.6,0.7,0.7,0.7,0.675
2.5,1.6,1.5,1.2,1.7
0.4,0.4,0.4,0.4,0.4
The answers seem correct. Floating point numbers cannot be perfectly represented on our systems. There are bound to be some differences. Read The Floating Point Guide.
>>> a = 5.9+5.4+2.4+3.2
>>> a / 4
4.2250000000000005
As you said, you could always format the results if you want to get only a fixed number of points after the decimal.
>>> "{:.3f}".format(a/4)
'4.225'

How to group data by time

I am trying to find a way to group data daily.
This is an example of my data set.
Dates Price1 Price 2
2002-10-15 11:17:03pm 0.6 5.0
2002-10-15 11:20:04pm 1.4 2.4
2002-10-15 11:22:12pm 4.1 9.1
2002-10-16 12:21:03pm 1.6 1.4
2002-10-16 12:22:03pm 7.7 3.7
Yeah, I would definitely use Pandas for this. The trickiest part is just figuring out the datetime parser for pandas to use to load in the data. After that, its just a resampling of the subsequent DataFrame.
In [62]: parse = lambda x: datetime.datetime.strptime(x, '%Y-%m-%d %I:%M:%S%p')
In [63]: dframe = pandas.read_table("data.txt", delimiter=",", index_col=0, parse_dates=True, date_parser=parse)
In [64]: print dframe
Price1 Price 2
Dates
2002-10-15 23:17:03 0.6 5.0
2002-10-15 23:20:04 1.4 2.4
2002-10-15 23:22:12 4.1 9.1
2002-10-16 12:21:03 1.6 1.4
2002-10-16 12:22:03 7.7 3.7
In [78]: means = dframe.resample("D", how='mean', label='left')
In [79]: print means
Price1 Price 2
Dates
2002-10-15 2.033333 5.50
2002-10-16 4.650000 2.55
where data.txt:
Dates , Price1 , Price 2
2002-10-15 11:17:03pm, 0.6 , 5.0
2002-10-15 11:20:04pm, 1.4 , 2.4
2002-10-15 11:22:12pm, 4.1 , 9.1
2002-10-16 12:21:03pm, 1.6 , 1.4
2002-10-16 12:22:03pm, 7.7 , 3.7
From pandas documentation: http://pandas.pydata.org/pandas-docs/stable/pandas.pdf
# 72 hours starting with midnight Jan 1st, 2011
In [1073]: rng = date_range(’1/1/2011’, periods=72, freq=’H’)
Use
data.groupby(data['dates'].map(lambda x: x.day))

Categories

Resources