Pandas data frame multiplication where the data frames are of different matrix - python

df1 is from excel file with columns as below:
Currency
Net Original
Net USD
COGS
USD
1.5
1.2
2.1
USD
1.3
2.1
1.2
USD
1.1
2.3
-1.1
Peso Mexicano
1.6
2.2
2.1
Step 1: Need to derive conversion rate column 'Conv' where 'Currency' is 'Peso Mexicano'
#Filter "Peso Mexicano" currency & take it as a separate data frame (df2)
df2 = df1[df1['Currency']== "Peso Mexicano"]
Step 2:
#Next use formula to get the "Conversion Rate" from df2 using formula
df2['Conv']= (df2['Net USD']/df2['Net Original'])
#Output 1.37
#Multiply the filtered result 'Conv' with 'COGS' column to get the desired result
df1['Inv'] = (df2['Conv']*df1['COGS'])*-1
display(df1)
However the result shows 'NaN' column 'Inv' wherever the currency is 'USD'.
Expected output:
Currency
Net Original
Net USD
COGS
Inv
USD
1.5
1.2
2.1
1.87
USD
1.3
2.1
1.2
0.64
USD
1.1
2.3
-1.1
-2.50
Peso Mexicano
1.6
2.2
2.1
1.87

You needed to aggregate your conv computation, even if there is only one value (I took the mean here).
Here is a working code:
df2 = df1[df1['Currency'] == "Peso Mexicano"]
conv = (df2['Net USD']/df2['Net Original']).mean()
df['Inv'] = conv*df['COGS']-1
output:
Currency Net Original Net USD COGS Inv
0 USD 1.5 1.2 2.1 1.8875
1 USD 1.3 2.1 1.2 0.6500
2 USD 1.1 2.3 -1.1 -2.5125
3 Peso Mexicano 1.6 2.2 2.1 1.8875

Related

Using pandas to get max value per row and column header

I have a data frame and I am looking to get the max value for each row and the column header for the column where the max value is located and return a new dataframe. In reality my data frame has over 50 columns and over 30,000 rows:
df1:
ID Tis RNA DNA Prot Node Exv
AB 1.4 2.3 0.0 0.3 2.4 4.4
NJ 2.2 3.4 2.1 0.0 0.0 0.2
KL 0.0 0.0 0.0 0.0 0.0 0.0
JC 5.2 4.4 2.1 5.4 3.4 2.3
So the ideal output looks like this:
df2:
ID
AB Exv 4.4
NJ RNA 3.4
KL N/A N/A
JC Prot 5.4
I have tried the following without any success:
df2 = df1.max(axis=1)
result.index = df1.idxmax(axis=1)
also tried:
df2=pd.Series(df1.columns[np.argmax(df1.values,axis=1)])
final=pd.DataFrame(df1.lookup(s.index,s),s)
I have looked at other posts but still can't seem to solve this.
Any help would be great
Use if ID is index DataFrame.agg with replace 0 rows by missing values:
df = df1.agg(['idxmax','max'], axis=1).mask(lambda x: x['max'].eq(0))
print (df)
idxmax max
AB Exv 4.4
NJ RNA 3.4
KL NaN NaN
JC Prot 5.4
Use if ID is column:
df = df1.set_index('ID').agg(['idxmax','max'], axis=1).mask(lambda x: x['max'].eq(0))

Index - Match using Pandas

I have the following 2 data frames:
df1 = pd.DataFrame({
'dates': ['02-Jan','03-Jan','30-Jan'],
'currency': ['aud','gbp','eur'],
'amount': [100,330,500]
})
df2 = pd.DataFrame({
'dates': ['01-Jan','02-Jan','03-Jan','30-Jan'],
'aud': [0.72,0.73,0.74,0.71],
'gbp': [1.29,1.30,1.4,1.26],
'eur': [1.15,1.16,1.17,1.18]
})
I want to obtain the intersection of df1.dates & df1.currency. For eg: Looking up the prevalent 'aud' exchange rate on '02-Jan'
It can be solved using the Index + Match functionality of excel. What shall be the best way to replicate it in Pandas.
Desired Output: add a new column 'price'
dates currency amount price
02-Jan aud 100 0.73
03-Jan gbp 330 1.4
30-Jan eur 500 1.18
The best equivalent of INDEX MATCH is DataFrame.lookup:
df2 = df2.set_index('dates')
df1['price'] = df2.lookup(df1['dates'], df1['currency'])
Reshaping your df2 makes it a lot easier to do a straightforward merge:
In [42]: df2.set_index("dates").unstack().to_frame("value")
Out[42]:
value
dates
aud 01-Jan 0.72
02-Jan 0.73
03-Jan 0.74
30-Jan 0.71
gbp 01-Jan 1.29
02-Jan 1.30
03-Jan 1.40
30-Jan 1.26
eur 01-Jan 1.15
02-Jan 1.16
03-Jan 1.17
30-Jan 1.18
In this form, you just need to match the df1 fields with df2's new index as such:
In [43]: df1.merge(df2.set_index("dates").unstack().to_frame("value"), left_on=["currency", "dates"], right_index=True)
Out[43]:
dates currency amount value
0 02-Jan aud 100 0.73
1 03-Jan gbp 330 1.40
You can also left merge it if you don't want to lose missing data (I had to fix your df1 a little for this:
In [44]: df1.merge(df2.set_index("dates").unstack().to_frame("value"), left_on=["currency", "dates"], right_index=True, how="left")
Out[44]:
dates currency amount value
0 02-Jan aud 100 0.73
1 03-Jan gbp 330 1.40
2 04-Jan eur 500 NaN

Plot all columns from two separate dataframes by group in Python

I have two dataframes with lots of columns and rows with the same structure as the examples below:
df1:
Time Group Var1 Var2 Var3
1/1/2016 A 0.1 1.1 2.1
2/1/2016 A 0.2 1.2 2.2
1/1/2016 B 3.5 4.5 5.5
2/1/2016 B 3.6 4.6 5.6
df2:
Time Group Var1 Var2 Var3
1/1/2016 A 0.3 1.3 2.3
2/1/2016 A 0.4 1.4 2.4
1/1/2016 B 3.7 4.7 5.7
2/1/2016 B 3.8 4.8 5.8
I would like write a code that created one plot per column having Time as x-axis for each group and plotting columns with the same name from each dataframe on the same plot.
I was able to write a code that did this but not varying by group:
df1_agg = df1.groupby(by =['Time']).sum()
df2_agg = df2.groupby(by =['Time']).sum()
def plots_all_columns(col_names, filename):
with PdfPages(filename) as pdf:
for i in col_names:
plt.figure()
df1_agg[i].plot(label="df1", legend=True, title = i)
df2_agg[i].plot(label="df2", legend=True)
pdf.savefig()
plt.close('all')
How could I do the same as above but keeping the Group dimension in my dataframe? I have lots of groups so I would need a separate plot for each group category.
Thank you.

Using column name as a new attribute in pandas

I have the following data structure
Date Agric Food
01/01/1990 1.3 0.9
01/02/1990 1.2 0.9
I would like to covert it into the format
Date Sector Beta
01/01/1990 Agric 1.3
01/02/1990 Agric 1.2
01/01/1990 Food 0.9
01/02/1990 Food 0.9
while I am sure I can do this in a complicated way, is there a way of doing this in a few line of code?
Using pd.DataFrame.melt
df.melt('Date', var_name='Sector', value_name='Beta')
Date Sector Beta
0 01/01/1990 Agric 1.3
1 01/02/1990 Agric 1.2
2 01/01/1990 Food 0.9
3 01/02/1990 Food 0.9
Use set_index and stack:
df.set_index('Date').rename_axis('Sector',axis=1).stack()\
.reset_index(name='Beta')
Output:
Date Sector Beta
0 01/01/1990 Agric 1.3
1 01/01/1990 Food 0.9
2 01/02/1990 Agric 1.2
3 01/02/1990 Food 0.9
Or you can using lreshape
df=pd.lreshape(df2, {'Date': ["Date","Date"], 'Beta': ['Agric', 'Food']})
df['Sector']=sorted(df2.columns.tolist()[1:3]*2)
Out[654]:
Date Beta Sector
0 01/01/1990 1.3 Agric
1 01/02/1990 1.2 Agric
2 01/01/1990 0.9 Food
3 01/02/1990 0.9 Food
In case you have 48 columns
df=pd.lreshape(df2, {'Date':['Date']*2, 'Beta': df2.columns.tolist()[1:3]})
df['Sector']=sorted(df2.columns.tolist()[1:3]*2)
also for the columns Sector , it is more safety create it by
import itertools
list(itertools.chain.from_iterable(itertools.repeat(x, 2) for x in df2.columns.tolist()[1:3]))
EDIT Cause lreshap is undocumented (As per# Ted Petrou It's best to use available DataFrame methods if possible and then if none available use documented functions. pandas is constantly looking to improve its API and calling undocumented, old and experimental functions like lreshape for anything is unwarranted. Furthermore, this problem is a very straightforward usecase for melt or stack. It is a bad precedent to set for those new to pandas to come to Stack Overflow and find upvoted answers with lreshape. )
Also , if you want to know more about this , you can check it at github
Below are the method by using pd.wide_to_long
dict1 = {'Agric':'A_Agric','Food':'A_Food'}
df2 = df.rename(columns=dict1)
pd.wide_to_long(df2.reset_index(),['A'],i='Date',j='Sector',sep='_',suffix='.').reset_index().drop('index',axis=1).rename(columns={'A':'Beta '})
Out[2149]:
Date Sector Beta
0 01/01/1990 Agric 1.3
1 01/02/1990 Agric 1.2
2 01/01/1990 Food 0.9
3 01/02/1990 Food 0.9

How to group data by time

I am trying to find a way to group data daily.
This is an example of my data set.
Dates Price1 Price 2
2002-10-15 11:17:03pm 0.6 5.0
2002-10-15 11:20:04pm 1.4 2.4
2002-10-15 11:22:12pm 4.1 9.1
2002-10-16 12:21:03pm 1.6 1.4
2002-10-16 12:22:03pm 7.7 3.7
Yeah, I would definitely use Pandas for this. The trickiest part is just figuring out the datetime parser for pandas to use to load in the data. After that, its just a resampling of the subsequent DataFrame.
In [62]: parse = lambda x: datetime.datetime.strptime(x, '%Y-%m-%d %I:%M:%S%p')
In [63]: dframe = pandas.read_table("data.txt", delimiter=",", index_col=0, parse_dates=True, date_parser=parse)
In [64]: print dframe
Price1 Price 2
Dates
2002-10-15 23:17:03 0.6 5.0
2002-10-15 23:20:04 1.4 2.4
2002-10-15 23:22:12 4.1 9.1
2002-10-16 12:21:03 1.6 1.4
2002-10-16 12:22:03 7.7 3.7
In [78]: means = dframe.resample("D", how='mean', label='left')
In [79]: print means
Price1 Price 2
Dates
2002-10-15 2.033333 5.50
2002-10-16 4.650000 2.55
where data.txt:
Dates , Price1 , Price 2
2002-10-15 11:17:03pm, 0.6 , 5.0
2002-10-15 11:20:04pm, 1.4 , 2.4
2002-10-15 11:22:12pm, 4.1 , 9.1
2002-10-16 12:21:03pm, 1.6 , 1.4
2002-10-16 12:22:03pm, 7.7 , 3.7
From pandas documentation: http://pandas.pydata.org/pandas-docs/stable/pandas.pdf
# 72 hours starting with midnight Jan 1st, 2011
In [1073]: rng = date_range(’1/1/2011’, periods=72, freq=’H’)
Use
data.groupby(data['dates'].map(lambda x: x.day))

Categories

Resources