Index - Match using Pandas - python

I have the following 2 data frames:
df1 = pd.DataFrame({
'dates': ['02-Jan','03-Jan','30-Jan'],
'currency': ['aud','gbp','eur'],
'amount': [100,330,500]
})
df2 = pd.DataFrame({
'dates': ['01-Jan','02-Jan','03-Jan','30-Jan'],
'aud': [0.72,0.73,0.74,0.71],
'gbp': [1.29,1.30,1.4,1.26],
'eur': [1.15,1.16,1.17,1.18]
})
I want to obtain the intersection of df1.dates & df1.currency. For eg: Looking up the prevalent 'aud' exchange rate on '02-Jan'
It can be solved using the Index + Match functionality of excel. What shall be the best way to replicate it in Pandas.
Desired Output: add a new column 'price'
dates currency amount price
02-Jan aud 100 0.73
03-Jan gbp 330 1.4
30-Jan eur 500 1.18

The best equivalent of INDEX MATCH is DataFrame.lookup:
df2 = df2.set_index('dates')
df1['price'] = df2.lookup(df1['dates'], df1['currency'])

Reshaping your df2 makes it a lot easier to do a straightforward merge:
In [42]: df2.set_index("dates").unstack().to_frame("value")
Out[42]:
value
dates
aud 01-Jan 0.72
02-Jan 0.73
03-Jan 0.74
30-Jan 0.71
gbp 01-Jan 1.29
02-Jan 1.30
03-Jan 1.40
30-Jan 1.26
eur 01-Jan 1.15
02-Jan 1.16
03-Jan 1.17
30-Jan 1.18
In this form, you just need to match the df1 fields with df2's new index as such:
In [43]: df1.merge(df2.set_index("dates").unstack().to_frame("value"), left_on=["currency", "dates"], right_index=True)
Out[43]:
dates currency amount value
0 02-Jan aud 100 0.73
1 03-Jan gbp 330 1.40
You can also left merge it if you don't want to lose missing data (I had to fix your df1 a little for this:
In [44]: df1.merge(df2.set_index("dates").unstack().to_frame("value"), left_on=["currency", "dates"], right_index=True, how="left")
Out[44]:
dates currency amount value
0 02-Jan aud 100 0.73
1 03-Jan gbp 330 1.40
2 04-Jan eur 500 NaN

Related

transpose multiple columns in a pandas dataframe

AD AP AR MD MS iS AS
0 169.88 0.00 50.50 814.0 57.3 32.3 43.230
1 12.54 0.01 84.75 93.0 51.3 36.6 43.850
2 321.38 0.00 65.08 986.0 56.7 28.9 42.070
I would like to change the dataframe above to a transposed version where for each column, the values are put in a single row, so e.g. for columns AD and AP, it will look like this
d1_AD d2_AD d3_AD d1_AP d2_AP d3_AP
169.88 12.54 321.38 0.00 0.01 0.00
I can do a transpose, but how do I get the column names and output structure like above?
NOTE: The output is truncated for legibility but the actual output should include all the other columns like AR MD MS iS AS
We can rename to make the index of the correct form, then stack and sort_index, then Collapse the MultiIndex and to_frame and transpose
new_df = df.rename(lambda x: f'd{x + 1}').stack().sort_index(level=1)
new_df.index = new_df.index.map('_'.join)
new_df = new_df.to_frame().transpose()
Input df:
df = pd.DataFrame({
'AD': [169.88, 12.54, 321.38], 'AP': [0.0, 0.01, 0.0],
'AR': [50.5, 84.75, 65.08], 'MD': [814.0, 93.0, 986.0],
'MS': [57.3, 51.3, 56.7], 'iS': [32.3, 36.6, 28.9],
'AS': [43.23, 43.85, 42.07]
})
new_df:
d1_AD d2_AD d3_AD d1_AP d2_AP ... d2_MS d3_MS d1_iS d2_iS d3_iS
0 169.88 12.54 321.38 0.0 0.01 ... 51.3 56.7 32.3 36.6 28.9
[1 rows x 21 columns]
If lexicographic sorting does not work we can wait to convert the MultiIndex to string until after sort_index:
new_df = df.stack().sort_index(level=1) # Sort level 1 (by number)
new_df.index = new_df.index.map(lambda x: f'd{x[0]+1}_{x[1]}')
new_df = new_df.to_frame().transpose()
Larger frame:
df = pd.concat([df] * 4, ignore_index=True)
Truncated output:
d1_AD d2_AD d3_AD d4_AD d5_AD ... d8_iS d9_iS d10_iS d11_iS d12_iS
0 169.88 12.54 321.38 169.88 12.54 ... 36.6 28.9 32.3 36.6 28.9
[1 rows x 84 columns]
If needing columns in same order as df, use melt using ignore_index=False to not have to recalculate groups and let melt handle the ordering:
new_df = df.melt(value_name=0, ignore_index=False)
new_df = new_df[[0]].set_axis(
# Create the new index
'd' + (new_df.index + 1).astype(str) + '_' + new_df['variable']
).transpose()
Truncated output on the larger frame:
d1_AD d2_AD d3_AD d4_AD d5_AD ... d8_AS d9_AS d10_AS d11_AS d12_AS
0 169.88 12.54 321.38 169.88 12.54 ... 43.85 42.07 43.23 43.85 42.07
[1 rows x 84 columns]
You could try melt and set_index with groupby:
x = df.melt().set_index('variable').rename_axis(index=None).T.set_axis([0])
x.set_axis(x.columns + x.columns.to_series().groupby(level=0).transform('cumcount').add(1).astype(str), axis=1)
AD1 AD2 AD3 AP1 AP2 AP3 AR1 AR2 AR3 ... MS1 MS2 MS3 iS1 iS2 iS3 AS1 AS2 AS3
0 169.88 12.54 321.38 0.0 0.01 0.0 50.5 84.75 65.08 ... 57.3 51.3 56.7 32.3 36.6 28.9 43.23 43.85 42.07
[1 rows x 21 columns]

Filter out the column having same value throughout all the rows

This is my dataset:
df = {'brand_no':['BH 1', 'BH 2', 'BH 5', 'BH 7', 'BH 6'],
'1240000601_min':[5.87,5.87,5.87,5.87,np.nan],
'1240000601_max':[8.87,7.47,10.1,1.9,10.8],
'1240000603_min':[5.87,np.nan,6.5,2.0,7.8],
'1240000603_max':[8.57,7.47,10.2,1.0,10.2],
'1240000604_min':[5.87,5.67,6.9,1.0,7.8],
'1240000604_max':[8.87,8.87,8.87,np.nan,8.87],
'1240000605_min':[15.87,15.67,16.9,1.0,17.8],
'1240000605_max':[18.11,17.47,20.1,1.9,22.6],
'1240000606_min':[8.12,8.12,np.nan,8.12,np.nan],
'1240000606_max;':[np.nan,7.47,10.1,1.9,np.nan]}
# Create DataFrame
df = pd.DataFrame(df)
# Print the output.
df
As you can see the column stays the same except nan.(Because the data is sparse it has nan as well), so I want drop these column which has value same across all the rows. (in this case column 1240000601_min, 1240000604_max and 1240000606_min)
Desired output:
As we can see here, all the column with same value across all rows are dropped. Pls help get this.
You can use something like this:
columns = [column for column in df.columns if df[column].nunique()==1]
df.drop(columns=columns)
df.nunique() drops nans by default, so you don't have to worry about that.
Try this:
df_cleared = df.loc[:, df.nunique() > 1]
You can use: df.nunique() to check for number of unique items in all columns and filter those > 1 with .gt(1). This will form a boolean mask of the columns. Then, use .loc and put the boolean mask just created on the second parameter of .loc to filter the columns:
df_cleaned = df.loc[:, df.nunique().gt(1)]
Result:
pritn(df_cleaned)
brand_no 1240000601_max 1240000603_min 1240000603_max 1240000604_min 1240000605_min 1240000605_max 1240000606_max;
0 BH 1 8.87 5.87 8.57 5.87 15.87 18.11 NaN
1 BH 2 7.47 NaN 7.47 5.67 15.67 17.47 7.47
2 BH 5 10.10 6.50 10.20 6.90 16.90 20.10 10.10
3 BH 7 1.90 2.00 1.00 1.00 1.00 1.90 1.90
4 BH 6 10.80 7.80 10.20 7.80 17.80 22.60 NaN

python / pandas - Find common columns between two dataframes, and create another one with same columns showing their difference

My version of pandas is:
pd.__version__
'0.25.3'
I have two dataframes, below is a sample, with the majority of the columns being the same across the two dataframes. I am trying to find the common columns, and create a new dataframe with all the common columns that shows their difference in values.
A sample from c_r dataframe:
Comp_name EOL - CL Per $ Access - CL Per $ Total Impact - CL Per $
Nike -0.02 -0.39 -0.01
Nike -0.02 -0.39 -0.02
Adidas -0.02 -0.39 -0.01
Adidas -0.02 -0.39 -0.02
A sample from x dataframe:
Comp_name EOL - CL Per $ Access - CL Per $ Total Impact - CL Per $
Nike -0.02 -0.39 0.05
Nike -0.02 -0.39 0.03
Adidas -0.02 -0.39 0.08
Adidas -0.02 -0.39 0.08
new_df: (to have the same column names, and show the difference, i.e:)
EOL - CL Per $ - Diff Access - CL Per $ - Diff Total Impact - CL Per $ - Diff
-0.00 -0.00 -0.06
-0.00 -0.00 -0.05
-0.00 -0.00 -0.09
-0.00 -0.00 -0.10
I have tried - please see where the error is in the code:
new_df = pd.DataFrame()
for i in c_r:
for j in x:
if c_r[i].dtype != object and x[j].dtype != object:
if i == j:
## THE ISSUE IS IN THE LINE BELOW ##
new_df[i+'-Diff'] = (c_r[i]) - (x[j])
else:
pass
but for some reason I get back only 1 row of values.
Any ideas of why my code does not work? How can I achieve it the resulting dataframe, including the initial column of Comp_name?
Thanks all.
Have you tried using intersection/ symmetric_difference(for difference) i.e.
a = dataframe2.columns.intersection(dataframe1.columns)
print(a)
I think I understood the problem now, I have a small code as below.
import pandas as pd
d = {'col1': [-0.02 , -0.02 ,-0.02 ,-0.02 ], 'col2': [-0.39, -0.39, -0.39, -0.39],'col3': [-0.01,-0.02,-0.01,-0.02]}
d1 = {'col1': [-0.02 , -0.02 ,-0.02 ,-0.02 ], 'col2': [-0.39, -0.39, -0.39, -0.39],'col3': [0.05,0.03,0.06,0.04]}
df = pd.DataFrame(data=d)
df2 = pd.DataFrame(data=d1)
df = df.apply(pd.to_numeric, errors='coerce')
df2 = df2.apply(pd.to_numeric, errors='coerce')
print(df)
print(df2)
col1 = df.col1 - df2.col1
col2 = df.col2 - df2.col2
col3 = df.col3 - df2.col3
dfnew = pd.concat([col1, col2,col3], axis=1)
print(type(col1))
print(dfnew)

Can't replace string symbol "-" in dataframe on jupyter notebook

I import data from "url = ("http://finviz.com/quote.ashx?t=" + symbol.lower())"
and got the table:
P/B P/E Forward P/E PEG Debt/Eq EPS (ttm) Dividend % ROE \
AMZN 18.73 92.45 56.23 2.09 1.21 16.25 - 26.70%
GOOG 4.24 38.86 - 2.55 - 26.65 - -
PG 4.47 22.67 19.47 3.45 0.61 4.05 3.12% 18.80%
KO 11.04 30.26 21.36 4.50 2.45 1.57 3.29% 15.10%
IBM 5.24 9.28 8.17 9.67 2.37 12.25 5.52% 30.90%
ROI EPS Q/Q Insider Own
AMZN 3.50% 1026.20% 16.20%
GOOG - 36.50% 5.74%
PG 13.10% 15.50% 0.10%
KO 12.50% 56.80% 0.10%
IBM 17.40% 0.70% 0.10%
Then I was trying to convert string to float:
df = df[(df['P/E'].astype(float)<20) & (df['P/B'].astype(float) < 3)]
and got "ValueError: could not convert string to float:"
I think that values 0.70% and sign "-" is the problem.
I tried:
df.replace("-","0")
df.replace('-', 0)
df.replace('-', nan)
But nothing works.
You may need to assign it back
df=df.replace("-","0")
And I recommend to_numeric
df['P/E']=pd.to_numeric(df['P/E'],errors = 'coerce')
df['P/B']=pd.to_numeric(df['P/B'],errors = 'coerce')
You should use numpy:
import numpy as np
then the next replacement:
df = df.replace('-', np.nan)
Next, change the datatype:
df = df['Forward P/E'].astype(float)
Lastly, you can test if the datatype is float64.

Multiply all columns of a multi-indexed DataFrame by appropriate values in a Series

I feel like this one should be obvious, but I'm a bit stuck.
I have a DataFrame (df) with a 3-level MultiIndex on the rows. One of the levels of the MultiIndex is ccy and represents the currency that denominates the information contained in that row. Each row has 3 columns of data.
I would like to convert all of the data to be denominated in a reference currency (say USD). To do this, I have a series (forex) that contains foreign exchange rates for the relevant currencies.
So the goal is simple: multiply all the data in each row of df by the value of forex that corresponds to the ccy entry of the index of that row in df.
The mechanical setup looks like this:
import pandas as pd
import numpy as np
import itertools
np.random.seed(0)
tuples = list(itertools.product(
list('abd'),
['one', 'two', 'three'],
['USD', 'EUR', 'GBP']
))
np.random.shuffle(tuples)
idx = pd.MultiIndex.from_tuples(tuples[:-10], names=['letter', 'number', 'ccy'])
df = pd.DataFrame(np.random.randn(len(idx), 3), index=idx,
columns=['val_1', 'val_2', 'val_3'])
forex = pd.Series({'USD': 1.0,
'EUR': 1.3,
'GBP': 1.7})
I can get what I need by running:
df.apply(lambda col: col.mul(forex, level='ccy'), axis=0)
But it seems weird to me that I would need to use pd.DataFrame.apply in such a simple case. I would have expected the following syntax (or something very much like it) to work:
df.mul(forex, level='ccy', axis=0)
but that gives me:
ValueError: cannot reindex from a duplicate axis
Clearly the apply method isn't a disaster. But just seems weird that I couldn't figure out the syntax for doing this directly across all the columns with mul. Is there a more direct way to handle this? If not, is there an intuitive reason the mul syntax shouldn't be enhanced to work this way?
This now works in master/0.14. See the issue: https://github.com/pydata/pandas/pull/6682
In [11]: df.mul(forex,level='ccy',axis=0)
Out[11]:
val_1 val_2 val_3
letter number ccy
a one GBP -2.172854 2.443530 -0.132098
d three USD 1.089630 0.096543 1.418667
b two GBP 1.986064 1.610216 1.845328
three GBP 4.049782 -0.690240 0.452957
a two GBP -2.304713 -0.193974 -1.435192
b one GBP 1.199589 -0.677936 -1.406234
d two GBP -0.706766 -0.891671 1.382272
b two EUR -0.298026 2.810233 -1.244011
d one EUR 0.087504 0.268448 -0.593946
GBP -1.801959 1.045427 2.430423
b three EUR -0.275538 -0.104438 0.527017
a one EUR 0.154189 1.630738 1.844833
b one EUR -0.967013 -3.272668 -1.959225
d three GBP 1.953429 -2.029083 1.939772
EUR 1.962279 1.388108 -0.892566
a three GBP 0.025285 -0.638632 -0.064980
USD 0.367974 -0.044724 -0.302375
[17 rows x 3 columns]
Here is a another way to do it (also requires master/0.14)
In [127]: df = df.sortlevel()
In [128]: df
Out[128]:
val_1 val_2 val_3
letter number ccy
a one EUR 0.118607 1.254414 1.419102
GBP -1.278149 1.437371 -0.077705
three GBP 0.014873 -0.375666 -0.038224
USD 0.367974 -0.044724 -0.302375
two GBP -1.355714 -0.114103 -0.844231
b one EUR -0.743856 -2.517437 -1.507096
GBP 0.705641 -0.398786 -0.827197
three EUR -0.211952 -0.080337 0.405398
GBP 2.382224 -0.406024 0.266445
two EUR -0.229251 2.161717 -0.956931
GBP 1.168273 0.947186 1.085487
d one EUR 0.067311 0.206499 -0.456881
GBP -1.059976 0.614957 1.429661
three EUR 1.509445 1.067775 -0.686589
GBP 1.149076 -1.193578 1.141042
USD 1.089630 0.096543 1.418667
two GBP -0.415745 -0.524512 0.813101
[17 rows x 3 columns]
idx = pd.IndexSlice
In [129]: pd.concat([ df.loc[idx[:,:,x],:]*v for x,v in forex.iteritems() ])
Out[129]:
val_1 val_2 val_3
letter number ccy
a one EUR 0.154189 1.630738 1.844833
b one EUR -0.967013 -3.272668 -1.959225
three EUR -0.275538 -0.104438 0.527017
two EUR -0.298026 2.810233 -1.244011
d one EUR 0.087504 0.268448 -0.593946
three EUR 1.962279 1.388108 -0.892566
a one GBP -2.172854 2.443530 -0.132098
three GBP 0.025285 -0.638632 -0.064980
two GBP -2.304713 -0.193974 -1.435192
b one GBP 1.199589 -0.677936 -1.406234
three GBP 4.049782 -0.690240 0.452957
two GBP 1.986064 1.610216 1.845328
d one GBP -1.801959 1.045427 2.430423
three GBP 1.953429 -2.029083 1.939772
two GBP -0.706766 -0.891671 1.382272
a three USD 0.367974 -0.044724 -0.302375
d three USD 1.089630 0.096543 1.418667
[17 rows x 3 columns]
Here's another way via merging
In [36]: f = forex.to_frame('value')
In [37]: f.index.name = 'ccy'
In [38]: pd.merge(df.reset_index(),f.reset_index(),on='ccy')
Out[38]:
letter number ccy val_1 val_2 val_3 value
0 a one GBP -1.278149 1.437371 -0.077705 1.7
1 b two GBP 1.168273 0.947186 1.085487 1.7
2 b three GBP 2.382224 -0.406024 0.266445 1.7
3 a two GBP -1.355714 -0.114103 -0.844231 1.7
4 b one GBP 0.705641 -0.398786 -0.827197 1.7
5 d two GBP -0.415745 -0.524512 0.813101 1.7
6 d one GBP -1.059976 0.614957 1.429661 1.7
7 d three GBP 1.149076 -1.193578 1.141042 1.7
8 a three GBP 0.014873 -0.375666 -0.038224 1.7
9 d three USD 1.089630 0.096543 1.418667 1.0
10 a three USD 0.367974 -0.044724 -0.302375 1.0
11 b two EUR -0.229251 2.161717 -0.956931 1.3
12 d one EUR 0.067311 0.206499 -0.456881 1.3
13 b three EUR -0.211952 -0.080337 0.405398 1.3
14 a one EUR 0.118607 1.254414 1.419102 1.3
15 b one EUR -0.743856 -2.517437 -1.507096 1.3
16 d three EUR 1.509445 1.067775 -0.686589 1.3
[17 rows x 7 columns]

Categories

Resources