rsuffix for merging data in pandas - python

I have multiple dataframes with the same columns but different values that look like that
Product 1 Dataframe
Here's the code that generated them
import pandas as pd
d1 = {"Year":[2018,2019,2020],"Quantity": [10,20,30], "Price": [100,200,300]}
df_product1 = pd.DataFrame(data=d1)
d2 = {"Year":[2018,2019,2020],"Quantity": [20,20,50], "Price": [120,110,380]}
df_product2 = pd.DataFrame(data=d2)
d3 = {"Year":[2018,2019,2020],"Quantity": [40,20,70], "Price": [1000,140,380]}
df_product3 = pd.DataFrame(data=d3)
I merge two dataframes and identify suffixes like so
df_total = df_product1.merge(df_product2,on="Year", suffixes = ("_Product1","_Product2"))
And I get
First Merged Dataframe
However, when I merge another dataframe to the result above using:
df_total = df_total.merge(df_product3,on="Year", suffixes = ("_Product","_Product3"))
I get
Final Merged Dataframe
Where there is no suffix for the third product.
I would like the last two columns of the dataframe to be Quantity_Product3, Price_Product3 instead of just Quantity and Price.
Let me know if it is possible or if I need to approach the problem from a completely different angle.

Why you don't get the result you want
It's explained in the docs: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html
suffixeslist-like, default is (“_x”, “_y”) A length-2 sequence where
each element is optionally a string indicating the suffix to add to
overlapping column names in left and right respectively. Pass a value
of None instead of a string to indicate that the column name from left
or right should be left as-is, with no suffix. At least one of the
values must not be None.
Suffixes are added to overlapping column names.
See this example - suffixes are added to column b, because both dataframes have a column b, but not to columns a and c, as they are unique and not in common between the two dataframes.
df1 = pd.DataFrame(columns =['a','b'], data = np.random.rand(10,2))
df2 = pd.DataFrame(columns =['b','c'], data = np.random.rand(10,2), index = np.arange(5,15))
# equivalent to an inner join on the indices
out = pd.merge(df1, df2, how ='inner', left_index = True, right_index = True)
A crude solution
Why don't you just rename the columns manually? Not elegant but effective
A possible alternative
The table you are trying to build looks like a pivot. I would look into normalising all your dataframes, concatenating them, then running a pivot on the result.
Depending on your case, this may well be more convoluted and could well be overkill. I mention it because I want to bring your attention to the concepts of pivoting/unpivoting (stacking/unstacking/normalising) data.
The code below takes a df which looks similar to yours and normalises it. For simpler cases you can use pandas.melt(). I don't have the exact data of your example but this should be a good starting point.
def random_dates(start, end, n, unit='D', seed=None):
ndays = (end - start).days + 1
return start + pd.to_timedelta(
np.random.randint(0, ndays, n), unit=unit
)
df = pd.DataFrame()
mysize = 20
df['key'] = np.arange(0,mysize)
df['A_value'] = np.random.randint(0,10000,mysize)
df['A_date'] = random_dates(pd.to_datetime('2010-01-01' ), pd.to_datetime('2019-01-01'), mysize)
df['B_value'] = np.random.randint(-5000,5000,mysize)
df['B_date'] = random_dates(pd.to_datetime('2005-01-01' ), pd.to_datetime('2015-01-01'), mysize)
df['C_value'] = np.random.randint(-10000,10000,mysize)
df['C_date'] = random_dates(pd.to_datetime('2000-01-01' ), pd.to_datetime('2019-01-01'), mysize)
df2 = df.set_index('key', drop=True, verify_integrity = True)
df2 = df2.stack().reset_index()
df2.columns=['key','rawsource','rawvalue']
df2['source'] = df2['rawsource'].apply(lambda x: x[0:1])
df2['metric'] = df2['rawsource'].apply(lambda x: x[2:])
df2 = df2.drop(['rawsource'], axis = 1)
df_piv = df2.pivot_table( index=['key','source'], columns = 'metric' , values ='rawvalue', aggfunc='first' ).reset_index().rename_axis(None, axis=1)

Related

Concatenate two dataframes with different row indices

I want to concatenate two data frames of the same length, by adding a column to the first one (df).
But because certain df rows are being filtered, it seems the index isn't matching.
import pandas as pd
pd.read_csv(io.StringIO(uploaded['customer.csv'].decode('utf-8')), sep=";")
df["Margin"] = df["Sales"]-df["Cost"]
df = df.loc[df["Margin"]>-100000]
df = df.loc[df["Sales"]> 1000]
df.reindex()
df
This returns:
So this operation:
customerCluster = pd.concat([df, clusters], axis = 1, ignore_index= True)
print(customerCluster)
Is returning:
So, I've tried reindex and the argument ignore_index = True as you can see in above code snippet.
Thanks for all the answers. If anyone encounters the same problem, the solution I found was this:
customerID = df["CustomerID"]
customerID = customerID.reset_index(drop=True)
df = df.reset_index(drop=True)
So, basically, the indexes of both data frames are now matching, thus:
customerCluster = pd.concat((customerID, clusters), axis = 1)
This will concatenate correctly the two data frames.

Populating a column based off of values in another column

Hi I am working with pandas to manipulate some lab data. I currently have a data frame with 5 columns.
The first three columns(Analyte,CAS NO(1), and Value) are in the correct order.
The last two columns(CAS NO 2 and Value 2) are not.
Is there a way to align CAS No(2) and Value(2) with the first three columns based off of matching CAS Numbers(aka CAS NO(2)=CAS(NO1).
I am new to python and pandas. Thank you for your help
you can reorder the columns by reassigning the df variable as a slice of itself indexed on a list whose entries are the column names in question.
colidx = ['Analyte', 'CAS NO(1)', 'CAS NO(2)']
df = df[colidx]
Better provide input data in text format so we can copy-paste it. I understand you question like this: You need to sort two last columns together, so that CAS NO(2) matches CAS NO(1).
Since CAS NO(2)=CAS(NO1) you then do not need duplicated CAS NO(2) column, right?
Split off two last columns and make a Series from it, then convert that series to dict, and use that dict to map new values.
# Split 2 last columns and assign index.
df_tmp = df[['CAS NO(2)', 'Value(2)']]
df_tmp = df_tmp.set_index('CAS NO(2)')
# Keep only 3 first columns of original dataframe
df = df[['Analyte',' CASNo(1)', 'Value(1)']]
# Now copy the CasNO(1) to CAS NO(2)
df['CAS NO(2)'] = df['CasNO(1)']
# Now create Value(2) column on original dataframe
df['Value(2)'] = df['CASNo(1)'].map(df_tmp.to_dict()['Value(2)'])
Try the following:
import pandas as pd
import numpy as np
#create an example of your table
list_CASNo1 = ['71-43-2', '100-41-4', np.nan, '1634-04-4']
list_Val1 = [np.nan]*len(list_CASNo1)
list_CASNo2 = [np.nan, np.nan, np.nan, '100-41-4']
list_Val2 = [np.nan, np.nan, np.nan, '18']
df = pd.DataFrame(zip(list_CASNo1, list_Val1, list_CASNo2, list_Val2), columns =['CASNo(1)','Value(1)','CAS NO(2)','Value(2)'], index = ['Benzene','Ethylbenzene','Gasonline Range Organics','Methyl-tert-butyl ether'])
#split the data to two dataframes
df1 = df[['CASNo(1)','Value(1)']]
df2 = df[['CAS NO(2)','Value(2)']]
#merge df2 to df1 based on the specified columns
#reset_index and set_index will take care
#that df_adjusted will have the same index names as df1
df_adjusted = df1.reset_index().merge(df2.dropna(),
how = 'left',
left_on = 'CASNo(1)',
right_on = 'CAS NO(2)').set_index('index')
but be careful with duplicates in your columns, those will cause the merge to fail..

Merging mutliple pandas dfs time series on DATE index and which are contained in a python dictionary

I have a python dictionary that contains CLOSE prices for several stocks, stock indices, fixed income instruments and currencies (AAPL, AORD, etc.), using a DATE index. The different DFs in the dictionary have different lengths, i.e. some time series are longer than others. All the DFs have the same field, ie. 'CLOSE'.
The length of the dictionary is variable. How can I merge all the DFs into a single one, by DATE index, and also using lsuffix = partial name and feature of the the file I am reading? (for example, the AAPL_CLOSE.csv file has a DATE & a CLOSE field, but to differentiate from the other 'CLOSE' in the merged DF, its name should be AAPL_CLOSE)
This is what I have:
asset_name = []
files_to_test = glob.glob('*_CLOSE*')
for name in files_to_test:
asset_name.append(name.rsplit('_', 1)[0])
Which returns:
asset_name = ['AAPL', 'AORD', 'EURODOLLAR1MONTH', 'NGETF', 'USDBRL']
files_to_test = ['AAPL_CLOSE.csv',
'AORD_CLOSE.csv',
'EURODOLLAR1MONTH_CLOSE.csv',
'NGETF_CLOSE.csv',
'USDBRL_CLOSE.csv']
Then:
asset_dict = {}
for name, file in zip(asset_name, files_to_test):
asset_dict[name] = pd.read_csv(file, index_col = 'DATE', parse_dates = True)
This is the little function I would like to generalize, to create a big merge of all the DFs in the dictionary by DATE, using lsuffix = the elements in asset_list.
merged = asset_dict['AAPL'].join(asset_dict['AORD'], how = 'right', lsuffix ='_AAPL')
The DFs will have a lot of N/A due to the mismatch of lengths, but I will deal with that later.
After not getting any answers, I found a solution that works, although there might be better ones. This is what I did:
asset_dict = {}
for name, file in zip(asset_name, files_to_test):
asset_dict[name] = pd.read_csv(file, index_col='DATE', parse_dates=True)
asset_dict[name].sort_index(ascending = True, inplace = True)
Pandas can concatenate multilple dfs (at once, not one by one) contained in dictionaries, 'straight out of the box' without much tweaking, by specifying the axis and other parameters.
df = pd.concat(asset_dict, axis = 1)
The resulting df is a multi-index df, which is a problem for me. Also, the time series for stock prices are all of different lengths, which creates a lot of NaNs. I solved bot problems with this:
df.columns = df.columns.droplevel(1)
df.dropna(inplace = True)
Now, the columns of my df are these:
['AAPL', 'AORD', 'EURODOLLAR1MONTH', 'NGETF', 'USDBRL']
But since I wanted them to contain the 'STOCK_CLOSE' format, I do this:
old_columns = df.columns.tolist()
new_columns = []
for name in old_columns:
new_name = name + '_CLOSE_'
new_columns.append(new_name)

Pandas difference between dataframes on column values

I couldn't find a way to have a dataframe that has the difference of 2 dataframes based on a column. So basically:
dfA = ID, val
1, test
2, other test
dfB = ID, val
2, other test
I want to have a dfC that holds the difference dfA - dfB based on column ID
dfC = ID, val
1, test
merge the dataframe on ID
dfMerged = dfA.merge(dfB, left_on='ID', right_on='ID', how='outer') # defaults to inner join.
In the merged dataframe, name collisions are avoided using the suffix _x & _y to denote left and right source dataframes.
So, you'll end up with (most likely) val_x and val_y. compare these columns however you want to. For example:
dfMerged['x_y_test'] = dfMerged.val_y == dfMerged.val_x
# gives you a column with a comparison of val_x, val_y.
Use this as a mask to get to the desired dfC in your question.
Does this work for you?
dfC = dfB[dfB["ID"] == dfA["ID"]]
How about this:
dfC = dfB[dfB["ID"].isin(dfA["ID"])]

Iteratively concatenate pandas dataframe with multiindex

I am iteratively processing a couple of "groups" and I would like to add them together to a dataframe with every group being identified by a 2nd level index.
This:
print pd.concat([df1, df2, df3], keys=["A", "B", "C"])
was suggested to me - but it doesn't play well with iteration.
I am currently doing
data_all = pd.DataFrame([])
for a in a_list:
group = some.function(a, etc)
group = group.set_index(['CoI'], append=True, drop=True)
group = group.reorder_levels(['CoI','oldindex'])
data_all = pd.concat([data_all, group], ignore_index=False)
But the last line totally destroys my multi-index and I cannot reconstruct it.
Can you give me a hand?
Should be able just make data_all a list and concatenate once at the end:
data_all = []
for a in a_list:
group = some.function(a, etc)
group = group.set_index(['CoI'], append=True, drop=True)
group = group.reorder_levels(['CoI','oldindex'])
data_all.append(group)
data_all = pd.concat(data_all, ignore_index=False)
Also keep in mind that pandas' concat works with iterators. Something like yield group may be more efficient than appending to a list each time. I haven't profiled it though!

Categories

Resources