How to specify hierarchical columns in Pandas merge? - python

After a serious misconception of how on works in join (spoiler: very different from on in merge), this is my example code.
import pandas as pd
index1 = pd.MultiIndex.from_product([["variables"], ["number", "fruit"]])
df1 = pd.DataFrame([["one", "apple"], ["two", "banana"]], columns=index1)
index2 = pd.MultiIndex.from_product([["variables"], ["fruit", "color"]])
df2 = pd.DataFrame([["banana", "yellow"]], columns=index2)
print(df1.merge(df2, on="fruit", how="left"))
I get a KeyError. How do I correctly reference variables.fruit here?
To understand what I am after, consider the same problem without a multi index:
import pandas as pd
df1 = pd.DataFrame([["one", "apple"], ["two", "banana"]], columns=["number", "fruit"])
df2 = pd.DataFrame([["banana", "yellow"]], columns=["fruit", "color"])
# this is obviously incorrect as it uses indexes on `df1` as well as `df2`:
print(df1.join(df2, rsuffix="_"))
# this is *also* incorrect, although I initially thought it should work, but it uses the index on `df2`:
print(df1.join(df2, on="fruit", rsuffix="_"))
# this is correct:
print(df1.merge(df2, on="fruit", how="left"))
The expected and wanted result is this:
number fruit color
0 one apple NaN
1 two banana yellow
How do I get the same when fruit is part of a multi index?

I think I understand what you are trying to accomplish now, and I don't think join is going to get you there. Both DataFrame.join and DataFrame.merge make a call to pandas.core.reshape.merge.merge, but using DataFrame.merge gives you more control over what defaults are applied.
In your case, you can use reference the columns to join on via a list of tuples, where the elements of the tuple are the levels of the multi-indexed columns. I.e. to use the variables / fruit column, you can pass [('variables', 'fruit')].
Using tuples is how you index into multi-index columns (and row indices). You need to wrap it in a list because merge operations can be performed using multiple columns, or multiple multi-indexed columns, like a JOIN statement in SQL. Passing a single string is just a convenience case that gets wrapped in a list for you.
Since you are only joining on 1 column, it is a list of a single tuple.
import pandas as pd
index1 = pd.MultiIndex.from_product([["variables"], ["number", "fruit"]])
df1 = pd.DataFrame([["one", "apple"], ["two", "banana"]], columns=index1)
index2 = pd.MultiIndex.from_product([["variables"], ["fruit", "color"]])
df2 = pd.DataFrame([["banana", "yellow"]], columns=index2)
df1.merge(df2, how='left', on=[('variables', 'fruit')])
# returns:
variables
number fruit color
0 one apple NaN
1 two banana yellow

Related

Joining two dataframes then combining data in fields with same name using Pandas

I have seven dataframes with hundreds of rows each(don't ask) that I need to combine on a column. I know how to use the inner join functionality. In Pandas. What I need help with is that there are instances where these seven data frames have columns with the same names. In those instances, I would like to combine the data therein and delimit with a semicolon.
For example, if Row 1 in DF1 through DF7 have the same identifier, I would like Col1 in each dataframe (given they have the same name) to be combined to read:
dfdata1; dfdata2; ...;dfdata7
In cases where a column name is unique, I'd like it to appear in the final combined dataframe.
I've included a simple example
import pandas as pd
data1 = pd.DataFrame([['Banana', 'Sally', 'CA'], ['Apple', 'Gretta', 'MN'], ['Orange', 'Samantha', 'NV']],
columns=['Product', 'Cashier', 'State'])
data2 = pd.DataFrame([['Shirt','', 'CA'], ['Shoe', 'Trish', 'MN'], ['Socks', 'Paula', 'NM', 'Hourly']],
This yields two dataframes:
If we were to use an outer merge on state:
pd.merge(data1,data2,on='State',how='outer')
What I want is something more like this:
Is this doable in pandas or will I have to merge the first two, combine the columns with the same names, then move on to combine THAT with the third one etc. I'm trying to be as efficient as possible.
Instead of merging, concatenate
# concatenate and groupby to join the strings
df = pd.concat([data1, data2]).groupby('State', as_index=False).agg(lambda x: '; '.join(el for el in x if pd.notna(el)))
print(df)
State Product Cashier Type
0 CA Banana; Shirt Sally;
1 MN Apple; Shoe Gretta; Trish
2 NM Socks Paula Hourly
3 NV Orange Samantha

How to get Python function to update original df [duplicate]

I have a large data frame df and a small data frame df_right with 2 columns a and b. I want to do a simple left join / lookup on a without copying df.
I come up with this code but I am not sure how robust it is:
dtmp = pd.merge(df[['a']], df_right, on = 'a', how = "left") #one col left join
df['b'] = dtmp['b'].values
I know it certainly fails when there are duplicated keys: pandas left join - why more results?
Is there better way to do this?
Related:
Outer merging two data frames in place in pandas
What are the exact downsides of copy=False in DataFrame.merge()?
You are almost there.
There are 4 cases to consider:
Both df and df_right do not have duplicated keys
Only df has duplicated keys
Only df_right has duplicated keys
Both df and df_right have duplicated keys
Your code fails in case 3 & 4 since the merging extends the number of row count in df. In order to make it work, you need to choose what information to drop in df_right prior to merging. The purpose of this is to enforce any merging scheme to be either case 1 or 2.
For example, if you wish to keep "first" values for each duplicated key in df_right, the following code works for all 4 cases above.
dtmp = pd.merge(df[['a']], df_right.drop_duplicates('a', keep='first'), on='a', how='left')
df['b'] = dtmp['b'].values
Alternatively, if column 'b' of df_right consists of numeric values and you wish to have summary statistic:
dtmp = pd.merge(df[['a']], df_right.groupby('a').mean().reset_index(drop=False), on='a', how='left')
df['b'] = dtmp['b'].values

How to merge two pandas dataframes on column of sets

I have columns in two dataframes representing interacting partners in a biological system, so if gene_A interacts with gene_B, the entry in column 'gene_pair' would be {gene_A, gene_B}. I want to do an inner join, but trying:
pd.merge(df1, df2, how='inner', on=['gene_pair'])
throws the error
TypeError: type object argument after * must be a sequence, not itertools.imap
I need to merge on the unordered pair, so as far as I can tell I can't merge on a combination of two individual columns with gene names. Is there another way to achieve this merge?
Some example dfs:
gene_pairs1 = [
set(['gene_A','gene_B']),
set(['gene_A','gene_C']),
set(['gene_D','gene_A'])
]
df1 = pd.DataFrame({'r_name': ['r1','r2','r3'], 'gene_pair': gene_pairs1})
gene_pairs2 = [
set(['gene_A','gene_B']),
set(['gene_F','gene_A']),
set(['gene_C','gene_A'])
]
df2 = pd.DataFrame({'function': ['f1','f2','f3'], 'gene_pair': gene_pairs2})
pd.merge(df1,df2,how='inner',on=['gene_pair'])
and I would like entry 'r1' line up with 'f1' and 'r2' to line up with 'f3'.
Pretty simple in the end: I used frozenset, rather than set.
I suggest u get an extra Id column for each pair and then join on that!
for eg.
df2['gp'] = df2.gene_pair.apply(lambda x: list(x)[0][-1]+list(x)[1][-1])
df1['gp'] = df1.gene_pair.apply(lambda x: list(x)[0][-1]+list(x)[1][-1])
pd.merge(df1, df2[['function','gp']],how='inner',on=['gp']).drop('gp', axis=1)

Why does referencing a concatenated pandas dataframe return multiple entries?

When I create a dataframe using concat like this:
import pandas as pd
dfa = pd.DataFrame({'a':[1],'b':[2]})
dfb = pd.DataFrame({'a':[3],'b':[4]})
dfc = pd.concat([dfa,dfb])
And I try to reference like I would for any other DataFrame I get the following result:
>>> dfc['a'][0]
0 1
0 3
Name: a, dtype: int64
I would expect my concatenated DataFrame to behave like a normal DataFrame and return the integer that I want like this simple DataFrame does:
>>> dfa['a'][0]
1
I am just a beginner, is there a simple explanation for why the same call is returning an entire DataFrame and not the single entry that I want? Or, even better, an easy way to get my concatenated DataFrame to respond like a normal DataFrame when I try to reference it? Or should I be using something other than concat?
You've mistaken what normal behavior is. dfc['a'][0] is a label lookup and matches anything with an index value of 0 in which there are two because you concatenated two dataframes with index values including 0.
in order to specify position of 0
dfc['a'].iloc[0]
or you could have constructed dfc like
dfc = pd.concat([dfa,dfb], ignore_index=True)
dfc['a'][0]
Both returning
1
EDITED (thx piRSquared's comment)
Use append() instead pd.concat():
dfc = dfa.append(dfb, ignore_index=True)
dfc['a'][0]
1

pandas appending df1 to df2 get 0s/NaNs in result

I have 2 dataframes. df1 comprises a Series of values.
df1 = pd.DataFrame({'winnings': cumsums_winnings_s, 'returns':cumsums_returns_s, 'spent': cumsums_spent_s, 'runs': cumsums_runs_s, 'wins': cumsums_wins_s, 'expected': cumsums_expected_s}, columns=["winnings", "returns", "runs", "wins", "expected"])
df2 runs each row through a function which takes 3 columns and produces a result for each row - specialSauce
df2= pd.DataFrame(list(map(lambda w,r,e: doStuff(w,r,e), df1['wins'], df1['runs'], df1['expected'])), columns=["specialSauce"])
print(df2.append(df1))
produces all the df1 columns but NaN for the df1 (and vice versa if df1/df2 switched in append)
So the problem I has is how to append these 2 dataframes correctly.
As I understand things, your issue seems to be related to the fact that you get NaN's in the result DataFrame.
The reason for this is that you are trying to .append() one dataframe to the other while they don't have the same columns.
df2 has one extra column, the one created with apply() and doStuff, while df1 does not have that column. When trying to append one pd.DataFrame to the other the result will have all columns both pd.DataFrame objects. Naturally, you will have some NaN's for ['specialSauce'] since this column does not exist in df1.
This would be the same if you were to use pd.concat(), both methods do the same thing in this case. The one thing that you could do to bring the result closer to your desired result is use the ignore_index flag like this:
>> df2.append(df1, ignore_index=True)
This would at least give you a 'fresh' index for the result pd.DataFrame.
EDIT
If what you're looking for is to "append" the result of doStuff to the end of your existing df, in the form of a new column (['specialSauce']), then what you'll have to do is use pd.concat() like this:
>> pd.concat([df1, df2], axis=1)
This will return the result pd.DataFrame as you want it.
If you had a pd.Series to add to the columns of df1 then you'd need to add it like this:
>> df1['specialSauce'] = <'specialSauce values'>
I hope that helps, if not please rephrase the description of what you're after.
Ok, there are a couple of things going on here. You've left code out and I had to fill in the gaps. For example you did not define doStuff, so I had to.
doStuff = lambda w, r, e: w + r + e
With that defined, your code does not run. I had to guess what you were trying to do. I'm guessing that you want to have an additional column called 'specialSauce' adjacent to your other columns.
So, this is how I set it up and solved the problem.
Setup and Solution
import pandas as pd
import numpy as np
np.random.seed(314)
df = pd.DataFrame(np.random.randn(100, 6),
columns=["winnings", "returns",
"spent", "runs",
"wins", "expected"]).cumsum()
doStuff = lambda w, r, e: w + r + e
df['specialSauce'] = df[['wins', 'runs', 'expected']].apply(lambda x: doStuff(*x), axis=1)
print df.head()
winnings returns spent runs wins expected specialSauce
0 0.166085 0.781964 0.852285 -0.707071 -0.931657 0.886661 -0.752067
1 -0.055704 1.163688 0.079710 0.155916 -1.212917 -0.045265 -1.102266
2 -0.554241 1.928014 0.271214 -0.462848 0.452802 1.692924 1.682878
3 0.627985 3.047389 -1.594841 -1.099262 -0.308115 4.356977 2.949601
4 0.796156 3.228755 -0.273482 -0.661442 -0.111355 2.827409 2.054611
Also
You tried to use pd.DataFrame.append(). Per the linked documentation, it attaches the DataFrame specified as the argument to the end of the DataFrame that is being appended to. You would have wanted to use pd.DataFrame.concat().

Categories

Resources