`pandas.merge` not recognising same index - python

I have two dataframes with overlapping columns but identical indexes and I want to combine them. I feel like this should be straight forward but I have worked through sooo many examples and SO questions and it's not working but also seems to not be consistent with other examples.
import pandas as pd
# create test data
df = pd.DataFrame({'gen1': [1, 0, 0, 1, 1], 'gen3': [1, 0, 0, 1, 0], 'gen4': [0, 1, 1, 0, 1]}, index = ['a', 'b', 'c', 'd', 'e'])
df1 = pd.DataFrame({'gen1': [1, 0, 0, 1, 1], 'gen2': [0, 1, 1, 1, 1], 'gen3': [1, 0, 0, 1, 0]}, index = ['a', 'b', 'c', 'd', 'e'])
In [1]: df
Out[1]:
gen1 gen2 gen3
a 1 0 1
b 0 1 0
c 0 1 0
d 1 1 1
e 1 1 0
In [2]: df1
Out[2]:
gen1 gen3 gen4
a 1 1 0
b 0 0 1
c 0 0 1
d 1 1 0
e 1 0 1
After working through all the examples here (https://pandas.pydata.org/pandas-docs/stable/merging.html) I'm convinced I have found the correct example (the first and second example of the merges). The second example is this:
In [43]: result = pd.merge(left, right, on=['key1', 'key2'])
In their example they have two DFs (left and right) that have overlapping columns and identical indexs and their resulting dataframe has one version of each column and the original indexs but this is not what happens when I do that:
# get the intersection of columns (I need this to be general)
In [3]: column_intersection = list(set(df).intersection(set(df1))
In [4]: pd.merge(df, df1, on=column_intersection)
Out[4]:
gen1 gen2 gen3 gen4
0 1 0 1 0
1 1 0 1 0
2 1 1 1 0
3 1 1 1 0
4 0 1 0 1
5 0 1 0 1
6 0 1 0 1
7 0 1 0 1
8 1 1 0 1
Here we see that merge has not seen that the indexs are the same! I have fiddled around with the options but cannot get the result I want.
A similar but different question was asked here How to keep index when using pandas merge but I don't really understand the answers and so can't relate it to my problem.
Points for this specific example:
Index will always be identical.
Columns with the same name will always have identical entries (i.e. they are duplicates).
It would be great to have a solution for this specific problem but I would also really like to understand it because I find myself spending lots of time on combining dataframes from time to time. I love pandas and in general I find it very intuitive but I just can't seem to get comfortable with anything other than trivial combinations of dataframes.

Starting v0.23, you can specify an index name for the join key, if you have it.
df.index.name = df1.index.name = 'idx'
df.merge(df1, on=list(set(df).intersection(set(df1)) | {'idx'}))
gen1 gen3 gen4 gen2
idx
a 1 1 0 0
b 0 0 1 1
c 0 0 1 1
d 1 1 0 1
e 1 0 1 1
The assumption here is that your actual DataFrame does not have exactly the same values in overlapping columns. If they did, then your question would be one of concatenation— you can use pd.concat for that:
c = list(set(df).intersection(set(df1)))
pd.concat([df1, df.drop(c, 1)], axis=1)
gen1 gen2 gen3 gen4
a 1 0 1 0
b 0 1 0 1
c 0 1 0 1
d 1 1 1 0
e 1 1 0 1

In this special case, you can use assign
Things in df take priority but all other things in df1 are included.
df1.assign(**df)
gen1 gen2 gen3 gen4
a 1 0 1 0
b 0 1 0 1
c 0 1 0 1
d 1 1 1 0
e 1 1 0 1
**df unpacks df assuming a dictionary context. This unpacking delivers keyword arguments to assign with the names of columns as the keyword and the column as the argument.
It is the same as
df1.assign(gen1=df.gen1, gen3=df.gen3, gen4=df.gen4)

Related

Split a column into multiple columns that has value as list

I have a problem about splitting column into multiple columns
I have a data like table on the top.
column B contains the values of list .
I want to split the values of column B into columns like the right table. The values in the top table will be the number of occurrences of the values in column B (bottom table).
input:
A B
a [1, 2]
b [3, 4, 5]
c [1, 5]
expected output:
A 1 2 3 4 5
a 1 1 0 0 0
b 0 0 1 1 1
c 1 0 0 0 1
You can explode the column of lists and use crosstab:
df2 = df.explode('B')
out = pd.crosstab(df2['A'], df2['B']).reset_index().rename_axis(columns=None)
output:
A 1 2 3 4 5
0 a 1 1 0 0 0
1 b 0 0 1 1 1
2 c 1 0 0 0 1
used input:
df = pd.DataFrame({'A': list('abc'), 'B': [[1,2], [3,4,5], [1,5]]})

Pandas: How to pivot rows into columns

I am looking for a way to pivot around 600 columns into rows. Here's a sample with only 4 of those columns (good, bad, ok, Horrible):
df:
RecordID good bad ok Horrible
A 0 0 1 0
B 1 0 0 1
desired output:
RecordID Column Value
A Good 0
A Bad 0
A Ok 1
A Horrible 0
B Good 1
B Bad 0
B Ok 0
B Horrible 1
You can use melt function:
(df.melt(id_vars='RecordID', var_name='Column', value_name='Value')
.sort_values('RecordID')
.reset_index(drop=True)
)
Output:
RecordID Column Value
0 A good 0
1 A bad 0
2 A ok 1
3 A Horrible 0
4 B good 1
5 B bad 0
6 B ok 0
7 B Horrible 1
You can use .stack() as follows. Using .stack() is preferred as it naturally resulted in rows already sorted in the order of RecordID so that you don't need to waste processing time sorting on it again, especially important when you have a large number of columns.
df = df.set_index('RecordID').stack().reset_index().rename(columns={'level_1': 'Column', 0: 'Value'})
Output:
RecordID Column Value
0 A good 0
1 A bad 0
2 A ok 1
3 A Horrible 0
4 B good 1
5 B bad 0
6 B ok 0
7 B Horrible 1
Adding dataframe:
import pandas as pd
import numpy as np
data2 = {'RecordID': ['a', 'b', 'c'],
'good': [0, 1, 1],
'bad': [0, 0, 1],
'horrible': [0, 1, 1],
'ok': [1, 0, 0]}
# Convert the dictionary into DataFrame
df = pd.DataFrame(data2)
Melt data:
https://pandas.pydata.org/docs/reference/api/pandas.melt.html
melted = df.melt(id_vars='RecordID', var_name='Column', value_name='Value')
melted
Optionally: Group By - for summ or mean values:
f2 = melted.groupby(['Column']).sum()
df2

How to conditionally add one hot vector to a Pandas DataFrame

I have the following Pandas DataFrame in Python:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.array([[1, 2, 3], [3, 2, 1], [2, 1, 1]]),
columns=['a', 'b', 'c'])
df
It looks as the following when you output it:
a b c
0 1 2 3
1 3 2 1
2 2 1 1
I need to add 3 new columns, as column "d", column "e", and column "f".
Values in each new column will be determined based on the values of column "b" and column "c".
In a given row:
If the value of column "b" is bigger than the value of column "c", columns [d, e, f] will have the values [1, 0, 0].
If the value of column "b" is equal to the value of column "c", columns [d, e, f] will have the values [0, 1, 0].
If the value of column "b" is smaller than the value of column "c", columns [d, e, f] will have the values [0, 0, 1].
After this operation, the DataFrame needs to look as the following:
a b c d e f
0 1 2 3 0 0 1 # Since b smaller than c
1 3 2 1 1 0 0 # Since b bigger than c
2 2 1 1 0 1 0 # Since b = c
My original DataFrame is much bigger than the one in this example.
Is there a good way of doing this in Python without looping through the DataFrame?
You can use np.where to create condition vector and use str.get_dummies to create dummies
df['vec'] = np.where(df.b>df.c, 'd', np.where(df.b == df.c, 'e', 'f'))
df = df.assign(**df['vec'].str.get_dummies()).drop('vec',1)
a b c d e f
0 1 2 3 0 0 1
1 3 2 1 1 0 0
2 2 1 1 0 1 0
Let us try np.sign with get_dummies, -1 is c<b, 0 is c=b, 1 is c>b
df=df.join(np.sign(df.eval('c-b')).map({-1:'d',0:'e',1:'f'}).astype(str).str.get_dummies())
df
Out[29]:
a b c d e f
0 1 2 3 0 0 1
1 3 2 1 1 0 0
2 2 1 1 0 1 0
You simply harness the Boolean conditions you've already specified.
df["d"] = np.where(df.b > df.c, 1, 0)
df["e"] = np.where(df.b == df.c, 1, 0)
df["f"] = np.where(df.b < df.c, 1, 0)

Creating a string from pandas column and row data

I am interested in generating a string that is composed of pandas row and column data. Given the following pandas data frame I am interested only in generating a string from columns with positive values
index A B C
1 0 1 2
2 0 0 3
3 0 0 0
4 1 0 0
I would like to create a new column that appends a string that lists which columns in a row were positive. Then I would drop all of the rows that the data came from:
index Positives
1 B-1, C-2
2 C-3
4 A-1
Here is one way using pd.DataFrame.apply + pd.Series.apply:
df = pd.DataFrame([[1, 0, 1, 2], [2, 0, 0, 3], [3, 0, 0, 0], [4, 1, 0, 0]],
columns=['index', 'A', 'B', 'C'])
def formatter(x):
x = x[x > 0]
return (x.index[1:].astype(str) + '-' + x[1:].astype(str))
df['Positives'] = df.apply(formatter, axis=1).apply(', '.join)
print(df)
index A B C Positives
0 1 0 1 2 B-1, C-2
1 2 0 0 3 C-3
2 3 0 0 0
3 4 1 0 0 A-1
If you need to filter out zero-length strings, you can use the fact that empty strings evaluate to False with bool:
res = df[df['Positives'].astype(bool)]
print(res)
index A B C Positives
0 1 0 1 2 B-1, C-2
1 2 0 0 3 C-3
3 4 1 0 0 A-1
I'd replace the zeros with np.NaN to remove things you don't care about and stack. Then form the strings you want and groupby.apply(list)
import numpy as np
df = df.set_index('index') # if 'index' is not your index.
stacked = df.replace(0, np.NaN).stack().reset_index()
stacked['Positives'] = stacked['level_1'] + '-' + stacked[0].astype(int).astype('str')
stacked = stacked.groupby('index').Positives.apply(list).reset_index()
stacked is now:
index Positives
0 1 [B-1, C-2]
1 2 [C-3]
2 4 [A-1]
Or if you just want one string and not a list, change the last line:
stacked.groupby('index').Positives.apply(lambda x: ', '.join(list(x))).reset_index()
# index Positives
#0 1 B-1, C-2
#1 2 C-3
#2 4 A-1

Use a list to conditionally fill a new column based on values in multiple columns

I am trying to populate a new column within a pandas dataframe by using values from several columns. The original columns are either 0 or '1' with exactly a single 1 per series. The new column would correspond to df['A','B','C','D'] by populating new_col = [1, 3, 7, 10] as shown below. (A 1 at A means new_col = 1; if B=1,new_col = 3, etc.)
df
A B C D
1 1 0 0 0
2 0 0 1 0
3 0 0 0 1
4 0 1 0 0
The new df should look like this.
df
A B C D new_col
1 1 0 0 0 1
2 0 0 1 0 7
3 0 0 0 1 10
4 0 1 0 0 3
I've tried to use map, loc, and where but can't seem to formulate an efficient way to get it done. Problem seems very close to this. A couple other posts I've looked at 1 2 3. None of these show how to use multiple columns conditionally to fill a new column based on a list.
I can think of a few ways, mostly involving argmax or idxmax, to get either an ndarray or a Series which we can use to fill the column.
We could drop down to numpy, find the maximum locations (where the 1s are) and use those to index into an array version of new_col:
In [148]: np.take(new_col,np.argmax(df.values,1))
Out[148]: array([ 1, 7, 10, 3])
We could make a Series with new_col as the values and the columns as the index, and index into that with idxmax:
In [116]: pd.Series(new_col, index=df.columns).loc[df.idxmax(1)].values
Out[116]: array([ 1, 7, 10, 3])
We could use get_indexer to turn the column idxmax results into integer offsets we can use with new_col:
In [117]: np.array(new_col)[df.columns.get_indexer(df.idxmax(axis=1))]
Out[117]: array([ 1, 7, 10, 3])
Or (and this seems very wasteful) we could make a new frame with the new columns and use idxmax directly:
In [118]: pd.DataFrame(df.values, columns=new_col).idxmax(1)
Out[118]:
0 1
1 7
2 10
3 3
dtype: int64
It's not the most elegant solution, but for me it beats the if/elif/elif loop:
d = {'A': 1, 'B': 3, 'C': 7, 'D': 10}
def new_col(row):
k = row[row == 1].index.tolist()[0]
return d[k]
df['new_col'] = df.apply(new_col, axis=1)
Output:
A B C D new_col
1 1 0 0 0 1
2 0 0 1 0 7
3 0 0 0 1 10
4 0 1 0 0 3

Categories

Resources