here are my two dataframes
index = pd.MultiIndex.from_product([['a','b'],[1,2]],names=['one','two'])
df = pd.DataFrame({'col':[10,20,30,40]}, index = index)
df
col
one two
a 1 10
2 20
b 1 30
2 40
index_1 = pd.MultiIndex.from_product([['a','b'],[1.,2],['abc','mno','xyz']], names = ['one','two','three'])
temp = pd.DataFrame({'col1':[1,2,3,4,5,6,7,8,9,10,11,12]}, index = index_1)
temp
col1
one two three
a 1.0 abc 1
mno 2
xyz 3
2.0 abc 4
mno 5
xyz 6
b 1.0 abc 7
mno 8
xyz 9
2.0 abc 10
mno 11
xyz 12
how can I merge both of them?
I have tried, this
pd.merge(left = temp, right = df, left_on = temp.index.levels[0], right_on = df.index.levels[0])
but this does not work
KeyError: "Index([u'a', u'b'], dtype='object', name=u'one') not in index"
if I convert the index into columns through reset_index() than the merge works. However, I wish to achieve this while preserving the index structure.
my desired output is:
method 1
reset_index + merge
df.reset_index().merge(temp.reset_index()).set_index(index_1.names)
method 2
join with reset_index partial
df.join(temp.reset_index('three')).set_index('three', append=True)
Related
I was working with inner join using concat in pandas.
Using two DataFrames as below:-
df1 = pd.DataFrame([['a',1],['b',2]], columns=['letter','number'])
df3 = pd.DataFrame([['c',3,'cat'],['d',4,'dog']],
columns=['letter','number','animal'])
pd.concat([df1,df3], join='inner')
The out is below
letter number
0 a 1
1 b 2
0 c 3
1 d 4
But after using axis=1 the output is as below
pd.concat([df1,df3], join='inner', axis=1)
letter number letter number animal
0 a 1 c 3 cat
1 b 2 d 4 dog
Why it is showing animal column while doing inner join when axis=1?
In Pandas.concat()
axis argument defines whether to concat the dataframes based on index or columns.
axis=0 // based on index (default value)
axis=1 // based on columns
when you Concatenated df1 and df3, it uses index to combine dataframes and thus output is
letter number
0 a 1
1 b 2
0 c 3
1 d 4
But when you used axis=1, pandas combined the data based on columns.
thats why the output is
letter number letter number animal
0 a 1 c 3 cat
1 b 2 d 4 dog
EDIT:
you asked But inner join only join same column right? Then why it is showing 'animal' column?
So, Because right now you have 2 rows in both the dataframes and join only works in indexes.
For explaining to you, I have added another row in df3
Let's suppose df3 is
0 1 2
0 c 3 cat
1 d 4 dog
2 e 5 bird
Now, If you concat the df1 and df3
pd.concat([df1,df3], join='inner', axis=1)
letter number 0 1 2
0 a 1 c 3 cat
1 b 2 d 4 dog
pd.concat([df1,df3], join='outer', axis=1)
letter number 0 1 2
0 a 1.0 c 3 cat
1 b 2.0 d 4 dog
2 NaN NaN e 5 bird
As you can see, in inner join only 0 and 1 indexes are in output
but in outer join, all the indexes are in output with NAN values.
The default value of the axis is 0. So in the first concat call axis=0 and over there concatenation happens in rows. When you set axis=1 the operation is similar to
df1.merge( df3, how="inner", left_index=True, right_index=True)
I have a dataframe and a list
df=pd.read_csv('aa.csv')
temp=['1','2','3','4','5','6','7']`
Now my data-frame have only 3 rows. I am adding temp as a new column
df['temp']=pd.Series(temp)
But in the final df i am only getting first 3 values of temp and all others are rejected. Is there any way to add a list of larger/smaller in size as a new column to the dataframe
Thanks
Use DataFrame.reindex for create rows filled by missing values before created new column:
df = pd.read_csv('aa.csv')
temp = ['1','2','3','4','5','6','7']
df = df.reindex(range(len(temp)))
df['temp'] = pd.Series(temp)
Sample:
df = pd.DataFrame({'A': [1,2,3]})
print(df)
A
0 1
1 2
2 3
temp = ['1','2','3','4','5','6','7']
df = df.reindex(range(len(temp)))
df['temp']=pd.Series(temp)
print (df)
A temp
0 1.0 1
1 2.0 2
2 3.0 3
3 NaN 4
4 NaN 5
5 NaN 6
6 NaN 7
Or use concat with Series with specify name for new column name:
s = pd.Series(temp, name='temp')
df = pd.concat([df, s], axis=1)
Similar:
s = pd.Series(temp)
df = pd.concat([df, s.rename('temp')], axis=1)
print (df)
A temp
0 1.0 1
1 2.0 2
2 3.0 3
3 NaN 4
4 NaN 5
5 NaN 6
6 NaN 7
i want to add 2 columns of 2 different dataframes based on the condition that name is same:
import pandas as pd
df1 = pd.DataFrame([("Apple",2),("Litchi",4),("Orange",6)], columns=['a','b'])
df2 = pd.DataFrame([("Apple",200),("Orange",400),("Litchi",600)], columns=['a','c'])
now i want to add column b and c if the name is same in a.
I tried this df1['b+c']=df1['b']+df2['c'] but it simply adds column b and c so the result comes as
a b b+c
0 Apple 2 202
1 Litchi 4 404
2 Orange 6 606
but i want to
a b+c
0 Apple 202
1 Litchi 604
2 Orange 406
i guess i have to use isin but i am not getting how?
Columns b and c are aligned by index values in sum operation, so is necessary create index by DataFrame.set_index by column a:
s1 = df1.set_index('a')['b']
s2 = df2.set_index('a')['c']
df1 = (s1+s2).reset_index(name='b+c')
print (df1)
a b+c
0 Apple 202
1 Litchi 604
2 Orange 406
EDIT: If need original value for not matched values use Series.add with parameter fill_value=0
df2 = pd.DataFrame([("Apple",200),("Apple",400),("Litchi",600)], columns=['a','c'])
print (df2)
a c
0 Apple 200
1 Apple 400
2 Litchi 600
s1 = df1.set_index('a')['b']
s2 = df2.set_index('a')['c']
df1 = s1.add(s2, fill_value=0).reset_index(name='b+c')
print (df1)
a b+c
0 Apple 202.0
1 Apple 402.0
2 Litchi 604.0
3 Orange 6.0
I have a pandas dataframe df like this, say
ID activity date
1 A 4
1 B 8
1 A 12
1 C 12
2 B 9
2 A 10
3 A 3
3 D 4
and I would like to return a table that counts the number of occurence of some activity in a precise list, say l = [A, B] in this case, then
ID activity(count)_A activity(count)_B
1 2 1
2 1 2
3 1 0
is what I need.
What is the quickest way to perform that ? ideally without for loop
Thanks !
Edit: I know there is pivot function to do this kind of job. But in my case I have much more activity types than what I really need to count in the list l. Is it still optimal to use pivot ?
You can use isin with boolean indexing as first step and then pivoting - fastest should be groupby, size and unstack, then pivot_table and last crosstab, the best test each solution with real data:
df2 = (df[df['activity'].isin(['A','B'])]
.groupby(['ID','activity'])
.size()
.unstack(fill_value=0)
.add_prefix('activity(count)_')
.reset_index()
.rename_axis(None, axis=1))
print (df2)
ID activity(count)_A activity(count)_B
0 1 2 1
1 2 1 1
2 3 1 0
Or:
df1 = df[df['activity'].isin(['A','B'])]
df2 = (pd.crosstab(df1['ID'], df1['activity'])
.add_prefix('activity(count)_')
.reset_index()
.rename_axis(None, axis=1))
Or:
df2 = (df[df['activity'].isin(['A','B'])]
.pivot_table(index='ID', columns='activity', aggfunc='size', fill_value=0)
.add_prefix('activity(count)_')
.reset_index()
.rename_axis(None, axis=1))
I believe df.groupby('activity').size().reset_index(name='count')
should do as you expect.
Just aggregate by Counter and use pd.DataFrame default constructor
from collections import Counter
agg_= df.groupby(df.index).ID.agg(Counter).tolist()
ndf = pd.DataFrame(agg_)
A B C D
0 2 1.0 1.0 NaN
1 1 1.0 NaN NaN
2 1 NaN NaN 1.0
If you have l = ['A', 'B'], just filter
ndf[l]
A B
0 2 1.0
1 1 1.0
2 1 NaN
To be concrete, say we have two DataFrames:
df1:
date A
12/1/14 3
12/2/14 NaN
12/3/14 2
12/2/14 NaN
12/4/14 NaN
12/6/14 5
df2:
B
12/2/14 20
12/4/14 30
I want to do kind of a left outer join to fill the missing values in df1, and generate
df3:
date A
12/1/14 3
12/2/14 20
12/3/14 2
12/2/14 20
12/4/14 30
12/6/14 5
Any efficient way to make it?
You can use combine_first (only the columns names should match, therefore I first rename the column B in df2):
In [8]: df2 = df2.rename(columns={'B':'A'})
In [9]: df1.combine_first(df2)
Out[9]:
A
12/1/14 3
12/2/14 20
12/2/14 20
12/3/14 2
12/4/14 30
12/6/14 5