Augment DataFrame index - python

I want to write a series ('b') of a dataframe from one dataframe (df2) to another one (df1). Both DataFrames use the same index column, but the range of df2's index goes a bit further and it's missing some of the indices of df1.
This is the current behaviour:
>>> import pandas as pd
>>> pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
a b
0 1 4
1 2 5
2 3 6
>>>
>>> df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
>>> df1 = df.set_index(['a'])
>>> df1
b
a
1 4
2 5
3 6
>>> dg = pd.DataFrame({'a': [3, 4, 5], 'b': [7, 8, 9]})
>>> dg
a b
0 3 7
1 4 8
2 5 9
>>> df2 = dg.set_index('a')
>>> df2
b
a
3 7
4 8
5 9
>>> df1['b'] = df2['b']
>>> df1
b
a
1 NaN
2 NaN
3 7.0
When I call df1['b'] = df2['b'] those values of the indices not in df2 are becoming nan and the indices of df2 that aren't in df1 are not getting carried over into df1.
Is there any way to change this behaviour so that the resulting DataFrame is the below?
>>> df1
b
a
1 1
2 2
3 7
4 8
5 9

This is a use case for combine_first. It will prioritize the calling dataframe and fill in any missing values with the second. It will also concatenate rows from the second data frame that don't have labels in the first.
df2.combine_first(df1)

One option you can go with is reindex() df2 and then fill missing values with df1:
df2 = df2.reindex(df1.index.union(df2.index))
df2['b'] = df2['b'].fillna(df1['b'])
df2
# b
#a
#1 4.0
#2 5.0
#3 7.0
#4 8.0
#5 9.0

Related

How to merge dataframes on one column while aligning the other columns in common

Consider two DataFrames:
>>> df1 = pd.DataFrame({'key': [1, 2, 3, 4, 5],
'bar': ['w','x','y','z','h'],
'foo': ['A', 'B', 'C', 'D','E']})
>>> df2 = pd.DataFrame({'key': [1, 2, 3, 8, 9, 10],
'foo': [np.nan, np.nan, np.nan, 'I','J','K']})
Imagine we want to join DataFrames on 'key' so that ONLY the keys in df1 are returned EXCEPT for those keys in df2 that are greater than 8. You can do this by
first doing a left join via df3 = pd.merge(df1,df2,on='key',how='left')
Then, doing an outer join with a slice of df2 via df4 = pd.merge(df3,df2.loc[df2['key']>8],on='key',how='outer')
However, rather than aligning the columns 'foo' in each DataFrame, each 'foo' column will be added to df4 as discrete columns with suffixes added to distinguish between them. And, several lines of code will be required to combine the three 'foo' columns so that I have a DataFrame with only one 'foo' column. Is there a more concise way to do this?
EDIT:
I guess my example belies the true question. Let's use these DataFrames:
>>> df1 = pd.DataFrame({'key': [1, 2, 3, 4, 5],
'bar': ['w','x','y','z','h'],
'foo': [np.nan, np.nan, 'C', 'D','E'],})
>>> df2 = pd.DataFrame({'key': [1, 2, 3, 8, 9, 10],
'foo': ['A', 'B', np.nan, 'I','J','K']})
If I use a left and then outer join as described above, I will get this...
key bar foo_x foo_y foo
0 1 w NaN A NaN
1 2 x NaN B NaN
2 3 y C NaN NaN
3 4 z D NaN NaN
4 5 h E NaN NaN
5 9 NaN NaN NaN J
6 10 NaN NaN NaN K
Because combining the three 'foo' columns will require many lines of code, I wondering if there is a more concise do all this. That is, merge the two DataFrames and combine the 'foo' columns such that the returned DataFrame is this:
key bar foo
0 1 w A
1 2 x B
2 3 y C
3 4 z D
4 5 h E
5 9 NaN J
6 10 NaN K
Let's try concat and groupby:
(pd.concat((df1, df2.query('key>8')))
.groupby('key',as_index=False).first()
)
Output:
key foo bar
0 1 A w
1 2 B x
2 3 C y
3 4 D z
4 5 E h
5 9 J NaN
6 10 K NaN

sum values in different rows and columns dataframe python

My Data Frame
A B C D
2 3 4 5
1 4 5 6
5 6 7 8
How do I add values of different rows and different columns
Column A Row 2 with Column B row 1
Column A Row 3 with Column B row 2
Similarly for all rows
If you only need do this with two columns (and I understand your question well), I think you can use the shift function.
Your data frame (pandas?) is something like:
d = {'A': [2, 1, 5], 'B': [3, 4, 6], 'C': [4, 5, 7], 'D':[5, 6, 8]}
df = pd.DataFrame(data=d)
So, it's possible to create a new data frame with B column shifted:
df2 = df['B'].shift(1)
which gives:
0 NaN
1 3.0
2 4.0
Name: B, dtype: float64
and then, merge this new data with the previous df and, for example, sum the values:
df = df.join(df2, rsuffix='shift')
df['out'] = df['A'] + df['Bshift']
The final output is in out column:
A B C D Bshift out
0 2 3 4 5 NaN NaN
1 1 4 5 6 3.0 4.0
2 5 6 7 8 4.0 9.0
But it's only an intuition, I'm not sure about your question!

update column value of pandas groupby().last()

Given dataframe:
dfd = pd.DataFrame({'A': [1, 1, 2,2,3,3],
'B': [4, 5, 6,7,8,9],
'C':['a','b','c','c','d','e']
})
I can find the last C value of each A group by using
dfd.groupby('A').last()['C']
However, I want to update the C values to np.nan. I don't know how to do that. Method such as:
def replace(df):
df['C']=np.nan
return replace
dfd.groupby('A').last().apply(lambda dfd: replace(dfd))
Does not work.
I want the result like:
dfd_result= pd.DataFrame({'A': [1, 1, 2,2,3,3],
'B': [4, 5, 6,7,8,9],
'C':['a',np.nan,'c',np.nan,'d',np.nan]
})
IIUIC, you need loc. Get the index of last values using tail
In [1145]: dfd.loc[dfd.groupby('A')['C'].tail(1).index, 'C'] = np.nan
In [1146]: dfd
Out[1146]:
A B C
0 1 4 a
1 1 5 NaN
2 2 6 c
3 2 7 NaN
4 3 8 d
5 3 9 NaN
dfd.loc[dfd.groupby('A').tail(1).index, 'C'] = np.nan should be fine too.

How do I transpose dataframe in pandas without index?

Pretty sure this is very simple.
I am reading a csv file and have the dataframe:
Attribute A B C
a 1 4 7
b 2 5 8
c 3 6 9
I want to do a transpose to get
Attribute a b c
A 1 2 3
B 4 5 6
C 7 8 9
However, when I do df.T, it results in
0 1 2
Attribute a b c
A 1 2 3
B 4 5 6
C 7 8 9`
How do I get rid of the indexes on top?
You can set the index to your first column (or in general, the column you want to use as as index) in your dataframe first, then transpose the dataframe. For example if the column you want to use as index is 'Attribute', you can do:
df.set_index('Attribute',inplace=True)
df.transpose()
Or
df.set_index('Attribute').T
It works for me:
>>> data = {'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]}
>>> df = pd.DataFrame(data, index=['a', 'b', 'c'])
>>> df.T
a b c
A 1 2 3
B 4 5 6
C 7 8 9
If your index column 'Attribute' is really set to index before the transpose, then the top row after the transpose is not the first row, but a title row. if you don't like it, I would first drop the index, then rename them as columns after the transpose.

pandas rearrange dataframe to have all values in ascending order per every column independently

The title should say it all, I want to turn this DataFrame:
A NaN 4 3
B 2 1 4
C 3 4 2
D 4 2 8
into this DataFrame:
A 2 1 2
B 3 2 3
C 4 4 4
D NaN 4 8
And I want to do it in a nice manner. The ugly solution would be to take every column and form a new DataFrame.
To test, use:
d = {'one':[None, 2, 3, 4],
'two':[4, 1, 4, 2],
'three':[3, 4, 6, 8],}
df = pd.DataFrame(d, index = list('ABCD'))
The desired sort ignores the index values, so the operation appears to be more
like a NumPy operation than a Pandas one:
import pandas as pd
d = {'one':[None, 2, 3, 4],
'two':[4, 1, 4, 2],
'three':[3, 4, 6, 8],}
df = pd.DataFrame(d, index = list('ABCD'))
# one three two
# A NaN 3 4
# B 2 4 1
# C 3 6 4
# D 4 8 2
arr = df.values
arr.sort(axis=0)
df = pd.DataFrame(arr, index=df.index, columns=df.columns)
print(df)
yields
one three two
A 2 3 1
B 3 4 2
C 4 6 4
D NaN 8 4

Categories

Resources