Assign new column using a set of sub-columns - python

I have a dataframe with a column 'name' of the form ['A','B','C',A','B','B'....] and a set of arrays: one corresponding to 'A', say array_A = [0, 1, 2 ...] and array_B = [3, 1, 0 ...], array_C etc...
I want to create a new column 'value' by assigning array_A where the row name in the dataframe is 'A', and similarly for 'B' and 'C'.
The function df['value']=np.where(df['name']=='A',array_A, df['value']) won't do it because it would overwrite the values for other names or have dimensionality issues.
For example:
arrays = {'A': np.array([0, 1, 2]),
'B': np.array([3, 1])}
Desired output:
df = pd.DataFrame({'name': ['A', 'B', 'A', 'A', 'B']})
name value
0 A 0
1 B 3
2 A 1
3 A 2
4 B 1

You can use a for loop with a dictionary:
arrays = {'A': np.array([0, 1, 2]),
'B': np.array([3, 1])}
df = pd.DataFrame({'name': ['A', 'B', 'A', 'A', 'B']})
for k, v in arrays.items():
df.loc[df['name'] == k, 'value'] = v
df['value'] = df['value'].astype(int)
print(df)
name value
0 A 0
1 B 3
2 A 1
3 A 2
4 B 1

Related

Filling column of dataframe based on 'groups' of values of another column

I am trying to fill values of a column based on the value of another column. Suppose I have the following dataframe:
import pandas as pd
data = {'A': [4, 4, 5, 6],
'B': ['a', np.nan, np.nan, 'd']}
df = pd.DataFrame(data)
And I would like to fill column B but only if the value of column A equals 4. Hence, all rows that have the same value as another in column A should have the same value in column B (by filling this).
Thus, the desired output should be:
data = {'A': [4, 4, 5, 6],
'B': ['a', a, np.nan, 'd']}
df = pd.DataFrame(data)
I am aware of the fillna method, but this gives the wrong output as the third row also gets the value 'A' assigned:
df['B'] = fillna(method="ffill", inplace=True)
data = {'A': [4, 4, 5, 6],
'B': ['a', 'a', 'a', 'd']}
df = pd.DataFrame(data)
How can I get the desired output?
Try this:
df['B'] = df.groupby('A')['B'].ffill()
Output:
>>> df
A B
0 4 a
1 4 a
2 5 NaN
3 6 d

column is not getting dropped

Why column A is not getting dropped in train,valid,test data frames?
import pandas as pd
train = pd.DataFrame({'A': [0, 1, 2, 3, 4],'B': [5, 6, 7, 8, 9],'C': ['a', 'b', 'c', 'd', 'e']})
test = pd.DataFrame({'A': [0, 1, 2, 3, 4],'B': [5, 6, 7, 8, 9],'C': ['a', 'b', 'c', 'd', 'e']})
valid = pd.DataFrame({'A': [0, 1, 2, 3, 4],'B': [5, 6, 7, 8, 9],'C': ['a', 'b', 'c', 'd', 'e']})
for df in [train,valid,test]:
df = df.drop(['A'],axis=1)
print('A' in train.columns)
print('A' in test.columns)
print('A' in valid.columns)
#True
#True
#True
You can use inplace=True parameter, because DataFrame.drop function working also inplace:
for df in [train,valid,test]:
df.drop(['A'],axis=1, inplace=True)
print('A' in train.columns)
False
print('A' in test.columns)
False
print('A' in valid.columns)
False
Reason why is not removed column is df is not assign back, so DataFrames are not changed.
Another idea is create list of DataFrames and assign each changed DataFrame back:
L = [train,valid,test]
for i in range(len(L)):
L[i] = L[i].drop(['A'],axis=1)
print (L)
[ B C
0 5 a
1 6 b
2 7 c
3 8 d
4 9 e, B C
0 5 a
1 6 b
2 7 c
3 8 d
4 9 e, B C
0 5 a
1 6 b
2 7 c
3 8 d
4 9 e]

Sum the values in a pandas column based on the items in another column

how can I sum the values in column 'two' based on the items in column 'one' in pandas dataframe:
df = pd.DataFrame({'One': ['A', 'B', 'A', 'B'], 'Two': [1, 5, 3, 4]})
out[1]:
One Two
0 A 1
1 B 5
2 A 3
3 B 4
Expected output should be:
A 4
B 9
You need to group by the first column and sum on the second.
df.groupby('One', as_index=False).sum()
One Two
0 A 4
1 B 9
The trick is use pandas built-in functions .groupby(COLUMN_NAME) and then .sum() that new pandas object
import pandas as pd
df = pd.DataFrame({'One': ['A', 'B', 'A', 'B'], 'Two': [1, 5, 3, 4]})
groups = df.groupby('One').sum()
print(groups.head())

Pandas duplicated indexes still shows correct elements

I have a pandas DataFrame like this:
test = pd.DataFrame({'score1' : pandas.Series(['a', 'b', 'c', 'd', 'e']), 'score2' : pandas.Series(['b', 'a', 'k', 'n', 'c'])})
Output:
score1 score2
0 a b
1 b a
2 c k
3 d n
4 e c
I then split the score1 and score2 columns and concatenate them together:
In (283): frame1 = test[['score1']]
frame2 = test[['score2']]
frame2.rename(columns={'score2': 'score1'}, inplace=True)
test = pandas.concat([frame1, frame2])
test
Out[283]:
score1
0 a
1 b
2 c
3 d
4 e
0 b
1 a
2 k
3 n
4 c
Notice the duplicate indexes. Now if I do a groupby and then retrieve a group using get_group(), pandas is still able to retrieve the elements with the correct index, even though the indexes are duplicated!
In (283): groups = test.groupby('score1')
groups.get_group('a') # Get group with key a
Out[283]:
score1
0 a
1 a
In (283): groups.get_group('b') # Get group with key b
Out[283]:
score1
1 b
0 b
I understand that pandas uses an inverted index data structure for storing the groups, which looks like this:
In (284): groups.groups
Out[284]: {'a': [0, 1], 'b': [1, 0], 'c': [2, 4], 'd': [3], 'e': [4], 'k': [2], 'n': [3]}
If both a and b are stored at index 0, how does pandas show me the elements correctly when I do get_group()?
This is into the internals (i.e., don't rely on this API!) but the way it works now is that there is a Grouping object which stores the groups in terms of positions, rather than index labels.
In [25]: gb = test.groupby('score1')
In [26]: gb.grouper
Out[26]: <pandas.core.groupby.BaseGrouper at 0x4162b70>
In [27]: gb.grouper.groupings
Out[27]: [Grouping(score1)]
In [28]: gb.grouper.groupings[0]
Out[28]: Grouping(score1)
In [29]: gb.grouper.groupings[0].indices
Out[29]:
{'a': array([0, 6], dtype=int64),
'b': array([1, 5], dtype=int64),
'c': array([2, 9], dtype=int64),
'd': array([3], dtype=int64),
'e': array([4], dtype=int64),
'k': array([7], dtype=int64),
'n': array([8], dtype=int64)}
See here for where it's actually implemented.
https://github.com/pydata/pandas/blob/master/pandas/core/groupby.py#L2091

Pandas keep column after multiple aggregations

I'm trying to do multiple aggragations over a pandas dataframe, the problem is that I want to keep the column over I aggregate
df3 = pd.DataFrame({'X' : ['A', 'B', 'A', 'B'], 'Y' : [1, 4, 3, 2]})
df3.groupby('X', as_index=False).agg('sum')
X Y
0 A 4
1 B 6
That's good but what I want is multiple aggregations like this
df3 = pd.DataFrame({'X' : ['A', 'B', 'A', 'B'], 'Y' : [1, 4, 3, 2]})
df3.groupby('X', as_index=False).agg(['sum', 'mean'])
It gives me
Y
sum mean
X
A 4 2
B 6 3
But I want this
X Y
sum mean
0 A 4 2
1 B 6 3
To move X from the index to a column use reset_index:
In [4]: df3 = pd.DataFrame({'X' : ['A', 'B', 'A', 'B'], 'Y' : [1, 4, 3, 2]})
In [5]: df3.groupby('X', as_index=False).agg(['sum', 'mean']).reset_index()
Out[5]:
X Y
sum mean
0 A 4 2
1 B 6 3

Categories

Resources