I have a pandas dataframe like so:
import pandas as pd
df = pd.DataFrame({
'id':[1,2,3,4,5,6],
'a':[1,2,3,4,5,6],
'b':['a', 'b', 'c', 'd', 'e', 'f']
})
And I would like to replace values in columns a and b with constants given by a dictionary like so:
fills = dict(
a = 1,
b = 'a'
)
to obtain a result like this:
id a b
0 1 1 a
1 2 1 a
2 3 1 a
3 4 1 a
4 5 1 a
5 6 1 a
Obviously, I can do:
for column in fills:
df.loc[:, column] = fills[column]
To get the desired results of:
id a b
0 1 1 a
1 2 1 a
2 3 1 a
3 4 1 a
4 5 1 a
5 6 1 a
But is there perhaps some pandas function, that would let me pass the dictionary as an argument and to this replacement without writing a python loop?
You are right if columns names are not numbers - then is possible use DataFrame.assign:
df = df.assign(**fills)
print (df)
id a b
0 1 1 a
1 2 1 a
2 3 1 a
3 4 1 a
4 5 1 a
5 6 1 a
Generally solution:
fills = {'a':4, 5:3}
for k, v in fills.items():
df[k] = v
print (df)
id a b 5
0 1 4 a 3
1 2 4 b 3
2 3 4 c 3
3 4 4 d 3
4 5 4 e 3
5 6 4 f 3
Related
I'm new in python.
My code:
import numpy as np
import pandas as pd
df = pd.DataFrame({'A':[2,3,2,2,2],
'B':[1,5,5,1,1],
'C':[1,6,6,2,1],
'D':[1,2,3,1,1]})
df
dataframe:
A B C D
0 2 1 1 1
1 3 5 6 2
2 2 5 6 3
3 2 1 2 1
4 2 1 1 1
I want to delete the row and remain the first row, if column B and column C are both the same.
Like,
for row0 & row4, columnB and columnC are the same, delete row4;
for row1 & row2, columnB and columnC are the same, delete row2;
Use drop_duplicates on 'B' and 'C' columns (subset=['B', 'C']) and keep first (keep='first')
>>> df.drop_duplicates(subset=['B', 'C'], keep='first')
A B C D
0 2 1 1 1
1 3 5 6 2
3 2 1 2 1
keep='first' is the default option so you don't have to set it.
You can do something like:
df.groupby(['B', 'C']).head(1)
This takes the first element from each group:
A B C D
0 2 1 1 1
1 3 5 6 2
3 2 1 2 1
Or:
>>> df[~df[['B', 'C']].duplicated()]
A B C D
0 2 1 1 1
1 3 5 6 2
3 2 1 2 1
>>>
I'm trying to add another line of headers above the headers in a pandas dataframe.
Turning this :
import pandas as pd
df = pd.DataFrame(data={'A': range(5), 'B': range(5), 'C':range(5)})
print(df)
A B C
0 0 0 0
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
into this (for instance):
D E
A B C
0 0 0 0
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
Having both A and B under D and C under E. This doesn't seem like something that would be hard with pandas but yet I can't seem to find the answer. How do you do this?
With some help from #Nk03, you can create a mapping of existing column labels to D and E, then create a MultiIndex :
map_dict = {'A': 'D', 'B': 'D' ,'C':'E'}
df.columns = pd.MultiIndex.from_tuples(zip(df.columns.map(map_dict), df.columns))
D E
A B C
0 0 0 0
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
Let's say we have the following dataframe. If we wanted to find the count of consecutive 1's, you could use the below.
col
0 0
1 1
2 1
3 1
4 0
5 0
6 1
7 1
8 0
9 1
10 1
11 1
12 1
13 0
14 1
15 1
df['col'].groupby(df['col'].diff().ne(0).cumsum()).cumsum()
But the problem I see is when you need to use groupby with and id field. If we added an id field to the dataframe (below), it makes it more complicated. We can no longer use the solution above.
id col
0 B 0
1 B 1
2 B 1
3 B 1
4 A 0
5 A 0
6 B 1
7 B 1
8 B 0
9 B 1
10 B 1
11 A 1
12 A 1
13 A 0
14 A 1
15 A 1
When presented with this issue, ive seen the case for making a helper series to use in the groupby like this:
s = df['col'].eq(0).groupby(df['id']).cumsum()
df['col'].groupby([df['id'],s]).cumsum()
Which works, but the problem is that the first group contains the first row, which does not fit the criteria. This usually isn't a problem, but it is if we wanted to find the count. Replacing cumsum() at the end of the last groupby() with .transform('count') would actually give us 6 instead of 5 for the count of consecutive 1's in the first B group.
The only solution I can come up with for this problem is the following code:
df['col'].groupby([df['id'],df.groupby('id')['col'].transform(lambda x: x.diff().ne(0).astype(int).cumsum())]).transform('count')
Expected output:
0 1
1 5
2 5
3 5
4 2
5 2
6 5
7 5
8 1
9 2
10 2
11 2
12 2
13 1
14 2
15 2
This works, but uses transform() twice, which I heard isn't the fastest. It is the only solution I can think of that uses diff().ne(0) to get the "real" groups.
Index 1,2,3,6 and 7 are all id B, with the same value in the 'col' column, so the count would not be reset, so they would all be apart of the same group.
Can this be done without using multiple .transform()?
The following code uses only 1 .transform(), and relies upon ordering the index, to get the correct counts.
The original index is kept, so the final result can be reindexed back to the original order.
Use cum_counts['cum_counts'] to get the exact desired output, without the other column.
import pandas as pd
# test data as shown in OP
df = pd.DataFrame({'id': ['B', 'B', 'B', 'B', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'A', 'A', 'A', 'A', 'A'], 'col': [0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1]})
# reset the index, then set the index and sort
df = df.reset_index().set_index(['index', 'id']).sort_index(level=1)
col
index id
4 A 0
5 A 0
11 A 1
12 A 1
13 A 0
14 A 1
15 A 1
0 B 0
1 B 1
2 B 1
3 B 1
6 B 1
7 B 1
8 B 0
9 B 1
10 B 1
# get the cumulative sum
g = df.col.ne(df.col.shift()).cumsum()
# use g to groupby and use only 1 transform to get the counts
cum_counts = df['col'].groupby(g).transform('count').reset_index(level=1, name='cum_counts').sort_index()
id cum_counts
index
0 B 1
1 B 5
2 B 5
3 B 5
4 A 2
5 A 2
6 B 5
7 B 5
8 B 1
9 B 2
10 B 2
11 A 2
12 A 2
13 A 1
14 A 2
15 A 2
After looking at #TrentonMcKinney solution, I came up with:
df = df.sort_values(['id'])
grp =(df[['id','col']] != df[['id','col']].shift()).any(axis=1).cumsum()
df['count'] = df.groupby(grp)['id'].transform('count')
df.sort_index()
Output:
id col count
0 B 0 1
1 B 1 5
2 B 1 5
3 B 1 5
4 A 0 2
5 A 0 2
6 B 1 5
7 B 1 5
8 B 0 1
9 B 1 2
10 B 1 2
11 A 1 2
12 A 1 2
13 A 0 1
14 A 1 2
15 A 1 2
IIUC, do you want?
grp = (df[['id', 'col']] != df[['id', 'col']].shift()).any(axis = 1).cumsum()
df['count'] = df.groupby(grp)['id'].transform('count')
df
Output:
id col count
0 B 0 1
1 B 1 3
2 B 1 3
3 B 1 3
4 A 0 2
5 A 0 2
6 B 1 2
7 B 1 2
8 B 0 1
9 B 1 2
10 B 1 2
11 A 1 2
12 A 1 2
13 A 0 1
14 A 1 2
15 A 1 2
My Problem is as follows:
I have a dataframe df which has 5 columns say ('A', 'B', 'C', 'D', 'E')
Now I am looking to combine these columns for some other purposes based on the columns where they are in sets say GP1 = [ 'A', 'B', 'D'] and GP2 = ['C','E'] based on which I will create two new columns.
df['Group1'] = df[GP1].min(axis=1)
df['Group2'] = df[GP2].max(axis=1)
However, it can be possible based on the data that many times say the column 'A' ( or say 'D' or 'B' or maybe all) may be missing from the first set or maybe the column 'C' or 'E' (or both) may be missing from second set.
So what I am looking for is to do something such that the code will check if any of the columns from first set or second set is missing and then only create the new 'Group1' or 'Group2' if all columns exists in a group and if any of the columns in any set is missing it will then skip creating the new column.
How can I achieve that. I was trying for loops but not helping and becoming complicated logic.
An example when all the columns in both set is there:
df_in
A B C D E
1 2 3 4 5
2 4 6 2 3
1 0 2 4 2
df_out
A B C D E Group1 Group2
1 2 3 4 5 1 5
2 4 6 2 3 2 6
1 0 2 4 2 0 2
An example when say E column from second group is not there:
df_in
A B C D
1 2 3 4
2 4 6 2
1 0 2 4
df_out
A B C D Group1
1 2 3 4 1
2 4 6 2 2
1 0 2 4 0
When both A & D are missing from set A ( and only B is there from set/group 1)
df_in
B C E
2 3 5
4 6 3
0 2 2
df_out
B C E Group2
2 3 5 5
4 6 3 6
0 2 2 2
The following case when A from set 1 missing and C from set 2 missing :
df_in
B D E
2 4 5
4 2 3
0 4 2
df_out
B D E
2 4 5
4 2 3
0 4 2
Any help in this direction will be immensely appreciated. Thanks
Here you go, I think you can use this:
df_out = (df_in.assign(Group1=df_in.reindex(gp1, axis=1).dropna().min(axis=1),
Group2=df_in.reindex(gp2, axis=1).dropna().max(axis=1))
.dropna(axis=1, how='all'))
MCVE:
df_in = pd.read_clipboard() #Read from copy of df_in in the question above
print(df_in)
# A B C D E
# 0 1 2 3 4 5
# 1 2 4 6 2 3
# 2 1 0 2 4 2
gp1 = ['A','B','D']
gp2 = ['C','E']
df_out = (df_in.assign(Group1=df_in.reindex(gp1, axis=1).dropna().min(axis=1),
Group2=df_in.reindex(gp2, axis=1).dropna().max(axis=1))
.dropna(axis=1, how='all'))
print(df_out)
# A B C D E Group1 Group2
# 0 1 2 3 4 5 1 5
# 1 2 4 6 2 3 2 6
# 2 1 0 2 4 2 0 2
df_in_copy=df_in.copy() #make a copy to reuse later
df_in = df_in.drop('E', axis=1) #Drop Col E
print(df_in)
# A B C D
# 0 1 2 3 4
# 1 2 4 6 2
# 2 1 0 2 4
df_out = (df_in.assign(Group1=df_in.reindex(gp1, axis=1).dropna().min(axis=1),
Group2=df_in.reindex(gp2, axis=1).dropna().max(axis=1))
.dropna(axis=1, how='all'))
print(df_out)
# A B C D Group1
# 0 1 2 3 4 1
# 1 2 4 6 2 2
# 2 1 0 2 4 0
df_in = df_in_copy.copy() #Copy for copy create
df_in = df_in.drop(['A','D'], axis=1) #Drop Columns A and D
print(df_in)
# B C E
# 0 2 3 5
# 1 4 6 3
# 2 0 2 2
df_out = (df_in.assign(Group1=df_in.reindex(gp1, axis=1).dropna().min(axis=1),
Group2=df_in.reindex(gp2, axis=1).dropna().max(axis=1))
.dropna(axis=1, how='all'))
print(df_out)
# B C E
# 0 2 3 5
# 1 4 6 3
# 2 0 2 2
When using the drop method for a pandas.DataFrame it accepts lists of column names, but not tuples, despite the documentation saying that "list-like" arguments are acceptable. Am I reading the documentation incorrectly, as I would expect my MWE to work.
MWE
import pandas as pd
df = pd.DataFrame({k: range(5) for k in list('abcd')})
df.drop(['a', 'c'], axis=1) # Works
df.drop(('a', 'c'), axis=1) # Errors
Versions - Using Python 2.7.12, Pandas 0.20.3.
There is problem with tuples select Multiindex:
np.random.seed(345)
mux = pd.MultiIndex.from_arrays([list('abcde'), list('cdefg')])
df = pd.DataFrame(np.random.randint(10, size=(4,5)), columns=mux)
print (df)
a b c d e
c d e f g
0 8 0 3 9 8
1 4 3 4 1 7
2 4 0 9 6 3
3 8 0 3 1 5
df = df.drop(('a', 'c'), axis=1)
print (df)
b c d e
d e f g
0 0 3 9 8
1 3 4 1 7
2 0 9 6 3
3 0 3 1 5
Same as:
df = df[('a', 'c')]
print (df)
0 8
1 4
2 4
3 8
Name: (a, c), dtype: int32
Pandas treats tuples as multi-index values, so try this instead:
In [330]: df.drop(list(('a', 'c')), axis=1)
Out[330]:
b d
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
here is an example for deleting rows (axis=0 - default) in the multi-index DF:
In [342]: x = df.set_index(np.arange(len(df), 0, -1), append=True)
In [343]: x
Out[343]:
a b c d
0 5 0 0 0 0
1 4 1 1 1 1
2 3 2 2 2 2
3 2 3 3 3 3
4 1 4 4 4 4
In [344]: x.drop((0,5))
Out[344]:
a b c d
1 4 1 1 1 1
2 3 2 2 2 2
3 2 3 3 3 3
4 1 4 4 4 4
In [345]: x.drop([(0,5), (4,1)])
Out[345]:
a b c d
1 4 1 1 1 1
2 3 2 2 2 2
3 2 3 3 3 3
So when you specify tuple Pandas treats it as a multi-index label
I used this to delete column of tuples
del df3[('val1', 'val2')]
and it got deleted.