Adding a row to each index on a multi-indexed dataframe - python

I have a multi-indexed dataframe, and I want to add to every one of the most outer index another line, where the two other indices are marked with a specific string (Same string for all indices in all values). The other values of that row can be empty or anything else.
I tried creating a different dataframe using groupby and appending them but I can't get the indices to work.
For example, for the dataframe:
Index1 Index2 Index3 val
A d 1 a
A d 2 b
A e 3 c
A e 4 d
B f 5 e
B f 6 f
B g 7 g
C h 8 h
C h 9 i
C i 10 j
I would like to get:
Index1 Index2 Index3 val
A d 1 a
A d 2 b
A e 3 c
A e 4 d
A StringA StringA <any value>
B f 5 e
B f 6 f
B g 7 g
B StringA StringA <any value>
C h 8 h
C h 9 i
C i 10 j
C StringA StringA <any value>

IIUC
s=pd.DataFrame({'Index1':df.Index1.unique(),
'Index2':df.Index1.radd('String').unique(),
'Index3': df.Index1.radd('String').unique(),
'val':[1]*df.Index1.nunique()})
pd.concat([df.reset_index(),s]).sort_values('Index1').set_index(['Index1','Index2','Index3'])
Out[301]:
Index1 Index2 Index3 val
0 A d 1 a
1 A d 2 b
2 A e 3 c
3 A e 4 d
0 A StringA StringA 1
4 B f 5 e
5 B f 6 f
6 B g 7 g
1 B StringB StringB 1
7 C h 8 h
8 C h 9 i
9 C i 10 j
2 C StringC StringC 1

You can unstack, assign, stack:
new_df = df.unstack(level=(-1,-2))
# you can pass a series here
new_df[('val','StringA','StringA')] = 'ABC'
new_df.stack(level=(-1,-2))
Output:
val
Index1 Index2 Index3
A d 1 a
2 b
e 3 c
4 d
StringA StringA ABC
B f 5 e
6 f
g 7 g
StringA StringA ABC
C h 8 h
9 i
i 10 j
StringA StringA ABC

Or try using:
groupby = df.groupby(df['Index1'], as_index=False).last()
groupby[['Index2', 'Index3', 'val']] = ['StringA', 'StringA', np.nan]
df = pd.concat([df, groupby]).sort_values(['Index1', 'Index3']).reset_index()
print(df)
Output:
index Index1 Index2 Index3 val
0 0 A d 1 a
1 1 A d 2 b
2 2 A e 3 c
3 3 A e 4 d
4 0 A StringA StringA NaN
5 4 B f 5 e
6 5 B f 6 f
7 6 B g 7 g
8 1 B StringA StringA NaN
9 7 C h 8 h
10 8 C h 9 i
11 9 C i 10 j
12 2 C StringA StringA NaN

Related

In Pandas merge colum1 value with colum2, both col data type is object and only few values are null in first column?

Data Frame is having two columns
Data Frame is having two columns
df1
col1 col2
A A
B A
B A
C B
C
D
E
E E
F G
G
H
H
here both columns are object type, trying to merge value of column 2 with column 1 where column 1 value is null.
how to apply this for large dataset?I'm beginner for panads, trying to learn all the tricks here.
Expected output:
col1 col2
A A
B A
B A
C B
C C
D D
E E
E E
F G
G G
H H
H H
You can also use np.where.
df['col1'] = np.where(df['col1'], df['col1'], df['col2'])
Or combine_first after first ensuring empty strings are represented as null values.
df['col1'] = df['col1'].replace('', np.nan).combine_first(df['col2'])
If empty values are missing values:
print (df)
col1 col2
0 A A
1 B B
2 NaN C
3 NaN D
4 NaN E
5 E E
6 F F
7 NaN G
8 NaN H
9 NaN H
df['col1'] = df['col1'].fillna(df['col2'])
print (df)
col1 col2
0 A A
1 B B
2 C C
3 D D
4 E E
5 E E
6 F F
7 G G
8 H H
9 H H
If empty values are empty strings:
print (df)
col1 col2
0 A A
1 B B
2 C
3 D
4 E
5 E E
6 F F
7 G
8 H
9 H
df['col1'] = df['col1'].mask(df['col1'] == '', df['col2'])
#thanks U10-Forward
df['col1'] = f['col1'].replace('', np.nan).fillna(df['col2'])
print (df)
col1 col2
0 A A
1 B B
2 C C
3 D D
4 E E
5 E E
6 F F
7 G G
8 H H
9 H H

Display pandas columns - wide to long

Say I have a row of column headers, and associated values in a Pandas Dataframe:
print df
A B C D E F G H I J K
1 2 3 4 5 6 7 8 9 10 11
how do I go about displaying them like the following:
print df
A B C D E
1 2 3 4 5
F G H I J
6 7 8 9 10
K
11
custom function
def new_repr(self):
g = self.groupby(np.arange(self.shape[1]) // 5, axis=1)
return '\n\n'.join([d.to_string() for _, d in g])
print(new_repr(df))
A B C D E
0 1 2 3 4 5
F G H I J
0 6 7 8 9 10
K
0 11
pd.set_option('display.width', 20)
pd.set_option('display.expand_frame_repr', True)
df
A B C D E \
0 1 2 3 4 5
F G H I J \
0 6 7 8 9 10
K
0 11

Separate pandas df by repeating range in a column

Problem:
I'm trying to split a pandas data frame by the repeating ranges in column A. My data and output are as follows. The ranges in columns A are always increasing and do not skip values. The values in column A do start and stop arbitrarily, however.
Data:
import pandas as pd
dict = {"A": [1,2,3,2,3,4,3,4,5,6],
"B": ["a","b","c","d","e","f","g","h","i","k"]}
df = pd.DataFrame(dict)
df
A B
0 1 a
1 2 b
2 3 c
3 2 d
4 3 e
5 4 f
6 3 g
7 4 h
8 5 i
9 6 k
Desired ouptut:
df1
A B
0 1 a
1 2 b
2 3 c
df2
A B
0 2 d
1 3 e
2 4 f
df3
A B
0 3 g
1 4 h
2 5 i
3 6 k
Thanks for advice!
Answer times:
from timeit import default_timer as timer
start = timer()
for x ,y in df.groupby(df.A.diff().ne(1).cumsum()):
print(y)
end = timer()
aa = end - start
start = timer()
s = (df.A.diff() != 1).cumsum()
g = df.groupby(s)
for _,g_ in g:
print(g_)
end = timer()
bb = end - start
start = timer()
[*(d for _, d in df.groupby(df.A.diff().ne(1).cumsum()))]
print(*(d for _, d in df.groupby(df.A.diff().ne(1).cumsum())), sep='\n\n')
end = timer()
cc = end - start
print(aa,bb,cc)
0.0176649530000077 0.018132143000002543 0.018715283999995336
Create the groupby key by using diff and cumsum
for x ,y in df.groupby(df.A.diff().ne(1).cumsum()):
print(y)
A B
0 1 a
1 2 b
2 3 c
A B
3 2 d
4 3 e
5 4 f
A B
6 3 g
7 4 h
8 5 i
9 6 k
Just groupby by the difference
s = (df.A.diff() != 1).cumsum()
g = df.groupby(s)
for _,g_ in g:
print(g_)
Outputs
A B
0 1 a
1 2 b
2 3 c
A B
3 2 d
4 3 e
5 4 f
A B
6 3 g
7 4 h
8 5 i
9 6 k
One-liner
because that's important
[*(d for _, d in df.groupby(df.A.diff().ne(1).cumsum()))]
Print it
print(*(d for _, d in df.groupby(df.A.diff().ne(1).cumsum())), sep='\n\n')
A B
0 1 a
1 2 b
2 3 c
A B
3 2 d
4 3 e
5 4 f
A B
6 3 g
7 4 h
8 5 i
9 6 k
Assign it
df1, df2, df3 = (d for _, d in df.groupby(df.A.diff().ne(1).cumsum()))

How to find out the number of redundant rows in a data frame and report this number as a separate column

Having a data frame that looks like this:
Col1 Col2 Col3
A B C
C D E
F G H
A B C
A H K
A B C
F G H
A B C
I need to find the each repeated pattern, count it, and report it in an extra column, the output would then be:
Col1 Col2 Col3 Count
A B C 4
C D E 1
F G H 2
A B C 4
A H K 1
A B C 4
F G H 2
A B C 4
The idea that I have is to use the size of th original data frame and the size after dropping duplicates making use of the df.drop_duplicates, but I wonder if there is a nice way?
Use groupby and transform
In [4241]: df['Count'] = df.groupby(['Col1', 'Col2', 'Col3'])['Col1'].transform('size')
In [4242]: df
Out[4242]:
Col1 Col2 Col3 Count
0 A B C 4
1 C D E 1
2 F G H 2
3 A B C 4
4 A H K 1
5 A B C 4
6 F G H 2
7 A B C 4
Or, alternative use of merge
In [4256]: df.merge(df.groupby(['Col1', 'Col2', 'Col3']).size().reset_index(name='Count'),
how='left')
Out[4256]:
Col1 Col2 Col3 Count
0 A B C 4
1 C D E 1
2 F G H 2
3 A B C 4
4 A H K 1
5 A B C 4
6 F G H 2
7 A B C 4

Replace values in a column, with certain values from another, ignoring any 'nan' entries

I have the following pandas dataframe:
A B C D
2 a 1 F
4 b 2 G
6 b 3 nan
1 c 4 G
5 c 5 nan
7 d 6 H
I want to replace any values in column B, with the values in column D, while not doing anything for the 'nan' entries in column D.
Desired output:
A B C D
2 F 1 F
4 G 2 G
6 b 3 nan
1 G 4 G
5 c 5 nan
7 H 6 H
You can mask the rows of interest using a boolean mask and pass this to loc so only those rows are overwritten:
In [3]:
df.loc[df['D'].notnull(), 'B'] = df['D']
df
Out[3]:
A B C D
0 2 F 1 F
1 4 G 2 G
2 6 b 3 NaN
3 1 G 4 G
4 5 c 5 NaN
5 7 H 6 H
See the docs on boolean indexing and notnull
Few alternative solutions:
In [72]: df['B'] = df['D'].combine_first(df['B'])
In [73]: df
Out[73]:
A B C D
0 2 F 1 F
1 4 G 2 G
2 6 b 3 NaN
3 1 G 4 G
4 5 c 5 NaN
5 7 H 6 H
or:
df['B'] = df['D'].fillna(df['B'])
or:
df['B'] = df['D'].mask(df['D'].isnull(), df['B'])
or:
df['B'] = df['D'].where(df['D'].notnull(), df['B'])
or:
df['B'] = np.where(df['D'].notnull(), df['D'], df['B'])

Categories

Resources