I have a multi-indexed dataframe, and I want to add to every one of the most outer index another line, where the two other indices are marked with a specific string (Same string for all indices in all values). The other values of that row can be empty or anything else.
I tried creating a different dataframe using groupby and appending them but I can't get the indices to work.
For example, for the dataframe:
Index1 Index2 Index3 val
A d 1 a
A d 2 b
A e 3 c
A e 4 d
B f 5 e
B f 6 f
B g 7 g
C h 8 h
C h 9 i
C i 10 j
I would like to get:
Index1 Index2 Index3 val
A d 1 a
A d 2 b
A e 3 c
A e 4 d
A StringA StringA <any value>
B f 5 e
B f 6 f
B g 7 g
B StringA StringA <any value>
C h 8 h
C h 9 i
C i 10 j
C StringA StringA <any value>
IIUC
s=pd.DataFrame({'Index1':df.Index1.unique(),
'Index2':df.Index1.radd('String').unique(),
'Index3': df.Index1.radd('String').unique(),
'val':[1]*df.Index1.nunique()})
pd.concat([df.reset_index(),s]).sort_values('Index1').set_index(['Index1','Index2','Index3'])
Out[301]:
Index1 Index2 Index3 val
0 A d 1 a
1 A d 2 b
2 A e 3 c
3 A e 4 d
0 A StringA StringA 1
4 B f 5 e
5 B f 6 f
6 B g 7 g
1 B StringB StringB 1
7 C h 8 h
8 C h 9 i
9 C i 10 j
2 C StringC StringC 1
You can unstack, assign, stack:
new_df = df.unstack(level=(-1,-2))
# you can pass a series here
new_df[('val','StringA','StringA')] = 'ABC'
new_df.stack(level=(-1,-2))
Output:
val
Index1 Index2 Index3
A d 1 a
2 b
e 3 c
4 d
StringA StringA ABC
B f 5 e
6 f
g 7 g
StringA StringA ABC
C h 8 h
9 i
i 10 j
StringA StringA ABC
Or try using:
groupby = df.groupby(df['Index1'], as_index=False).last()
groupby[['Index2', 'Index3', 'val']] = ['StringA', 'StringA', np.nan]
df = pd.concat([df, groupby]).sort_values(['Index1', 'Index3']).reset_index()
print(df)
Output:
index Index1 Index2 Index3 val
0 0 A d 1 a
1 1 A d 2 b
2 2 A e 3 c
3 3 A e 4 d
4 0 A StringA StringA NaN
5 4 B f 5 e
6 5 B f 6 f
7 6 B g 7 g
8 1 B StringA StringA NaN
9 7 C h 8 h
10 8 C h 9 i
11 9 C i 10 j
12 2 C StringA StringA NaN
Related
Data Frame is having two columns
Data Frame is having two columns
df1
col1 col2
A A
B A
B A
C B
C
D
E
E E
F G
G
H
H
here both columns are object type, trying to merge value of column 2 with column 1 where column 1 value is null.
how to apply this for large dataset?I'm beginner for panads, trying to learn all the tricks here.
Expected output:
col1 col2
A A
B A
B A
C B
C C
D D
E E
E E
F G
G G
H H
H H
You can also use np.where.
df['col1'] = np.where(df['col1'], df['col1'], df['col2'])
Or combine_first after first ensuring empty strings are represented as null values.
df['col1'] = df['col1'].replace('', np.nan).combine_first(df['col2'])
If empty values are missing values:
print (df)
col1 col2
0 A A
1 B B
2 NaN C
3 NaN D
4 NaN E
5 E E
6 F F
7 NaN G
8 NaN H
9 NaN H
df['col1'] = df['col1'].fillna(df['col2'])
print (df)
col1 col2
0 A A
1 B B
2 C C
3 D D
4 E E
5 E E
6 F F
7 G G
8 H H
9 H H
If empty values are empty strings:
print (df)
col1 col2
0 A A
1 B B
2 C
3 D
4 E
5 E E
6 F F
7 G
8 H
9 H
df['col1'] = df['col1'].mask(df['col1'] == '', df['col2'])
#thanks U10-Forward
df['col1'] = f['col1'].replace('', np.nan).fillna(df['col2'])
print (df)
col1 col2
0 A A
1 B B
2 C C
3 D D
4 E E
5 E E
6 F F
7 G G
8 H H
9 H H
Say I have a row of column headers, and associated values in a Pandas Dataframe:
print df
A B C D E F G H I J K
1 2 3 4 5 6 7 8 9 10 11
how do I go about displaying them like the following:
print df
A B C D E
1 2 3 4 5
F G H I J
6 7 8 9 10
K
11
custom function
def new_repr(self):
g = self.groupby(np.arange(self.shape[1]) // 5, axis=1)
return '\n\n'.join([d.to_string() for _, d in g])
print(new_repr(df))
A B C D E
0 1 2 3 4 5
F G H I J
0 6 7 8 9 10
K
0 11
pd.set_option('display.width', 20)
pd.set_option('display.expand_frame_repr', True)
df
A B C D E \
0 1 2 3 4 5
F G H I J \
0 6 7 8 9 10
K
0 11
Problem:
I'm trying to split a pandas data frame by the repeating ranges in column A. My data and output are as follows. The ranges in columns A are always increasing and do not skip values. The values in column A do start and stop arbitrarily, however.
Data:
import pandas as pd
dict = {"A": [1,2,3,2,3,4,3,4,5,6],
"B": ["a","b","c","d","e","f","g","h","i","k"]}
df = pd.DataFrame(dict)
df
A B
0 1 a
1 2 b
2 3 c
3 2 d
4 3 e
5 4 f
6 3 g
7 4 h
8 5 i
9 6 k
Desired ouptut:
df1
A B
0 1 a
1 2 b
2 3 c
df2
A B
0 2 d
1 3 e
2 4 f
df3
A B
0 3 g
1 4 h
2 5 i
3 6 k
Thanks for advice!
Answer times:
from timeit import default_timer as timer
start = timer()
for x ,y in df.groupby(df.A.diff().ne(1).cumsum()):
print(y)
end = timer()
aa = end - start
start = timer()
s = (df.A.diff() != 1).cumsum()
g = df.groupby(s)
for _,g_ in g:
print(g_)
end = timer()
bb = end - start
start = timer()
[*(d for _, d in df.groupby(df.A.diff().ne(1).cumsum()))]
print(*(d for _, d in df.groupby(df.A.diff().ne(1).cumsum())), sep='\n\n')
end = timer()
cc = end - start
print(aa,bb,cc)
0.0176649530000077 0.018132143000002543 0.018715283999995336
Create the groupby key by using diff and cumsum
for x ,y in df.groupby(df.A.diff().ne(1).cumsum()):
print(y)
A B
0 1 a
1 2 b
2 3 c
A B
3 2 d
4 3 e
5 4 f
A B
6 3 g
7 4 h
8 5 i
9 6 k
Just groupby by the difference
s = (df.A.diff() != 1).cumsum()
g = df.groupby(s)
for _,g_ in g:
print(g_)
Outputs
A B
0 1 a
1 2 b
2 3 c
A B
3 2 d
4 3 e
5 4 f
A B
6 3 g
7 4 h
8 5 i
9 6 k
One-liner
because that's important
[*(d for _, d in df.groupby(df.A.diff().ne(1).cumsum()))]
Print it
print(*(d for _, d in df.groupby(df.A.diff().ne(1).cumsum())), sep='\n\n')
A B
0 1 a
1 2 b
2 3 c
A B
3 2 d
4 3 e
5 4 f
A B
6 3 g
7 4 h
8 5 i
9 6 k
Assign it
df1, df2, df3 = (d for _, d in df.groupby(df.A.diff().ne(1).cumsum()))
Having a data frame that looks like this:
Col1 Col2 Col3
A B C
C D E
F G H
A B C
A H K
A B C
F G H
A B C
I need to find the each repeated pattern, count it, and report it in an extra column, the output would then be:
Col1 Col2 Col3 Count
A B C 4
C D E 1
F G H 2
A B C 4
A H K 1
A B C 4
F G H 2
A B C 4
The idea that I have is to use the size of th original data frame and the size after dropping duplicates making use of the df.drop_duplicates, but I wonder if there is a nice way?
Use groupby and transform
In [4241]: df['Count'] = df.groupby(['Col1', 'Col2', 'Col3'])['Col1'].transform('size')
In [4242]: df
Out[4242]:
Col1 Col2 Col3 Count
0 A B C 4
1 C D E 1
2 F G H 2
3 A B C 4
4 A H K 1
5 A B C 4
6 F G H 2
7 A B C 4
Or, alternative use of merge
In [4256]: df.merge(df.groupby(['Col1', 'Col2', 'Col3']).size().reset_index(name='Count'),
how='left')
Out[4256]:
Col1 Col2 Col3 Count
0 A B C 4
1 C D E 1
2 F G H 2
3 A B C 4
4 A H K 1
5 A B C 4
6 F G H 2
7 A B C 4
I have the following pandas dataframe:
A B C D
2 a 1 F
4 b 2 G
6 b 3 nan
1 c 4 G
5 c 5 nan
7 d 6 H
I want to replace any values in column B, with the values in column D, while not doing anything for the 'nan' entries in column D.
Desired output:
A B C D
2 F 1 F
4 G 2 G
6 b 3 nan
1 G 4 G
5 c 5 nan
7 H 6 H
You can mask the rows of interest using a boolean mask and pass this to loc so only those rows are overwritten:
In [3]:
df.loc[df['D'].notnull(), 'B'] = df['D']
df
Out[3]:
A B C D
0 2 F 1 F
1 4 G 2 G
2 6 b 3 NaN
3 1 G 4 G
4 5 c 5 NaN
5 7 H 6 H
See the docs on boolean indexing and notnull
Few alternative solutions:
In [72]: df['B'] = df['D'].combine_first(df['B'])
In [73]: df
Out[73]:
A B C D
0 2 F 1 F
1 4 G 2 G
2 6 b 3 NaN
3 1 G 4 G
4 5 c 5 NaN
5 7 H 6 H
or:
df['B'] = df['D'].fillna(df['B'])
or:
df['B'] = df['D'].mask(df['D'].isnull(), df['B'])
or:
df['B'] = df['D'].where(df['D'].notnull(), df['B'])
or:
df['B'] = np.where(df['D'].notnull(), df['D'], df['B'])