Problem:
I'm trying to split a pandas data frame by the repeating ranges in column A. My data and output are as follows. The ranges in columns A are always increasing and do not skip values. The values in column A do start and stop arbitrarily, however.
Data:
import pandas as pd
dict = {"A": [1,2,3,2,3,4,3,4,5,6],
"B": ["a","b","c","d","e","f","g","h","i","k"]}
df = pd.DataFrame(dict)
df
A B
0 1 a
1 2 b
2 3 c
3 2 d
4 3 e
5 4 f
6 3 g
7 4 h
8 5 i
9 6 k
Desired ouptut:
df1
A B
0 1 a
1 2 b
2 3 c
df2
A B
0 2 d
1 3 e
2 4 f
df3
A B
0 3 g
1 4 h
2 5 i
3 6 k
Thanks for advice!
Answer times:
from timeit import default_timer as timer
start = timer()
for x ,y in df.groupby(df.A.diff().ne(1).cumsum()):
print(y)
end = timer()
aa = end - start
start = timer()
s = (df.A.diff() != 1).cumsum()
g = df.groupby(s)
for _,g_ in g:
print(g_)
end = timer()
bb = end - start
start = timer()
[*(d for _, d in df.groupby(df.A.diff().ne(1).cumsum()))]
print(*(d for _, d in df.groupby(df.A.diff().ne(1).cumsum())), sep='\n\n')
end = timer()
cc = end - start
print(aa,bb,cc)
0.0176649530000077 0.018132143000002543 0.018715283999995336
Create the groupby key by using diff and cumsum
for x ,y in df.groupby(df.A.diff().ne(1).cumsum()):
print(y)
A B
0 1 a
1 2 b
2 3 c
A B
3 2 d
4 3 e
5 4 f
A B
6 3 g
7 4 h
8 5 i
9 6 k
Just groupby by the difference
s = (df.A.diff() != 1).cumsum()
g = df.groupby(s)
for _,g_ in g:
print(g_)
Outputs
A B
0 1 a
1 2 b
2 3 c
A B
3 2 d
4 3 e
5 4 f
A B
6 3 g
7 4 h
8 5 i
9 6 k
One-liner
because that's important
[*(d for _, d in df.groupby(df.A.diff().ne(1).cumsum()))]
Print it
print(*(d for _, d in df.groupby(df.A.diff().ne(1).cumsum())), sep='\n\n')
A B
0 1 a
1 2 b
2 3 c
A B
3 2 d
4 3 e
5 4 f
A B
6 3 g
7 4 h
8 5 i
9 6 k
Assign it
df1, df2, df3 = (d for _, d in df.groupby(df.A.diff().ne(1).cumsum()))
Related
I want to groupby and drop groups if it satisfies two conditions (the shape is 3 and column A doesn't contain zeros).
My df
ID value
A 3
A 2
A 0
B 1
B 1
C 3
C 3
C 4
D 0
D 5
D 5
E 6
E 7
E 7
F 3
F 2
my desired df would be
ID value
A 3
A 2
A 0
B 1
B 1
D 0
D 5
D 5
F 3
F 2
You can use boolean indexing with groupby operations:
g = df['value'].eq(0).groupby(df['ID'])
# group contains a 0
m1 = g.transform('any')
# group doesn't have size 3
m2 = g.transform('size').ne(3)
# keep if any of the condition above is met
# this is equivalent to dropping if contains 0 AND size 3
out = df[m1|m2]
Output:
ID value
0 A 3
1 A 2
2 A 0
3 B 1
4 B 1
8 D 0
9 D 5
10 D 5
14 F 3
15 F 2
Given dataframe:
df = pd.DataFrame({'a':[1,2,4,5,6,8],
'b':[5,6,4,8,9,6],
'c':[6,3,3,7,8,4],
'd':[1,2,3,8,7,3],
'e':[3,2,4,4,6,2],
'f':[3,2,6,4,5,5]})
I want to divide/split df several parts (into 2,3,4.. n parts)
Desired output:
df1 =
a b c d e f
0 1 5 6 1 3 3
1 2 6 3 2 2 2
df2 =
a b c d e f
2 4 4 3 3 4 6
3 5 8 7 8 4 4
df3 =
a b c d e f
4 6 9 8 7 6 5
5 8 6 4 3 2 5
UPDATED
Real data has not equal dividable size!
real data 4351 rows × 3 columns
Use qcut to split. How you want to store it after is up to you
import pandas as pd
gp = df.groupby(pd.qcut(range(df.shape[0]), 3)) # N = 3
d = {f'df{i+1}': x[1] for i, x in enumerate(gp)}
d['df1']
# a b c d e f
#0 1 5 6 1 3 3
#1 2 6 3 2 2 2
Assuming your DataFrame can be evenly divided into n chunks:
n = 3
dfs = [df.loc[i] for i in np.split(df.index, n)]
dfs is a list containing 3 dataframes.
Say I have a row of column headers, and associated values in a Pandas Dataframe:
print df
A B C D E F G H I J K
1 2 3 4 5 6 7 8 9 10 11
how do I go about displaying them like the following:
print df
A B C D E
1 2 3 4 5
F G H I J
6 7 8 9 10
K
11
custom function
def new_repr(self):
g = self.groupby(np.arange(self.shape[1]) // 5, axis=1)
return '\n\n'.join([d.to_string() for _, d in g])
print(new_repr(df))
A B C D E
0 1 2 3 4 5
F G H I J
0 6 7 8 9 10
K
0 11
pd.set_option('display.width', 20)
pd.set_option('display.expand_frame_repr', True)
df
A B C D E \
0 1 2 3 4 5
F G H I J \
0 6 7 8 9 10
K
0 11
I am working on a huge dataframe and trying to create a new column, based on a condition in another column. Right now, I have a big while-loop and this calculation takes too much time, is there an easier way to do it?
With lambda for example?:
def promo(dataframe, a):
i=0
while i < len(dataframe)-1:
i=i+1
if dataframe.iloc[i-1,5] >= a:
dataframe.iloc[i-1,6] = 1
else:
dataframe.iloc[i-1,6] = 0
return dataframe
Don't use loops in pandas, they are slow compared to a vectorized solution - convert boolean mask to integers by astype True, False are converted to 1, 0:
dataframe = pd.DataFrame({'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':list('aaabbb'),
'F':[5,3,6,9,2,4],
'G':[5,3,6,9,2,4]
})
a = 5
dataframe['new'] = (dataframe.iloc[:,5] >= a).astype(int)
print (dataframe)
A B C D E F G new
0 a 4 7 1 a 5 5 1
1 b 5 8 3 a 3 3 0
2 c 4 9 5 a 6 6 1
3 d 5 4 7 b 9 9 1
4 e 5 2 1 b 2 2 0
5 f 4 3 0 b 4 4 0
If you want to overwrite the 7th column:
a = 5
dataframe.iloc[:,6] = (dataframe.iloc[:,5] >= a).astype(int)
print (dataframe)
A B C D E F G
0 a 4 7 1 a 5 1
1 b 5 8 3 a 3 0
2 c 4 9 5 a 6 1
3 d 5 4 7 b 9 1
4 e 5 2 1 b 2 0
5 f 4 3 0 b 4 0
Sorry I have a very simple question. So I have two dataframes that look like
Dataframe 1:
columns: a b c d e f g h
Dataframe 2:
columns: e ef
I'm trying to join Dataframe 2 on Dataframe 1 at column e, which should yield
columns: a b c d e ef g h
or
columns: a b c d e f g h ef
However:
df1.merge(df2, how = 'inner', on = 'e') yields a blank dataframe when I print it out.
'outer' merge only extends the dataframe vertically (like using an append function).
Would appreciate some help thank you!
You need same dtypes of columns for join, so need converting:
#convert string column to int
df1['e'] = df1['e'].astype(int)
#inner is default value, so can be omit
df1.merge(df2, on = 'e')
Sample:
df1 = pd.DataFrame({'a':list('abcdef'),
'b':[4,5,4,5,5,4],
'c':[7,8,9,4,2,3],
'd':[1,3,5,7,1,0],
'e':['5','3','6','9','2','4'],
'f':list('aaabbb'),
'g':[1,3,5,7,1,0]})
print (df1)
a b c d e f g
0 a 4 7 1 5 a 1
1 b 5 8 3 3 a 3
2 c 4 9 5 6 a 5
3 d 5 4 7 9 b 7
4 e 5 2 1 2 b 1
5 f 4 3 0 4 b 0
df2 = pd.DataFrame({'ef':[10,30,50,70,10,100],
'e':[5,3,6,9,0,7]})
print (df2)
e ef
0 5 10
1 3 30
2 6 50
3 9 70
4 0 10
5 7 100
df1['e'] = df1['e'].astype(int)
df = df1.merge(df2, on = 'e')
print (df)
a b c d e f g ef
0 a 4 7 1 5 a 1 10
1 b 5 8 3 3 a 3 30
2 c 4 9 5 6 a 5 50
3 d 5 4 7 9 b 7 70
Instead of
df1.merge(...)
try:
pd.merge(left=df1, right=df2, on ='e', how='inner')
You can do it like this:
def mergeDfs(df1,df2):
newDf = dict()
dfList = []
for i in df1:
l = len(i)
row = []
for j in range(l):
row.append(df1[i][j])
newDf[i] = row
dfList.append(i)
for i in df2:
l = len(i)
row = []
if i not in dfList:
for j in range(l):
row.append(df2[i][j])
newDf[i] = row
df = pd.DataFrame(newDf)
return df