Creating a column by addition of two adjacent rows with a condition

Creating a column by addition of two adjacent rows with a condition - python

Create column E that fills column C. If D is <10, then it fill C of earlier row and current row.
This is my Input DataSet:
I,A,B,C,D
1,P,100+,L,15
2,P,100+,M,9
3,P,100+,N,15
4,P,100+,O,15
5,Q,100+,L,2
6,Q,100+,M,15
7,Q,100+,N,3
8,Q,100+,O,15
I tried using some for loops. However, i think we can use shift or append functions to complete this. However, i am getting value errors using the shift function.
Desired Output:
I,A,B,C,D,E
1,P,100+,L,15,L
2,P,100+,M,9,M+N
3,P,100+,N,15,M+N
4,P,100+,O,15,O
5,Q,100+,L,2,L+O
6,Q,100+,M,15,M+N
7,Q,100+,N,3,M+N
8,Q,100+,O,15,L+O
I am working out the column E given in desired output table above.

using np.where and pd.shift
##will populate C values index+1 where the condition is True
df['E'] = np.where( df['D'] < 10,df.loc[df.index + 1,'C'] , df['C'])
##Appending the values of C and E
df['E'] = df.apply(lambda x: x.C + '+' + x.E if x.C != x.E else x.C, axis=1)
df['F'] = df['E'].shift(1)
##Copying the values at index+1 position where the condition is True
df['E'] = df.apply(lambda x: x.F if '+' in str(x.F) else x.E, axis=1)
df.drop('F', axis=1, inplace=True)
Output
I A B C D E
0 1 P 100+ L 15 L
1 2 P 100+ M 9 M+N
2 3 P 100+ N 15 M+N
3 4 P 100+ O 15 O
4 5 Q 100+ L 2 L+M
5 6 Q 100+ M 15 L+M
6 7 Q 100+ N 3 N+O
7 8 Q 100+ O 15 N+O

Idea is create helper groups by replace values of index by mask with Series.where and forward filling only one missing value, then set new column by numpy.where with GroupBy.transform and join:
m = df['D'].lt(10)
g = df.index.to_series().where(m).ffill(limit=1)
df['E'] = np.where(g.notna(), df['C'].groupby(g.fillna(-1)).transform('+'.join), df['C'])
print (df)
I A B C D E
0 1 P 100+ L 15 L
1 2 P 100+ M 9 M+N
2 3 P 100+ N 15 M+N
3 4 P 100+ O 15 O
4 5 Q 100+ L 2 L+M
5 6 Q 100+ M 15 L+M
6 7 Q 100+ N 3 N+O
7 8 Q 100+ O 15 N+O

Related

Is there a good way to apply a function cumulatively to a pandas series of strings?

I have a Pandas data frame like this
x y
0 0 a
1 0 b
2 0 c
3 0 d
4 1 e
5 1 f
6 1 g
7 1 h
what I want to do is for each value of x to create a series which cumulatively concatenates the strings which have already appeared in y for that value of x. In other words, I want to get a Pandas series like this.
0
1 a,
2 a,b,
3 a,b,c,
4
5 e,
6 e,f,
7 e,f,g,
I can do it using a double for loop:
dat = pd.DataFrame({'x': [0, 0, 0, 0, 1, 1, 1, 1],
'y': ['a','b','c','d','e','f','g','h']})
z = dat['x'].copy()
for i in range(dat.shape[0]):
z[i] = ''
for j in range(i):
if dat['x'][j] == dat['x'][i]:
z[i] += dat['y'][j] + ","
but I was wondering whether there is a quicker way? It seems that pandas expanding().apply() doesn't work for strings and it is an open issue. But perhaps there is an efficient way of doing it which doesn't involve apply?

You can do with shift and np.cumsum in a custom function:
def myfun(x):
y = x.shift()
return np.cumsum(y.fillna('').add(',').mask(y.isna(),'')).str[:-1]
df.groupby("x")['y'].apply(myfun)
0
1 a
2 a,b
3 a,b,c
4
5 e
6 e,f
7 e,f,g
Name: y, dtype: object

We can group the dataframe by x then for each group in x we can cumsum and shift the column y and update the values in new column cum_y in dat
dat['cum_y'] = ''
for _, g in dat.groupby('x'):
dat['cum_y'].update(g['y'].add(',').cumsum().shift().str[:-1])
>>> dat
x y cum_y
0 0 a
1 0 b a
2 0 c a,b
3 0 d a,b,c
4 1 e
5 1 f e
6 1 g e,f
7 1 h e,f,g

Use GroupBy.transform with lambda function with Series.shift, adding ,, cumulative sum and last remove trailing separator:
f = lambda x: (x.shift(fill_value='') + ',').cumsum()
dat['z'] = dat.groupby('x')['y'].transform(f).str.strip(',')
print (dat)
x y z
0 0 a
1 0 b a
2 0 c a,b
3 0 d a,b,c
4 1 e
5 1 f e
6 1 g e,f
7 1 h e,f,g

I would try to use lists here. Unsure for the efficiency anyway...
df.assign(y=df['y'].apply(lambda x: [x])).groupby('x')['y'].transform(
lambda x: x.cumsum()).str.join(',')
It gives as expected:
0 a
1 a,b
2 a,b,c
3 a,b,c,d
4 e
5 e,f
6 e,f,g
7 e,f,g,h
Name: y, dtype: object

Can also do:
(df['y'].apply(list)
.groupby(df['x'])
.transform(lambda x: x.cumsum().shift(fill_value=''))
.str.join(',')
)
Output:
0
1 a
2 a,b
3 a,b,c
4
5 e
6 e,f
7 e,f,g
Name: y, dtype: object

pd.Dataframe.update puts the result at the top of the dataframe

Lets say I have two dataframes like this:
n = {'x':['a','b','c','d','e'], 'y':['1','2','3','4','5'],'z':['0','0','0','0','0']}
nf = pd.DataFrame(n)
m = {'x':['b','d','e'], 'z':['10','100','1000']}
mf = pd.DataFrame(n)
I want to update the zeroes in the z column in the nf dataframe with the values from the z column in the mf dataframe only in the rows with keys from the column x
when i call
nf.update(mf)
i get
x y z
b 1 10
d 2 100
e 3 1000
d 4 0
e 5 0
instead of the desired output
x y z
a 1 0
b 2 10
c 3 0
d 4 100
e 5 1000

To answer your problem, you need to match the indexes of both dataframes, here how you can do it :
n = {'x':['a','b','c','d','e'], 'y':['1','2','3','4','5'],'z':['0','0','0','0','0']}
nf = pd.DataFrame(n).set_index('x')
m = {'x':['b','d','e'], 'z':['10','100','1000']}
mf = pd.DataFrame(m).set_index('x')
nf.update(mf)
nf = nf.reset_index()

Sorting a dataframe by another

I have an initial dataframe X:
x y z w
0 1 a b c
1 1 d e f
2 0 g h i
3 0 k l m
4 -1 n o p
5 -1 q r s
6 -1 t v à
with many columns and rows (this is a toy example). After applying some Machine Learning procedures, I get back a similar dataframe, but with the -1s changed to 0s or 1s and the rows sorted in a different way; for example:
x y z w
4 1 n o p
0 1 a b c
6 0 t v à
1 1 d e f
2 0 g h i
5 0 q r s
3 0 k l m
How could I do in order to sort the second dataframe as the first one? For example, like
x y z w
0 1 a b c
1 1 d e f
2 0 g h i
3 0 k l m
4 1 n o p
5 0 q r s
6 0 t v à

If you can't trust just sorting the indexes (e.g. if the first df's indexes are not sorted, or if you have something other than RangeIndex), just use loc
df2.loc[df.index]
x y z w
0 1 a b c
1 1 d e f
2 0 g h i
3 0 k l m
4 1 n o p
5 0 q r s
6 0 t v à

Use:
df.sort_index(inplace=True)
It restores the order, just by index

Adding a column to a dataframe after every nth column

I have a dataframe of 9,000 columns and 100 rows. I want to insert a column after every 3rd column such that its value is equal to 50 for all rows.
Existing DataFrame
0 1 2 3 4 5 6 7 8 9....9000
0 a b c d e f g h i j ....x
1 k l m n o p q r s t ....x
.
.
100 u v w x y z aa bb cc....x
Desired DataFrame
0 1 2 3 4 5 6 7 8 9....12000
0 a b c 50 d e f 50 g h i j ....x
1 k l m 50 n o p 50 q r s t ....x
.
.
100 u v w 50 x y z 50 aa bb cc....x

Create new DataFrame by indexing each 3rd column, add .5 for correct sorting and add to original with concat:
df.columns = np.arange(len(df.columns))
df1 = pd.DataFrame(50, index=df.index, columns= df.columns[2::3] + .5)
df2 = pd.concat([df, df1], axis=1).sort_index(axis=1)
df2.columns = np.arange(len(df2.columns))
print (df2)
0 1 2 3 4 5 6 7 8 9 10 11 12
0 a b c 50 d e f 50 g h i 50 j
1 k l m 50 n o p 50 q r s 50 t

Numpy
# How many columns to group
x = 3
# Get the shape of things
a = df.to_numpy()
m, n = a.shape
k = n // x
# Get only a multiple of x columns and reshape
b = a[:, :k * x].reshape(m, k, x)
# Get the other columns missed by b
c = a[:, k * x:]
# array of 50's that we'll append to the last dimension
_50 = np.ones((m, k, 1), np.int64) * 50
# append 50's and reshape back to 2D
d = np.append(b, _50, axis=2).reshape(m, k * (x + 1))
# Create DataFrame while appending the missing bit
pd.DataFrame(np.append(d, c, axis=1))
0 1 2 3 4 5 6 7 8 9 10 11 12
0 a b c 50 d e f 50 g h i 50 j
1 k l m 50 n o p 50 q r s 50 t
Setup
df = pd.DataFrame(np.reshape([*'abcdefghijklmnopqrst'], (2, -1)))

So here is one solution
s=pd.concat([y.assign(new=50) for x, y in df.groupby(np.arange(df.shape[1])//3,axis=1)],axis=1)
s.columns=np.arange(s.shape[1])

pandas apply and applymap functions are taking long time to run on large dataset

I have two functions applied on a dataframe
res = df.apply(lambda x:pd.Series(list(x)))
res = res.applymap(lambda x: x.strip('"') if isinstance(x, str) else x)
{{Update}} Dataframe has got almost 700 000 rows. This is taking much time to run.
How to reduce the running time?
Sample data :
A
----------
0 [1,4,3,c]
1 [t,g,h,j]
2 [d,g,e,w]
3 [f,i,j,h]
4 [m,z,s,e]
5 [q,f,d,s]
output:
A B C D E
-------------------------
0 [1,4,3,c] 1 4 3 c
1 [t,g,h,j] t g h j
2 [d,g,e,w] d g e w
3 [f,i,j,h] f i j h
4 [m,z,s,e] m z s e
5 [q,f,d,s] q f d s
This line of code res = df.apply(lambda x:pd.Series(list(x))) takes items from a list and fill one by one to each column as shown above. There will be almost 38 columns.

I think:
res = df.apply(lambda x:pd.Series(list(x)))
should be changed to:
df1 = pd.DataFrame(df['A'].values.tolist())
print (df1)
0 1 2 3
0 1 4 3 c
1 t g h j
2 d g e w
3 f i j h
4 m z s e
5 q f d s
And second if not mixed columns values - numeric with strings:
cols = res.select_dtypes(object).columns
res[cols] = res[cols].apply(lambda x: x.str.strip('"'))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Creating a column by addition of two adjacent rows with a condition - python

Related

Is there a good way to apply a function cumulatively to a pandas series of strings?

pd.Dataframe.update puts the result at the top of the dataframe

Sorting a dataframe by another

Adding a column to a dataframe after every nth column

pandas apply and applymap functions are taking long time to run on large dataset

Categories

Resources