Sorting a dataframe by another

Sorting a dataframe by another - python

I have an initial dataframe X:
x y z w
0 1 a b c
1 1 d e f
2 0 g h i
3 0 k l m
4 -1 n o p
5 -1 q r s
6 -1 t v à
with many columns and rows (this is a toy example). After applying some Machine Learning procedures, I get back a similar dataframe, but with the -1s changed to 0s or 1s and the rows sorted in a different way; for example:
x y z w
4 1 n o p
0 1 a b c
6 0 t v à
1 1 d e f
2 0 g h i
5 0 q r s
3 0 k l m
How could I do in order to sort the second dataframe as the first one? For example, like
x y z w
0 1 a b c
1 1 d e f
2 0 g h i
3 0 k l m
4 1 n o p
5 0 q r s
6 0 t v à

If you can't trust just sorting the indexes (e.g. if the first df's indexes are not sorted, or if you have something other than RangeIndex), just use loc
df2.loc[df.index]
x y z w
0 1 a b c
1 1 d e f
2 0 g h i
3 0 k l m
4 1 n o p
5 0 q r s
6 0 t v à

Use:
df.sort_index(inplace=True)
It restores the order, just by index

Related

Is there a good way to apply a function cumulatively to a pandas series of strings?

I have a Pandas data frame like this
x y
0 0 a
1 0 b
2 0 c
3 0 d
4 1 e
5 1 f
6 1 g
7 1 h
what I want to do is for each value of x to create a series which cumulatively concatenates the strings which have already appeared in y for that value of x. In other words, I want to get a Pandas series like this.
0
1 a,
2 a,b,
3 a,b,c,
4
5 e,
6 e,f,
7 e,f,g,
I can do it using a double for loop:
dat = pd.DataFrame({'x': [0, 0, 0, 0, 1, 1, 1, 1],
'y': ['a','b','c','d','e','f','g','h']})
z = dat['x'].copy()
for i in range(dat.shape[0]):
z[i] = ''
for j in range(i):
if dat['x'][j] == dat['x'][i]:
z[i] += dat['y'][j] + ","
but I was wondering whether there is a quicker way? It seems that pandas expanding().apply() doesn't work for strings and it is an open issue. But perhaps there is an efficient way of doing it which doesn't involve apply?

You can do with shift and np.cumsum in a custom function:
def myfun(x):
y = x.shift()
return np.cumsum(y.fillna('').add(',').mask(y.isna(),'')).str[:-1]
df.groupby("x")['y'].apply(myfun)
0
1 a
2 a,b
3 a,b,c
4
5 e
6 e,f
7 e,f,g
Name: y, dtype: object

We can group the dataframe by x then for each group in x we can cumsum and shift the column y and update the values in new column cum_y in dat
dat['cum_y'] = ''
for _, g in dat.groupby('x'):
dat['cum_y'].update(g['y'].add(',').cumsum().shift().str[:-1])
>>> dat
x y cum_y
0 0 a
1 0 b a
2 0 c a,b
3 0 d a,b,c
4 1 e
5 1 f e
6 1 g e,f
7 1 h e,f,g

Use GroupBy.transform with lambda function with Series.shift, adding ,, cumulative sum and last remove trailing separator:
f = lambda x: (x.shift(fill_value='') + ',').cumsum()
dat['z'] = dat.groupby('x')['y'].transform(f).str.strip(',')
print (dat)
x y z
0 0 a
1 0 b a
2 0 c a,b
3 0 d a,b,c
4 1 e
5 1 f e
6 1 g e,f
7 1 h e,f,g

I would try to use lists here. Unsure for the efficiency anyway...
df.assign(y=df['y'].apply(lambda x: [x])).groupby('x')['y'].transform(
lambda x: x.cumsum()).str.join(',')
It gives as expected:
0 a
1 a,b
2 a,b,c
3 a,b,c,d
4 e
5 e,f
6 e,f,g
7 e,f,g,h
Name: y, dtype: object

Can also do:
(df['y'].apply(list)
.groupby(df['x'])
.transform(lambda x: x.cumsum().shift(fill_value=''))
.str.join(',')
)
Output:
0
1 a
2 a,b
3 a,b,c
4
5 e
6 e,f
7 e,f,g
Name: y, dtype: object

How to repeat certain rows of a dataframe?

I have a dataframe like this
import pandas as pd
df1 = pd.DataFrame({
'key': list('AAABBC'),
'prop1': list('xyzuuy'),
'prop2': list('mnbnbb')
})
key prop1 prop2
0 A x m
1 A y n
2 A z b
3 B u n
4 B u b
5 C y b
and a dictionary like this (user input):
d = {
'A': 2,
'B': 1,
'C': 3,
}
The keys of d refer to entries in column key in df1, the values indicate how often the rows of df1 that belong to the respective keys should be present: 1 means that nothing has to be done, 2 means all lines should be copied once, 3 they should be copied twice.
For the example above, the expected output looks as follows:
key prop1 prop2
0 A x m
1 A y n
2 A z b
3 B u n
4 B u b
5 C y b
6 A x m # <-- copied, copy 1
7 A y n # <-- copied, copy 1
8 A z b # <-- copied, copy 1
9 C y b # <-- copied, copy 1
10 C y b # <-- copied, copy 2
So, the rows that belong to A have been copied once and added to df1, nothing had to be done about the rows the belong to B and the rows that belong to C have been copied twice and were also added to df1.
I currently implement this as follows:
dfs_to_add = []
for el, val in d.items():
if val > 1:
_temp_df = pd.concat(
[df1[df1['key'] == el]] * (val-1)
)
dfs_to_add.append(_temp_df)
df_to_add = pd.concat(dfs_to_add)
df_final = pd.concat([df1, df_to_add]).reset_index(drop=True)
which gives me the desired output.
The code is rather ugly; does anyone see a more straightforward option to get to the same output?
The order is important, so in case of A, I would need
0 A x m
1 A y n
2 A z b
0 A x m
1 A y n
2 A z b
and not
0 A x m
0 A x m
1 A y n
1 A y n
2 A z b
2 A z b

We can sue concat + groupby
df=pd.concat([ pd.concat([y]*d.get(x)) for x , y in df1.groupby('key')])
key prop1 prop2
0 A x m
1 A y n
2 A z b
0 A x m
1 A y n
2 A z b
3 B u n
4 B u b
5 C y b
5 C y b
5 C y b

One way using Index.repeat with loc[] and series.map:
m = df1.set_index('key',append=True)
out = m.loc[m.index.repeat(df1['key'].map(d))].reset_index('key')
print(out)
key prop1 prop2
0 A x m
0 A x m
1 A y n
1 A y n
2 A z b
2 A z b
3 B u n
4 B u b
5 C y b
5 C y b
5 C y b

You can try repeat:
df1.loc[df1.index.repeat(df1['key'].map(d))]
Output:
key prop1 prop2
0 A x m
0 A x m
1 A y n
1 A y n
2 A z b
2 A z b
3 B u n
4 B u b
5 C y b
5 C y b
5 C y b

If order is not important, use another solutions.
If order is important get indices of repeated values, repeat by loc and add to original:
idx = [x for k, v in d.items() for x in df1.index[df1['key'] == k].repeat(v-1)]
df = df1.append(df1.loc[idx], ignore_index=True)
print (df)
key prop1 prop2
0 A x m
1 A y n
2 A z b
3 B u n
4 B u b
5 C y b
6 A x m
7 A y n
8 A z b
9 C y b
10 C y b

Using DataFrame.merge and np.repeat:
df = df1.merge(
pd.Series(np.repeat(list(d.keys()), list(d.values())), name='key'), on='key')
Result:
# print(df)
key prop1 prop2
0 A x m
1 A x m
2 A y n
3 A y n
4 A z b
5 A z b
6 B u n
7 B u b
8 C y b
9 C y b
10 C y b

Adding a column to a dataframe after every nth column

I have a dataframe of 9,000 columns and 100 rows. I want to insert a column after every 3rd column such that its value is equal to 50 for all rows.
Existing DataFrame
0 1 2 3 4 5 6 7 8 9....9000
0 a b c d e f g h i j ....x
1 k l m n o p q r s t ....x
.
.
100 u v w x y z aa bb cc....x
Desired DataFrame
0 1 2 3 4 5 6 7 8 9....12000
0 a b c 50 d e f 50 g h i j ....x
1 k l m 50 n o p 50 q r s t ....x
.
.
100 u v w 50 x y z 50 aa bb cc....x

Create new DataFrame by indexing each 3rd column, add .5 for correct sorting and add to original with concat:
df.columns = np.arange(len(df.columns))
df1 = pd.DataFrame(50, index=df.index, columns= df.columns[2::3] + .5)
df2 = pd.concat([df, df1], axis=1).sort_index(axis=1)
df2.columns = np.arange(len(df2.columns))
print (df2)
0 1 2 3 4 5 6 7 8 9 10 11 12
0 a b c 50 d e f 50 g h i 50 j
1 k l m 50 n o p 50 q r s 50 t

Numpy
# How many columns to group
x = 3
# Get the shape of things
a = df.to_numpy()
m, n = a.shape
k = n // x
# Get only a multiple of x columns and reshape
b = a[:, :k * x].reshape(m, k, x)
# Get the other columns missed by b
c = a[:, k * x:]
# array of 50's that we'll append to the last dimension
_50 = np.ones((m, k, 1), np.int64) * 50
# append 50's and reshape back to 2D
d = np.append(b, _50, axis=2).reshape(m, k * (x + 1))
# Create DataFrame while appending the missing bit
pd.DataFrame(np.append(d, c, axis=1))
0 1 2 3 4 5 6 7 8 9 10 11 12
0 a b c 50 d e f 50 g h i 50 j
1 k l m 50 n o p 50 q r s 50 t
Setup
df = pd.DataFrame(np.reshape([*'abcdefghijklmnopqrst'], (2, -1)))

So here is one solution
s=pd.concat([y.assign(new=50) for x, y in df.groupby(np.arange(df.shape[1])//3,axis=1)],axis=1)
s.columns=np.arange(s.shape[1])

pandas apply and applymap functions are taking long time to run on large dataset

I have two functions applied on a dataframe
res = df.apply(lambda x:pd.Series(list(x)))
res = res.applymap(lambda x: x.strip('"') if isinstance(x, str) else x)
{{Update}} Dataframe has got almost 700 000 rows. This is taking much time to run.
How to reduce the running time?
Sample data :
A
----------
0 [1,4,3,c]
1 [t,g,h,j]
2 [d,g,e,w]
3 [f,i,j,h]
4 [m,z,s,e]
5 [q,f,d,s]
output:
A B C D E
-------------------------
0 [1,4,3,c] 1 4 3 c
1 [t,g,h,j] t g h j
2 [d,g,e,w] d g e w
3 [f,i,j,h] f i j h
4 [m,z,s,e] m z s e
5 [q,f,d,s] q f d s
This line of code res = df.apply(lambda x:pd.Series(list(x))) takes items from a list and fill one by one to each column as shown above. There will be almost 38 columns.

I think:
res = df.apply(lambda x:pd.Series(list(x)))
should be changed to:
df1 = pd.DataFrame(df['A'].values.tolist())
print (df1)
0 1 2 3
0 1 4 3 c
1 t g h j
2 d g e w
3 f i j h
4 m z s e
5 q f d s
And second if not mixed columns values - numeric with strings:
cols = res.select_dtypes(object).columns
res[cols] = res[cols].apply(lambda x: x.str.strip('"'))

Concatenate strings along the off diagonals

Setup
import pandas as pd
from string import ascii_uppercase
df = pd.DataFrame(np.array(list(ascii_uppercase[:25])).reshape(5, 5))
df
0 1 2 3 4
0 A B C D E
1 F G H I J
2 K L M N O
3 P Q R S T
4 U V W X Y
Question
How do I concatenate the strings along the off diagonals?
Expected Result
0 A
1 FB
2 KGC
3 PLHD
4 UQMIE
5 VRNJ
6 WSO
7 XT
8 Y
dtype: object
What I Tried
df.unstack().groupby(sum).sum()
This works fine. But #Zero's answer is far faster.

You could do
In [1766]: arr = df.values[::-1, :] # or np.flipud(df.values)
In [1767]: N = arr.shape[0]
In [1768]: [''.join(arr.diagonal(i)) for i in range(-N+1, N)]
Out[1768]: ['A', 'FB', 'KGC', 'PLHD', 'UQMIE', 'VRNJ', 'WSO', 'XT', 'Y']
In [1769]: pd.Series([''.join(arr.diagonal(i)) for i in range(-N+1, N)])
Out[1769]:
0 A
1 FB
2 KGC
3 PLHD
4 UQMIE
5 VRNJ
6 WSO
7 XT
8 Y
dtype: object
You may also do arr.diagonal(i).sum() but ''.join is more explicit.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Sorting a dataframe by another - python

If you can't trust just sorting the indexes (e.g. if the first df's indexes are not sorted, or if you have something other than RangeIndex), just use loc df2.loc[df.index] x y z w 0 1 a b c 1 1 d e f 2 0 g h i 3 0 k l m 4 1 n o p 5 0 q r s 6 0 t v à

Use: df.sort_index(inplace=True) It restores the order, just by index

Related

Is there a good way to apply a function cumulatively to a pandas series of strings?

How to repeat certain rows of a dataframe?

Adding a column to a dataframe after every nth column

pandas apply and applymap functions are taking long time to run on large dataset

Concatenate strings along the off diagonals

Categories

Resources