How to concatenate rows side by side in pandas - python

I want to combine the five rows of the same dataset into a single dataset
I have 700 rows and i want to combining every five rows
A B C D E F G
1 10,11,12,13,14,15,16
2 17,18,19,20,21,22,23
3 24,25,26,27,28,29,30
4 31,32,33,34,35,36,37
5 38,39,40,41,42,43,44
.
.
.
.
.
700
After combining the first five rows.. My first row should look like this:
A B C D E F G A B C D E F G A B C D E F G A B C D E F G A B C D E F G
1 10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44

If you can guarantee that the total number of rows you have is a multiple of 5, dipping into numpy will be the most efficient way to solve this problem:
import numpy as np
import pandas as pd
data = np.arange(70).reshape(-1, 7)
df = pd.DataFrame(data, columns=[*'ABCDEFG'])
print(df)
A B C D E F G
0 0 1 2 3 4 5 6
1 7 8 9 10 11 12 13
2 14 15 16 17 18 19 20
3 21 22 23 24 25 26 27
4 28 29 30 31 32 33 34
5 35 36 37 38 39 40 41
6 42 43 44 45 46 47 48
7 49 50 51 52 53 54 55
8 56 57 58 59 60 61 62
9 63 64 65 66 67 68 69
out = pd.DataFrame(
df.to_numpy().reshape(-1, df.shape[1] * 5),
columns=[*df.columns] * 5
)
print(out)
A B C D E F G A B C D E F ... B C D E F G A B C D E F G
0 0 1 2 3 4 5 6 7 8 9 10 11 12 ... 22 23 24 25 26 27 28 29 30 31 32 33 34
1 35 36 37 38 39 40 41 42 43 44 45 46 47 ... 57 58 59 60 61 62 63 64 65 66 67 68 69
[2 rows x 35 columns]

You can do:
cols = [col for v in [df.columns.tolist()]*len(df) for col in v]
dfs = [df[i:min(i+5,len(df))].reset_index(drop=True) for i in range(0,len(df),5)]
df2 = pd.concat([pd.DataFrame(df.stack()).T for df in dfs])
df2.columns = cols
df2.reset_index(drop=True, inplace=True)

see if this helps answer your question
unstack turns the columns into rows, and once we have the data in a column, we just need it transposed. reset_index makes the resulting series into a dataframe. the original columns names are made into an index, so when we transpose we have the columns as you had stated in your columns.
df.unstack().reset_index().set_index('level_0')[[0]].T
level_0 A A A A A B B B B B ... F F F F F G G G G G
0 10 17 24 31 38 11 18 25 32 39 ... 15 22 29 36 43 16 23 30 37 44
vote and/or accept if the answer helps

the easiest way is to convert your dataframe to a numpy array, reshape it then cast it back to a new dataframe.
Edit:
data= # your dataframe
new_dataframe=pd.DataFrame(data.to_numpy().reshape(len(data)//5,-1),columns=np.tile(data.columns,5))

Stacking and unstacking data in pandas
Data in tables are often presented multiple ways. Long form ("tidy data") refers to data that are stacked in a couple of columns. One of the columns will have categorical indicators about the values. In contrast, wide form ("stacked data") is where each category has it's own column.
In your example, you present the wide form of data, and you're trying to get it into long form. The pandas.melt, pandas.groupby, pandas.pivot, pandas.stack, pandas.unstack, and pandas.reset_index are the functions that help convert between these forms.
Start with your original dataframe:
df = pd.DataFrame({
'A' : [10, 17, 24, 31, 38],
'B' : [11, 18, 25, 32, 39],
'C' : [12, 19, 26, 33, 40],
'D' : [13, 20, 27, 34, 41],
'E' : [14, 21, 28, 35, 42],
'F' : [15, 22, 29, 36, 43],
'G' : [16, 23, 30, 37, 44]})
A B C D E F G
0 10 11 12 13 14 15 16
1 17 18 19 20 21 22 23
2 24 25 26 27 28 29 30
3 31 32 33 34 35 36 37
4 38 39 40 41 42 43 44
Use pandas.melt to convert it to long form, then sort to get it how you requested the data: The ignore index option helps us to get it back to wide form later.
melted_df = df.melt(ignore_index=False).sort_values(by='value')
variable value
0 A 10
0 B 11
0 C 12
0 D 13
0 E 14
0 F 15
0 G 16
1 A 17
1 B 18
...
Use groupby, unstack, and reset_index to convert it back to wide form. This is often a much more difficult process that relies on grouping by the value stacked column, other columns, index, and stacked variable and then unstacking and resetting the index.
(melted_df
.reset_index() # puts the index values into a column called 'index'
.groupby(['index','variable']) #groups by the index and the variable
.value #selects the value column in each of the groupby objects
.mean() #since there is only one item per group, it only aggregates one item
.unstack() #this sets the first item of the multi-index to columns
.reset_index() #fix the index
.set_index('index') #set index
)
A B C D E F G
0 10 11 12 13 14 15 16
1 17 18 19 20 21 22 23
2 24 25 26 27 28 29 30
3 31 32 33 34 35 36 37
4 38 39 40 41 42 43 44
This stuff can be quite difficult and requires trial and error. I would recommend making a smaller version of your problems and mess with them. This way you can figure out how the functions are working.

Try this using arange() with floordiv to get groups by every 5, then creating a new df with the groups. This should work even if your df is not divisible by 5.
l = 5
(df.groupby(np.arange(len(df.index))//l)
.apply(lambda x: pd.DataFrame([x.to_numpy().ravel()]))
.set_axis(df.columns.tolist() * l,axis=1)
.reset_index(drop=True))
or
(df.groupby(np.arange(len(df.index))//5)
.apply(lambda x: x.reset_index(drop=True).stack())
.unstack(level=[1,2])
.droplevel(0,axis=1))
Output:
A B C D E F G A B C ... E F G A B C D E F G
0 9 0 3 2 6 2 9 1 7 5 ... 2 5 9 5 4 9 7 3 8 9
1 9 5 0 8 1 5 8 7 7 7 ... 6 3 5 5 2 3 9 7 5 6

Related

Pandas sum multi-index columns with same name

I know that I can sum index's by:
df["name1"]+df["name2"]
But how does sum work when the two index names are the same?
Given the following CSV:
,,College 1,,,,,,,,,,,,College 2,,,,,,,,,,,,College 3,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,Music,,,,Geography,,,,Business,,,,Mathematics,,,,Biology,,,,Geography,,,,Business,,,,Biology,,,,Technology,,,
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,F,P,M,D,F,P,M,D,F,P,M,D,F,P,M,D,F,P,M,D,F,P,M,D,F,P,M,D,F,P,M,D,F,P,M,D
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Year 1,M,0,5,7,9,2,18,5,10,4,9,6,2,4,14,18,11,10,19,18,20,3,17,19,13,4,9,6,2,0,10,11,14,4,12,12,5
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,F,0,13,14,11,0,6,8,6,2,12,14,9,9,17,12,18,6,17,16,14,0,4,2,5,2,12,14,9,10,11,18,20,0,5,7,8
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Year 2,M,5,10,6,6,1,20,5,18,4,9,6,2,10,13,15,19,2,18,16,13,1,19,5,12,4,9,6,2,1,13,15,18,3,19,8,16
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,F,1,11,14,15,0,9,9,2,2,12,14,9,7,17,18,14,9,18,13,14,0,9,2,10,2,12,14,9,0,17,19,19,0,4,6,4
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Evening,M,4,10,6,5,3,13,19,5,4,9,6,2,8,17,10,18,3,11,20,11,4,18,17,20,4,9,6,2,8,12,16,13,4,19,18,7
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,F,4,12,12,13,0,9,3,8,2,12,14,9,0,18,11,18,1,13,13,10,0,6,2,8,2,12,14,9,9,16,20,13,0,10,5,6
I can clean the file and setup a multi-index with pandas and numpy:
df = pd.read_csv("CollegeGrades2.csv", index_col=[0,1], header=[0,1,2], skiprows=lambda x: x%2 == 1)
df.columns = pd.MultiIndex.from_frame(df.columns.to_frame().apply(lambda x: np.where(x.str.contains('Unnamed'), np.nan, x)).ffill())
df.index = pd.MultiIndex.from_frame(df.index.to_frame().ffill())
df.groupby(level=0, sort=False).sum()
However my issue is that I want to total the subjects e.g. College 1 Geography + College 3 Geography and display them in the following output:
I have tried separating them out into different data frames, summing them and then concatenating them but in doing so I lose the headings, for example:
music = df2["College 1", "Music"]
geography = df2["College 1", "Geography"] + df2["College 1", "Geography"]
pd.concat([music,geography], axis=1).groupby(level=0, sort=False).sum()
How I sum the subjects while maintaining my desired output? Any help would be appreciated.
Thank you.
You can also group by the column:
df.groupby(level=[1, 2], axis=1).sum().groupby(level=0).sum()
Result:
1 Biology Business Geography Mathematics Music Technology
2 D F M P D F M P D F M P D F M P D F M P D F M P
0
Evening 47 21 69 52 22 12 40 42 41 7 41 46 36 8 21 35 18 8 18 22 13 4 23 29
Year 1 68 26 63 57 22 12 40 42 34 5 34 45 29 13 30 31 20 0 21 18 13 4 19 17
Year 2 64 12 63 66 22 12 40 42 42 2 21 57 33 17 33 30 21 6 20 21 20 3 14 23

Pandas: adding several columns to a dataframe in a single line

I come from R background & am wondering if there's a single line code to add several new columns to an existing dataframe in Pandas just like dplyr. If have this code:
import pandas as pd
df = pd.DataFrame({'a': range(1, 11)})
df['b'] = range(11, 21)
df['c'] = range(21, 31)
df['d'] = range(31, 40)
df['e'] = range(41, 50)
Is there a way to make all columns addition into df in one line?
An example of what I want in R would be:
library(dplyr)
df <- data.frame('a' = 1:10)
df <- df %>% mutate(b = 11:20, c = 21:30, d = 31:40, e = 41:50)
There is assign:
df.assign(b=range(11,21), c=range(21,31), d=range(31,41))
Things are even easier when you have a dictionary:
# assume you get this from somewhere else
val_dict = {'b': range(11,21), 'c':range(21,31)}
df.assign(**val_dict)
Note the second approach is expected when b is not a possible choice for keyword arguments, for example, having spaces 'a b'.
As others have noted, you could build them all in the original construction of the dataframe, but if you needed to add multiple columns at a later point, you can add each through multiple declaration:
df['b'], df['c'], df['d'], df['e'] = range(11, 21), range(21,31), range(31,41), range(41,51)
df = pd.DataFrame( {c: range(x, y) for c,x,y in [(chr(97+x), x*10+1, x*10+11) for x in range(5)]})
>>> df
a b c d e
0 1 11 21 31 41
1 2 12 22 32 42
2 3 13 23 33 43
3 4 14 24 34 44
4 5 15 25 35 45
5 6 16 26 36 46
6 7 17 27 37 47
7 8 18 28 38 48
8 9 19 29 39 49
9 10 20 30 40 50
Or to add to an existing dataframe:
df = pd.DataFrame({'a': range(1,11)})
df = pd.concat([df, pd.DataFrame( {c: range(x, y) for c,x,y in [(chr(97+x), x*10+1, x*10+11) for x in range(1, 5)]})], axis=1)
Check out this:
>>> from datar.all import f, tibble, mutate
>>> df = tibble(a = f[1:10])
>>> df >> mutate(b = f[11:20], c = f[21:30], d = f[31:40], e = f[41:50])
a b c d e
<int64> <int64> <int64> <int64> <int64>
0 1 11 21 31 41
1 2 12 22 32 42
2 3 13 23 33 43
3 4 14 24 34 44
4 5 15 25 35 45
5 6 16 26 36 46
6 7 17 27 37 47
7 8 18 28 38 48
8 9 19 29 39 49
9 10 20 30 40 50
I am the author of the datar package.
You just pass all the data and associated column name into pd.DataFrame just as you did with column 'a', separate with commas.
Like this:
df = pd.DataFrame({'a': range(1, 11), 'b' : range(11, 21)})

Iterating over dataframe and replace with values from another dataframe

I have 2 dataframes, df1 and df2, and df2 holds the min and max values for the corresponding columns.
import numpy as np
import pandas as pd
df1 = pd.DataFrame(np.random.randint(0,50,size=(10, 5)), columns=list('ABCDE'))
df2 = pd.DataFrame(np.array([[5,3,4,7,2],[30,20,30,40,50]]),columns=list('ABCDE'))
I would like to iterate through df1 and replace the cell values with those of df2 when the df1 cell value is below/above the respective columns' min/max values.
First dont loop/iterate in pandas, if exist some another better and vectorized solutions like here.
Use numpy.select with broadcasting for set values by conditions:
np.random.seed(123)
df1 = pd.DataFrame(np.random.randint(0,50,size=(10, 5)), columns=list('ABCDE'))
df2 = pd.DataFrame(np.array([[5,3,4,7,2],[30,20,30,40,50]]),columns=list('ABCDE'))
print (df1)
A B C D E
0 45 2 28 34 38
1 17 19 42 22 33
2 32 49 47 9 32
3 46 32 47 25 19
4 14 36 32 16 4
5 49 3 2 20 39
6 2 20 47 48 7
7 41 35 28 38 33
8 21 30 27 34 33
print (df2)
A B C D E
0 5 3 4 7 2
1 30 20 30 40 50
#for pandas below 0.24 change .to_numpy() to .values
min1 = df2.loc[0].to_numpy()
max1 = df2.loc[1].to_numpy()
arr = df1.to_numpy()
df = pd.DataFrame(np.select([arr < min1, arr > max1], [min1, max1], arr),
index=df1.index,
columns=df1.columns)
print (df)
A B C D E
0 30 3 28 34 38
1 17 19 30 22 33
2 30 20 30 9 32
3 30 20 30 25 19
4 14 20 30 16 4
5 30 3 4 20 39
6 5 20 30 40 7
7 30 20 28 38 33
8 21 20 27 34 33
9 12 20 4 40 5
Another better solution with numpy.clip:
df = pd.DataFrame(np.clip(arr, min1, max1), index=df1.index, columns=df1.columns)
print (df)
A B C D E
0 30 3 28 34 38
1 17 19 30 22 33
2 30 20 30 9 32
3 30 20 30 25 19
4 14 20 30 16 4
5 30 3 4 20 39
6 5 20 30 40 7
7 30 20 28 38 33
8 21 20 27 34 33
9 12 20 4 40 5

Create a column with periodically repeated values in pandas

I have a sample data frame df with one column:
Cost
30
49
98
10
37
20
10
48
70
20
30
40
50
29
90
39
30
29
50
40
and a list: id_list = ["A","B","C","D"] which is a list with 4 different id types. I would like to create a new column in the data frame where the first 5 cost values will be "A" the next 5 cost values will be "B" .... and the last 5 cost values will be "D". Therefore, I want to repeat the elements of the id_list 5 times and my new df will be like this:
Cost ID
30 A
49 A
98 A
10 A
37 A
20 B
10 B
48 B
70 B
20 B
30 C
40 C
50 C
29 C
90 C
39 D
30 D
29 D
50 D
40 D
My actual data frame has many rows and the actual id_list has many elements.
The row-number is multiple of 5 so there will be an exact fill in the final data frame.
In general I know how to add a column with specific values in pandas data frame
but I don't know how to do this with the repeated values.
Could you suggest how can I do this in python?
Thanks in advance for any help
There is function from numpy , repeat
df['New']=np.repeat(id_list,5)
df
Out[23]:
Cost New
0 30 A
1 49 A
2 98 A
3 10 A
4 37 A
5 20 B
6 10 B
7 48 B
8 70 B
9 20 B
10 30 C
11 40 C
12 50 C
13 29 C
14 90 C
15 39 D
16 30 D
17 29 D
18 50 D
19 40 D
Numpy free v1
df.assign(ID=sum(zip(*[id_list] * 5), tuple()))
Cost ID
0 30 A
1 49 A
2 98 A
3 10 A
4 37 A
5 20 B
6 10 B
7 48 B
8 70 B
9 20 B
10 30 C
11 40 C
12 50 C
13 29 C
14 90 C
15 39 D
16 30 D
17 29 D
18 50 D
19 40 D
Numpy free v2
df.assign(ID=[x for x in id_list for _ in range(5)])
I would suggest something like this, which takes advantage of the [item]*n => [item, item, item, ...] expansion that python does:
labels = ['label1', 'label2', 'label3']
num = 5
repeated = []
for i in labels:
repeated.extend([i]*num)
You can then add the column to your dataframe.

Fastest way to sort each row in a pandas dataframe

I need to find the quickest way to sort each row in a dataframe with millions of rows and around a hundred columns.
So something like this:
A B C D
3 4 8 1
9 2 7 2
Needs to become:
A B C D
8 4 3 1
9 7 2 2
Right now I'm applying sort to each row and building up a new dataframe row by row. I'm also doing a couple of extra, less important things to each row (hence why I'm using pandas and not numpy). Could it be quicker to instead create a list of lists and then build the new dataframe at once? Or do I need to go cython?
I think I would do this in numpy:
In [11]: a = df.values
In [12]: a.sort(axis=1) # no ascending argument
In [13]: a = a[:, ::-1] # so reverse
In [14]: a
Out[14]:
array([[8, 4, 3, 1],
[9, 7, 2, 2]])
In [15]: pd.DataFrame(a, df.index, df.columns)
Out[15]:
A B C D
0 8 4 3 1
1 9 7 2 2
I had thought this might work, but it sorts the columns:
In [21]: df.sort(axis=1, ascending=False)
Out[21]:
D C B A
0 1 8 4 3
1 2 7 2 9
Ah, pandas raises:
In [22]: df.sort(df.columns, axis=1, ascending=False)
ValueError: When sorting by column, axis must be 0 (rows)
To Add to the answer given by #Andy-Hayden, to do this inplace to the whole frame... not really sure why this works, but it does. There seems to be no control on the order.
In [97]: A = pd.DataFrame(np.random.randint(0,100,(4,5)), columns=['one','two','three','four','five'])
In [98]: A
Out[98]:
one two three four five
0 22 63 72 46 49
1 43 30 69 33 25
2 93 24 21 56 39
3 3 57 52 11 74
In [99]: A.values.sort
Out[99]: <function ndarray.sort>
In [100]: A
Out[100]:
one two three four five
0 22 63 72 46 49
1 43 30 69 33 25
2 93 24 21 56 39
3 3 57 52 11 74
In [101]: A.values.sort()
In [102]: A
Out[102]:
one two three four five
0 22 46 49 63 72
1 25 30 33 43 69
2 21 24 39 56 93
3 3 11 52 57 74
In [103]: A = A.iloc[:,::-1]
In [104]: A
Out[104]:
five four three two one
0 72 63 49 46 22
1 69 43 33 30 25
2 93 56 39 24 21
3 74 57 52 11 3
I hope someone can explain the why of this, just happy that it works 8)
You could use pd.apply.
Eg:
A = pd.DataFrame(np.random.randint(0,100,(4,5)), columns=['one','two','three','four','five'])
print (A)
one two three four five
0 2 75 44 53 46
1 18 51 73 80 66
2 35 91 86 44 25
3 60 97 57 33 79
A = A.apply(np.sort, axis = 1)
print(A)
one two three four five
0 2 44 46 53 75
1 18 51 66 73 80
2 25 35 44 86 91
3 33 57 60 79 97
Since you want it in descending order, you can simply multiply the dataframe with -1 and sort it.
A = pd.DataFrame(np.random.randint(0,100,(4,5)), columns=['one','two','three','four','five'])
A = A * -1
A = A.apply(np.sort, axis = 1)
A = A * -1
Instead of using pd.DataFrame constructor, an easier way to assign the sorted values back is to use double brackets:
original dataframe:
A B C D
3 4 8 1
9 2 7 2
df[['A', 'B', 'C', 'D']] = np.sort(df)[:, ::-1]
A B C D
0 8 4 3 1
1 9 7 2 2
This way you can also sort a part of the columns:
df[['B', 'C']] = np.sort(df[['B', 'C']])[:, ::-1]
A B C D
0 3 8 4 1
1 9 7 2 2
One could try this approach to preserve the integrity of the df:
import pandas as pd
import numpy as np
A = pd.DataFrame(np.random.randint(0,100,(4,5)), columns=['one','two','three','four','five'])
print (A)
print(type(A))
one two three four five
0 85 27 64 50 55
1 3 90 65 22 8
2 0 7 64 66 82
3 58 21 42 27 30
<class 'pandas.core.frame.DataFrame'>
B = A.apply(lambda x: np.sort(x), axis=1, raw=True)
print(B)
print(type(B))
one two three four five
0 27 50 55 64 85
1 3 8 22 65 90
2 0 7 64 66 82
3 21 27 30 42 58
<class 'pandas.core.frame.DataFrame'>

Categories

Resources