Shuffle DataFrame rows except the first row - python

I am trying to randomize all rows in a data frame except for the first. I would like for the first row to always appear first, and the remaining rows can be in any randomized order.
My data frame is:
df = pd.DataFrame(np.random.randn(10, 5), columns=['a', 'b', 'c', 'd', 'e'])
Any suggestions as to how I can approach this?

try this:
df = pd.concat([df[:1], df[1:].sample(frac=1)]).reset_index(drop=True)
test:
In [38]: df
Out[38]:
a b c d e
0 2.070074 2.216060 -0.015823 0.686516 -0.738393
1 -1.213517 0.994057 0.634805 0.517844 -0.128375
2 0.937532 0.814923 -0.231120 1.970019 1.438927
3 1.499967 0.105707 1.255207 0.929084 -3.359826
4 0.418702 -0.894226 -1.088968 0.631398 0.152026
5 1.214119 -0.122633 0.983818 -0.445202 -0.807955
6 0.252078 -0.258703 -0.445209 -0.179094 1.180077
7 1.428827 -0.569009 -0.718485 0.161108 1.300349
8 -1.403100 2.154548 -0.492264 -0.544538 -0.061745
9 0.468671 0.004839 -0.738240 -0.385624 -0.532640
In [39]: df = pd.concat([df[:1], df[1:].sample(frac=1)]).reset_index(drop=True)
In [40]: df
Out[40]:
a b c d e
0 2.070074 2.216060 -0.015823 0.686516 -0.738393
1 0.468671 0.004839 -0.738240 -0.385624 -0.532640
2 0.418702 -0.894226 -1.088968 0.631398 0.152026
3 -1.213517 0.994057 0.634805 0.517844 -0.128375
4 1.428827 -0.569009 -0.718485 0.161108 1.300349
5 0.937532 0.814923 -0.231120 1.970019 1.438927
6 0.252078 -0.258703 -0.445209 -0.179094 1.180077
7 1.499967 0.105707 1.255207 0.929084 -3.359826
8 -1.403100 2.154548 -0.492264 -0.544538 -0.061745
9 1.214119 -0.122633 0.983818 -0.445202 -0.807955

Use numpy's shuffle
import pandas as pd
import numpy as np
df = pd.DataFrame(np.arange(100).reshape(20, 5), columns=list('ABCDE'))
np.random.shuffle(df.values[1:, :])
print df
A B C D E
0 0 1 2 3 4
1 55 56 57 58 59
2 10 11 12 13 14
3 80 81 82 83 84
4 90 91 92 93 94
5 70 71 72 73 74
6 25 26 27 28 29
7 40 41 42 43 44
8 65 66 67 68 69
9 5 6 7 8 9
10 45 46 47 48 49
11 85 86 87 88 89
12 15 16 17 18 19
13 30 31 32 33 34
14 60 61 62 63 64
15 20 21 22 23 24
16 35 36 37 38 39
17 95 96 97 98 99
18 75 76 77 78 79
19 50 51 52 53 54
np.random.shuffle shuffles an ndarray in place. The dataframe is just a wrapper on an ndarray. You can access that ndarray with the values attribute. To specify that all but the first row get shiffled, operate on the array slice [1:, :].

Related

Place data from a Pandas DF into a Grid or Template

I have process where the end product is a Pandas DF where the output, which is variable in terms of data and length, is structured like this example of the output.
9 80340796
10 80340797
11 80340798
12 80340799
13 80340800
14 80340801
15 80340802
16 80340803
17 80340804
18 80340805
19 80340806
20 80340807
21 80340808
22 80340809
23 80340810
24 80340811
25 80340812
26 80340813
27 80340814
28 80340815
29 80340816
30 80340817
31 80340818
32 80340819
33 80340820
34 80340821
35 80340822
36 80340823
37 80340824
38 80340825
39 80340826
40 80340827
41 80340828
42 80340829
43 80340830
44 80340831
45 80340832
46 80340833
I need to get the numbers in the second column above, into the following grid format based on the numbers in the first column above.
1 2 3 4 5 6 7 8 9 10 11 12
A 1 9 17 25 33 41 49 57 65 73 81 89
B 2 10 18 26 34 42 50 58 66 74 82 90
C 3 11 19 27 35 43 51 59 67 75 83 91
D 4 12 20 28 36 44 52 60 68 76 84 92
E 5 13 21 29 37 45 53 61 69 77 85 93
F 6 14 22 30 38 46 54 62 70 78 86 94
G 7 15 23 31 39 47 55 63 71 79 87 95
H 8 16 24 32 40 48 56 64 72 80 88 96
So the end result in this example would be
Any advice on how to go about this would be much appreciated. I've been asked for this by a colleague, so the data is easy to read for their team (as it matches the layout of a physical test) but I have no idea how to produce it.
pandas pivot table, can do what you want in your question, but first you have to create 2 auxillary columns, 1 determing which column the value has to go in, another which row it is. You can get that as shown in the following example:
import numpy as np
import pandas as pd
df = pd.DataFrame({'num': list(range(9, 28)), 'val': list(range(80001, 80020))})
max_rows = 8
df['row'] = (df['num']-1)%8
df['col'] = np.ceil(df['num']/8).astype(int)
df.pivot_table(values=['val'], columns=['col'], index=['row'])
val
col 2 3 4
row
0 80001.0 80009.0 80017.0
1 80002.0 80010.0 80018.0
2 80003.0 80011.0 80019.0
3 80004.0 80012.0 NaN
4 80005.0 80013.0 NaN
5 80006.0 80014.0 NaN
6 80007.0 80015.0 NaN
7 80008.0 80016.0 NaN

Rolling average across several columns and rows

import random
random.sample(range(1, 100), 10)
df = pd.DataFrame({"A": random.sample(range(1, 100), 10),
"B":random.sample(range(1, 100), 10),
"C":random.sample(range(1, 100), 10)})
df["D"]="need_to_calc"
df
I need the value of Column D, Row 9 to equal the average of the block of cells from rows 6 through 8 across columns A through C. I want to do this for all rows.
I am not sure how to do this in a single pythonic action. Instead I have hacky temporary columns and ugly nonsense.
Is there a cleaner way to define this column without temporary tables?
You can do it like this:
means = df.rolling(3).mean().shift(1)
df['D'] = (means['A'] + means['B'] + means['C'])/3
Output:
A B C D
0 43 57 15 NaN
1 86 34 68 NaN
2 40 12 78 NaN
3 97 24 54 48.111111
4 90 42 10 54.777778
5 34 54 98 49.666667
6 98 36 31 55.888889
7 16 5 24 54.777778
8 35 53 67 44.000000
9 80 66 37 40.555556
You can do it so:
df["D"]= (df.sum(axis=1).rolling(window=3, min_periods=3).sum()/9).shift(1)
Example:
A B C D
0 62 89 12 need_to_calc
1 44 13 63 need_to_calc
2 28 21 54 need_to_calc
3 93 93 4 need_to_calc
4 95 84 42 need_to_calc
5 68 68 35 need_to_calc
6 3 92 56 need_to_calc
7 13 88 83 need_to_calc
8 22 37 23 need_to_calc
9 64 58 5 need_to_calc
Output:
A B C D
0 62 89 12 NaN
1 44 13 63 NaN
2 28 21 54 NaN
3 93 93 4 42.888889
4 95 84 42 45.888889
5 68 68 35 57.111111
6 3 92 56 64.666667
7 13 88 83 60.333333
8 22 37 23 56.222222
9 64 58 5 46.333333

Split a Pandas Dataframe into multiple Dataframes based on Triangular Number Series

I have a DataFrame (df) and I need to split it into n number of Dataframes based on the column numbers. But, it has to follow the Triangular Series pattern:
df1 = df[[0]]
df2 = df[[1,2]]
df3 = df[[3,4,5]]
df4 = df[[6,7,8,9]]
etc.
Consider the dataframe df
df = pd.DataFrame(
np.arange(100).reshape(10, 10),
columns=list('ABCDEFGHIJ')
)
df
A B C D E F G H I J
0 0 1 2 3 4 5 6 7 8 9
1 10 11 12 13 14 15 16 17 18 19
2 20 21 22 23 24 25 26 27 28 29
3 30 31 32 33 34 35 36 37 38 39
4 40 41 42 43 44 45 46 47 48 49
5 50 51 52 53 54 55 56 57 58 59
6 60 61 62 63 64 65 66 67 68 69
7 70 71 72 73 74 75 76 77 78 79
8 80 81 82 83 84 85 86 87 88 89
9 90 91 92 93 94 95 96 97 98 99
i_s, j_s = np.arange(4).cumsum(), np.arange(1, 5).cumsum()
df1, df2, df3, df4 = [
df.iloc[:, i:j] for i, j in zip(i_s, j_s)
]
Verify
pd.concat(dict(enumerate([df.iloc[:, i:j] for i, j in zip(i_s, j_s)])), axis=1)
0 1 2 3
A B C D E F G H I J
0 0 1 2 3 4 5 6 7 8 9
1 10 11 12 13 14 15 16 17 18 19
2 20 21 22 23 24 25 26 27 28 29
3 30 31 32 33 34 35 36 37 38 39
4 40 41 42 43 44 45 46 47 48 49
5 50 51 52 53 54 55 56 57 58 59
6 60 61 62 63 64 65 66 67 68 69
7 70 71 72 73 74 75 76 77 78 79
8 80 81 82 83 84 85 86 87 88 89
9 90 91 92 93 94 95 96 97 98 99
first get Triangular Number Series, then apply it to dataframe
n = len(df.columns.tolist())
end = 0
i = 0
res = []
while end < n:
begin = end
end = i*(i+1)/2
res.append(begin,end)
idx = map( lambda x:range(x),res)
for i in idx:
df[i]

Retriving all the rows from a csv file and plotting

I need to retrieve the rows from a csv file generated from the function:
def your_func(row):
return (row['x-momentum']**2+ row['y-momentum']**2 + row['z-momentum']**2)**0.5 / row['mass']
columns_to_keep = ['#time', 'x-momentum', 'y-momentum', 'z-momentum', 'mass']
dataframe = pd.read_csv("./flash.csv", usecols=columns_to_keep)
dataframe['mean_velocity'] = dataframe.apply(your_func, axis=1)
print dataframe
I got rows up until 29s then it skipped to the last few lines, also I need to plot this column 2 against 1
you can adjust pd.options.display.max_rows option, but it won't affect your plots, so your plots will contain all your data
demo:
In [25]: df = pd.DataFrame(np.random.randint(0,100,size=(10, 3)), columns=list('ABC'))
In [26]: df
Out[26]:
A B C
0 93 76 5
1 33 70 12
2 50 52 26
3 88 98 85
4 90 93 92
5 66 10 58
6 82 43 39
7 17 20 91
8 47 90 33
9 44 30 26
In [27]: pd.options.display.max_rows = 4
Now it'll display 4 rows at most
In [36]: df
Out[36]:
A B C
0 93 76 5
1 33 70 12
.. .. .. ..
8 47 90 33
9 44 30 26
[10 rows x 3 columns]
but it'll plot all your data
In [37]: df.plot.bar()
Out[37]: <matplotlib.axes._subplots.AxesSubplot at 0x49e2d68>
In [38]: pd.options.display.max_rows = 60
In [39]: df
Out[39]:
A B C
0 93 76 5
1 33 70 12
2 50 52 26
3 88 98 85
4 90 93 92
5 66 10 58
6 82 43 39
7 17 20 91
8 47 90 33
9 44 30 26

pandas - get a fraction of each multi-index level label rows

I have an XY problem. My setup is as follows - I have a dataframe with multi-index of 2 levels. I want to split it to two dataframes, taking only a fraction of rows from each label in the first level. For example:
df = pd.DataFrame({'a':[1, 1, 1, 1, 7, 7, 10, 10, 10, 10, 10, 10, 10], 'b': np.random.randint(0, 100, 13), 'c':np.random.randint(0, 100, 13)}).set_index(['a', 'b'])
df
Out[13]:
c
a b
1 86 83
1 37
57 64
53 5
7 4 66
13 49
10 61 0
32 84
97 59
69 98
25 52
17 31
37 95
So let's say the fraction is 0.5, I want to split it to two dataframes:
c
a b
1 86 83
1 37
7 4 66
10 61 0
32 84
97 59
69 98
c
a b
1 57 64
53 5
7 13 49
10 25 52
17 31
37 95
I thought about doing (df.groupby(level = 0).count() * 0.5).astype(int) to get the limit on which to "slice" the dataframe. Then, if only I had a way to add a running distance such as this:
c r
a b
1 38 36 0
6 47 1
57 6 2
55 45 3
7 7 51 0
90 96 1
10 59 75 0
27 16 1
58 7 2
79 51 3
58 77 4
63 48 5
87 60 6
I could join the limits and this df and filter with a boolean condition. Any suggestions on either problem? (splitting a fraction of rows or adding a level-aware running index)
This turns out to be pretty trivial with groupby:
In [36]: df.groupby(level=0).apply(lambda x:x.head(int(x.shape[0] * 0.5))).reset_index(level=0, drop=True)
Out[36]:
c
a b
1 86 83
1 37
7 4 66
10 61 0
32 84
97 59
Also getting the running index per group:
In [33]: df.groupby(level=0).cumcount()
Out[33]:
a b
1 38 0
6 1
57 2
55 3
7 7 0
90 1
10 59 0
27 1
58 2
79 3
58 4
63 5
87 6

Categories

Resources