import random
random.sample(range(1, 100), 10)
df = pd.DataFrame({"A": random.sample(range(1, 100), 10),
"B":random.sample(range(1, 100), 10),
"C":random.sample(range(1, 100), 10)})
df["D"]="need_to_calc"
df
I need the value of Column D, Row 9 to equal the average of the block of cells from rows 6 through 8 across columns A through C. I want to do this for all rows.
I am not sure how to do this in a single pythonic action. Instead I have hacky temporary columns and ugly nonsense.
Is there a cleaner way to define this column without temporary tables?
You can do it like this:
means = df.rolling(3).mean().shift(1)
df['D'] = (means['A'] + means['B'] + means['C'])/3
Output:
A B C D
0 43 57 15 NaN
1 86 34 68 NaN
2 40 12 78 NaN
3 97 24 54 48.111111
4 90 42 10 54.777778
5 34 54 98 49.666667
6 98 36 31 55.888889
7 16 5 24 54.777778
8 35 53 67 44.000000
9 80 66 37 40.555556
You can do it so:
df["D"]= (df.sum(axis=1).rolling(window=3, min_periods=3).sum()/9).shift(1)
Example:
A B C D
0 62 89 12 need_to_calc
1 44 13 63 need_to_calc
2 28 21 54 need_to_calc
3 93 93 4 need_to_calc
4 95 84 42 need_to_calc
5 68 68 35 need_to_calc
6 3 92 56 need_to_calc
7 13 88 83 need_to_calc
8 22 37 23 need_to_calc
9 64 58 5 need_to_calc
Output:
A B C D
0 62 89 12 NaN
1 44 13 63 NaN
2 28 21 54 NaN
3 93 93 4 42.888889
4 95 84 42 45.888889
5 68 68 35 57.111111
6 3 92 56 64.666667
7 13 88 83 60.333333
8 22 37 23 56.222222
9 64 58 5 46.333333
Related
I have a following DataFrame:
model_year cylinders mpg
0 70 4 25.285714
1 70 6 20.500000
2 70 8 14.111111
3 71 4 27.461538
4 71 6 18.000000
5 71 8 13.428571
6 72 3 19.000000
7 72 4 23.428571
8 72 8 13.615385
9 73 3 18.000000
10 73 4 22.727273
11 73 6 19.000000
12 73 8 13.200000
13 74 4 27.800000
14 74 6 17.857143
15 74 8 14.200000
16 75 4 25.250000
17 75 6 17.583333
18 75 8 15.666667
19 76 4 26.766667
20 76 6 20.000000
21 76 8 14.666667
22 77 3 21.500000
23 77 4 29.107143
24 77 6 19.500000
25 77 8 16.000000
26 78 4 29.576471
27 78 5 20.300000
28 78 6 19.066667
29 78 8 19.050000
30 79 4 31.525000
31 79 5 25.400000
32 79 6 22.950000
33 79 8 18.630000
34 80 3 23.700000
35 80 4 34.612000
36 80 5 36.400000
37 80 6 25.900000
38 81 4 32.814286
39 81 6 23.428571
40 81 8 26.600000
41 82 4 32.071429
42 82 6 28.333333
I want to select rows that fulfill the following condition:
For each model_year select a row with minimal value of cylinders in that year.
So, for instance, for model years = 70, 71, 72 and 73 I want to get:
model_year cylinders mpg
0 70 4 25.285714
3 71 4 27.461538
6 72 3 19.000000
9 73 3 18.000000
My most advanced attempt consisted of this:
I converted model_year and cylinders column into MultiIndex of the DataFrame
Using (among others) groupby method I've obtained MultiIndex object of rows I'd like to select.
However, I couldn't find a way to select rows using MultiIndex object.
For reference the MultiIndex I've obtained is:
MultiIndex([(70, 4),
(71, 4),
(72, 3),
(73, 3),
(74, 4),
(75, 4),
(76, 4),
(77, 3),
(78, 4),
(79, 4),
(80, 3),
(81, 4),
(82, 4)],
names=['model_year', 'cylinders'])
I think a simpler solution would actually be to use groupby + transform:
selected = df[df['cylinders'] == df.groupby('model_year')['cylinders'].transform('min')]
Output:
>>> selected
model_year cylinders mpg
0 70 4 25.285714
3 71 4 27.461538
6 72 3 19.000000
9 73 3 18.000000
13 74 4 27.800000
16 75 4 25.250000
19 76 4 26.766667
22 77 3 21.500000
26 78 4 29.576471
30 79 4 31.525000
34 80 3 23.700000
38 81 4 32.814286
41 82 4 32.071429
(Note that if there are multiple minimums for a group (e.g. for model_year 70 there are two 4-cylinder rows), they will be included in the output.)
You could use groupby + idxmin to create a mask and filter df with it:
out = df.loc[df.groupby('model_year')['cylinders'].idxmin()]
Output:
model_year cylinders mpg
0 70 4 25.285714
3 71 4 27.461538
6 72 3 19.000000
9 73 3 18.000000
13 74 4 27.800000
16 75 4 25.250000
19 76 4 26.766667
22 77 3 21.500000
26 78 4 29.576471
30 79 4 31.525000
34 80 3 23.700000
38 81 4 32.814286
41 82 4 32.071429
You can just try
out = df.sort_values('cylinders',ascending=False).drop_duplicates('model_year')
I have process where the end product is a Pandas DF where the output, which is variable in terms of data and length, is structured like this example of the output.
9 80340796
10 80340797
11 80340798
12 80340799
13 80340800
14 80340801
15 80340802
16 80340803
17 80340804
18 80340805
19 80340806
20 80340807
21 80340808
22 80340809
23 80340810
24 80340811
25 80340812
26 80340813
27 80340814
28 80340815
29 80340816
30 80340817
31 80340818
32 80340819
33 80340820
34 80340821
35 80340822
36 80340823
37 80340824
38 80340825
39 80340826
40 80340827
41 80340828
42 80340829
43 80340830
44 80340831
45 80340832
46 80340833
I need to get the numbers in the second column above, into the following grid format based on the numbers in the first column above.
1 2 3 4 5 6 7 8 9 10 11 12
A 1 9 17 25 33 41 49 57 65 73 81 89
B 2 10 18 26 34 42 50 58 66 74 82 90
C 3 11 19 27 35 43 51 59 67 75 83 91
D 4 12 20 28 36 44 52 60 68 76 84 92
E 5 13 21 29 37 45 53 61 69 77 85 93
F 6 14 22 30 38 46 54 62 70 78 86 94
G 7 15 23 31 39 47 55 63 71 79 87 95
H 8 16 24 32 40 48 56 64 72 80 88 96
So the end result in this example would be
Any advice on how to go about this would be much appreciated. I've been asked for this by a colleague, so the data is easy to read for their team (as it matches the layout of a physical test) but I have no idea how to produce it.
pandas pivot table, can do what you want in your question, but first you have to create 2 auxillary columns, 1 determing which column the value has to go in, another which row it is. You can get that as shown in the following example:
import numpy as np
import pandas as pd
df = pd.DataFrame({'num': list(range(9, 28)), 'val': list(range(80001, 80020))})
max_rows = 8
df['row'] = (df['num']-1)%8
df['col'] = np.ceil(df['num']/8).astype(int)
df.pivot_table(values=['val'], columns=['col'], index=['row'])
val
col 2 3 4
row
0 80001.0 80009.0 80017.0
1 80002.0 80010.0 80018.0
2 80003.0 80011.0 80019.0
3 80004.0 80012.0 NaN
4 80005.0 80013.0 NaN
5 80006.0 80014.0 NaN
6 80007.0 80015.0 NaN
7 80008.0 80016.0 NaN
I want to convert N columns into one series. How to do it effectively?
Input:
0 1 2 3
0 64 98 47 58
1 80 94 81 46
2 18 43 79 84
3 57 35 81 31
Expected Output:
0 64
1 80
2 18
3 57
4 98
5 94
6 43
7 35
8 47
9 81
10 79
11 81
12 58
13 46
14 84
15 31
dtype: int64
So Far I tried:
print df[0].append(df[1]).append(df[2]).append(df[3]).reset_index(drop=True)
I'm not satisfied with my solution, moreover it won't work for dynamic columns. Please help me to find a better approach.
You can use unstack
pd.Series(df.unstack().values)
you need np.flatten
pd.Series(df.values.flatten(order='F'))
out[]
0 64
1 80
2 18
3 57
4 98
5 94
6 43
7 35
8 47
9 81
10 79
11 81
12 58
13 46
14 84
15 31
dtype: int64
Here's yet another short one.
>>> pd.Series(df.values.ravel(order='F'))
>>>
0 64
1 80
2 18
3 57
4 98
5 94
6 43
7 35
8 47
9 81
10 79
11 81
12 58
13 46
14 84
15 31
dtype: int64
You can also use Series class and .values attribute:
pd.Series(df.values.T.flatten())
Output:
0 64
1 80
2 18
3 57
4 98
5 94
6 43
7 35
8 47
9 81
10 79
11 81
12 58
13 46
14 84
15 31
dtype: int64
Use pd.melt() -
df.melt()['value']
Output
0 64
1 80
2 18
3 57
4 98
5 94
6 43
7 35
8 47
9 81
10 79
11 81
12 58
13 46
14 84
15 31
Name: value, dtype: int64
df.T.stack().reset_index(drop=True)
Out:
0 64
1 80
2 18
3 57
4 98
5 94
6 43
7 35
8 47
9 81
10 79
11 81
12 58
13 46
14 84
15 31
dtype: int64
I have a DataFrame (df) and I need to split it into n number of Dataframes based on the column numbers. But, it has to follow the Triangular Series pattern:
df1 = df[[0]]
df2 = df[[1,2]]
df3 = df[[3,4,5]]
df4 = df[[6,7,8,9]]
etc.
Consider the dataframe df
df = pd.DataFrame(
np.arange(100).reshape(10, 10),
columns=list('ABCDEFGHIJ')
)
df
A B C D E F G H I J
0 0 1 2 3 4 5 6 7 8 9
1 10 11 12 13 14 15 16 17 18 19
2 20 21 22 23 24 25 26 27 28 29
3 30 31 32 33 34 35 36 37 38 39
4 40 41 42 43 44 45 46 47 48 49
5 50 51 52 53 54 55 56 57 58 59
6 60 61 62 63 64 65 66 67 68 69
7 70 71 72 73 74 75 76 77 78 79
8 80 81 82 83 84 85 86 87 88 89
9 90 91 92 93 94 95 96 97 98 99
i_s, j_s = np.arange(4).cumsum(), np.arange(1, 5).cumsum()
df1, df2, df3, df4 = [
df.iloc[:, i:j] for i, j in zip(i_s, j_s)
]
Verify
pd.concat(dict(enumerate([df.iloc[:, i:j] for i, j in zip(i_s, j_s)])), axis=1)
0 1 2 3
A B C D E F G H I J
0 0 1 2 3 4 5 6 7 8 9
1 10 11 12 13 14 15 16 17 18 19
2 20 21 22 23 24 25 26 27 28 29
3 30 31 32 33 34 35 36 37 38 39
4 40 41 42 43 44 45 46 47 48 49
5 50 51 52 53 54 55 56 57 58 59
6 60 61 62 63 64 65 66 67 68 69
7 70 71 72 73 74 75 76 77 78 79
8 80 81 82 83 84 85 86 87 88 89
9 90 91 92 93 94 95 96 97 98 99
first get Triangular Number Series, then apply it to dataframe
n = len(df.columns.tolist())
end = 0
i = 0
res = []
while end < n:
begin = end
end = i*(i+1)/2
res.append(begin,end)
idx = map( lambda x:range(x),res)
for i in idx:
df[i]
I am trying to randomize all rows in a data frame except for the first. I would like for the first row to always appear first, and the remaining rows can be in any randomized order.
My data frame is:
df = pd.DataFrame(np.random.randn(10, 5), columns=['a', 'b', 'c', 'd', 'e'])
Any suggestions as to how I can approach this?
try this:
df = pd.concat([df[:1], df[1:].sample(frac=1)]).reset_index(drop=True)
test:
In [38]: df
Out[38]:
a b c d e
0 2.070074 2.216060 -0.015823 0.686516 -0.738393
1 -1.213517 0.994057 0.634805 0.517844 -0.128375
2 0.937532 0.814923 -0.231120 1.970019 1.438927
3 1.499967 0.105707 1.255207 0.929084 -3.359826
4 0.418702 -0.894226 -1.088968 0.631398 0.152026
5 1.214119 -0.122633 0.983818 -0.445202 -0.807955
6 0.252078 -0.258703 -0.445209 -0.179094 1.180077
7 1.428827 -0.569009 -0.718485 0.161108 1.300349
8 -1.403100 2.154548 -0.492264 -0.544538 -0.061745
9 0.468671 0.004839 -0.738240 -0.385624 -0.532640
In [39]: df = pd.concat([df[:1], df[1:].sample(frac=1)]).reset_index(drop=True)
In [40]: df
Out[40]:
a b c d e
0 2.070074 2.216060 -0.015823 0.686516 -0.738393
1 0.468671 0.004839 -0.738240 -0.385624 -0.532640
2 0.418702 -0.894226 -1.088968 0.631398 0.152026
3 -1.213517 0.994057 0.634805 0.517844 -0.128375
4 1.428827 -0.569009 -0.718485 0.161108 1.300349
5 0.937532 0.814923 -0.231120 1.970019 1.438927
6 0.252078 -0.258703 -0.445209 -0.179094 1.180077
7 1.499967 0.105707 1.255207 0.929084 -3.359826
8 -1.403100 2.154548 -0.492264 -0.544538 -0.061745
9 1.214119 -0.122633 0.983818 -0.445202 -0.807955
Use numpy's shuffle
import pandas as pd
import numpy as np
df = pd.DataFrame(np.arange(100).reshape(20, 5), columns=list('ABCDE'))
np.random.shuffle(df.values[1:, :])
print df
A B C D E
0 0 1 2 3 4
1 55 56 57 58 59
2 10 11 12 13 14
3 80 81 82 83 84
4 90 91 92 93 94
5 70 71 72 73 74
6 25 26 27 28 29
7 40 41 42 43 44
8 65 66 67 68 69
9 5 6 7 8 9
10 45 46 47 48 49
11 85 86 87 88 89
12 15 16 17 18 19
13 30 31 32 33 34
14 60 61 62 63 64
15 20 21 22 23 24
16 35 36 37 38 39
17 95 96 97 98 99
18 75 76 77 78 79
19 50 51 52 53 54
np.random.shuffle shuffles an ndarray in place. The dataframe is just a wrapper on an ndarray. You can access that ndarray with the values attribute. To specify that all but the first row get shiffled, operate on the array slice [1:, :].