Pivoting an numpy array by using pandas [duplicate] - python

How do I convert a list of lists to a panda dataframe?
it is not in the form of coloumns but instead in the form of rows.
#!/usr/bin/env python
from random import randrange
import pandas
data = [[[randrange(0,100) for j in range(0, 12)] for y in range(0, 12)] for x in range(0, 5)]
print data
df = pandas.DataFrame(data[0], columns=['B','P','F','I','FP','BP','2','M','3','1','I','L'])
print df
for example:
data[0][0] == [64, 73, 76, 64, 61, 32, 36, 94, 81, 49, 94, 48]
I want it to be shown as rows and not coloumns.
currently it shows somethign like this
B P F I FP BP 2 M 3 1 I L
0 64 73 76 64 61 32 36 94 81 49 94 48
1 57 58 69 46 34 66 15 24 20 49 25 98
2 99 61 73 69 21 33 78 31 16 11 77 71
3 41 1 55 34 97 64 98 9 42 77 95 41
4 36 50 54 27 74 0 8 59 27 54 6 90
5 74 72 75 30 62 42 90 26 13 49 74 9
6 41 92 11 38 24 48 34 74 50 10 42 9
7 77 9 77 63 23 5 50 66 49 5 66 98
8 90 66 97 16 39 55 38 4 33 52 64 5
9 18 14 62 87 54 38 29 10 66 18 15 86
10 60 89 57 28 18 68 11 29 94 34 37 59
11 78 67 93 18 14 28 64 11 77 79 94 66
I want the rows and coloumns to be switched. Moreover, How do I make it for all 5 main lists?
This is how I want the output to look like with other coloumns also filled in.
B P F I FP BP 2 M 3 1 I L
0 64
1 73
1 76
2 64
3 61
4 32
5 36
6 94
7 81
8 49
9 94
10 48
However. df.transpose() won't help.

This is what I came up with
data = [[[randrange(0,100) for j in range(0, 12)] for y in range(0, 12)] for x in range(0, 5)]
print data
df = pandas.DataFrame(data[0], columns=['B','P','F','I','FP','BP','2','M','3','1','I','L'])
print df
df1 = df.transpose()
df1.columns = ['B','P','F','I','FP','BP','2','M','3','1','I','L']
print df1

import numpy
df = pandas.DataFrame(numpy.asarray(data[x]).T.tolist(),
columns=['B','P','F','I','FP','BP','2','M','3','1','I','L'])

Related

Saving result from for loop to different columns

I am trying to run a nested loop in which I want the output to be saved in four different columns. Let C1R1 be the value I want in the first column first row, C2R2 the one I want in the second column second row, etc. What I have come up with this far gives me a list where the output is saved like this:
['C1R1', 'C2R1', 'C3R1', 'C4R1']. This is the code I am using:
dfs1 = []
for i in range(24):
pd = (data_json2['data']['Rows'][i])
for j in range(4):
pd1 = pd['Columns'][j]['Value']
dfs1.append(pd1)
What could be a good way to achieve this?
EDIT: This is what I want to achieve:
Column 1 Column 2 Column 3 Column 4
0 0 24 48 72
1 1 25 49 73
2 2 26 50 74
3 3 27 51 75
4 4 28 52 76
5 5 29 53 77
6 6 30 54 78
7 7 31 55 79
8 8 32 56 80
9 9 33 57 81
10 10 34 58 82
11 11 35 59 83
12 12 36 60 84
13 13 37 61 85
14 14 38 62 86
15 15 39 63 87
16 16 40 64 88
17 17 41 65 89
18 18 42 66 90
19 19 43 67 91
20 20 44 68 92
21 21 45 69 93
22 22 46 70 94
23 23 47 71 95
While this is what I got now:
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39]
Thank you.
Try:
import pandas as pd
def get_dataframe(num_cols=4, num_values=24):
return pd.DataFrame(
([v * 24 + c for v in range(num_cols)] for c in range(num_values)),
columns=[f"Column {c}" for c in range(1, num_cols + 1)],
)
df = get_dataframe()
print(df)
Prints:
Column 1 Column 2 Column 3 Column 4
0 0 24 48 72
1 1 25 49 73
2 2 26 50 74
3 3 27 51 75
4 4 28 52 76
5 5 29 53 77
6 6 30 54 78
7 7 31 55 79
8 8 32 56 80
9 9 33 57 81
10 10 34 58 82
11 11 35 59 83
12 12 36 60 84
13 13 37 61 85
14 14 38 62 86
15 15 39 63 87
16 16 40 64 88
17 17 41 65 89
18 18 42 66 90
19 19 43 67 91
20 20 44 68 92
21 21 45 69 93
22 22 46 70 94
23 23 47 71 95

Is numpys setdiff1d broken?

To select data for training and validation in my machine learning projects, I usually use numpys masking functionality. So a typical reoccuring block of code to select the indices for validation and test data looks like this:
import numpy as np
validation_split = 0.2
all_idx = np.arange(0,100000)
idxValid = np.random.choice(all_idx, int(validation_split * len(all_idx)))
idxTrain = np.setdiff1d(all_idx, idxValid)
Now the following should always be true:
len(all_idx) == len(idxValid)+len(idxTrain)
Unfortunately, I found out that somehow this is not always the case. As I inrease the number of elements that are chosen from the all_idx-array the resulting numbers do not add up properly. Here another standalone example which breaks as soon as I increase the number of randomly chosen validation indices above 1000:
import numpy as np
all_idx = np.arange(0,100000)
idxValid = np.random.choice(all_idx, 1000)
idxTrain = np.setdiff1d(all_idx, idxValid)
print(len(all_idx), len(idxValid), len(idxTrain))
This results in -> 100000, 1000, 99005
I am confused?! Please try yourself. I would be glad to understand this.
idxValid = np.random.choice(all_idx, 10, replace=False)
Careful, you need to indicate that you don't want to have duplicates in idxValid. To do so, you just have to had replace=False in np.random.choice
replace boolean, optional
Whether the sample is with or without replacement
Consider the following example:
all_idx = np.arange(0, 100)
print(all_idx)
>>> [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
96 97 98 99]
Now if you print out your validation dataset:
idxValid = np.random.choice(all_idx, int(validation_split * len(all_idx)))
print(idxValid)
>>> [31 57 55 45 26 25 55 76 33 69 49 90 46 14 18 30 89 73 47 82]
You can actually observe that there are duplicates in the resulting set and thus
len(all_idx) == len(idxValid)+len(idxTrain)
wouldn't result to True.
What you need to do is to make sure that np.random.choice does a sampling without replcacement by passing replace=False:
idxValid = np.random.choice(all_idx, int(validation_split * len(all_idx)), replace=False)
Now the results should be as expected:
import numpy as np
validation_split = 0.2
all_idx = np.arange(0, 100)
print(all_idx)
idxValid = np.random.choice(all_idx, int(validation_split * len(all_idx)), replace=False)
print(idxValid)
idxTrain = np.setdiff1d(all_idx, idxValid)
print(idxTrain)
print(len(all_idx) == len(idxValid)+len(idxTrain))
and the output is:
[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
96 97 98 99]
[12 85 96 64 48 21 55 56 80 42 11 92 54 77 49 36 28 31 70 66]
[ 0 1 2 3 4 5 6 7 8 9 10 13 14 15 16 17 18 19 20 22 23 24 25 26
27 29 30 32 33 34 35 37 38 39 40 41 43 44 45 46 47 50 51 52 53 57 58 59
60 61 62 63 65 67 68 69 71 72 73 74 75 76 78 79 81 82 83 84 86 87 88 89
90 91 93 94 95 97 98 99]
True
Consider using train_test_split from scikit-learn which is straight-forward:
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.2)

How to write this code in an optimal (pythonic) way?

I have the following code in R and I need to write it in an optimal way in python using pandas. I wrote it but it takes a long time to run.
1) is there someone who can confirm that this is an equivalent of R code in python
2) how to write it in a pythonic way(optimal way)
in R
for (i in 1:dim(df1)[1])
df1$column1[i] <- sum(df2[i,4:33])
in Python
for i in range(df1.shape[0]):
df1['column1'][i] = df2.iloc[i,3:34].sum()
These are two ways to make the replacement
df1['column1'] = df2.iloc[:, 3:34].sum(axis=1)
OR
df1.loc[:, 'column1'] = df2.iloc[:, 3:34].sum(axis=1)
Use vectorized operations:
>>> df = pd.DataFrame(np.random.randint(0, 100, (10, 15)), columns=list('abcdefghijklmno'))
>>> df
a b c d e f g h i j k l m n o
0 71 93 12 32 17 23 35 57 26 89 4 29 28 83 30
1 98 78 75 0 61 81 8 17 93 71 48 47 72 52 11
2 13 62 93 48 31 23 42 66 77 99 59 1 40 72 87
3 7 5 5 43 83 19 59 36 18 96 50 60 46 45 54
4 32 69 93 6 7 12 15 49 29 11 37 83 75 97 84
5 52 53 43 61 93 85 91 99 65 62 35 89 55 77 62
6 44 7 41 56 40 11 39 91 87 46 95 48 30 75 16
7 93 15 63 23 14 20 7 33 29 31 41 40 82 0 16
8 46 63 59 59 81 51 34 41 89 68 20 64 95 70 74
9 33 58 49 91 51 46 43 83 37 53 47 32 42 12 59
Then simply:
>>> df['column1'] = df.iloc[:, 3:8].sum(axis=1)
>>> df
a b c d e f g h i j k l m n o column1
0 71 93 12 32 17 23 35 57 26 89 4 29 28 83 30 164
1 98 78 75 0 61 81 8 17 93 71 48 47 72 52 11 167
2 13 62 93 48 31 23 42 66 77 99 59 1 40 72 87 210
3 7 5 5 43 83 19 59 36 18 96 50 60 46 45 54 240
4 32 69 93 6 7 12 15 49 29 11 37 83 75 97 84 89
5 52 53 43 61 93 85 91 99 65 62 35 89 55 77 62 429
6 44 7 41 56 40 11 39 91 87 46 95 48 30 75 16 237
7 93 15 63 23 14 20 7 33 29 31 41 40 82 0 16 97
8 46 63 59 59 81 51 34 41 89 68 20 64 95 70 74 266
9 33 58 49 91 51 46 43 83 37 53 47 32 42 12 59 314
>>>

How to create a pandas dataframe array ,whose specific column always has value greater than a particular column -by using np.random.randint

import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
print(df)
I want column 'A' always to have a value greater than column 'B'.
df.A, df.B = df[['A', 'B']].max(axis=1), df[['A', 'B']].min(axis=1)
Try this:
newdf = df.apply(lambda x: x if x[0]>x[1] else [*x[:2][::-1],*x[2:]],axis=1)
print(newdf)
Output:
A B C D
0 85 14 22 85
1 62 54 20 1
2 82 78 48 59
3 81 59 54 39
4 92 12 79 44
5 69 64 8 11
6 49 34 48 69
7 68 28 80 27
8 72 17 2 40
9 26 15 49 62
10 29 2 86 12
11 69 7 32 99
12 39 35 65 32
13 45 36 36 12
14 54 21 29 79
15 91 82 35 80
16 67 16 4 37
17 94 82 93 37
18 64 18 2 15
19 13 11 28 82
20 78 9 93 45
21 72 41 16 33
22 92 71 62 69
23 87 79 71 11
24 31 14 8 24
25 85 27 43 3
26 82 34 14 52
27 41 32 39 48
28 13 12 24 86
29 96 17 14 80
.. .. .. .. ..
70 17 13 20 91
71 26 7 57 96
72 41 0 24 58
73 98 68 90 13
74 88 35 81 56
75 65 43 70 86
76 82 81 44 68
77 97 45 23 66
78 81 45 78 48
79 62 24 43 62
80 43 13 42 49
81 97 28 75 45
82 3 0 54 40
83 57 46 16 38
84 87 46 35 13
85 41 13 78 89
86 62 36 94 23
87 84 35 69 93
88 63 18 39 3
89 45 42 30 6
90 81 8 49 82
91 28 28 11 47
92 97 81 49 92
93 86 24 82 40
94 76 72 30 51
95 93 92 1 69
96 97 76 38 81
97 87 49 26 64
98 98 25 93 55
99 57 2 87 10
[100 rows x 4 columns]
You can apply it to any no of columns.
import numpy as np
import pandas as pd
#np.random.seed(1)
df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
#we are just sorting values of each rows in descending order.
df.values[:,::-1].sort()
print(df)
It gives following output:
A B C D
0 72 37 12 9
1 79 75 64 5
2 76 71 16 1
3 50 25 20 6
4 84 28 18 11
5 68 50 29 14
6 96 94 87 87
7 86 13 9 7
8 63 61 57 22
9 81 60 1 0
10 88 47 13 8
11 72 71 30 3
12 70 57 49 21
13 68 43 24 3
14 80 76 52 26
15 82 64 41 15
16 98 87 68 25
17 26 25 22 7
18 67 27 23 9
19 83 57 38 37
20 34 32 10 8

Split a Pandas Dataframe into multiple Dataframes based on Triangular Number Series

I have a DataFrame (df) and I need to split it into n number of Dataframes based on the column numbers. But, it has to follow the Triangular Series pattern:
df1 = df[[0]]
df2 = df[[1,2]]
df3 = df[[3,4,5]]
df4 = df[[6,7,8,9]]
etc.
Consider the dataframe df
df = pd.DataFrame(
np.arange(100).reshape(10, 10),
columns=list('ABCDEFGHIJ')
)
df
A B C D E F G H I J
0 0 1 2 3 4 5 6 7 8 9
1 10 11 12 13 14 15 16 17 18 19
2 20 21 22 23 24 25 26 27 28 29
3 30 31 32 33 34 35 36 37 38 39
4 40 41 42 43 44 45 46 47 48 49
5 50 51 52 53 54 55 56 57 58 59
6 60 61 62 63 64 65 66 67 68 69
7 70 71 72 73 74 75 76 77 78 79
8 80 81 82 83 84 85 86 87 88 89
9 90 91 92 93 94 95 96 97 98 99
i_s, j_s = np.arange(4).cumsum(), np.arange(1, 5).cumsum()
df1, df2, df3, df4 = [
df.iloc[:, i:j] for i, j in zip(i_s, j_s)
]
Verify
pd.concat(dict(enumerate([df.iloc[:, i:j] for i, j in zip(i_s, j_s)])), axis=1)
0 1 2 3
A B C D E F G H I J
0 0 1 2 3 4 5 6 7 8 9
1 10 11 12 13 14 15 16 17 18 19
2 20 21 22 23 24 25 26 27 28 29
3 30 31 32 33 34 35 36 37 38 39
4 40 41 42 43 44 45 46 47 48 49
5 50 51 52 53 54 55 56 57 58 59
6 60 61 62 63 64 65 66 67 68 69
7 70 71 72 73 74 75 76 77 78 79
8 80 81 82 83 84 85 86 87 88 89
9 90 91 92 93 94 95 96 97 98 99
first get Triangular Number Series, then apply it to dataframe
n = len(df.columns.tolist())
end = 0
i = 0
res = []
while end < n:
begin = end
end = i*(i+1)/2
res.append(begin,end)
idx = map( lambda x:range(x),res)
for i in idx:
df[i]

Categories

Resources