graph_tool: how to avoid that all_circuits function block my script - python

I'm learning python and I'm doing some experiment with the module graph_tool.
Since the function all_circuits could take a long time to calculate all the cycles, is there a way to stop the function (for example after "X" seconds or after the iterator reaches a certain size) and continue the execution of the script?
Thanks

This is quite simple, actually. The function all_circuits() returns an iterator over all circuits. Therefore, if you want to stop early, all you need is to break the iterations:
g = collection.ns["football"]
for i, c in enumerate(all_circuits(g)):
if i > 10:
print(c)
break
which prints
[ 0 1 25 24 11 10 5 4 9 8 7 6 2 3 26 12 13 15
14 38 18 19 29 30 35 34 31 32 21 20 17 16 23 22 47 46
49 48 44 45 33 37 36 43 42 57 56 27 62 61 54 39 60 59
58 63 64 100 99 89 88 83 53 52 40 41 67 68 50 28 69 70
65 66 75 76 95 87 86 80 79 55 94 82 81 72 74 73 110 114
104 93]
and stops.

Related

issubset method different than subSet in superSet - Error in Python3.x

Why issubset method of sets in python3.x don't return the same than subSet in superSet ?
logically is correctly but the console return me unexpected result
works fine with shorts sets but large sets the (subSet in superSet) make mistakes
def isStrictSuperset(superSet, subSet):
strictSuperset = False
# condition1 = subSet.issubset(superSet) # why this is difrent than de follow condition
condition1 = subSet in superSet # Error! incorrect result line
condition2 = superSet != subSet
if condition1 and condition2:
strictSuperset = True
return strictSuperset # return if strict superset or not
if __name__ == "__main__":
# list of string
superSet = input().split(' ')
subSet = input().split(" ")
# convert the list of string to set of integers
superSet = set(int(x) for x in superSet)
subSet = set (int(x) for x in subSet)
# output
print( isStrictSuperset(superSet, subSet) )
input:
51 28 10 61 99 31 55 7 88 48 18 80 18 36 49 21 36 1 49 53 11 78 46 87 82 28 76 50 89 31 14 81 87 39 3 69 26 18 85 18 23 43 75 5 64 47 34 19 2 54 92 45 79 80 59 16 75 80 55 24 56 74 76 31 22 74 20 93 79 81 12 57 21 79 65 32 57 37 47 84 82 28 72 15 53 50 86 58 83 88 3 44 76 63 32 14 13 38 29 70 38 4 71 15 45 4 94 24 46 6 95 48 15 82 92 62 6 67 38 20 60 78 37 84 32 39 51 88 13 99 6 3 64 37 83 68 18 51 98 37 11 48 63 97 30 90 73 44 63 25 78 12 25 91 36 38 59 12 36 51 58 61 82 91 31 41 36 99 28 50 28 64 22 56 26 39 75 53 8 41 94 86 35 69 48 17 80 32 12 29 2 33 51 79 58 74 91 46 6 54 66 0 75 60 30 95 57 36 70 32 83 1 88 27 57 2 67 28 18 51 61 16 40 79 96 78 27 72 85 45 73 12 89 31 11 24 42 94 22 84 1 67 8 62 80 77 81 58 1 6 63 30 64 37 44 60 11 14 68 28 81 86 30 17 81 14 30 44 64 89 7 94 89 13 59 88 34 42 6 51 10 19 66 91 46 22 41 34 98 4 26 90 84 90 44 90 84 13 36 6 97 21 30 52 46 15 83 89 45 83 33 11 3 18 6 82 17 23 13 91 27 39 76 11 86 12 97 64 51 48 84 35 66 15 48 32 99 11 18 93 11 85 71 63 57 76 1 80 45 19 7 39 80 70 78 3 17 51 14 99 47 83 17 82 23 59 59 41 77 22 7 35 22 98 59 90 80 72 60 67 22 75 3 99 18 81 47 48 18 98 18 37 47 65 98 86 82 5 30 87 25 17 97 60 93 33 99 89 62 98 40 27 70 57 49 93 46 11 38 94 43 75 61 75 55 45 26 9 84 89 40 87 14 61 31 99 53 6 83 55 15 95 46 8 58 73 58 57 9 7 49 21 31 88 31 32 61 30 19 69 78 33 3 0 70 73 40 91 91 96 72 79 0 41 91 51 10 80 50 77 30 38 1 85 56 90 78 36 31 0 82 12 95 28 1 65 72 75 89 54
81 79 97 20 68 23 19 12 53 86 26 36 4 64 10 43 12 75 98 30 12 33 27 1 32 68 64 49 99 10 16 9 7 47 23 29 30 94 57 25 38 15 57 33 79 28 45 98 20 50 34 93 6 14 9 29 56 13 44 67 5 23 32 38 78 20 55 35 25 91 64 10 47 32 97 44 85 65 87 36 91 88 78 6 48 86 67 56 44 18 98 39 10 80 47 65 49 98 63 21
output: False
expected output: True
subset in superset checkes whether subset is an element of superset; i.e., it checks ∈, not ⊆.
You can simply use < to check whether a set is a proper subset of another: https://docs.python.org/3/library/stdtypes.html#frozenset.issubset
print({1, 2} in {1,2,3}) # False
print({1, 2} < {1,2,3}) # True

changing columns names in pandas

I'm trying to change the names of the columns in a pandas dataframe. I use python 3.7. I have 30 columns numbered 0-29 and I want to change their names to 1-30. I know it's a silly question, but I'm trying to do it in minimum lines as possible, but I couldn't find anything efficient online. can anyone please help me?
Thank you
If you have dataframe like this:
0 1 2 3
0 a d e f
1 b g h i
2 c j k l
Then you can do:
df.columns = df.columns.astype(int) + 1
print(df)
Prints:
1 2 3 4
0 a d e f
1 b g h i
2 c j k l
Another way is to recreate the index with RangeIndex
df.columns = pd.RangeIndex(1, len(df.columns)+1)
FYI, you can read the documentation about Int64Index and RangeIndex: RangeIndex is an optimized version of Int64Index
Here you can use this. I believe you will find it short and simple enough.
df.columns = [list(range(1,31))]
In this case, You can use list comprehension to rename your dataframe columns
df = df[[i for i in range(1,30)]]
You can use below ...
Sample Data:
Just creating random sample data with 30 columns as follows, where we see the default RangeIndex starting Index startwith 0 by having step=1, which we can change to get the desired.
df = pd.DataFrame(np.random.randint(0,100,size=(100, 30)))
print(df)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
0 37 87 10 94 76 42 94 80 2 54 98 18 27 32 94 41 97 61 22 87 67 43 12 49 67 92 69 52 78 49
1 80 77 64 81 91 36 46 83 54 25 55 5 4 57 68 59 36 94 79 14 27 7 36 37 15 3 9 32 50 95
2 58 91 87 59 60 65 90 97 55 48 11 62 76 28 89 99 78 60 92 25 93 35 41 69 88 19 85 18 56 52
3 50 5 80 32 42 96 89 62 77 89 72 8 1 3 52 92 71 95 42 18 9 76 5 53 56 18 17 5 3 40
4 37 92 30 45 14 15 96 29 0 45 59 59 82 51 78 30 25 95 50 22 34 12 24 59 63 5 75 15 85 95
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
95 49 58 9 18 44 48 15 74 76 70 81 88 36 32 35 96 93 95 2 69 20 40 22 19 55 92 33 45 20 82
96 75 15 65 77 4 2 45 16 42 25 12 47 35 64 3 89 47 68 59 52 82 37 67 32 64 62 7 81 79 42
97 7 95 21 52 42 84 0 85 0 2 16 97 45 56 30 15 33 49 82 60 51 29 3 37 51 8 65 73 55 56
98 69 66 25 61 85 50 76 27 51 44 46 53 56 67 20 15 5 77 54 18 18 48 34 2 89 84 55 26 19 4
99 41 63 23 46 33 78 86 32 4 9 13 40 13 17 22 78 60 96 56 3 30 78 65 66 15 43 98 79 10 23
[100 rows x 30 columns]
print(df.columns)
RangeIndex(start=0, stop=30, step=1) <-- default behaviour
Solution :
We can change the default RangeIndex to start=1 as follows in order to get the result you desired.
df.columns = df.columns+1
print(df.columns)
RangeIndex(start=1, stop=31, step=1)
print(df)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
0 37 87 10 94 76 42 94 80 2 54 98 18 27 32 94 41 97 61 22 87 67 43 12 49 67 92 69 52 78 49
1 80 77 64 81 91 36 46 83 54 25 55 5 4 57 68 59 36 94 79 14 27 7 36 37 15 3 9 32 50 95
2 58 91 87 59 60 65 90 97 55 48 11 62 76 28 89 99 78 60 92 25 93 35 41 69 88 19 85 18 56 52
3 50 5 80 32 42 96 89 62 77 89 72 8 1 3 52 92 71 95 42 18 9 76 5 53 56 18 17 5 3 40
4 37 92 30 45 14 15 96 29 0 45 59 59 82 51 78 30 25 95 50 22 34 12 24 59 63 5 75 15 85 95
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
95 49 58 9 18 44 48 15 74 76 70 81 88 36 32 35 96 93 95 2 69 20 40 22 19 55 92 33 45 20 82
96 75 15 65 77 4 2 45 16 42 25 12 47 35 64 3 89 47 68 59 52 82 37 67 32 64 62 7 81 79 42
97 7 95 21 52 42 84 0 85 0 2 16 97 45 56 30 15 33 49 82 60 51 29 3 37 51 8 65 73 55 56
98 69 66 25 61 85 50 76 27 51 44 46 53 56 67 20 15 5 77 54 18 18 48 34 2 89 84 55 26 19 4
99 41 63 23 46 33 78 86 32 4 9 13 40 13 17 22 78 60 96 56 3 30 78 65 66 15 43 98 79 10 23
[100 rows x 30 columns]
for more, you can look at the help(df.columns)
| start : int (default: 0), or other RangeIndex instance
| If int and "stop" is not given, interpreted as "stop" instead.
| stop : int (default: 0)
| step : int (default: 1)
| name : object, optional
| Name to be stored in the index.
| copy : bool, default False
| Unused, accepted for homogeneity with other index types.
|
| Attributes
| ----------
| start
| stop
| step
|
| Methods
| -------
| from_range

Project Euler problem 11 in Python - Row by row iterations not working

In order to solve problem 11, I have sought to implement 4 loops. Each of the 4 loops iterates in a different direction, so for example the first loop (which I will use to demonstrate my issue below) starts vertically from the top left of the grid. The logic of the loop is to go through the top row and then move down a row and follow the same multiplication pattern. After 16 iterations there are no more combinations of numbers and so the loop stops.
In order to test whether or not the function works, I want to print a list of all the iterations to ensure that it prints 360 unique numbers. The idea being that I can then alter the code to start with figure = 0, and with each iteration I can check to see if the number produced is bigger than the current value for figure. If it is, then figure is replaced with the value of that iteration.
My issue is that the output of my code is the same list of 20 numbers 16 times. Any help with this one would be highly appreciated! I know that there are many ways of doing this, and that I can look up the answers, but I want to get my own logic/solution working before I look at any answers, and this is the main blocker at the moment.
#code starts here
twenmat = [20*20 matrix]
newlist = []
figure = 0
for items in twenmat:
for x in range(0,20):
y = 0
newlist.append(twenmat[0+y][x]*twenmat[1+y][x]*twenmat[2+y][x]*twenmat[3+y][x])
y = y + 1
if y == 16:
break
print(newlist)
#end of script
Rather than manipulating individual coordinates, you could just shift the matrix by 1, 2 and 3 in each direction and perform cell by cell of multiplications between shifted matrices. Record the maximum of these products as you go through the 4 directions (right, down, down-right, up-right):
data =\
"""08 02 22 97 38 15 00 40 00 75 04 05 07 78 52 12 50 77 91 08
49 49 99 40 17 81 18 57 60 87 17 40 98 43 69 48 04 56 62 00
81 49 31 73 55 79 14 29 93 71 40 67 53 88 30 03 49 13 36 65
52 70 95 23 04 60 11 42 69 24 68 56 01 32 56 71 37 02 36 91
22 31 16 71 51 67 63 89 41 92 36 54 22 40 40 28 66 33 13 80
24 47 32 60 99 03 45 02 44 75 33 53 78 36 84 20 35 17 12 50
32 98 81 28 64 23 67 10 26 38 40 67 59 54 70 66 18 38 64 70
67 26 20 68 02 62 12 20 95 63 94 39 63 08 40 91 66 49 94 21
24 55 58 05 66 73 99 26 97 17 78 78 96 83 14 88 34 89 63 72
21 36 23 09 75 00 76 44 20 45 35 14 00 61 33 97 34 31 33 95
78 17 53 28 22 75 31 67 15 94 03 80 04 62 16 14 09 53 56 92
16 39 05 42 96 35 31 47 55 58 88 24 00 17 54 24 36 29 85 57
86 56 00 48 35 71 89 07 05 44 44 37 44 60 21 58 51 54 17 58
19 80 81 68 05 94 47 69 28 73 92 13 86 52 17 77 04 89 55 40
04 52 08 83 97 35 99 16 07 97 57 32 16 26 26 79 33 27 98 66
88 36 68 87 57 62 20 72 03 46 33 67 46 55 12 32 63 93 53 69
04 42 16 73 38 25 39 11 24 94 72 18 08 46 29 32 40 62 76 36
20 69 36 41 72 30 23 88 34 62 99 69 82 67 59 85 74 04 36 16
20 73 35 29 78 31 90 01 74 31 49 71 48 86 81 16 23 57 05 54
01 70 54 71 83 51 54 69 16 92 33 48 61 43 52 01 89 19 67 48"""
M = [ [*map(int,line.split())] for line in data.split("\n") ]
...
# shift the matrix by a positive or negative amount vertically and horizontally
# empty positions are filled with 1 so that the products aren't impacted
def shift(m,v,h):
if v<0 : m = [[1]*len(m)]*-v + m[:v]
else : m = m[v:] + [[1]*len(m)]*v
if h<0 : m = [ [1]*-h + r[:h] for r in m ]
else : m = [ r[h:] + [1]*h for r in m ]
return m
# base matrix multiplied cell by cell with 3 shifted versions ...
maxProd = 0
for dv,dh in [(0,1),(1,0),(1,1),(-1,1)]:
m = M # start with non-shifted values
for i in range(1,4):
# multiply by each shifted copies cell by cell
m = [ [a*b for a,b in zip(r0,r1)]
for r0,r1 in zip(m,shift(M,dv*i,dh*i)) ]
# record maximum of all resulting products
maxProd = max(maxProd,max((max(row) for row in m)))
print(maxProd) # 70600674
To illustrate this shifting process, let's look at the 3 shifted versions going down-right on the main diagonal (offset: 1,1):
shifted by 1:
49 99 40 17 81 18 57 60 87 17 40 98 43 69 48 4 56 62 0 1
49 31 73 55 79 14 29 93 71 40 67 53 88 30 3 49 13 36 65 1
70 95 23 4 60 11 42 69 24 68 56 1 32 56 71 37 2 36 91 1
31 16 71 51 67 63 89 41 92 36 54 22 40 40 28 66 33 13 80 1
47 32 60 99 3 45 2 44 75 33 53 78 36 84 20 35 17 12 50 1
98 81 28 64 23 67 10 26 38 40 67 59 54 70 66 18 38 64 70 1
26 20 68 2 62 12 20 95 63 94 39 63 8 40 91 66 49 94 21 1
55 58 5 66 73 99 26 97 17 78 78 96 83 14 88 34 89 63 72 1
36 23 9 75 0 76 44 20 45 35 14 0 61 33 97 34 31 33 95 1
17 53 28 22 75 31 67 15 94 3 80 4 62 16 14 9 53 56 92 1
39 5 42 96 35 31 47 55 58 88 24 0 17 54 24 36 29 85 57 1
56 0 48 35 71 89 7 5 44 44 37 44 60 21 58 51 54 17 58 1
80 81 68 5 94 47 69 28 73 92 13 86 52 17 77 4 89 55 40 1
52 8 83 97 35 99 16 7 97 57 32 16 26 26 79 33 27 98 66 1
36 68 87 57 62 20 72 3 46 33 67 46 55 12 32 63 93 53 69 1
42 16 73 38 25 39 11 24 94 72 18 8 46 29 32 40 62 76 36 1
69 36 41 72 30 23 88 34 62 99 69 82 67 59 85 74 4 36 16 1
73 35 29 78 31 90 1 74 31 49 71 48 86 81 16 23 57 5 54 1
70 54 71 83 51 54 69 16 92 33 48 61 43 52 1 89 19 67 48 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
shifted by 2:
31 73 55 79 14 29 93 71 40 67 53 88 30 3 49 13 36 65 1 1
95 23 4 60 11 42 69 24 68 56 1 32 56 71 37 2 36 91 1 1
16 71 51 67 63 89 41 92 36 54 22 40 40 28 66 33 13 80 1 1
32 60 99 3 45 2 44 75 33 53 78 36 84 20 35 17 12 50 1 1
81 28 64 23 67 10 26 38 40 67 59 54 70 66 18 38 64 70 1 1
20 68 2 62 12 20 95 63 94 39 63 8 40 91 66 49 94 21 1 1
58 5 66 73 99 26 97 17 78 78 96 83 14 88 34 89 63 72 1 1
23 9 75 0 76 44 20 45 35 14 0 61 33 97 34 31 33 95 1 1
53 28 22 75 31 67 15 94 3 80 4 62 16 14 9 53 56 92 1 1
5 42 96 35 31 47 55 58 88 24 0 17 54 24 36 29 85 57 1 1
0 48 35 71 89 7 5 44 44 37 44 60 21 58 51 54 17 58 1 1
81 68 5 94 47 69 28 73 92 13 86 52 17 77 4 89 55 40 1 1
8 83 97 35 99 16 7 97 57 32 16 26 26 79 33 27 98 66 1 1
68 87 57 62 20 72 3 46 33 67 46 55 12 32 63 93 53 69 1 1
16 73 38 25 39 11 24 94 72 18 8 46 29 32 40 62 76 36 1 1
36 41 72 30 23 88 34 62 99 69 82 67 59 85 74 4 36 16 1 1
35 29 78 31 90 1 74 31 49 71 48 86 81 16 23 57 5 54 1 1
54 71 83 51 54 69 16 92 33 48 61 43 52 1 89 19 67 48 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
shifter by 3:
23 4 60 11 42 69 24 68 56 1 32 56 71 37 2 36 91 1 1 1
71 51 67 63 89 41 92 36 54 22 40 40 28 66 33 13 80 1 1 1
60 99 3 45 2 44 75 33 53 78 36 84 20 35 17 12 50 1 1 1
28 64 23 67 10 26 38 40 67 59 54 70 66 18 38 64 70 1 1 1
68 2 62 12 20 95 63 94 39 63 8 40 91 66 49 94 21 1 1 1
5 66 73 99 26 97 17 78 78 96 83 14 88 34 89 63 72 1 1 1
9 75 0 76 44 20 45 35 14 0 61 33 97 34 31 33 95 1 1 1
28 22 75 31 67 15 94 3 80 4 62 16 14 9 53 56 92 1 1 1
42 96 35 31 47 55 58 88 24 0 17 54 24 36 29 85 57 1 1 1
48 35 71 89 7 5 44 44 37 44 60 21 58 51 54 17 58 1 1 1
68 5 94 47 69 28 73 92 13 86 52 17 77 4 89 55 40 1 1 1
83 97 35 99 16 7 97 57 32 16 26 26 79 33 27 98 66 1 1 1
87 57 62 20 72 3 46 33 67 46 55 12 32 63 93 53 69 1 1 1
73 38 25 39 11 24 94 72 18 8 46 29 32 40 62 76 36 1 1 1
41 72 30 23 88 34 62 99 69 82 67 59 85 74 4 36 16 1 1 1
29 78 31 90 1 74 31 49 71 48 86 81 16 23 57 5 54 1 1 1
71 83 51 54 69 16 92 33 48 61 43 52 1 89 19 67 48 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Each number is moved to the next position diagonally so the product of cells at a given position corresponds to the 4 values going down-right on the main diagonal.
We do this for all directions to get the maximum product.

Is numpys setdiff1d broken?

To select data for training and validation in my machine learning projects, I usually use numpys masking functionality. So a typical reoccuring block of code to select the indices for validation and test data looks like this:
import numpy as np
validation_split = 0.2
all_idx = np.arange(0,100000)
idxValid = np.random.choice(all_idx, int(validation_split * len(all_idx)))
idxTrain = np.setdiff1d(all_idx, idxValid)
Now the following should always be true:
len(all_idx) == len(idxValid)+len(idxTrain)
Unfortunately, I found out that somehow this is not always the case. As I inrease the number of elements that are chosen from the all_idx-array the resulting numbers do not add up properly. Here another standalone example which breaks as soon as I increase the number of randomly chosen validation indices above 1000:
import numpy as np
all_idx = np.arange(0,100000)
idxValid = np.random.choice(all_idx, 1000)
idxTrain = np.setdiff1d(all_idx, idxValid)
print(len(all_idx), len(idxValid), len(idxTrain))
This results in -> 100000, 1000, 99005
I am confused?! Please try yourself. I would be glad to understand this.
idxValid = np.random.choice(all_idx, 10, replace=False)
Careful, you need to indicate that you don't want to have duplicates in idxValid. To do so, you just have to had replace=False in np.random.choice
replace boolean, optional
Whether the sample is with or without replacement
Consider the following example:
all_idx = np.arange(0, 100)
print(all_idx)
>>> [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
96 97 98 99]
Now if you print out your validation dataset:
idxValid = np.random.choice(all_idx, int(validation_split * len(all_idx)))
print(idxValid)
>>> [31 57 55 45 26 25 55 76 33 69 49 90 46 14 18 30 89 73 47 82]
You can actually observe that there are duplicates in the resulting set and thus
len(all_idx) == len(idxValid)+len(idxTrain)
wouldn't result to True.
What you need to do is to make sure that np.random.choice does a sampling without replcacement by passing replace=False:
idxValid = np.random.choice(all_idx, int(validation_split * len(all_idx)), replace=False)
Now the results should be as expected:
import numpy as np
validation_split = 0.2
all_idx = np.arange(0, 100)
print(all_idx)
idxValid = np.random.choice(all_idx, int(validation_split * len(all_idx)), replace=False)
print(idxValid)
idxTrain = np.setdiff1d(all_idx, idxValid)
print(idxTrain)
print(len(all_idx) == len(idxValid)+len(idxTrain))
and the output is:
[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
96 97 98 99]
[12 85 96 64 48 21 55 56 80 42 11 92 54 77 49 36 28 31 70 66]
[ 0 1 2 3 4 5 6 7 8 9 10 13 14 15 16 17 18 19 20 22 23 24 25 26
27 29 30 32 33 34 35 37 38 39 40 41 43 44 45 46 47 50 51 52 53 57 58 59
60 61 62 63 65 67 68 69 71 72 73 74 75 76 78 79 81 82 83 84 86 87 88 89
90 91 93 94 95 97 98 99]
True
Consider using train_test_split from scikit-learn which is straight-forward:
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.2)

How to write this code in an optimal (pythonic) way?

I have the following code in R and I need to write it in an optimal way in python using pandas. I wrote it but it takes a long time to run.
1) is there someone who can confirm that this is an equivalent of R code in python
2) how to write it in a pythonic way(optimal way)
in R
for (i in 1:dim(df1)[1])
df1$column1[i] <- sum(df2[i,4:33])
in Python
for i in range(df1.shape[0]):
df1['column1'][i] = df2.iloc[i,3:34].sum()
These are two ways to make the replacement
df1['column1'] = df2.iloc[:, 3:34].sum(axis=1)
OR
df1.loc[:, 'column1'] = df2.iloc[:, 3:34].sum(axis=1)
Use vectorized operations:
>>> df = pd.DataFrame(np.random.randint(0, 100, (10, 15)), columns=list('abcdefghijklmno'))
>>> df
a b c d e f g h i j k l m n o
0 71 93 12 32 17 23 35 57 26 89 4 29 28 83 30
1 98 78 75 0 61 81 8 17 93 71 48 47 72 52 11
2 13 62 93 48 31 23 42 66 77 99 59 1 40 72 87
3 7 5 5 43 83 19 59 36 18 96 50 60 46 45 54
4 32 69 93 6 7 12 15 49 29 11 37 83 75 97 84
5 52 53 43 61 93 85 91 99 65 62 35 89 55 77 62
6 44 7 41 56 40 11 39 91 87 46 95 48 30 75 16
7 93 15 63 23 14 20 7 33 29 31 41 40 82 0 16
8 46 63 59 59 81 51 34 41 89 68 20 64 95 70 74
9 33 58 49 91 51 46 43 83 37 53 47 32 42 12 59
Then simply:
>>> df['column1'] = df.iloc[:, 3:8].sum(axis=1)
>>> df
a b c d e f g h i j k l m n o column1
0 71 93 12 32 17 23 35 57 26 89 4 29 28 83 30 164
1 98 78 75 0 61 81 8 17 93 71 48 47 72 52 11 167
2 13 62 93 48 31 23 42 66 77 99 59 1 40 72 87 210
3 7 5 5 43 83 19 59 36 18 96 50 60 46 45 54 240
4 32 69 93 6 7 12 15 49 29 11 37 83 75 97 84 89
5 52 53 43 61 93 85 91 99 65 62 35 89 55 77 62 429
6 44 7 41 56 40 11 39 91 87 46 95 48 30 75 16 237
7 93 15 63 23 14 20 7 33 29 31 41 40 82 0 16 97
8 46 63 59 59 81 51 34 41 89 68 20 64 95 70 74 266
9 33 58 49 91 51 46 43 83 37 53 47 32 42 12 59 314
>>>

Categories

Resources