nested for loop seem to be overlapping - python

I'm writing a program that have two nested for loops and and I wrote down my idea on a paper and drew a diagram for it and thought my logic was fine. But when I executed that I got something totally different that what I wanted.
Then I created a small version of my program with only those two loops but the problem remains. so here is the smaller version I was testing
lopt = 0
dwl = 0.04
Dt = 0.01
for i in numpy.arange(0, 5):
for t in numpy.arange(lopt, dwl, Dt): #Inner loop (time)
print 't = ', t, 'row = ', i
lopt = t + 0.001
the output I got was
t = 0.0 row = 0
t = 0.01 row = 0
t = 0.02 row = 0
t = 0.03 row = 0
t = 0.031 row = 1
t = 0.032 row = 2
t = 0.033 row = 3
t = 0.034 row = 4
but what I want it to be is
t = 0.0 row = 0
t = 0.01 row = 0
t = 0.02 row = 0
t = 0.03 row = 0
t = 0.04 row = 0
t = 0.041 row = 1
t = 0.051 row = 1
t = 0.061 row = 1
t = 0.071 row = 1
t = 0.081 row = 1
t = 0.082 row = 2
t = 0.092 row = 2
t = 0.102 row = 2
t = 0.112 row = 2
t = 0.122 row = 2
t = 0.123 row = 3
t = 0.133 row = 3
t = 0.143 row = 3
t = 0.153 row = 3
t = 0.163 row = 3
t = 0.164 row = 4
t = 0.174 row = 4
t = 0.184 row = 4
t = 0.194 row = 4
t = 0.204 row = 4
t = 0.205 row = 5
t = 0.215 row = 5
t = 0.225 row = 5
t = 0.265 row = 5
t = 0.275 row = 5
my logic to it is:
the outer loop starts with 0 then the inner loop goes from 0-0.04
then back to the final step of the outer loop by incrementing the variable lopt by 0.001. then back and execute one step of the outer loop again. now we get the inner loop one more time and start with time 0.041 to 0.081. this keep repeating until the outer loop is executed 6 times.
Another question I have is, at least in my current output the inner loop seems to be executed successfully in the first step, but it only go to 0.03. Since my range goes from 0 to 0.04 with increment of 0.01, shouldn't the loop runs for 0-0.01-0.02-0.03-0.04? also for the outer loop, shouldn't it run 0-1-2-3-4-5?
I'm sorry if i'm bugging you with this question but I honestly drew like a 8 diagrams for this problem and also did it with my hand and think it should run as intended. I feel really dumb because when I did by hand I just couldn't catch any flow in the logic

Firstly, the range returned by numpy.arange is not inclusive, so
for i in numpy.arange(0, 5):
print i
will print 0 1 2 3 4 but not 5. In order to get it to print the 5, you want to increment it by the step, which is 1 in the case of the outer loop (by default) and 0.01 in the case of the inner one.
After the first execution of the inner loop, lopt is 0.03, which means that the arguments to the inner numpy.arange are (0.03, 0.04, 0.01) which results in only one loop. What you really want is
lopt = 0
dwl = 0.04
Dt = 0.01
for i in numpy.arange(0, 5):
for t in numpy.arange(lopt, dwl + lopt + Dt, Dt):
print 't = ', t, 'row = ', i
lopt = t + 0.001
because that way the second evaluation of that inner numpy.arange will have arguments of (0.041, 0.09, 0.01), and the third will be (0.082, 0.131, 0.01) and so on.

Related

How to extract list of columns from a csv file and create a new file with 2 new columns of calculated values

I have a csv file from which I want to extract a list of columns. After doing that I would like to create a new csv file with the columns extracted from the previous one and with two new columns Function1 and Function2 that consist in the sum of the points for each client from times 0.001 s to 0.005 and from 0.006 to 0.01. So I have two problems.
This is the first .csv file except for time column in the others there are some point assigned to the subjects of the columns (just to clarify)
time |client1_points| client2_points| client3_points| server1_points|server2_points
0.001
0.002
0.003
0.004
0.005
0.006
0.007
0.008
0.009
0.01
So, to filter only the columns that I want I thought about a script in this way
import math
import pandas as pd
import matplotlib as plt
df = pd.read_csv('test.csv', sep=',')
print(df.columns) # check that the columns are parsed correctly
selected_columns = [col for col in df.columns if ["time, client"] in col]
df_filtered = df[selected_columns]
df_filtered.to_csv('new_testfile.csv')
unfortunately I don't know why this doesn't work because I'm not able to make the list ["time, client"] working.
The next problem is that I want to create two function "Average1" and "Average2":
Function1 must calculate the sum of (client1_point+client2_points+client3_points) from time 0.001 to 0.005
Function2 must calculate the sum of (client1_point+client2_points+client3_points) from time 0.006 to 0.01
The values calculated of this two functions must be integrated directly in the new .csv that was created.
Some ideas to solve the problems?
I would like to obtain the resulting .csv sheet with this columns
time |client1_points| client2_points| client3_points| Func1| Func2
0.001 1 2 1 4 0
0.002 1 1 2 4 0
0.003 2 3 9 14 0
0.004 2 5 4 11 0
0.005 4 9 8 21 0
0.006 8 1 0 0 9
0.007 4 1 0 0 5
0.008 7 0 0 0 7
0.009 1 0 9 0 10
0.01 2 1 0 0 3
Thanks a lot to everyone.
First of all you can use this code to keep only the desired columns:
columns = ['time'] + [el for el in df.columns if 'client' in el]
df = df[columns]
Then you have to perform two separate calculations (you can also do it in one step but maybe it is slower) to find the sums:
df.loc[df['time'] <= 0.005, 'Func1'] = df['client1_points'] + df['client2_points'] + df['client3_points']
df.loc[df['time'] > 0.005, 'Func2'] = df['client1_points'] + df['client2_points'] + df['client3_points']

How to sample data from the proximity of existing data?

I have data for xor as below -
x
y
z
x ^ y ^ z
0
0
1
1
0
1
0
1
1
0
0
1
1
1
1
1
Kept only the ones that make the xor of all three equal to 1.
I want to generate synthetic data around the already available data within some range uniformly at random. The above table can be thought of as seed data. An example of expected table will be as follows:
x
y
z
x ^ y ^ z
0.1
0.3
0.8
0.9
0.25
0.87
0.03
0.99
0.79
0.09
0.28
0.82
0.97
0.76
0.91
0.89
Above table is sampled with a range of 0 to 0.3 for 0 value and with range 0.7 to 1 for value 1.
I want to achieve this using pytorch.
For a problem such as this, you are able to completely synthesise data without using a reference because it has a simple solution. For zero (0-0.3) you can use the torch.rand function to generate uniformly random data for 0-1 and scale it. For one (0.7-1) you can do the same and just offset it:
N = 5
p = 0.5 #change this to bias your outputs
x_is_1 = torch.rand(N)>p #decide if x is going to be 1 or 0
y_is_1 = torch.rand(N)>p #decide if y is going to be 1 or 0
not_all_0 = ~(x_is_1 & y_is_1) #get rid of the x ^ y ^ z = 0 elements
x_is_1,y_is_1 = x_is_1[not_all_0],y_is_1[not_all_0]
N = x_is_1.shape[0]
x = torch.rand(N) * 0.3
x = torch.where(x_is_1,x+0.7,x)
y = torch.rand(N) * 0.3
y = torch.where(y_is_1,y+0.7,y)
z = torch.logical_xor(x_is_1,y_is_1).float()
triple_xor = 1 - torch.rand(z.shape)*0.3
print(torch.stack([x,y,z,triple_xor]).T)
#x y z x^y^z
tensor([[0.2615, 0.7676, 1.0000, 0.8832],
[0.9895, 0.0370, 1.0000, 0.9796],
[0.1406, 0.9203, 1.0000, 0.9646],
[0.1799, 0.9722, 1.0000, 0.9327]])
Or, to treat your data as the basis (for more complex data), there is a preprocessing tool known as gaussian noise injection which seems to be what you're after. Or you can just define a function and call it a bunch of times.
def add_noise(x,y,z,triple_xor,range=0.3):
def proc(dat,range):
return torch.where(dat>0.5,torch.rand(dat.shape)*range+1-range,torch.rand(dat.shape)*range)
return proc(x,range),proc(y,range),proc(z,range),proc(triple_xor,range)
gen_new_data = torch.cat([torch.stack(add_noise(x,y,z,triple_xor)).T for _ in range(5)])

Calculate running total based off original value in pandas

I wish to take an inital value of 1000 and multiply it by the first value in the 'Change' and then take that value and multiply it to the second value in the 'Change' column and so on.
I could do this by using a loop as follows
changes = [0.97,1.02,1.1,0.88,1.01 ]
df = pd.DataFrame()
df['Change'] = changes
df['Total'] = np.nan
df['Total'][0] = 1000*df['Change'][0]
for i in range(1,len(df)):
df['Total'][i] = df['Total'][i-1] * df['Change'][i]
Output:
Change Total
0 0.97 970.000000
1 1.02 989.400000
2 1.10 1088.340000
3 0.88 957.739200
4 1.01 967.316592
But this will be too slow for a large dataset. Is there any way to do this without loops?
Thanks

Python: How to change same numbers in a Series/Column to other values?

I am trying to change the values of a very long column (about 1mio entries) in a data frame. I have something like
####ID_Orig
3452
3452
3452
6543
6543
...
I want something like
####ID_new
0
0
0
1
1
...
At the moment I'm doing this:
j=0
for i in range(0,1199531):
if data.ID_orig[i]==data.ID_orig[i+1]:
data.ID_orig[i] = j
else:
data.ID_orig[i] = j
j=j+1
Which takes about ages... Is there a faster way to do this?
I don't know what values ID_orig has and how often a single value comes up.
Use factorize, but if duplicated groups then output values are set to same number.
Another solution with comparing by ne (!=) of shifted values with cumsum is more general - create always new values, also if repeating group values:
df['ID_new1'] = pd.factorize(df['ID_Orig'])[0]
df['ID_new2'] = df['ID_Orig'].ne(df['ID_Orig'].shift()).cumsum() - 1
print (df)
ID_Orig ID_new1 ID_new2
0 3452 0 0
1 3452 0 0
2 3452 0 0
3 6543 1 1
4 6543 1 1
5 100 2 2
6 100 2 2
7 6543 1 3 <-repeating group
8 6543 1 3 <-repeating group
You can do this …
import collections
l1 = [3452, 3452, 3452, 6543, 6543]
c = collections.Counter(l1)
l2 = list(c.items())
l3 = []
for i, t in enumerate(l2):
for x in range(t[1]):
l3.append(i)
for x in l3:
print(x)
This is the output:
0
0
0
1
1
You can use the following. In the following implementation duplicate ids in the original id will get same ids. The implementation is based on dropping duplicates from the column and assigning a different number to each unique id to form the enw ids. These new ids are then merged into the original dataset
import numpy as np
import pandas as pd
from time import time
num_rows = 119953
input_data = np.random.randint(1199531, size=(num_rows,1))
data = pd.DataFrame(input_data)
data.columns = ["ID_orig"]
data2 = pd.DataFrame(input_data)
data2.columns = ["ID_orig"]
t0 = time()
j=0
for i in range(0,num_rows-1):
if data.ID_orig[i]==data.ID_orig[i+1]:
data.ID_orig[i] = j
else:
data.ID_orig[i] = j
j=j+1
t1 = time()
id_new = data2.loc[:,"ID_orig"].drop_duplicates().reset_index().drop("index", axis=1)
id_new.reset_index(inplace=True)
id_new.columns = ["id_new"] + id_new.columns[1:].values.tolist()
data2 = data2.merge(id_new, on="ID_orig")
t2 = time()
print("Previous: ", round(t1-t0, 2), " seconds")
print("Current : ", round(t2-t1, 2), " seconds")
The output of the above program using only 119k rows is
Previous: 12.16 seconds
Current : 0.06 seconds
The runtime difference increases even more as the number of rows are increased.
EDIT
Using the same number of rows:
>>> print("Previous: ", round(t1-t0, 2))
Previous: 11.7
>>> print("Current : ", round(t2-t1, 2))
Current : 0.06
>>> print("jezrael's answer : ", round(t3-t2, 2))
jezrael's answer : 0.02

Adding list elements to empty list

I'm trying to take the first element of a list created within a loop, and add it to a empty list.
Easy right? But my code is not working...
The empty list is index
The list I'm pulling the numbers from is data
import pandas as pd
import numpy as np
ff = open("/Users/me/Documents/ff_monthly.txt", "r")
data = list(ff)
f = []
index = []
for i in range (1, len(data)):
t = data[i].split()
index.append(int(t[0]))
for j in range(1, 5):
k = float(t[j])
f.append(k / 100)
n = len(f)
f1 = np.reshape(f,[n / 4, 4])
ff = pd.DataFrame(f1, index= index, columns = ['Mkt_Rf','SMB','HML','Rf'])
ff.to_pickle("ffMonthly.pickle")
print ff.head()
When I create the list t in the loop, I've checked to see if it is being created correctly.
len(t) = 5
print t[0] = 192607
print t = ['192607', '2.95', '-2.54', '-2.65', '0.22']
The code:
index.append(t[0])
It should add the 1st element of the list t to index ... correct?
However, I get this error:
IndexError: list index out of range
What am I missing here?
Edit:
Posted the entire code above.
Here are the first few rows of "ff_monthly.txt"
Mkt-RF SMB HML RF
192607 2.95 -2.54 -2.65 0.22
192608 2.64 -1.22 4.25 0.25
192609 0.37 -1.21 -0.39 0.23
192610 -3.24 -0.09 0.26 0.32
192611 2.55 -0.15 -0.54 0.31
192612 2.62 -0.13 -0.08 0.28
Check if there is an empty line in your txt file.
Because "".split() will return an empty list which will fail the t[0]
Quick and uggly:
for i in range (1, len(data)):
t = data[i].split()
if t:
index.append(int(t[0]))
for j in range(1, 5):
k = float(t[j])
f.append(k / 100)
Reason:
You have empty lines at the end of the file. Reproduced locally.

Categories

Resources