How to caculate one row base on another row pandas - python

data picture
Sorry for inconvenience of picture of the data !
I get this data, I try to calculate EMA_20 a row base on EMA_20 row before
Example: calculate EMA_20 at index 1003 base on EMA_20 at index 1004, I try using vectorization for speed up but don't know how to specify the index at row
def vec_EMA(data ,indicator = 20):
K = 2/(indicator + 1)
if (data['index'].values[0] == len(data) - 1):
return data["close"] * K + data["SMA_" + str(indicator)] * (1- K)
return data["close"] * K + data["EMA_20"][data.index + 1] * (1- K)
new_data['EMA_20'] = vec_EMA(new_data)
The result just like on picture but it not exactly what I try to do
Expected out put is:
EMA_20 at index 1003 = data['close'] at index 1003 * K + EMA_20 at index 1004 * (1 - K) where K = 2/(20+1)
result is 47.13531746031746 not 39.158333

Instead trying to update the dataframe directly taking a list and finally returning the list from the function would be an easier approach
def vec_EMA(df,indicator=20):
EMA_20_list=[]
K = 2/(indicator + 1)
for index in df.index :
if index==len(df)-1: #indexing is done in reverse order starting from 0
value=df.loc[index,'close']*K + (1-K)*df.loc[index,"SMA_" +str(indicator)]
EMA_20_list.append(value)
else:
value=df.loc[index,'close']*K+ (1-K)* EMA_20_list[-1]
#EMA_20_list[-1] return above rows value
EMA_20_list.append(value)
return EMA_20_list
df['EMA_20']=vec_EMA(df)

Related

Question on Calculation Speed Difference in the two essentially same codes

Here are two codes, which handles the same data and returns the essentially the same result.
1.
for j in range(np.shape(I)[0]):
if (j%int(np.shape(I)[0]/10) ==0):
print(str(j/np.shape(I)[0]*100)+'% ........... is done')
for k in range(np.shape(I)[1]):
for i in range(np.shape(I)[1]):
if (abs(time_resc_array[j,k]-time_tar[i]) < t_toler):
I_pp[i] = I_pp[i]+ I[j,k]
count[i]=count[i]+1
norm[i]=norm[i]+1
break
here I and time_resc_array are the 290*10000 numpy arrays, and count, I_pp, and time_tar are the 290 numpy arrays.
2.
trial = int(n_rep * N/10)
freq11 = freq1 * 10**6
average = 100*10**6
tau = np.zeros(trial)
pp_seq = int(n_rep2*(t_unit2 * 10 **-9) * 10 * average )
Narray = np.arange(0,pp_seq)
pp_tk = Narray * 1/(10*average) # divide 1 period of average freq by 10
pp_data = np.zeros(pp_seq)
pp_cnt = np.zeros(pp_seq)
Narray = np.arange(1, n_rep2+1)
oper_tk = Narray * (t_unit2 * 10 **-9)
for i in range(0,A):
if (i%int(trial)==0):
print(str(i/trial*100)+'.......... % is done')
ptr = i%n_rep2
tau[i] = oper_tk[ptr] * freq11[i//n_rep2][i%n_rep2] / average
for j in range(0, pp_seq):
if ptr == 0:
break
elif tau[i] < pp_tk[j]:
pp_data[j] += I[i//n_rep2][i%n_rep2]
pp_cnt[j] += 1
break
where freq1 and I are the 290*10000 array. The first code is approximately 4-5 times slower than the second one, which I don't grasp the reason. Could somebody please help me understand what I am doing wrong with the first one?
p.s. the second code is not of mine, so it can be deleted sooner or later.

any tip to improve performance when using nested loops with python

so, I had this exercise where I would receive a list of integers and had to find how many sum pairs were multiple to 60
example:
input: list01 = [10,90,50,40,30]
result = 2
explanation: 10 + 50, 90 + 30
example2:
input: list02 = [60,60,60]
result = 3
explanation: list02[0] + list02[1], list02[0] + list02[2], list02[1] + list02[2]
seems pretty easy, so here is my code:
def getPairCount(numbers):
total = 0
cont = 0
for n in numbers:
cont+=1
for n2 in numbers[cont:]:
if (n + n2) % 60 == 0:
total += 1
return total
it's working, however, for a big input with over 100k+ numbers is taking too long to run, and I need to be able to run in under 8 seconds, any tips on how to solve this issue??
being with another lib that i'm unaware or being able to solve this without a nested loop
Here's a simple solution that should be extremely fast (it runs in O(n) time). It makes use of the following observation: We only care about each value mod 60. E.g. 23 and 143 are effectively the same.
So rather than making an O(n**2) nested pass over the list, we instead count how many of each value we have, mod 60, so each value we count is in the range 0 - 59.
Once we have the counts, we can consider the pairs that sum to 0 or 60. The pairs that work are:
0 + 0
1 + 59
2 + 58
...
29 + 31
30 + 30
After this, the order is reversed, but we only
want to count each pair once.
There are two cases where the values are the same:
0 + 0 and 30 + 30. For each of these, the number
of pairs is (count * (count - 1)) // 2. Note that
this works when count is 0 or 1, since in both cases
we're multiplying by zero.
If the two values are different, then the number of
cases is simply the product of their counts.
Here's the code:
def getPairCount(numbers):
# Count how many of each value we have, mod 60
count_list = [0] * 60
for n in numbers:
n2 = n % 60
count_list[n2] += 1
# Now find the total
total = 0
c0 = count_list[0]
c30 = count_list[30]
total += (c0 * (c0 - 1)) // 2
total += (c30 * (c30 - 1)) // 2
for i in range(1, 30):
j = 60 - i
total += count_list[i] * count_list[j]
return total
This runs in O(n) time, due to the initial one-time pass we make over the list of input values. The loop at the end is just iterating from 1 through 29 and isn't nested, so it should run almost instantly.
Below is a translation of Tom Karzes's answer but using numpy. I benchmarked it and it is only faster if the input is already a numpy array, not a list. I still want to write it here because it nicely shows how loops in python can be one-liners in numpy.
def get_pairs_count(numbers, /):
# Count how many of each value we have, modulo 60.
numbers_mod60 = np.mod(numbers, 60)
_, counts = np.unique(numbers_mod60, return_counts=True)
# Now find the total.
total = 0
c0 = counts[0]
c30 = counts[30]
total += (c0 * (c0 - 1)) // 2
total += (c30 * (c30 - 1)) // 2
total += np.dot(counts[1:30:+1], counts[59:30:-1]) # Notice the slicing indices used.
return total

Making permanent change in a dataframe using python pandas

I would like to convert y dataframe from one format (X:XX:XX:XX) of values to another (X.X seconds)
Here is my dataframe looks like:
Start End
0 0:00:00:00
1 0:00:00:00 0:07:37:80
2 0:08:08:56 0:08:10:08
3 0:08:13:40
4 0:08:14:00 0:08:14:84
And I would like to transform it in seconds, something like that
Start End
0 0.0
1 0.0 457.80
2 488.56 490.80
3 493.40
4 494.0 494.84
To do that I did:
i = 0
j = 0
while j < 10:
while i < 10:
if data.iloc[i, j] != "":
Value = (int(data.iloc[i, j][0]) * 3600) + (int(data.iloc[i, j][2:4]) *60) + int(data.iloc[i, j][5:7]) + (int(data.iloc[i, j][8: 10])/100)
NewValue = data.iloc[:, j].replace([data.iloc[i, j]], Value)
i += 1
else:
NewValue = data.iloc[:, j].replace([data.iloc[i, j]], "")
i += 1
data.update(NewValue)
i = 0
j += 1
But I failed to replace the new values in my oldest dataframe in a permament way, when I do:
print(data)
I still get my old data frame in the wrong format.
Some one could hep me? I tried so hard!
Thank you so so much!
You are using pandas.DataFrame.update that requires a pandas dataframe as an argument. See the Example part of the update function documentation to really understand what update does https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.update.html
If I may suggest a more idiomatic solution; you can directly map a function to all values of a pandas Series
def parse_timestring(s):
if s == "":
return s
else:
# weird to use centiseconds and not milliseconds
# l is a list with [hour, minute, second, cs]
l = [int(nbr) for nbr in s.split(":")]
return sum([a*b for a,b in zip(l, (3600, 60, 1, 0.01))])
df["Start"] = df["Start"].map(parse_timestring)
You can remove the if ... else ... from parse_timestring if you replace all empty string with nan values in your dataframe with df = df.replace("", numpy.nan) then use df["Start"] = df["Start"].map(parse_timestring, na_action='ignore')
see https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.map.html
The datetimelibrary is made to deal with such data. You should also use the apply function of pandas to avoid iterating on the dataframe like that.
You should proceed as follow :
from datetime import datetime, timedelta
def to_seconds(date):
comp = date.split(':')
delta = (datetime.strptime(':'.join(comp[1:]),"%H:%M:%S") - datetime(1900, 1, 1)) + timedelta(days=int(comp[0]))
return delta.total_seconds()
data['Start'] = data['Start'].apply(to_seconds)
data['End'] = data['End'].apply(to_seconds)
Thank you so much for your help.
Your method was working. I also found a method using loop:
To summarize, my general problem was that I had an ugly csv file that I wanted to transform is a csv usable for doing statistics, and to do that I wanted to use python.
my csv file was like:
MiceID = 1 Beginning End Type of behavior
0 0:00:00:00 Video start
1 0:00:01:36 grooming type 1
2 0:00:03:18 grooming type 2
3 0:00:06:73 0:00:08:16 grooming type 1
So in my ugly csv file I was writing only the moment of the begining of the behavior type without the end when the different types of behaviors directly followed each other, and I was writing the moment of the end of the behavior when the mice stopped to make any grooming, that allowed me to separate sequences of grooming. But this type of csv was not usable for easily making statistics.
So I wanted 1) transform all my value in seconds to have a correct format, 2) then I wanted to fill the gap in the end colonne (a gap has to be fill with the following begining value, as the end of a specific behavior in a sequence is the begining of the following), 3) then I wanted to create columns corresponding to the duration of each behavior, and finally 4) to fill this new column with the duration.
My questionning was about the first step, but I put here the code for each step separately:
step 1: transform the values in a good format
import pandas as pd
import numpy as np
data = pd.read_csv("D:/Python/TestPythonTraitementDonnéesExcel/RawDataBatch2et3.csv", engine = "python")
data.replace(np.nan, "", inplace = True)
i = 0
j = 0
while j < len(data.columns):
while i < len(data.index):
if (":" in data.iloc[i, j]) == True:
Value = str((int(data.iloc[i, j][0]) * 3600) + (int(data.iloc[i, j][2:4]) *60) + int(data.iloc[i, j][5:7]) + (int(data.iloc[i, j][8: 10])/100))
data = data.replace([data.iloc[i, j]], Value)
data.update(data)
i += 1
else:
i += 1
i = 0
j += 1
print(data)
step 2: fill the gaps
i = 0
j = 2
while j < len(data.columns):
while i < len(data.index) - 1:
if data.iloc[i, j] == "":
data.iloc[i, j] = data.iloc[i + 1, j - 1]
data.update(data)
i += 1
elif np.all(data.iloc[i:len(data.index), j] == ""):
break
else:
i += 1
i = 0
j += 4
print(data)
step 3: create a new colunm for each mice:
j = 1
k = 0
while k < len(data.columns) - 1:
k = (j * 4) + (j - 1)
data.insert(k, "Duree{}".format(k), "")
data.update(data)
j += 1
print(data)
step 3: fill the gaps
j = 4
i = 0
while j < len(data.columns):
while i < len(data.index):
if data.iloc[i, j - 2] != "":
data.iloc[i, j] = str(float(data.iloc[i, j - 2]) - float(data.iloc[i, j - 3]))
data.update(data)
i += 1
else:
break
i = 0
j += 5
print(data)
And of course, export my new usable dataframe
data.to_csv(r"D:/Python/TestPythonTraitementDonnéesExcel/FichierPropre.csv", index = False, header = True)
here are the transformations:
click on the links for the pictures
before step1
after step 1
after step 2
after step 3
after step 4

Knapsack problem(optimized doesn't work correctly)

I am working on the Python code in order to solve Knapsack problem.
Here is my code:
import time
start_time = time.time()
#reading the data:
values = []
weights = []
test = []
with open("test.txt") as file:
W, size = map(int, next(file).strip().split())
for line in file:
value, weight = map(int, line.strip().split())
values.append(int(value))
weights.append(int(weight))
weights = [0] + weights
values = [0] + values
#Knapsack Algorithm:
hash_table = {}
for x in range(0,W +1):
hash_table[(0,x)] = 0
for i in range(1,size + 1):
for x in range(0,W +1):
if weights[i] > x:
hash_table[(i,x)] = hash_table[i - 1,x]
else:
hash_table[(i,x)] = max(hash_table[i - 1,x],hash_table[i - 1,x - weights[i]] + values[i])
print("--- %s seconds ---" % (time.time() - start_time))
This code works correctly, but on a big files my programm crashes due to RAM issues.
So I have decided to change the followng part:
for i in range(1,size + 1):
for x in range(0,W +1):
if weights[i] > x:
hash_table[(1,x)] = hash_table[0,x]
#hash_table[(0,x)] = hash_table[1,x]
else:
hash_table[(1,x)] = max(hash_table[0,x],hash_table[0,x - weights[i]] + values[i])
hash_table[(0,x)] = hash_table[(1,x)]
As you can see instead of using n rows i am using only two(copying the second row into the first one in order to recreate the following line of code hash_table[(i,x)] = hash_table[i - 1,x]), which should solve issues with RAM.
But unfortunately it gives me a wrong result.
I have used the following test case:
190 6
50 56
50 59
64 80
46 64
50 75
5 17
Should get a total value of 150 and total weight of 190 using 3 items:
item with value 50 and weight 75,
item with value 50 and weight 59,
item with value 50 and weight 56,
More test cases: https://people.sc.fsu.edu/~jburkardt/datasets/knapsack_01/knapsack_01.html
The problem here is that you need to reset all the values in the iteration over i, but also need the x index, so to do so, you could use another loop:
for i in range(1,size + 1):
for x in range(0,W +1):
if weights[i] > x:
hash_table[(1,x)] = hash_table[0,x]
else:
hash_table[(1,x)] = max(hash_table[0,x],hash_table[0,x - weights[i]] + values[i])
for x in range(0, W+1): # Make sure to reset after working on item i
hash_table[(0,x)] = hash_table[(1,x)]

Looping over lists Python, indexing (basic bootstrap)

Given the following two lists:
dates = [1,2,3,4,5]
rates = [0.0154, 0.0169, 0.0179, 0.0187, 0.0194]
I would like to generate a list
df = []
of same lengths as dates and rates (0 to 4 = 5 elements) in 'pure' Python (without Numpy) as an exercise.
df[i] would be equal to:
df[0] = (1 / (1 + rates[0])
df[1] = (1 - df[0] * rates[1]) / (1 + rates[1])
...
df[4] = (1 - (df[0] + df[1]..+df[3])*rates[4]) / (1 + rates[4])
I was trying:
df = []
df.append(1 + rates[0]) #create df[0]
for date in enumerate(dates, start = 1):
running_sum_vec = 0
for i in enumerate(rates, start = 1):
running_sum_vec += df[i] * rates[i]
df[i] = (1 - running_sum_vec) / (1+ rates[i])
return df
but am getting as TypeError: list indices must be integers. Thank you.
So, the enumerate method return two values: index and value
>>> x = ['a', 'b', 'a']
>>> for y_count, y in enumerate(x):
... print('index: {}, value: {}'.format(y_count, y))
...
index: 0, value: a
index: 1, value: b
index: 2, value: a
It's because of for i in enumerate(rates, start = 1):. enumerate generates tuples of the index and the object in the list. You should do something like
for i, rate in enumerate(rates, start=1):
running_sum_vec += df[i] * rate
You'll need to fix the other loop (for date in enumerate...) as well.
You also need to move df[i] = (1 - running_sum_vec) / (1+ rates[i]) back into the loop (currently it will only set the last value) (and change it to append since currently it will try to set at an index out of bounds).
Not sure if this is what you want:
df = []
sum = 0
for ind, val in enumerate(dates):
df.append( (1 - (sum * rates[ind])) / (1 + rates[ind]) )
sum += df[ind]
Enumerate returns both index and entry.
So assuming the lists contain ints, your code can be:
df = []
df.append(1 + rates[0]) #create df[0]
for date in dates:
running_sum_vec = 0
for i, rate in enumerate(rates[1:], start = 1):
running_sum_vec += df[i] * rate
df[i] = (1 - running_sum_vec) / (1+ rate)
return df
Although I'm almost positive there's a way with list comprehension. I'll have to think about it for a bit.

Categories

Resources