Optimizing while loop in Python - python

I have a piece of code that takes forever to run. Does anybody know how to optimize it?
The purpose of the formula is to make a column that does the following: when 'action' != 0, if 'PX_LAST'<'ma', populate 'buy_sell' with -1, if 'PX_LAST'>'ma', populate 'buy_sell' with 1; in the other cases, do not populate 'buy_sell' with new values.
Fyi - Column 'action' is populated with either 0 or 1
#create column
df_zinc['buy_sell'] = 0
index = 0
while index < df_zinc.shape[0]:
if df_zinc['action'][index] != 0:
continue
if df_zinc['PX_LAST'][index]<df_zinc['ma'][index]:
df_zinc.loc[index,'buy_sell'] = -1
elif df_zinc['PX_LAST'][index]>df_zinc['ma'][index]:
df_zinc.loc[index,'buy_sell'] = 1
else:
index = index + 1

I think you need:
import numpy as np
mask1 = df_zinc['action'] != 0
mask2 = df_zinc['PX_LAST'] < df_zinc['ma']
mask3 = df_zinc['PX_LAST'] > df_zinc['ma']
df_zinc['buy_sell'] = np.select([mask1 & mask2, mask1 & mask3], [-1,1], 0)

Related

Making permanent change in a dataframe using python pandas

I would like to convert y dataframe from one format (X:XX:XX:XX) of values to another (X.X seconds)
Here is my dataframe looks like:
Start End
0 0:00:00:00
1 0:00:00:00 0:07:37:80
2 0:08:08:56 0:08:10:08
3 0:08:13:40
4 0:08:14:00 0:08:14:84
And I would like to transform it in seconds, something like that
Start End
0 0.0
1 0.0 457.80
2 488.56 490.80
3 493.40
4 494.0 494.84
To do that I did:
i = 0
j = 0
while j < 10:
while i < 10:
if data.iloc[i, j] != "":
Value = (int(data.iloc[i, j][0]) * 3600) + (int(data.iloc[i, j][2:4]) *60) + int(data.iloc[i, j][5:7]) + (int(data.iloc[i, j][8: 10])/100)
NewValue = data.iloc[:, j].replace([data.iloc[i, j]], Value)
i += 1
else:
NewValue = data.iloc[:, j].replace([data.iloc[i, j]], "")
i += 1
data.update(NewValue)
i = 0
j += 1
But I failed to replace the new values in my oldest dataframe in a permament way, when I do:
print(data)
I still get my old data frame in the wrong format.
Some one could hep me? I tried so hard!
Thank you so so much!
You are using pandas.DataFrame.update that requires a pandas dataframe as an argument. See the Example part of the update function documentation to really understand what update does https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.update.html
If I may suggest a more idiomatic solution; you can directly map a function to all values of a pandas Series
def parse_timestring(s):
if s == "":
return s
else:
# weird to use centiseconds and not milliseconds
# l is a list with [hour, minute, second, cs]
l = [int(nbr) for nbr in s.split(":")]
return sum([a*b for a,b in zip(l, (3600, 60, 1, 0.01))])
df["Start"] = df["Start"].map(parse_timestring)
You can remove the if ... else ... from parse_timestring if you replace all empty string with nan values in your dataframe with df = df.replace("", numpy.nan) then use df["Start"] = df["Start"].map(parse_timestring, na_action='ignore')
see https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.map.html
The datetimelibrary is made to deal with such data. You should also use the apply function of pandas to avoid iterating on the dataframe like that.
You should proceed as follow :
from datetime import datetime, timedelta
def to_seconds(date):
comp = date.split(':')
delta = (datetime.strptime(':'.join(comp[1:]),"%H:%M:%S") - datetime(1900, 1, 1)) + timedelta(days=int(comp[0]))
return delta.total_seconds()
data['Start'] = data['Start'].apply(to_seconds)
data['End'] = data['End'].apply(to_seconds)
Thank you so much for your help.
Your method was working. I also found a method using loop:
To summarize, my general problem was that I had an ugly csv file that I wanted to transform is a csv usable for doing statistics, and to do that I wanted to use python.
my csv file was like:
MiceID = 1 Beginning End Type of behavior
0 0:00:00:00 Video start
1 0:00:01:36 grooming type 1
2 0:00:03:18 grooming type 2
3 0:00:06:73 0:00:08:16 grooming type 1
So in my ugly csv file I was writing only the moment of the begining of the behavior type without the end when the different types of behaviors directly followed each other, and I was writing the moment of the end of the behavior when the mice stopped to make any grooming, that allowed me to separate sequences of grooming. But this type of csv was not usable for easily making statistics.
So I wanted 1) transform all my value in seconds to have a correct format, 2) then I wanted to fill the gap in the end colonne (a gap has to be fill with the following begining value, as the end of a specific behavior in a sequence is the begining of the following), 3) then I wanted to create columns corresponding to the duration of each behavior, and finally 4) to fill this new column with the duration.
My questionning was about the first step, but I put here the code for each step separately:
step 1: transform the values in a good format
import pandas as pd
import numpy as np
data = pd.read_csv("D:/Python/TestPythonTraitementDonnéesExcel/RawDataBatch2et3.csv", engine = "python")
data.replace(np.nan, "", inplace = True)
i = 0
j = 0
while j < len(data.columns):
while i < len(data.index):
if (":" in data.iloc[i, j]) == True:
Value = str((int(data.iloc[i, j][0]) * 3600) + (int(data.iloc[i, j][2:4]) *60) + int(data.iloc[i, j][5:7]) + (int(data.iloc[i, j][8: 10])/100))
data = data.replace([data.iloc[i, j]], Value)
data.update(data)
i += 1
else:
i += 1
i = 0
j += 1
print(data)
step 2: fill the gaps
i = 0
j = 2
while j < len(data.columns):
while i < len(data.index) - 1:
if data.iloc[i, j] == "":
data.iloc[i, j] = data.iloc[i + 1, j - 1]
data.update(data)
i += 1
elif np.all(data.iloc[i:len(data.index), j] == ""):
break
else:
i += 1
i = 0
j += 4
print(data)
step 3: create a new colunm for each mice:
j = 1
k = 0
while k < len(data.columns) - 1:
k = (j * 4) + (j - 1)
data.insert(k, "Duree{}".format(k), "")
data.update(data)
j += 1
print(data)
step 3: fill the gaps
j = 4
i = 0
while j < len(data.columns):
while i < len(data.index):
if data.iloc[i, j - 2] != "":
data.iloc[i, j] = str(float(data.iloc[i, j - 2]) - float(data.iloc[i, j - 3]))
data.update(data)
i += 1
else:
break
i = 0
j += 5
print(data)
And of course, export my new usable dataframe
data.to_csv(r"D:/Python/TestPythonTraitementDonnéesExcel/FichierPropre.csv", index = False, header = True)
here are the transformations:
click on the links for the pictures
before step1
after step 1
after step 2
after step 3
after step 4

Compare three columns and choose the highest

I have a dataset that looks like the image below,
and my goal is compare the three last rows and choose the highest each time.
I have four new variables: empty = 0, cancel = 0, release = 0, undertermined = 0
for index 0, the cancelCount is the highest, therefore cancel += 1. The undetermined is increased only if the three rows are the same.
Here is my failed code sample:
empty = 0
cancel = 0
release = 0
undetermined = 0
if (df["emptyCount"] > df["cancelcount"]) & (df["emptyCount"] > df["releaseCount"]):
empty += 1
elif (df["cancelcount"] > df["emptyCount"]) & (df["cancelcount"] > df["releaseCount"]):
cancel += 1
elif (df["releasecount"] > df["emptyCount"]) & (df["releasecount"] > df["emptyCount"]):
release += 1
else:
undetermined += 1
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Fist we find the undetermined rows
equal = (df['emptyCount'] == df['cancelcount']) | (df['cancelount'] == df['releaseCount'])
Then we find the max column of the determined rows
max_arg = df.loc[~equal, ['emptyCount', 'cancelcount', 'releaseCount']].idxmax(axis=1)
And count them
undetermined = equal.sum()
empty = (max_arg == 'emptyCount').sum()
cancel = (max_arg == 'cancelcount').sum()
release = (max_arg == 'releaseCount').sum()
In general, you should avoid looping. Here's an example of vectorized code that does what you need:
# data of intereset
s = df[['emptyCount', 'cancelCount', 'releaseCount']]
# maximum by rows
max_vals = s.max(1)
# those are equal to max values:
equal_max = df.eq(max_vals, axis='rows').astype(int)
# If there are single maximum along the rows:
single_max = equal_max.sum(1)==1
# The values:
equal_max.mul(single_max, axis='rows').sum()
Output would be a series that looks like this:
emmptyCount count1
cancelCount count2
releaseCount count3
dtype: int64
import pandas as pd
import numpy as np
class thing(object):
def __init__(self):
self.value = 0
empty , cancel , release , undetermined = [thing() for i in range(4)]
dictt = { 0 : empty, 1 : cancel , 2 : release , 3 : undetermined }
df = pd.DataFrame({
'emptyCount': [2,4,5,7,3],
'cancelCount': [3,7,8,11,2],
'releaseCount': [2,0,0,5,3],
})
for i in range(1,4):
series = df.iloc[-4+i]
for j in range(len(series)):
if series[j] == series.max():
dictt[j].value +=1
cancel.value
A small script to get the maximum values:
import numpy as np
emptyCount = [2,4,5,7,3]
cancelCount = [3,7,8,11,2]
releaseCount = [2,0,0,5,3]
# Here we use np.where to count instances where there is more than one index with the max value.
# np.where returns a tuple, so we flatten it using "for n in m"
count = [n for z in zip(emptyCount, cancelCount, releaseCount) for m in np.where(np.array(z) == max(z)) for n in m]
empty = count.count(0) # 1
cancel = count.count(1) # 4
release = count.count(2) # 1

if I want to make new column based on the old columns range?

I want to make a new column with old column's date range
df['block']= np.where((df['transacted_date']> '2016-06-01') & (df['transacted_date']< '2016-09-01') ,0,'None')
df['block']= np.where((df['transacted_date']> '2016-09-01') & (df['transacted_date']< '2016-12-01') ,1,'None')
is there way to do this in if elif statement?
try using np.select
m1 = (df['transacted_date'] > '2016-06-01') & (df['transacted_date'] < '2016-09-01')
m2 = (df['transacted_date'] > '2016-09-01') &( df['transacted_date'] < '2016-12-01')
df['block'] = np.select(condlist=[m1,m2],
choicelist=[0,1],
default=None)
Use numpy.select with Series.between:
m1 = df['transacted_date'].between('2016-06-01', '2016-09-01', inclusive = False)
m2 = df['transacted_date'].between('2016-09-01', '2016-12-01', inclusive = False)
df['block'] = np.select([m1,m2], [0,1], default=None)
If need if-else solution:
def f(x):
if (x > pd.Timestamp('2016-06-01')) and (x < pd.Timestamp('2016-09-01')):
return 0
elif (x > pd.Timestamp('2016-09-01')) and (x < pd.Timestamp('2016-12-01')):
return 1
else:
return None
df['block']=df['transacted_date'].apply(f)
If need more general solution use cut with numpy.where, because cut cannot create None or NaN labels:
b = pd.to_datetime([pd.Timestamp.min,'2016-06-01','2016-09-01','2016-12-01',pd.Timestamp.max])
s = pd.cut(df['transacted_date'], bins=b, labels=[-2, 0, 1, -1])
df['block1'] = np.where(s.astype(int) >= 0, s, np.nan)

How to form another colum in a pd.DataFrame out of different variables

I'm trying to make a new boolean variable by an if-statement with multiple conditions in other variables. But so far my many tries do not even work with variable as parameter.
head of used columns in data frame
I would really appreciate if anyone of you can see the Problem, I already searched for two days the whole World Wide Web. But as beginner I couldn't find the solution yet.
amount = df4['AnzZahlungIDAD']
time = df4['DLZ_SCHDATSCHL']
Erstr = df4['Schadenwert']
Zahlges = df4['zahlgesbrut']
timequantil = time.quantile(.2)
diff = (Erstr-Zahlges)/Erstr*100
diffrange = [(diff <=15) & (diff >= -15)]
special = df4[['Taxatoreneinsatz', 'Belegpruefereinsatz_rel', 'IntSVKZ', 'ExtTechSVKZ']]
First Method with list comprehension
label = []
label = [True if (amount[i] <= 1) & (time[i] <= timequantil) & (diff == diffrange) & (special == 'N') else False for i in label]
label
Second Method with iterrows()
df4['label'] = pd.Series([])
df4['label'] = [True if (row[amount] <= 1) & (row[time] <= timequantil) & (row[diff] == diffrange) & (row[special] == 'N') else False for row in df4.iterrows()]
df4['label']
3rd Method with Lambda function
df4.loc[:,'label'] = '1'
df4['label'] = df4['label'].apply([lambda c: True if (c[amount] <= 1) & (c[time] <= timequantil) & (c[diff] == diffrange) & (c[special]) == 'N' else False for c in df4['label']], axis = 0)
df4['label'].value_counts()
I expected that I get a varialbe "Label" in my dataframe df4 that is whether True or False.
Fewer tries gave me only all values = False or all = True even if I used only a single Parameter, which is impossible by the data.
First Method runs fine but Outputs: []
Second Method gives me following error: TypeError: tuple indices must be integers or slices, not Series
Third Method does not load at all.
IIUC, try this
time = df4['DLZ_SCHDATSCHL']
Erstr = df4['Schadenwert']
Zahlges = df4['zahlgesbrut']
# timequantil = time.quantile(.2)
diff = (Erstr-Zahlges)/Erstr*100
df4['label'] = (df4['AnzZahlungIDAD'] <= 1) & (time <= time.quantile(.2)) & (diff <=15) & (diff >= -15) & (df['Belegpruefereinsatz_rel'] =='N') & (df['Taxatoreneinsatz'] =='N') & (df['ExtTechSVKZ'] =='N') & (df['IntSVKZ'] =='N')
Given your dataset i got following output
Anz dlz sch zal taxa bel int ext label
0 2 82 200 253.80 N N N J False
1 2 82 200 253.80 N N N J False
2 1 153 200 323.68 N J N N False
3 1 153 200 323.68 N J N N False
4 1 191 500 1252.12 N J N N False
Note: Don't mind the abbreviations used in column name

Pythonic way to simplify multiply 2 pandas columns with condition

I am looking for a way how I can simplify below examples:
self.df[TARGET_NAME] = self.df.apply(lambda row: 1 if row['WINNER'] == 1 and row['WINNER_OVER_2_5'] == 1 else 0, axis=1)
like:
self.df[TARGET_NAME] = self.df[(self.df.WINNER == 1)] &
self.df[(self.df.WINNER_OVER_2_5 == 1)] # not it's not correct
and more complex as below
df["PROFIT"] = np.where((df[TARGET_NAME] == df["PREDICTED"]) & (df["PREDICTED"] == 0),
df['MATCH_HOME'] * df['HOME_STAKE'],
np.where((dfml[TARGET_NAME] == df["PREDICTED"]) & (df["PREDICTED"] == 1),
df['MATCH_DRAW'] * df['DRAW_STAKE'],
np.where((df[TARGET_NAME] == df["PREDICTED"]) & (df["PREDICTED"] == 2),
df['MATCH_AWAY'] * df['AWAY_STAKE'],
-0))).astype(float)
IIUC you can use isin:
print df
WINNER WINNER_OVER_2_5
0 1 0
1 1 1
2 0 2
df['TARGET_NAME'] = np.where((df.WINNER.isin([1]) & df.WINNER_OVER_2_5.isin([1])),1,0)
print df
WINNER WINNER_OVER_2_5 TARGET_NAME
0 1 0 0
1 1 1 1
2 0 2 0
EDIT (untested, because no data):
df["PROFIT"] = np.where((df[TARGET_NAME] == df["PREDICTED"]) & (df["PREDICTED"].isin([0])),
df['MATCH_HOME'] * df['HOME_STAKE'],
np.where((dfml[TARGET_NAME] == df["PREDICTED"]) & (df["PREDICTED"].isin([1])),
df['MATCH_DRAW'] * df['DRAW_STAKE'],
np.where((df[TARGET_NAME] == df["PREDICTED"]) & (df["PREDICTED"].isin([2])),
df['MATCH_AWAY'] * df['AWAY_STAKE'],
0))).astype(float)
I am looking for a way how I can simplify below examples:
I guess you're looking for a simpler syntax. How about this:
df['MATCH'] = matches(df, values=(0,1), WINNER=1, WINNER_OVER_2_5=1)
Note that values= is optional and takes any tuple(false-value, true-value), defaulting to (False, True).
To get there it takes a bit of magic. Essentially this builds a truth table by chaining the conditions and transforming the result into the values as specified. It ends up doing the same thing as your lambda, just in a generic way.
def matches(df, values=None, **kwargs):
values = values or (False, True)
flt = None
for var, value in kwargs.iteritems():
t = (df[var] == value)
flt = (flt & t) if flt is not None else t
flt = flt.apply(lambda t : values[t])
return flt
maybe you can try with boolean attribute
df= pd.DataFrame({'a':[1,0,1,0],'b' :[1,1,0,np.nan]})
df['NEW']= ((df['a']==1 ) & (df['b']==1)).astype(int).fillna(0)

Categories

Resources