I have a dataset that looks like the image below,
and my goal is compare the three last rows and choose the highest each time.
I have four new variables: empty = 0, cancel = 0, release = 0, undertermined = 0
for index 0, the cancelCount is the highest, therefore cancel += 1. The undetermined is increased only if the three rows are the same.
Here is my failed code sample:
empty = 0
cancel = 0
release = 0
undetermined = 0
if (df["emptyCount"] > df["cancelcount"]) & (df["emptyCount"] > df["releaseCount"]):
empty += 1
elif (df["cancelcount"] > df["emptyCount"]) & (df["cancelcount"] > df["releaseCount"]):
cancel += 1
elif (df["releasecount"] > df["emptyCount"]) & (df["releasecount"] > df["emptyCount"]):
release += 1
else:
undetermined += 1
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Fist we find the undetermined rows
equal = (df['emptyCount'] == df['cancelcount']) | (df['cancelount'] == df['releaseCount'])
Then we find the max column of the determined rows
max_arg = df.loc[~equal, ['emptyCount', 'cancelcount', 'releaseCount']].idxmax(axis=1)
And count them
undetermined = equal.sum()
empty = (max_arg == 'emptyCount').sum()
cancel = (max_arg == 'cancelcount').sum()
release = (max_arg == 'releaseCount').sum()
In general, you should avoid looping. Here's an example of vectorized code that does what you need:
# data of intereset
s = df[['emptyCount', 'cancelCount', 'releaseCount']]
# maximum by rows
max_vals = s.max(1)
# those are equal to max values:
equal_max = df.eq(max_vals, axis='rows').astype(int)
# If there are single maximum along the rows:
single_max = equal_max.sum(1)==1
# The values:
equal_max.mul(single_max, axis='rows').sum()
Output would be a series that looks like this:
emmptyCount count1
cancelCount count2
releaseCount count3
dtype: int64
import pandas as pd
import numpy as np
class thing(object):
def __init__(self):
self.value = 0
empty , cancel , release , undetermined = [thing() for i in range(4)]
dictt = { 0 : empty, 1 : cancel , 2 : release , 3 : undetermined }
df = pd.DataFrame({
'emptyCount': [2,4,5,7,3],
'cancelCount': [3,7,8,11,2],
'releaseCount': [2,0,0,5,3],
})
for i in range(1,4):
series = df.iloc[-4+i]
for j in range(len(series)):
if series[j] == series.max():
dictt[j].value +=1
cancel.value
A small script to get the maximum values:
import numpy as np
emptyCount = [2,4,5,7,3]
cancelCount = [3,7,8,11,2]
releaseCount = [2,0,0,5,3]
# Here we use np.where to count instances where there is more than one index with the max value.
# np.where returns a tuple, so we flatten it using "for n in m"
count = [n for z in zip(emptyCount, cancelCount, releaseCount) for m in np.where(np.array(z) == max(z)) for n in m]
empty = count.count(0) # 1
cancel = count.count(1) # 4
release = count.count(2) # 1
Related
I would like to take a list of unknown size, containing ndarrays, where each ndarray could have any dimension and size independent from the the others, and replace values at random spots in this entire data structure.
I can create an index for a random spot by doing this:
for w in weights:
number_of_weights += w.size()
The problem is how I would go about inserting without having to recursively check that I am at the last dimension while adding to a counter until it is greater than the index and decrementing another counter to know where in the last dimension I am inserting.
I found out about a solution that uses the ravel() function.
def get_row_and_index(weights, index):
index_const = index
row = 0
count = weights[row].size - 1
while count < index_const:
index -= weights[row].size
row += 1
count += weights[row].size
return row, index
def mutate_weights(weights, n_mutations):
new_weights = copy.deepcopy(weights)
number_of_weights = 0
a = 0
b = 0
for i in new_weights:
number_of_weights += i.size
a = min(a, i.min())
b = max(b, i.max())
n_mutations = min(number_of_weights, n_mutations)
for i in range(n_mutations):
index = random.randrange(0, number_of_weights)
row, index = get_row_and_index(new_weights, index)
new_weight = random.uniform(a, b)
flat_row = new_weights[row].ravel()
flat_row[index] = new_weight
I am already using Tensorflow, so I came up with a solution that uses it.
The idea was to reshape the multidimensional ndarray that would get the replacement into a flat array after finding what index in the list it was, and then reshaping it back to the original shape after the replacement, and replacing it's old copy in the list.
Please note that this is not as efficient as it could be due to multiple iterations of index keeping and replacements. If it becomes an issue, the indices could be pre-computed, and the ndarrays replaced a max of once instead of every iteration of the loop below.
def mutate(weights, n_mutations):
number_of_weights = 0
a = 0
b = 0
for i in weights:
number_of_weights += i.size
a = min(a, i.min())
b = max(b, i.min())
n_mutations = min(number_of_weights, n_mutations)
row = 0
count = weights[row].size - 1
for i in range(n_mutations):
index = random.randrange(0, number_of_weights)
index_const = index
while count < index_const:
index -= (weights[row].size + 1)
row += 1
count += weights[row].size
new_weight = random.uniform(a, b - 1)
shape = weights[row].shape
flat_row = tf.reshape(weights[row], [-1]).numpy()
flat_row[index] = new_weight
new_row = tf.reshape(flat_row, shape).numpy()
weights[row] = new_row
row = 0
count = 0
return weights
I need to write a code in Panda Dataframe. So: The values in the ID column will be checked sequentially whether they are the same or not. Three situations arise here. Case 1: If the ID is not the same as the next line, write it as "unique" in the Comment column. Case 2: If the ID is the same as the next column and different from the next one, write it as "ring" in the Comment column. Case 3: If the ID is the same as the next multiple columns, write it as "multi" in the Comment column. Case 4: do this until the rows in the ID column are complete.
import pandas as pd
df = pd.read_csv('History-s.csv')
a = len(df['ID'])
c = 0
while a != 0:
c += 1
while df['ID'][i] == df['ID'][i + 1]:
if c == 2:
if df['Nod 1'][i] == df['Nod 2'][i + 1]:
df['Comment'][i] = "Ring"
df['Comment'][i + 1] = "Ring"
else:
df['Comment'][i] = "Multi"
df['Comment'][i + 1] = "Multi"
elif c > 2:
df['Comment'][i] = "Multi"
df['Comment'][i + 1] = "Multi"
i += 1
else:
df['Comment'][i] = "Unique"
a = a -1
print(df, '\n')
Data is like this:
Data
After coding data frame should be like this:
Result
From the input dataframe you have provided, my first impression was that as you are checking next line in a while loop, so you are strictly considering just the next comin line, for ex.
ID
value
comment
1
2
MULTI
1
3
RING
3
4
UNIQUE
But if that is not the case, you can simply use pandas groupby function.
def func(df):
if len(df)>2:
df['comment'] = 'MULTI'
elif len(df)==2:
df['comment'] = 'RING'
else:
df['comment'] = 'UNIQUE'
return df
df = df.groupby(['ID']).apply(func)
Output:
ID value comment
0 1 2 RING
1 1 3 RING
2 3 4 UNIQUE
df = pd.DataFrame({
'label':[f"subj_{i}" for i in range(28)],
'data':[i for i in range(1, 14)] + [1,0,0,0,2] + [0,0,0,0,0,0,0,0,0,0]
})
I have a dataset something like that. It looks like:
I want to cut it at where the longest repetitions of 0s occur, so I want to cut at index 18, but I want to leave index 14-16 intact. So far I've tried stuff like:
Counters
cad_recorder = 0
new_index = []
for i,row in tqdm(temp_df.iterrows()):
if row['cadence'] == 0:
cad_recorder += 1
new_index.append(i)
* But obviously that won't work since the indices will be rewritten at each occurrance of zero.
I also tried a dictionary, but I'm not sure how to compare previous and next values using iterrows.
I also took the rolling mean for X rows at a time, and if its zero then I got an index. But then I got stuck at actually inferring the range of indices. Or finding the longest sequence of zeroes.
Edit: A friend of mine suggested the following logic, which gave the same results as #shubham-sharma. The poster's solution is much more pythonic and elegant.
def find_longest_zeroes(df):
'''
Finds the index at which the longest reptitions of <1 values begin
'''
current_length = 0
max_length = 0
start_idx = 0
max_idx = 0
for i in range(len(df['data'])):
if df.iloc[i,9] <= 1:
if current_length == 0:
start_idx = i
current_length += 1
if current_length > max_length:
max_length = current_length
max_idx = start_idx
else:
current_length = 0
return max_idx
The code I went with following #shubham-sharma's solution:
cut_us_sof = {}
og_df_sof = pd.DataFrame()
cut_df_sof = pd.DataFrame()
for lab in df['label'].unique():
temp_df = df[df['label'] == lab].reset_index(drop=True)
mask = temp_df['data'] <= 1 # some values in actual dataset were 0.0000001
counts = temp_df[mask].groupby((~mask).cumsum()).transform('count')['data']
idx = counts.idxmax()
# my dataset's trailing zeroes are usually after 200th index. But I also didn't want to remove trailing zeroes < 500 in length
if (idx > 2000) & (counts.loc[idx] > 500):
cut_us_sof[lab] = idx
og_df_sof = og_df_sof.append(temp_df)
cut_df_sof = cut_df_sof.append(temp_df.iloc[:idx,:])
We can use boolean masking and cumsum to identify the blocks of zeros, then groupby and transform these blocks using count followed by idxmax to get the starting index of the block having the maximum consecutive zeros
m = df['data'].eq(0)
idx = m[m].groupby((~m).cumsum()).transform('count').idxmax()
print(idx)
18
I would like to convert y dataframe from one format (X:XX:XX:XX) of values to another (X.X seconds)
Here is my dataframe looks like:
Start End
0 0:00:00:00
1 0:00:00:00 0:07:37:80
2 0:08:08:56 0:08:10:08
3 0:08:13:40
4 0:08:14:00 0:08:14:84
And I would like to transform it in seconds, something like that
Start End
0 0.0
1 0.0 457.80
2 488.56 490.80
3 493.40
4 494.0 494.84
To do that I did:
i = 0
j = 0
while j < 10:
while i < 10:
if data.iloc[i, j] != "":
Value = (int(data.iloc[i, j][0]) * 3600) + (int(data.iloc[i, j][2:4]) *60) + int(data.iloc[i, j][5:7]) + (int(data.iloc[i, j][8: 10])/100)
NewValue = data.iloc[:, j].replace([data.iloc[i, j]], Value)
i += 1
else:
NewValue = data.iloc[:, j].replace([data.iloc[i, j]], "")
i += 1
data.update(NewValue)
i = 0
j += 1
But I failed to replace the new values in my oldest dataframe in a permament way, when I do:
print(data)
I still get my old data frame in the wrong format.
Some one could hep me? I tried so hard!
Thank you so so much!
You are using pandas.DataFrame.update that requires a pandas dataframe as an argument. See the Example part of the update function documentation to really understand what update does https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.update.html
If I may suggest a more idiomatic solution; you can directly map a function to all values of a pandas Series
def parse_timestring(s):
if s == "":
return s
else:
# weird to use centiseconds and not milliseconds
# l is a list with [hour, minute, second, cs]
l = [int(nbr) for nbr in s.split(":")]
return sum([a*b for a,b in zip(l, (3600, 60, 1, 0.01))])
df["Start"] = df["Start"].map(parse_timestring)
You can remove the if ... else ... from parse_timestring if you replace all empty string with nan values in your dataframe with df = df.replace("", numpy.nan) then use df["Start"] = df["Start"].map(parse_timestring, na_action='ignore')
see https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.map.html
The datetimelibrary is made to deal with such data. You should also use the apply function of pandas to avoid iterating on the dataframe like that.
You should proceed as follow :
from datetime import datetime, timedelta
def to_seconds(date):
comp = date.split(':')
delta = (datetime.strptime(':'.join(comp[1:]),"%H:%M:%S") - datetime(1900, 1, 1)) + timedelta(days=int(comp[0]))
return delta.total_seconds()
data['Start'] = data['Start'].apply(to_seconds)
data['End'] = data['End'].apply(to_seconds)
Thank you so much for your help.
Your method was working. I also found a method using loop:
To summarize, my general problem was that I had an ugly csv file that I wanted to transform is a csv usable for doing statistics, and to do that I wanted to use python.
my csv file was like:
MiceID = 1 Beginning End Type of behavior
0 0:00:00:00 Video start
1 0:00:01:36 grooming type 1
2 0:00:03:18 grooming type 2
3 0:00:06:73 0:00:08:16 grooming type 1
So in my ugly csv file I was writing only the moment of the begining of the behavior type without the end when the different types of behaviors directly followed each other, and I was writing the moment of the end of the behavior when the mice stopped to make any grooming, that allowed me to separate sequences of grooming. But this type of csv was not usable for easily making statistics.
So I wanted 1) transform all my value in seconds to have a correct format, 2) then I wanted to fill the gap in the end colonne (a gap has to be fill with the following begining value, as the end of a specific behavior in a sequence is the begining of the following), 3) then I wanted to create columns corresponding to the duration of each behavior, and finally 4) to fill this new column with the duration.
My questionning was about the first step, but I put here the code for each step separately:
step 1: transform the values in a good format
import pandas as pd
import numpy as np
data = pd.read_csv("D:/Python/TestPythonTraitementDonnéesExcel/RawDataBatch2et3.csv", engine = "python")
data.replace(np.nan, "", inplace = True)
i = 0
j = 0
while j < len(data.columns):
while i < len(data.index):
if (":" in data.iloc[i, j]) == True:
Value = str((int(data.iloc[i, j][0]) * 3600) + (int(data.iloc[i, j][2:4]) *60) + int(data.iloc[i, j][5:7]) + (int(data.iloc[i, j][8: 10])/100))
data = data.replace([data.iloc[i, j]], Value)
data.update(data)
i += 1
else:
i += 1
i = 0
j += 1
print(data)
step 2: fill the gaps
i = 0
j = 2
while j < len(data.columns):
while i < len(data.index) - 1:
if data.iloc[i, j] == "":
data.iloc[i, j] = data.iloc[i + 1, j - 1]
data.update(data)
i += 1
elif np.all(data.iloc[i:len(data.index), j] == ""):
break
else:
i += 1
i = 0
j += 4
print(data)
step 3: create a new colunm for each mice:
j = 1
k = 0
while k < len(data.columns) - 1:
k = (j * 4) + (j - 1)
data.insert(k, "Duree{}".format(k), "")
data.update(data)
j += 1
print(data)
step 3: fill the gaps
j = 4
i = 0
while j < len(data.columns):
while i < len(data.index):
if data.iloc[i, j - 2] != "":
data.iloc[i, j] = str(float(data.iloc[i, j - 2]) - float(data.iloc[i, j - 3]))
data.update(data)
i += 1
else:
break
i = 0
j += 5
print(data)
And of course, export my new usable dataframe
data.to_csv(r"D:/Python/TestPythonTraitementDonnéesExcel/FichierPropre.csv", index = False, header = True)
here are the transformations:
click on the links for the pictures
before step1
after step 1
after step 2
after step 3
after step 4
I have a piece of code that takes forever to run. Does anybody know how to optimize it?
The purpose of the formula is to make a column that does the following: when 'action' != 0, if 'PX_LAST'<'ma', populate 'buy_sell' with -1, if 'PX_LAST'>'ma', populate 'buy_sell' with 1; in the other cases, do not populate 'buy_sell' with new values.
Fyi - Column 'action' is populated with either 0 or 1
#create column
df_zinc['buy_sell'] = 0
index = 0
while index < df_zinc.shape[0]:
if df_zinc['action'][index] != 0:
continue
if df_zinc['PX_LAST'][index]<df_zinc['ma'][index]:
df_zinc.loc[index,'buy_sell'] = -1
elif df_zinc['PX_LAST'][index]>df_zinc['ma'][index]:
df_zinc.loc[index,'buy_sell'] = 1
else:
index = index + 1
I think you need:
import numpy as np
mask1 = df_zinc['action'] != 0
mask2 = df_zinc['PX_LAST'] < df_zinc['ma']
mask3 = df_zinc['PX_LAST'] > df_zinc['ma']
df_zinc['buy_sell'] = np.select([mask1 & mask2, mask1 & mask3], [-1,1], 0)