Order data depending on multiple range criteria - python

I am trying to order data using multiple ranges. Let's suppose I have some data in tt array:
n= 50
b = 20
r = 3
tt = np.array([[[3]*r]*b]*n)
and another values in list:
z = (np.arange(0,5,0.1)).tolist()
Now I need to sort data from tt depending on ranges from z, which should go between 0 and 1, next range is between 1 and 2, next one between 2 and 3 and so on.
My attempts by now were trying to create array of length of each of ranges, and use those length to cut data from tt. Looks something like this:
za = []
za2 = []
za3 = []
za4 = []
za5 = []
za6 = []
za7 = []
for y in range(50):
if 0 <= int(z[y]) < 1:
za.append(z[y])
zi = array([int(len(za))])
if 1 <= int(z[y]) < 2:
za2.append(z[y])
zi2 = array([int(len(za2))])
if 2 <= int(z[y]) < 3:
za3.append(z[y])
zi3 = array([int(len(za3))])
if 3 <= int(z[y]) < 4:
za4.append(z[y])
zi4 = array([int(len(za4))])
if 4 <= int(z[y]) < 5:
za5.append(z[y])
zi5 = array([int(len(za5))])
if 5 <= int(z[y]) < 6:
za6.append(z[y])
zi6 = array([int(len(za6))])
if 6 <= int(z[y]) < 7:
za7.append(z[y])
zi7 = array([int(len(za7))])
till = np.concatenate((np.array(zi), np.array(zi2), np.array(zi3), np.array(zi4), np.array(zi5), np.array(zi6), np.array(zi7))
ttn = []
for p in range(50):
#if hour_lenght[p] != []
tt_h = np.empty(shape(tt[0:till[p],:,:]))
tt_h[:] = np.nan
tt_h = tt_h[np.newaxis,:,:,:]
tt_h[np.newaxis,:,:,:] = tt[0:till[p],:,:]
ttn.append(tt_h)
As you can guess,I get an error "name 'zi6' is not defined" since there are no data in that range. But at least it does the job for the parts that do exist :D. However, if I include else statement after if and do something like:
for y in range(50):
if 0 <= int(z[y]) < 1:
za.append(z[y])
zi = np.array([int(len(za))])
else:
zi = np.array([np.nan])
My initial zi from 1st part gets overwritten with nan.
I should also point that the ultimate goal is to load multiple files that are having similar shape as tt (two last dimensions are always the same while the first one is changing, e.g. :
tt.shape
(50, 20, 3)
and some other tt2 is having shape:
tt2.shape
(55, 20, 3)
with z2 that is having values between 5 and 9.
z2 = (np.arange(5,9,0.1)).tolist()
So in the end I should end up with array ttn wheres
ttn[0] is filled with values from tt in range between 0 and 1,
ttn[1] should be filled with values from tt in between 1 and 2 and so on.
I very much appreciate suggestions and possible solutions on this issue.

Related

List is being sorted but then when asked to pass the items to another list it just randomly transfer the items

In my code when I transfer the items of the sorted_resistances list which contains several item transfered of a file named file.txt, to the blocks_A[x] list it doesnt transfer in order which should be in ascending order (knowing that file.txt only contains numbers). Pretty much the rest of the program should be quite easy to understand but if not the objective is to pass 12 elements in order of the sorted_resistances list to each of the 29 sublists of the blocks_A list.
blocks_B = []
y = 0
while y < 29:
y = y + 1
block_y = []
blocks_B.append(block_y)
blocks_A = []
y = 0
while y < 29:
y = y + 1
block_y = []
blocks_A.append(block_y)
with open("file.txt") as file_in:
list_of_resistances = []
for line in file_in:
list_of_resistances.append(int(line))
sorted_resistances = sorted(list_of_resistances)
x = 0
while len(sorted_resistances) > 0:
for y in sorted_resistances:
blocks_A[x].append(y)
blocks_A[x].sort()
sorted_resistances.remove(y)
if len(blocks_A[x]) == 12:
x = x + 1
print(blocks_A)
y = 0
z = -1
while y < len(list_of_resistances):
y = y + 1
z = z + 1
list_of_resistances[z] = y
print(blocks_B)

Making permanent change in a dataframe using python pandas

I would like to convert y dataframe from one format (X:XX:XX:XX) of values to another (X.X seconds)
Here is my dataframe looks like:
Start End
0 0:00:00:00
1 0:00:00:00 0:07:37:80
2 0:08:08:56 0:08:10:08
3 0:08:13:40
4 0:08:14:00 0:08:14:84
And I would like to transform it in seconds, something like that
Start End
0 0.0
1 0.0 457.80
2 488.56 490.80
3 493.40
4 494.0 494.84
To do that I did:
i = 0
j = 0
while j < 10:
while i < 10:
if data.iloc[i, j] != "":
Value = (int(data.iloc[i, j][0]) * 3600) + (int(data.iloc[i, j][2:4]) *60) + int(data.iloc[i, j][5:7]) + (int(data.iloc[i, j][8: 10])/100)
NewValue = data.iloc[:, j].replace([data.iloc[i, j]], Value)
i += 1
else:
NewValue = data.iloc[:, j].replace([data.iloc[i, j]], "")
i += 1
data.update(NewValue)
i = 0
j += 1
But I failed to replace the new values in my oldest dataframe in a permament way, when I do:
print(data)
I still get my old data frame in the wrong format.
Some one could hep me? I tried so hard!
Thank you so so much!
You are using pandas.DataFrame.update that requires a pandas dataframe as an argument. See the Example part of the update function documentation to really understand what update does https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.update.html
If I may suggest a more idiomatic solution; you can directly map a function to all values of a pandas Series
def parse_timestring(s):
if s == "":
return s
else:
# weird to use centiseconds and not milliseconds
# l is a list with [hour, minute, second, cs]
l = [int(nbr) for nbr in s.split(":")]
return sum([a*b for a,b in zip(l, (3600, 60, 1, 0.01))])
df["Start"] = df["Start"].map(parse_timestring)
You can remove the if ... else ... from parse_timestring if you replace all empty string with nan values in your dataframe with df = df.replace("", numpy.nan) then use df["Start"] = df["Start"].map(parse_timestring, na_action='ignore')
see https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.map.html
The datetimelibrary is made to deal with such data. You should also use the apply function of pandas to avoid iterating on the dataframe like that.
You should proceed as follow :
from datetime import datetime, timedelta
def to_seconds(date):
comp = date.split(':')
delta = (datetime.strptime(':'.join(comp[1:]),"%H:%M:%S") - datetime(1900, 1, 1)) + timedelta(days=int(comp[0]))
return delta.total_seconds()
data['Start'] = data['Start'].apply(to_seconds)
data['End'] = data['End'].apply(to_seconds)
Thank you so much for your help.
Your method was working. I also found a method using loop:
To summarize, my general problem was that I had an ugly csv file that I wanted to transform is a csv usable for doing statistics, and to do that I wanted to use python.
my csv file was like:
MiceID = 1 Beginning End Type of behavior
0 0:00:00:00 Video start
1 0:00:01:36 grooming type 1
2 0:00:03:18 grooming type 2
3 0:00:06:73 0:00:08:16 grooming type 1
So in my ugly csv file I was writing only the moment of the begining of the behavior type without the end when the different types of behaviors directly followed each other, and I was writing the moment of the end of the behavior when the mice stopped to make any grooming, that allowed me to separate sequences of grooming. But this type of csv was not usable for easily making statistics.
So I wanted 1) transform all my value in seconds to have a correct format, 2) then I wanted to fill the gap in the end colonne (a gap has to be fill with the following begining value, as the end of a specific behavior in a sequence is the begining of the following), 3) then I wanted to create columns corresponding to the duration of each behavior, and finally 4) to fill this new column with the duration.
My questionning was about the first step, but I put here the code for each step separately:
step 1: transform the values in a good format
import pandas as pd
import numpy as np
data = pd.read_csv("D:/Python/TestPythonTraitementDonnéesExcel/RawDataBatch2et3.csv", engine = "python")
data.replace(np.nan, "", inplace = True)
i = 0
j = 0
while j < len(data.columns):
while i < len(data.index):
if (":" in data.iloc[i, j]) == True:
Value = str((int(data.iloc[i, j][0]) * 3600) + (int(data.iloc[i, j][2:4]) *60) + int(data.iloc[i, j][5:7]) + (int(data.iloc[i, j][8: 10])/100))
data = data.replace([data.iloc[i, j]], Value)
data.update(data)
i += 1
else:
i += 1
i = 0
j += 1
print(data)
step 2: fill the gaps
i = 0
j = 2
while j < len(data.columns):
while i < len(data.index) - 1:
if data.iloc[i, j] == "":
data.iloc[i, j] = data.iloc[i + 1, j - 1]
data.update(data)
i += 1
elif np.all(data.iloc[i:len(data.index), j] == ""):
break
else:
i += 1
i = 0
j += 4
print(data)
step 3: create a new colunm for each mice:
j = 1
k = 0
while k < len(data.columns) - 1:
k = (j * 4) + (j - 1)
data.insert(k, "Duree{}".format(k), "")
data.update(data)
j += 1
print(data)
step 3: fill the gaps
j = 4
i = 0
while j < len(data.columns):
while i < len(data.index):
if data.iloc[i, j - 2] != "":
data.iloc[i, j] = str(float(data.iloc[i, j - 2]) - float(data.iloc[i, j - 3]))
data.update(data)
i += 1
else:
break
i = 0
j += 5
print(data)
And of course, export my new usable dataframe
data.to_csv(r"D:/Python/TestPythonTraitementDonnéesExcel/FichierPropre.csv", index = False, header = True)
here are the transformations:
click on the links for the pictures
before step1
after step 1
after step 2
after step 3
after step 4

How to iterate through each element in a 2D array in python?

I have a 2d numpy array of size 768 x 1024 which contains all the class values of a segmented image.
I have detected pedestrians/vehicles within this array and have got the top left and bottom right coordinate of the bounding box say (381,254) and (387,257).
(381,254) (381,255) ............... (381,257)
(382,254)
.
.
.
(387,254) .................................(387,257)
Each cell under those coordinates have a specific class value (numbers from 1 to 22). The ones that interest me are '4' and '10' which indicates that the bounding box contains a pedestrian or vehicle respectively.
How do I iterate through each element individually (all the elements in row 381 from column 254 to 257 then onto the next row and so on till the bottom right coordinate (387,257)) and check if that particular cell contains the number 4 or 10?
I tried using nested for loop but I'm not able to figure out the logic.
x_1 = 381
x_2 = 387
y_1 = 254
y_2 = 257
ROW = []
COL = []
four = 0
ten = 0
other = 0
for rows in range(x_1, x_2):
ROW.append(rows)
for cols in range(y_1, y_2):
COL.append(cols)
if array[rows][cols] == 4:
four += 1
elif array[rows][cols] == 10:
ten += 1
else:
print('random number')
other += 1
Any help would be appreciated! Thanks.
Try using this instead:
x_1 = 381
x_2 = 387
y_1 = 254
y_2 = 257
ROW = []
COL = []
four = 0
ten = 0
other = 0
for rows in range(x_1, x_2):
ROW.append(rows)
for cols in range(y_1, y_2):
COL.append(cols)
if 4 in (x_2, y_2):
four += 1
elif 10 in (x_2, y_2):
ten += 1
else:
print('random number')
other += 1
It will check if array[x][y] contain number 4 or 10. For example, if array[x][y] == (12, 4) then four += 1.

How to optimize an O(N*M) to be O(n**2)?

I am trying to solve USACO's Milking Cows problem. The problem statement is here: https://train.usaco.org/usacoprob2?S=milk2&a=n3lMlotUxJ1
Given a series of intervals in the form of a 2d array, I have to find the longest interval and the longest interval in which no milking was occurring.
Ex. Given the array [[500,1200],[200,900],[100,1200]], the longest interval would be 1100 as there is continuous milking and the longest interval without milking would be 0 as there are no rest periods.
I have tried looking at whether utilizing a dictionary would decrease run times but I haven't had much success.
f = open('milk2.in', 'r')
w = open('milk2.out', 'w')
#getting the input
farmers = int(f.readline().strip())
schedule = []
for i in range(farmers):
schedule.append(f.readline().strip().split())
#schedule = data
minvalue = 0
maxvalue = 0
#getting the minimums and maximums of the data
for time in range(farmers):
schedule[time][0] = int(schedule[time][0])
schedule[time][1] = int(schedule[time][1])
if (minvalue == 0):
minvalue = schedule[time][0]
if (maxvalue == 0):
maxvalue = schedule[time][1]
minvalue = min(schedule[time][0], minvalue)
maxvalue = max(schedule[time][1], maxvalue)
filled_thistime = 0
filled_max = 0
empty_max = 0
empty_thistime = 0
#goes through all the possible items in between the minimum and the maximum
for point in range(minvalue, maxvalue):
isfilled = False
#goes through all the data for each point value in order to find the best values
for check in range(farmers):
if point >= schedule[check][0] and point < schedule[check][1]:
filled_thistime += 1
empty_thistime = 0
isfilled = True
break
if isfilled == False:
filled_thistime = 0
empty_thistime += 1
if (filled_max < filled_thistime) :
filled_max = filled_thistime
if (empty_max < empty_thistime) :
empty_max = empty_thistime
print(filled_max)
print(empty_max)
if (filled_max < filled_thistime):
filled_max = filled_thistime
w.write(str(filled_max) + " " + str(empty_max) + "\n")
f.close()
w.close()
The program works fine, but I need to decrease the time it takes to run.
A less pretty but more efficient approach would be to solve this like a free list, though it is a bit more tricky since the ranges can overlap. This method only requires looping through the input list a single time.
def insert(start, end):
for existing in times:
existing_start, existing_end = existing
# New time is a subset of existing time
if start >= existing_start and end <= existing_end:
return
# New time ends during existing time
elif end >= existing_start and end <= existing_end:
times.remove(existing)
return insert(start, existing_end)
# New time starts during existing time
elif start >= existing_start and start <= existing_end:
# existing[1] = max(existing_end, end)
times.remove(existing)
return insert(existing_start, end)
# New time is superset of existing time
elif start <= existing_start and end >= existing_end:
times.remove(existing)
return insert(start, end)
times.append([start, end])
data = [
[500,1200],
[200,900],
[100,1200]
]
times = [data[0]]
for start, end in data[1:]:
insert(start, end)
longest_milk = 0
longest_gap = 0
for i, time in enumerate(times):
duration = time[1] - time[0]
if duration > longest_milk:
longest_milk = duration
if i != len(times) - 1 and times[i+1][0] - times[i][1] > longest_gap:
longes_gap = times[i+1][0] - times[i][1]
print(longest_milk, longest_gap)
As stated in the comments, if the input is sorted, the complexity could be O(n), if that's not the case we need to sort it first and the complexity is O(nlog n):
lst = [ [300,1000],
[700,1200],
[1500,2100] ]
from itertools import groupby
longest_milking = 0
longest_idle = 0
l = sorted(lst, key=lambda k: k[0])
for v, g in groupby(zip(l[::1], l[1::1]), lambda k: k[1][0] <= k[0][1]):
l = [*g][0]
if v:
mn, mx = min(i[0] for i in l), max(i[1] for i in l)
if mx-mn > longest_milking:
longest_milking = mx-mn
else:
mx = max((i2[0] - i1[1] for i1, i2 in zip(l[::1], l[1::1])))
if mx > longest_idle:
longest_idle = mx
# corner case, N=1 (only one interval)
if len(lst) == 1:
longest_milking = lst[0][1] - lst[0][0]
print(longest_milking)
print(longest_idle)
Prints:
900
300
For input:
lst = [ [500,1200],
[200,900],
[100,1200] ]
Prints:
1100
0

Panda dataframe not updating all columns

I am running the following test code to map violations to nearby buildingIDs by "NearVicinity" and "MidVicinity". The results come out unexpected and I am not sure what I am missing in my code.
So the results I get seem to have correctly updated 'TicketIssuedDT' and 'NearVicinity', 'MidVicinity' colunns however the 'BuildingID' and 'X' columns only map correctly to result['X'][0] and result['BuildingID'][0]. All remaining 999 rows have 0 for 'BuildingID' and 'X'.
result = pd.DataFrame(np.zeros((1000, 5)),columns=['BuildingID', 'TicketIssuedDT', 'NearVicinity', 'MidVicinity','X'])
z = 0
for i in range(0,10):
#for i, j in dataframe2.iterrows():
dataframe2Lat = dataframe2['Latitude'][i]
dataframe2Long = dataframe2['Longitude'][i]
for x in range(0,11102):
#for x, y in dataframe1.iterrows():
dist = (math.fabs(dataframe2Long - dataframe1['Longitude'][x]) + math.fabs(dataframe2Lat - dataframe1['Latitude'][x]))
if dist < .02:
result['X'][z] = x
result['BuildingID'][z] = dataframe1['BuildingID'][x]
result['TicketIssuedDT'][z] = dataframe2['TicketIssuedDT'][i]
result['MidVicinity'][z] = 1
if dist < .007:
result['NearVicinity'][z] = 1
else:
result['NearVicinity'][z] = 0
z += 1
print(i)

Categories

Resources