Why is my code running so slowly? (Nested for loops)

Why is my code running so slowly? (Nested for loops) - python

I recently wrote some code, and it is creating the outputs I want, however it is taking an eternity... I have 700k customer lines to iterate through for 435 warehouses, and to run 100k took 3 hours.
I know this question may be vague, but im not sure why it is running so slowly. I am suspecting it is due to my nested for loops, but the data won't load in unless I chunk it.
import pandas as pd
import geopy.distance
dfware = pd.read_csv('dfware.csv', encoding = "ISO-8859-1")
dfcust = pd.read_csv(r'dfcust.csv', encoding = "ISO-8859-1")
ppmwinarray = []
#Chunk size to load in
csize=10 ** 3
Bigoutput = []
y=0
for dfcust in pd.read_csv(r'dfcust.csv', encoding = "ISO-8859-1", chunksize = csize):
#For all columns (index) and rows, in datacust, iterate through those rows.
y+=1
print(y)
ppmwinarray = []
z=0
for index,row in dfcust.iterrows():
#Assign the Lattitude variable in the row titled 'lat' to lat1
lat1 = row['Lat']
# Assign the longitude variable in the row titled 'Long' to lon1
lon1 = row['Lon']
dlist=[]
dindex=[]
print(z)
z+=1
for index2, row2 in dfware.iterrows():
y+=1
lat2 = row2['Lat']
lon2 = row2['Lon']
coords_1 = [lat1, lon1]
coords_2 = [lat2, lon2]
distance = geopy.distance.distance(coords_1, coords_2).miles
if distance > 300:
distance = 0
else:
distance = distance
dlist.append(distance)
d_i = ((300-distance)/300)
if d_i != 1:
d_i=d_i
else:
d_i=0
dindex.append(d_i)
sumdi = sum(dindex)
if sumdi == 0:
sumdi = 1
#Defining last 13 as PPM Index
ppmdindex =(dindex[-13:])
#Calculating their independent win chance
IndependentWinChance=[ x/ sumdi for x in ppmdindex]
#Store in an array
ppmarray = IndependentWinChance
#Summing independent chances to get sum chance
sumppmWinChance = sum(ppmarray)
#Appending the sum of all distance indexes
ppmarray.append(sumdi)
#Appending the sum of ppm win chance
ppmarray.append(sumppmWinChance)
ppmwinarray.append(ppmarray)
Bigoutput.extend(ppmwinarray)
Bigoutputdf = pd.DataFrame(Bigoutput)
Bigoutputdf.to_csv('customers1234.csv')
exit()

Related

How to Maintain Segmented Cell Numbering On Moving Tissue

The overall goal of the project is to identify cells that are undergoing a wave-like event and determine if cell area changes after the event occurs (red arrow). The raw file is a video of live frog epithelial tissue that has been exported frame-by-frame into multiple .tiff files. The image below is just one of the many frames used.
(https://i.stack.imgur.com/iFfYR.jpg)
Using Cellpose we have already been able to segment each frame of the video and determine the centroids. The image below is saved as .npy file as indicated in the code.
(https://i.stack.imgur.com/7XP0j.jpg)
We then go on to number each cell and calculate the area across all the frames we used.
(https://i.stack.imgur.com/xX7pY.png) (https://i.stack.imgur.com/JcvPj.png)
As you might be able to tell, this is where we run into our issues. Since we are using live epithelial tissue, there is movement in the original video file. As new cells come into frame throughout our raw video, the numbering becomes completely different. This means that the data compiled becomes virtually useless as we cannot tell which cell is the same between different frames (see image). Another (less pressing) issue is that the wave event is almost always identified as a cell by the Cellpose segmentation program regardless of how strict the threshold is set to.
(https://i.stack.imgur.com/zmvI6.png)
def graphData(filepath, oneImage, maxDist, minDist, savePath, length):
# graph data for all files in a folder
if oneImage == False:
for file in os.listdir(filepath):
filename = os.fsdecode(file)
temp = filepath + filename
if filename.endswith(".npy"):
a = len(filename)-7
save = savePath+filename[:a]+".tif"
graphData(temp,True,maxDist,minDist,save,length)
else:
#get data from .npy file
data = np.load(filepath, allow_pickle = True).item()
masks = data['masks']
properties = ('label', 'centroid')
prop_dict = regionprops_table(data['masks'], properties=properties)
centroidsDF = pd.DataFrame(prop_dict)
outlines = data['outlines']
results = []
n = 0
#find cell intersections
for x in range(len(masks)-2):
for y in range(len(masks[0])-2):
count = 0
temp = [0,masks[x][y]]
if masks[x][y] != n and x != 0 and y != 0:
for a in range(5):
for b in range(5):
if masks[x-2+a][y-2+b] not in temp:
count += 1
temp.append(masks[x-2+a][y-2+b])
if count >= 2:
results.append([x,y,count+1])
#remove intersections near the edge of the image
for a in results:
if a[0] < 20 or a[1] < 20:
results.remove(a)
#remove cell intersections that are too close (based on minDist)
for p in range(5):
for a in results:
nearest = 99999
for b in results:
dist = abs(math.sqrt((abs(a[0]-b[0])**2)+(abs(a[1]-b[1])**2)))
if dist <= nearest and dist != 0:
nearest = dist
for b in results:
dist = abs(math.sqrt((abs(a[0]-b[0])**2)+(abs(a[1]-b[1])**2)))
if dist <= minDist and dist == nearest:
results.remove(b)
#combine cell intersections that are too close (based on maxDist)
for p in range(5):
for a in results:
nearest = 99999
for b in results:
dist = abs(math.sqrt((abs(a[0]-b[0])**2)+(abs(a[1]-b[1])**2)))
if dist <= nearest and dist != 0:
nearest = dist
for b in results:
dist = abs(math.sqrt((abs(a[0]-b[0])**2)+(abs(a[1]-b[1])**2)))
if dist == nearest and dist <= maxDist:
avgX = (a[0]+b[0])/2
avgY = (a[1]+b[1])/2
Tcount = a[2]
if dist >= minDist:
Tcount = round((a[2]+b[2]+2)/2)
for i in range(len(results)-1):
if results[i]==a:
results[i]= [avgX, avgY, Tcount]
results.remove(b)
plt.clf()
# graph masks and outlines
for i in range(len(np.unique(masks))):
if i != 0:
plt.scatter(np.where(masks==i)[0],np.where(masks==i)[1],s=0.0001,c='0.9')
for i in range(len(np.unique(masks))):
if i != 0:
plt.scatter(np.where(outlines==i)[0],np.where(outlines==i)[1],s=0.001,c='0.4')
xs = [x[0] for x in results]
ys = [x[1] for x in results]
zs = [x[2] for x in results]
plt.scatter(xs,ys,s=.05,c='k')
#graph number of intersecting cells
#for a in range(len(xs)):
#plt.text(xs[a]+1,ys[a]+1,zs[a],size=3,c='k')
save = savePath+os.path.basename(filepath)[:len(os.path.basename(filepath))-3]+"txt"
areas = []
for x in range(len(np.unique(data['masks']))-1):
count = 0
for a in range(len(masks)):
for b in range(len(masks[0])):
if masks[a][b] == x:
count+=1
areas.append([x,count])
areas.pop(0)
count = 0
for a in areas:
count += a[1]
count /= (len(np.unique(data['masks']))-1)
if(exists(save)):
os.remove(save)
with open(save, 'w') as f:
f.write("average cell area: "+ str(round(count/length,5)))
f.write("\n\nnumber of cells: "+ str(len(np.unique(data['masks']))-1))
f.write("\n\narea of each cell (cell # and area): ")
for a in areas:
f.write("\n"+str(a[0])+", "+str(a[1]/length))
for a in range(len(centroidsDF)):
plt.text(centroidsDF['centroid-0'][a],centroidsDF['centroid-1'][a],centroidsDF['label'][a],size=3)
plt.savefig(savePath, dpi = 1440)
Is there a more efficient code that will help complete our project or does the movement of the tissue make it too difficult?

Eliminating Certain Values in Dataframe

Initial Data
d = {'RedVal':[1,1.1,2,1.5,1.7,2,1,1.1,2,1,1.1,2,2.6,2.5,2.4,2.5], 'GreenVal':[1,1.1,1.1,1,1.1,1.7,1,1.1,1.5,1,1.9,3,2.8,2.7,2.6,2.5],'Frame':[0,1,2,3,0,1,2,3,0,1,2,3,0,1,2,3],'Particle':[0,0,0,0,2,2,2,2,3,3,3,3,4,4,4,4] }
testframe = pd.DataFrame(data=d)
testframe
framenot = 2 #set how many frames you would like to get initial ratio for
ratarray = [] #initialize blank ratio array
testframe.sort_values(by =[ 'Particle', 'Frame'])
for particle in range(0,5):
if(testframe['Particle']== particle).any() == False:
particle = particle + 1
else:
newframe = testframe.loc[(testframe['Frame']<= framenot) & (testframe['Particle'] == particle)]
#print(particle)
for i in range(framenot):
#print(i)
GVal = newframe['GreenVal'].values[i]
RVal = newframe['RedVal'].values[i]
ratio = RVal/GVal
#print(RVal)
#print(GVal)
#print(ratio)
ratarray.append(ratio)
i+=1
#print(ratarray)
particle+=1
ratarray = np.array(ratarray)
avgRatios = np.average(ratarray.reshape(-1,framenot), axis = 1)
stdRatios = np.std(ratarray.reshape(-1,framenot), axis = 1)
print(avgRatios) #array with average ratios over set frames starting from initial particle
print(stdRatios)
So far I have code that gives the avg and standard deviation for each particle's ratio of Red/Green over the frames 0 and 1. Now I want to compare this avg ratio to the ratio for the next x frames and eliminate particles where the subsequent frames ratios falls outside the avg+2stdev. Not quite sure how to do this. Any help is appreciated.

K-Means clustering multidimensional data with a heatmap

I have been trying to implement k-means clustering with a heatmap, but have been unsuccessful.
Here is the initial data set:
https://raw.githubusercontent.com/gsprint23/cpts215/master/progassignments/files/simple.csv
And here is my code:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import math
import random
#%matplotlib inline
def truncate(f, n):
return math.floor(f * 10 ** n) / 10 ** n
def chooseCenter(data, centers):
length = data.shape
cent = []
while len(cent) < centers :
x = random.randrange(0,length[0])
y = random.randrange(0,length[1])
if data.iloc[x][y] not in cent:
d = truncate(data.iloc[x][y],2)
cent.append(d)
return cent
def distance(val, center):
return math.sqrt((val- center)**2)
def getDistances(centers, data):
length = data.shape
dist = []
for i in range(length[0]):
for j in range(length[1]):
y = []
for k in range(len(centers)):
val = distance(data.iloc[i][j], centers[k])
y.append(truncate(val,3))
dist.append(y)
return dist
def findClosest(data, dist):
close = data.copy()
length = close.shape
indexes = []
for i in range(len(dist)):
pt = min(dist[i])
idx = dist[i].index(pt)
indexes.append(idx)
#print(indexes)
length = data.shape
n = np.array(indexes)
n = pd.DataFrame(np.reshape(n, (length[0],length[1])))
#reshape this data frame into the same shape as the data
#keep running the find closest until there is no change
#try heatmap on this?
#this should cluster it, but to make sure test it
#might need to do some tweaking to this
return n
# for i in range(length[0]):
# for j in range(length[1]):
# print('dist[i]', dist[j])
# pt = min(dist[j])
# print(pt)
# idx = dist[j].index(pt)
# close.iloc[i][j] = int(idx)
#return close
def computeNewCenter(data, close):
d = dict()
for i in range(len(close)):
for j in range(len(close[0])):
d[close.iloc[i][j]] = []
for i in range(len(data)):
for j in range(len(data[0])):
if close.iloc[i][j] in d:
d[close.iloc[i][j]].append(data.iloc[i][j])
newCenters = []
for key, value in d.items():
m = np.mean(value)
newCenters.append(truncate(m, 3))
return newCenters
# lst = [[] * numcenters]
# for i in range(len(close)):
# for j in range(len(close[0])):
# if close.iloc[i][j]
def main():
data = np.array(pd.read_csv('https://raw.githubusercontent.com/gsprint23/cpts215/master/progassignments/files/simple.csv', header=None))
data = data.T
#print(data)
df = pd.DataFrame(data[1:], columns=data[0], dtype=float).T
df = df.iloc[::-1]
# print(df)
# print(df.iloc[1][9])
# print(df)
# print(df.iloc[0][1])
# heatmap = plt.pcolor(df, cmap=plt.cm.bwr)
# plt.colorbar(heatmap)
c = chooseCenter(df, 3)
print(c)
#print(len(c))
dist = getDistances(c, df)
#print(dist)
y = findClosest(df, dist)
# q = []
# for i in range(len(c)):
# q.append([])
# #print(q)
j = computeNewCenter(df, y)
#print(j)
length = df.shape
oldFrame = pd.DataFrame(np.ndarray((length[0],length[1])))
oldFrame = oldFrame.fillna(0)
ct=0
while y.equals(oldFrame) == False:
ct+=1
oldFrame = y.copy()
c = computeNewCenter(df, oldFrame)
#print(c)
dist = getDistances(c, df)
#print(dist)
y = findClosest(df, dist)
#print(y)
#plt.pcolor(df, cmap=plt.cm.bwr)
l = []
for i in range(len(y)):
for j in range(len(y[0])):
if y.iloc[i][j] == 1:
l.append(df.iloc[i][j])
for i in range(len(y)):
for j in range(len(y[0])):
if y.iloc[i][j] == 2:
l.append(df.iloc[i][j])
for i in range(len(y)):
for j in range(len(y[0])):
if y.iloc[i][j] == 0:
l.append(df.iloc[i][j])
l = np.ndarray((length[0],length[1]))
l = pd.DataFrame(l)
print(l)
hm = plt.pcolor(l, cmap=plt.cm.bwr)
plt.colorbar(hm)
# print(y)
# print(c)
# print(ct)
#plt.pcolor(y, cmap=plt.cm.bwr)
if __name__ == '__main__':
main()
My line of thinking was this:
My current thought process was to first randomly choose the centers.
Then create a list of lists for each point for the distance to each center.
Find the index of the minimum distance for each point for each center.
Create a data frame of the same size as the data set and fill each index for each element with the index of the center the point is closest to.
Recompute the center by taking the mean of the points with the same center index
Repeat this process multiple times until the index data frame does not change.
Create a new data frame and add the points which have the same center point close together in the frame.
Then create the heatmap.
This did not seem to work though.
Just wondering, am I on the right track or am I completely off, and if I am on the right track which parts would I need to change in order to fix the issue. If not could you please point me on the right track.
Here is a comparison of the maps:
Here are the maps
The first one is the one my program generated while the second is the way it is supposed to look.
I know my problem lies in some part of the k-means clustering algorithm, and my guess is it is either in the reassignment stage where you reassign the points to the centroids and calculate the new centroids or in the stopping condition in that the algorithm does not run long enough. Also in the back of my head, something tells me that I am not doing this as efficiently as I could have and that I am missing something key. I have watched several videos on K-means clustering and understand it conceptually, I'm just having a hard time implementing it.

Python: passing coordinates from list to function

I am using some code from a workshop to extract data from netCDF files by the coordinates closest to my specified coordinates. When using just one set of coordinates I am able to extract the values I need without trouble as below:
import numpy as np
import netCDF4
from math import pi
from numpy import cos, sin
def tunnel_fast(latvar,lonvar,lat0,lon0):
'''
Find closest point in a set of (lat,lon) points to specified point
latvar - 2D latitude variable from an open netCDF dataset
lonvar - 2D longitude variable from an open netCDF dataset
lat0,lon0 - query point
Returns iy,ix such that the square of the tunnel distance
between (latval[it,ix],lonval[iy,ix]) and (lat0,lon0)
is minimum.
'''
rad_factor = pi/180.0 # for trignometry, need angles in radians
# Read latitude and longitude from file into numpy arrays
latvals = latvar[:] * rad_factor
lonvals = lonvar[:] * rad_factor
ny,nx = latvals.shape
lat0_rad = lat0 * rad_factor
lon0_rad = lon0 * rad_factor
# Compute numpy arrays for all values, no loops
clat,clon = cos(latvals),cos(lonvals)
slat,slon = sin(latvals),sin(lonvals)
delX = cos(lat0_rad)*cos(lon0_rad) - clat*clon
delY = cos(lat0_rad)*sin(lon0_rad) - clat*slon
delZ = sin(lat0_rad) - slat;
dist_sq = delX**2 + delY**2 + delZ**2
minindex_1d = dist_sq.argmin() # 1D index of minimum element
iy_min,ix_min = np.unravel_index(minindex_1d, latvals.shape)
return iy_min,ix_min
ncfile = netCDF4.Dataset('E:\wind_level2_1.nc', 'r')
latvar = ncfile.variables['latitude']
lonvar = ncfile.variables['longitude']
#_________GG turbine_________GAD10 Latitude 51.735516, GAD10 Longitude 1.942656
iy,ix = tunnel_fast(latvar, lonvar, 51.735516, 1.942656)
print('Closest lat lon:', latvar[iy,ix], lonvar[iy,ix])
refLAT=latvar[iy,ix]
refLON = lonvar[iy,ix]
#try to find the data for this location
SARwind = ncfile.variables['sar_wind'][:,:]
ModelWind = ncfile.variables['model_speed'][:,:]
print 'iy,ix' #appears to be the index of the value of Lat,lon
print SARwind[iy,ix]
ncfile.close()
Now I am trying to loop through a text file containing coordinates coord_list to extract sets of coordinates, find the data then move to the next set of coordinates in the list. This code works on it's own as below:
import csv
from decimal import Decimal
with open('Turbine_locs_no_header.csv','rb') as f:
reader = csv.reader(f)
#coord_list = list(reader)
coord_list = [reader]
end_row = len(coord_list)
lon_ind=1
lat_ind=2
for row in range(0, end_row-1):#end_row - 1 due to the 0 index
turbine_lat = coord_list[row][lat_ind]
turbine_lon = coord_list[row][lon_ind]
turbine_lat = [Decimal(turbine_lat)]
print 'lat',turbine_lat, 'lon',turbine_lon, row
However, I want to pass coordinates from the text file to this part of the original code iy,ix = tunnel_fast(latvar, lonvar, 51.94341, 1.922094888), replacing the numbers with variables iy, ix = tunnel_fast(latvar, lonvar, turbine_lat, turbine_lon). I try to combine the two codes by creating a function get_coordinates, I get the following errors
File "C:/Users/mm/test_nc_bycoords_GG_turbines_AGW.py", line 65, in <module>
get_coordinates(coord_list, latvar, lonvar)
File "C:/Users/mm/test_nc_bycoords_GG_turbines_AGW.py", line 51, in get_coordinates
iy, ix = tunnel_fast(latvar, lonvar, turbine_lat, turbine_lon)
File "C:/Users/mm/test_nc_bycoords_GG_turbines_AGW.py", line 27, in tunnel_fast
lat0_rad = lat0 * rad_factor
TypeError: can't multiply sequence by non-int of type 'float'
I thought this is because the turbine_lat and turbine_lon are list items so cannot be used, but this doesn't seem to be connected to the errors. I know this code needs more work anyway, but if anyone could help me spot where I am going wrong that would be very helpful. My attempt to combine the two codes is below.
import numpy as np
import netCDF4
from math import pi
from numpy import cos, sin
import csv
# edited from https://github.com/Unidata/unidata-python-workshop/blob/a56daa50d7b343c7debe93968683613642d6b9f7/notebooks/netcdf-by-coordinates.ipynb
def tunnel_fast(latvar,lonvar,lat0,lon0):
'''
Find closest point in a set of (lat,lon) points to specified point
latvar - 2D latitude variable from an open netCDF dataset
lonvar - 2D longitude variable from an open netCDF dataset
lat0,lon0 - query point
Returns iy,ix such that the square of the tunnel distance
between (latval[it,ix],lonval[iy,ix]) and (lat0,lon0)
is minimum.
'''
rad_factor = pi/180.0 # for trignometry, need angles in radians
# Read latitude and longitude from file into numpy arrays
latvals = latvar[:] * rad_factor
lonvals = lonvar[:] * rad_factor
ny,nx = latvals.shape
lat0_rad = lat0 * rad_factor
lon0_rad = lon0 * rad_factor
# Compute numpy arrays for all values, no loops
clat,clon = cos(latvals),cos(lonvals)
slat,slon = sin(latvals),sin(lonvals)
delX = cos(lat0_rad)*cos(lon0_rad) - clat*clon
delY = cos(lat0_rad)*sin(lon0_rad) - clat*slon
delZ = sin(lat0_rad) - slat;
dist_sq = delX**2 + delY**2 + delZ**2
minindex_1d = dist_sq.argmin() # 1D index of minimum element
iy_min,ix_min = np.unravel_index(minindex_1d, latvals.shape)
return iy_min,ix_min
#________________my edits___________________________________________________
def get_coordinates(coord_list, latvar, lonvar):
"this takes coordinates from a .csv and assigns them to variables"
end_row = len(coord_list)
lon_ind=1
lat_ind=2
for row in range(0, end_row-1):#end_row - 1 due to the 0 index
turbine_lat = coord_list[row][lat_ind]
turbine_lon = coord_list[row][lon_ind]
iy, ix = tunnel_fast(latvar, lonvar, turbine_lat, turbine_lon)
print('Closest lat lon:', latvar[iy, ix], lonvar[iy, ix])
#________________________________________________________________________________________________________________________
ncfile = netCDF4.Dataset('NOGAPS_wind_level2_1.nc', 'r')
latvar = ncfile.variables['latitude']
lonvar = ncfile.variables['longitude']
#____added in to pass to get coordinates function
with open('Turbine_locs_no_header.csv','rb') as f:
reader = csv.reader(f)
coord_list = list(reader)
#_________take latitude from coordinateas function
get_coordinates(coord_list, latvar, lonvar)
#iy,ix = tunnel_fast(latvar, lonvar, turbine_lat, turbine_lon)#get these from the 'assign_coordinates_fromlist.py
#print('Closest lat lon:', latvar[iy,ix], lonvar[iy,ix])
SARwind = ncfile.variables['sar_wind'][:,:]
ModelWind = ncfile.variables['model_speed'][:,:]
print 'iy,ix' #appears to be the index of the value of Lat,lon
print SARwind[iy,ix]
ncfile.close()
When I try to convert

You can unpack an argument list using *args (see the docs). In your case you could do tunnel_fast(latvar, lonvar, *coord_list[row]). You need to make sure that the order of arguments in coord_list[row] is correct and if coord_list[row] contains more than the two values then you need to slice it appropriately.

Thanks to help from a_guest
It was a simple problem of lat0 and lon0 being passed as
<type 'str'> to tunnel_fast when it requires <type 'float'>. This appears to come from loading the coord_list as a list.
with open('Turbine_locs_no_header.csv','rb') as f:
reader = csv.reader(f)
coord_list = list(reader)
The workaround I used was to convert lat0 and lon0 to floats at the beginning of tunnel_fast
lat0 = float(lat0)
lon0 = float(lon0)
I am sure there is a more elegant way to do this, but it works.

GPS downsampling

My goal is to downsample my indata for every 100m and get the first and last line
My problem is that I get a lot fewer lines than i should when I downsample and I don't know how to get the last line.
Hope am clear enough for someone to understand
To make this
Line 20130904_0848.nmea
$GPGGA,111936.00,5849.37538,N,01739.88263,,E,2,09,00.9,00004.43,M,0024.87,M,007,0734*42
$GPGGA,111936.00,5849.37548,N,01739.88240,,E,2,09,00.9,00004.43,M,0024.87,M,007,0734*44
$GPGGA,111936.00,5849.37556,N,01739.88216,,E,2,09,00.9,00004.43,M,0024.87,M,007,0734*48
$GPGGA,111936.00,5849.37569,N,01739.88193,,E,2,09,00.9,00004.43,M,0024.87,M,007,0734*4a
$GPGGA,111936.00,5849.37581,N,01739.88171,,E,2,09,00.9,00004.43,M,0024.87,M,007,0734*40
$GPGGA,111936.00,5849.69118,N,01739.89674,,E,2,09,00.9,00004.43,M,0024.87,M,007,0734*4c
EOL
Line 20130904_0926.nmea
$GPGGA,111936.00,5849.67569,N,01739.98426,,E,2,09,00.9,00004.43,M,0024.87,M,007,0734*45
$GPGGA,111936.00,5849.67593,N,01739.98453,,E,2,09,00.9,00004.43,M,0024.87,M,007,0734*42
$GPGGA,111936.00,5849.67616,N,01739.98479,,E,2,09,00.9,00004.43,M,0024.87,M,007,0734*44
....
Look like this
Line 20081002-1119.nmea
58.853952 13.309779 0.00
58.853907 13.310688 101.15
58.853858 13.311593 100.72
58.853811 13.312498 100.62
58.853764 13.313402 100.59
58.853752 13.313660 28.70
EOL
Line 20081002-1119.nmea
58.853952 13.309779 0.00
58.853907 13.310688 101.15
58.853858 13.311593 100.72
58.853811 13.312498 100.62
58.853764 13.313402 100.59
...
This is my code so far
from math import sin, cos, sqrt, atan2, radians
coord=[]
coord1=None
def distance(coord1,coord2): #Haversin
lat1,lon1=coord1
lat2,lon2=coord2
dlat = radians(lat2-lat1)
dlon = radians(lon2-lon1)
a = sin(dlat/2) * sin(dlat/2)
+ cos(radians(lat1))*cos(radians(lat2))*sin(dlon/2)*sin(dlon/2)
c = 2 *atan2(sqrt(a),sqrt(1-a))
s = (6367*c)*1000 #meter
return s
# with open as data will close itself after reading each line. so you don't need to close it yourself
with open('asko_nav_2013.nmea', 'r') as indata: #making a indata and outdata, r stands for reading(readcapabilities
with open('asko_nav_out.txt', 'w') as outdata: #w stands for write write to a new file(open for writing-you can change things)
while True:
line = indata.readline()
if not line:
break
if line.startswith('EOL'): #if the line starts with EOL(end of line) it writes it in the output
outdata.writelines("EOL")
coord1=None
elif line.startswith('Line'):
LineID=line
outdata.writelines('\n%s' %LineID)
elif line.startswith('$GPGGA'): #when the fist line starts with $GPGGA it splits the columns
data=line.split(",") #the for loop reads the file line by line
# Importing only coordinates from asko input file (Row 2 and 4)
# Converting the coordinates from DDMM.MMMM to DD.DDDDDD
LAT=(data[2])
LAT_D=LAT[0:2]
LATID=float(LAT_D)
LAT_M=LAT[2:]
LATM=float(LAT_M)
LATIM = float(LATM) / 60.0
latitude=(LATID + LATIM)
LON=(data[4])
LON_D=LON[1:3]
LONGD=float(LON_D)
LON_M=LON[3:]
LONM=float(LON_M)
LONGM = float(LONM) / 60.0
longitude=(LONGD + LONGM)
if coord1 is None:
# The first time through the loop "coord1" is None
outdata.writelines('%0.6f\t%0.6f\t%s \n'%(latitude,longitude,0))
coord1=(latitude,longitude)
else:
coord2=(latitude,longitude)
dist=distance(coord1,coord2)
if dist <100:
continue
outdata.writelines('%0.6f\t%0.6f\t%f\n' % (latitude,longitude,dist))
coord1=coord2

Your code can do with a little bit of reorganising to make it clearer. You need to add an additional write whenever EOL is seen for the case where the distance is under 100m:
from math import sin, cos, sqrt, atan2, radians
def distance(coord1, coord2): #Haversin
lat1,lon1=coord1
lat2,lon2=coord2
dlat = radians(lat2-lat1)
dlon = radians(lon2-lon1)
a = sin(dlat/2) * sin(dlat/2)
+ cos(radians(lat1))*cos(radians(lat2))*sin(dlon/2)*sin(dlon/2)
c = 2 *atan2(sqrt(a),sqrt(1-a))
s = (6367*c)*1000 #meter
return s
def get_coordinates(data):
# Importing only coordinates from asko input file (Row 2 and 4)
# Converting the coordinates from DDMM.MMMM to DD.DDDDDD
LAT = (data[2])
LAT_D = LAT[0:2]
LATID = float(LAT_D)
LAT_M = LAT[2:]
LATM = float(LAT_M)
LATIM = float(LATM) / 60.0
latitude = (LATID + LATIM)
LON = (data[4])
LON_D = LON[1:3]
LONGD = float(LON_D)
LON_M = LON[3:]
LONM = float(LON_M)
LONGM = float(LONM) / 60.0
longitude = (LONGD + LONGM)
return (latitude, longitude)
coord1 = None
# with open as data will close itself after reading each line. so you don't need to close it yourself
with open('asko_nav_2013.nmea', 'r') as indata, open('asko_nav_out.txt', 'w') as outdata:
for line in indata:
if line.startswith('EOL'): #if the line starts with EOL(end of line) it writes it in the output
if dist < 100:
outdata.write('%0.6f\t%0.6f\t%f\n' % (latitude, longitude, dist))
outdata.write("\nEOL\n")
coord1 = None # Reset the first coordinate
elif line.startswith('Line'):
outdata.write('\n%s' % line)
elif line.startswith('$GPGGA'): #when the fist line starts with $GPGGA it splits the columns
data=line.split(",") #the for loop reads the file line by line
latitude, longitude = get_coordinates(data)
if coord1:
coord2 = (latitude, longitude)
dist = distance(coord1, coord2)
if dist >= 100:
outdata.write('%0.6f\t%0.6f\t%f\n' % (latitude, longitude, dist))
coord1 = coord2
else:
# The first time through the loop "coord1" is None
outdata.write('%0.6f\t%0.6f\t0.0 \n' % (latitude, longitude))
coord1 = (latitude, longitude)
For your given input, this produces the following output file:
Line 20130904_0848.nmea
58.822923 17.664710 0.0
58.828186 17.664946 584.888514
EOL
Line 20130904_0926.nmea
58.827928 17.666404 0.0
58.827936 17.666413 0.870480
EOL
You also need to reset coord1 whenever EOL is detected to make sure 0 is displayed again for the first entry.
It is a bit difficult to see if this completely solves matters as your sample data does not seem to tally with your expected output.

Addressing the second issue concerning fewer result lines than expected: You are providing too little information about the nature of your problem and the input data you are processing. Sampling your input "for every 100m" could mean something different if your input data is sampled from a trajectory travelled by a moving object, especially if the motion is not purely linear.
Imagine that your input describes coordinates obtained by measuring GPS coordinates in regular intervals while moving along a circle with radius smaller than, say, 15m. Then no matter how many data points your input provides, the output for your proposed solution will never be longer than two lines, because no two points along that curve can have an absolute distance greater than 100m. This might explain why you are seeing fewer lines in the output than expected.
If you mean to sample the input at every 100m travelled, you would have to sum over all distances between input samples since the last point sampled for output and use that instead of dist. Modifying Martin's reorganised code, it could be done like this (some lines omitted for brevity):
coord1 = None
coord_last = None # holds coordinate of last input sample
dist = 0.0 # total distance travelled since coord1
# [...]
with open('asko_nav_2013.nmea', 'r') as indata, open('asko_nav_out.txt', 'w') as outdata:
for line in indata:
# [...]
if coord1:
coord2 = (latitude, longitude)
delta = distance(coord_last, coord2)
dist += delta
coord_last = coord2
if dist >= 100:
outdata.write('%0.6f\t%0.6f\t%f\n' % (latitude, longitude, dist))
coord1 = coord2
dist = 0.0
else:
# The first time through the loop "coord1" is None
outdata.write('%0.6f\t%0.6f\t0.0 \n' % (latitude, longitude))
coord1 = (latitude, longitude)
coord_last = coord1
dist = 0.0

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Why is my code running so slowly? (Nested for loops) - python

Related

How to Maintain Segmented Cell Numbering On Moving Tissue

Eliminating Certain Values in Dataframe

K-Means clustering multidimensional data with a heatmap

Python: passing coordinates from list to function

GPS downsampling

Categories

Resources