I have the following case:
I need to get the time of a feature in a csv file and compare it with the time of the pictures taken by someone.
Then i need to find 2 (or less) matches. I will assign the first two pictures i find in a 2 mins interval from the time of the feature to that feature.
I managed to create two dictionaries with the details: feature_hours contains id and time of the feature. photo_hours contains photo_path and time of the photo.
sorted_feature and sorted_photo are two lists that sorted the two dictionaries.
The problem is that in the output csv file i only have 84 rows completed and some are blank. The feature csv file has 199 features. I think i incremented j too often. But i need a clear look from a pro, because i cannot figure it out.
Here is the code:
j=1
sheet1.write(0,71,"id")
sheet1.write(0,72,"feature_time")
sheet1.write(0,73,"Picture1")
sheet1.write(0,74,"Picture_time")
sheet1.write(0,75,"Picture2")
sheet1.write(0,76,"Picture_time")
def write_first_picture():
sheet1.write(j,71,feature_time[0])
sheet1.write(j,72,feature_time[1])
sheet1.write(j,73,photo_time[0])
sheet1.write(j,74,photo_time[1])
def write_second_picture():
sheet1.write(j-1,75,photo_time[0])
sheet1.write(j-1,76,photo_time[1])
def write_pictures():
if i==1:
write_first_picture()
elif i==2:
write_second_picture()
for feature_time in sorted_features:
i=0
for photo_time in sorted_photo:
if i<2:
if feature_time[1][0]==photo_time[1][0]:
if feature_time[1][1]==photo_time[1][1]:
if feature_time[1][2]<photo_time[1][2]:
i=i+1
write_pictures()
j=j+1
elif int(feature_time[1][1])+1==photo_time[1][1]:
i=i+1
write_pictures()
j=j+1
elif int(feature_time[1][1])+2==photo_time[1][1]:
i=i+1
write_pictures()
j=j+1
elif int(feature_time[1][0])+1==photo_time[1][0]:
if feature_time[1][1]>=58:
if photo_time[1][1]<=02:
i = i+1
write_pictures()
j=j+1
Edit: Here is examples of the two lists:
Features list: [('-70', ('10', '27', '03')), ('-73', ('10', '29', '50'))]
Photo list: [('20160801_125133-1151969393.jpg', ('12', '52', '04')), ('20160801_125211342753906.jpg', ('12', '52', '16'))]
There is a CSV module for python to help load these files. You could sort the results to try to be more efficient/short-circuit your checks as well. I cannot really tell what the i and j variables are meant to represent, but I am pretty sure you can do something like the following:
import csv
def hmstoseconds(hhmmss):
# 60 * 60 seconds in an hour, 60 seconds in a min, 1 second in a second
return sum(x*y for x, y in zip(hhmmss, (60*60, 60, 1)))
features = []
# features model looks like tuple(ID, (HH, MM, SS))
with open("features.csv") as f:
reader = csv.reader(f)
features = list(reader)
photos = []
# photos model looks like tuple(filename, (HH, MM, SS))
with open("photos.csv) as f:
reader = csv.reader(f)
photos = list(reader)
for feature in features:
for photo in photos:
# convert HH, MM, SS to seconds and find within 2 min (60s * 2)
# .. todo:: instead of nested for loops, we could use filter()
if abs(hmstoseconds((feature[1]) - hmstoseconds(photo[1])) <=(60 * 2):
# the photo was taken within 2 min of the feature
<here, write a photo>
In order to make this more maintainable/readable, you could also use namedtuples to better represent the data models:
import csv
from collections import namedtumple
# model definitions to help with readability/maintainence
# if the order of the indices changes or we add more fields, we just need to
# change them directly here instead of tracking the indexes everywhere
Feature = namedtuple("feature", "id, date")
Photo = namedtuple("photo", "file, date")
def hmstoseconds(hhmmss):
# 60 * 60 seconds in an hour, 60 seconds in a min, 1 second in a second
return sum(x*y for x, y in zip(hhmmss, (60*60, 60, 1)))
def within_two_min(date1, date2):
# convert HH, MM, SS to seconds for both dates
# return whether the absolute difference between them is within 2 min (60s * 2)
return abs(hmstoseconds(date1) - hmstoseconds(date2)) <= 60 * 2
if __name__ == '__main__':
# using main here means we avoid any nasty global variables
# and only execute this code when this file is run directly
features = []
with open("features.csv") as f:
reader = csv.reader(f)
features = [Feature(f) for f in reader]
photos = []
with open("photos.csv) as f:
reader = csv.reader(f)
photos = [Photo(p) for p in reader]
for feature in features:
for photo in photos:
# .. todo:: instead of nested for loops, we could use filter()
if within_two_min(feature.date, photo.date):
<here, write a photo>
Hopefully this gets you moving in the right direction. I don't fully understand what you were trying to do with i and j and the first/second "write_picture" stuff, but hoping you understand better the scope and access in python.
Related
I have a python script which calculates tree heights based off distance and angle from the ground, however, despite the script running with no errors my heights column is left empty. Also, I dont want to be using pandas and I would like to keep to the 'with open' method if possible, before anyone suggests going about it a different way. Any help would be great thanks. It seems that the whole script runs fine and does everything i need it to until the "for row in csvread:" block.
This is my current script:
#!/usr/bin/env python3
# Import any modules needed
import sys
import csv
import math
import os
import itertools
# Extract command line arguments, remove file extension and attach to output_filename
input_filename1 = sys.argv[1]
input_filename2 = os.path.splitext(input_filename1)[0]
filenames = (input_filename2, "treeheights.csv")
output_filename = "".join(filenames)
def TreeHeight(degrees, distance):
"""
This function calculates the heights of trees given distance
of each tree from its base and angle to its top, using the
trigonometric formula.
"""
radians = math.radians(degrees)
height = distance * math.tan(radians)
print("Tree height is:", height)
return height
def main(argv):
with open(input_filename1, 'r') as f:
with open(output_filename, 'w') as g:
csvread = csv.reader(f)
print(csvread)
csvwrite = csv.writer(g)
header = csvread.__next__()
header.append("Height.m")
csvwrite.writerow(header)
# Populating the output csv with the input data
csvwrite.writerows(itertools.islice(csvread, 0, 121))
for row in csvread:
height = TreeHeight(csvread[:,2], csvread[:,1])
row.append(height)
csvwrite.writerow(row)
return 0
if __name__ == "__main__":
status = main(sys.argv)
sys.exit(status)
Looking at your code, I think you're mostly there, but are a little confused on reading/writing rows:
# Populating the output csv with the input data
csvwrite.writerows(itertools.islice(csvread, 0, 121))
for row in csvread:
height = TreeHeight(csvread[:,2], csvread[:,1])
row.append(height)
csvwrite.writerow(row)
It looks like your reading rows 1 through 121 and writing them to your new file. Then, you're trying to iterate over your CSV reader in a second pass, compute the height, and then tack that computed value on to the end of the row, and also write to your CSV in a complete second pass.
If that's true, then you need to understand that CSV reader and writer are not designed to work "left-to-right" like that: read-write these columns, then read-write these columns... nope.
They both work "top-down", processing rows.
I propose, to get this working: iterate every row in one loop, and for every row:
read the values you need from row to compute the height
get the computed height
add the new computed to the original
write
...
header = next(csvread)
header.append("Height.m")
csvwrite.writerow(header)
for row in csvread:
degrees = float(row[1]) # second column for degrees?
distance = float(row[0]) # first column for distance?
height = TreeHeight(degrees, distance)
row.append(height)
csvwrite.writerow(row)
Some changes I made:
I replaced header = csvread.__next__() with header = next(csvread). Calling things that start with _ or __ is generally discouraged, at least in the standard library. next(<iterator>) is the built-in function that allows you to properly and safely advance through <iterator>.
Added float() conversion to textual values as read from CSV
Also, as far as I can tell, the ,2/,1 is incorrect syntax for subscripting/slice notation. You didn't get any errors because the reader was already done/exhausted from the islice() call, so your program never actually stepped into the for row in csvread: loop.
Hi I am trying to learn higher order functions (HOF's) in python. I understand their simple uses for reduce, map and filter. But here I need to create a tuple of the stations the bikes came and went from with the number of events at those stations as the second value. Now the commented out code is this done with normal functions (I left it as a dictionary but thats easy to convert to a tuple).
But I've been racking my brain for a while and cant get it to work using HOF's. My idea right now is to somehow use map to go through the csvReader and add to the dictionary. For some reason I cant understand what to do here. Any help understanding how to use these functions properly would be helpful.
import csv
#def stations(reader):
# Stations = {}
# for line in reader:
# startstation = line['start_station_name']
# endstation = line['end_station_name']
# Stations[startstation] = Stations.get(startstation, 0) + 1
# Stations[endstation] = Stations.get(endstation, 0) + 1
# return Stations
Stations = {}
def line_list(x):
l = x['start_station_name']
l2 = x['end_station_name']
Stations[l] = Stations.get(l, 0) + 1
Stations[l2] = Stations.get(l2, 0) + 1
return dict(l,l2)
with open('citibike.csv', 'r') as fi:
reader = csv.DictReader(fi)
#for line in reader:
output = list(map(line_list,reader))
#print(listmap)
#output1[:10]
print(output)
list(map(...)) creates a list of results, not a dictionary.
If you want to fill in a dictionary, you can use reduce(), using the dictionary as the accumulator.
from functools import reduce
def line_list(Stations, x):
l = x['start_station_name']
l2 = x['end_station_name']
Stations[l] = Stations.get(l, 0) + 1
Stations[l2] = Stations.get(l2, 0) + 1
return Stations
with open('citibike.csv', 'r') as fi:
reader = csv.DictReader(fi)
result = reduce(line_list, reader, {})
print(result)
Introduction
Firstly, thank you so much for taking the time to look at my question and code. I know my coding needs improving, but as with all new things it times time to perfect.
Background
I am making use of different functions to do the following:
Import multiple files (3 for now). Each files contains time, pressure and void columns.
Store the lists in dictionaries. Eg. All the pressures in one pressure dictionary.
Filter the pressure data and ensure that I still have the corresponding number of time data points (for each file).
I call these functions in the main function.
Problem
Everything runs perfectly until I run the time loop in the DataFilter function for a second time. However, I did check that I can access all three different lists for pressure and time in the dictionaries. Why do I get 'index out of range' at this point:t=timeFROMtimes [i] when the function runs a second time? This is the output I get for the code
Code
#IMPORT LIBRARIES
import glob
import pandas as pd
import matplotlib.pyplot as plt, numpy as np
from scipy.interpolate import spline
#INPUT PARAMETERS
max_iter = 3 #maximum number of iterations
i = 0 #starting number of iteration
tcounter = 0
number_of_files = 3 #number of files in directory
sourcePath = 'C:\\Users\\*' #not displaying actual directory
list_of_source_files = glob.glob(sourcePath + '/*.TXT')
pressures = {} #initialize a dictionary
times ={} #initialize a dictionary
#GET SPECIFIC FILE & PRESSURE, TIME VALUES
def Get_source_files(list_of_source_files,i):
#print('Get_source_files')
with open(list_of_source_files[i]) as source_file:
print("file entered:",i+1)
lst = []
for line in source_file:
lst.append([ float(x) for x in line.split()])
time = np.array([ x[0] for x in lst]) #first row in file and make array
void = np.array([ x[1] for x in lst]) #second row in file and make array
pressure =(np.array([ x[2] for x in lst]))*-1 #etc. & change the sign of the imported pressure data
return pressure,time
#SAVE THE TIME AND PRESSURE IN DICTIONARIES
def SaveInDictionary (Pressure,time,i):
#print('SaveInDictionary')
pressures[i]=[Pressure]
times[i]=[time]
return pressures,times
#FILTER PRESSURE DATA AND ADJUST TIME
def FilterData (pressureFROMpressures,timeFROMtimes,i):
print('FilterData')
t=timeFROMtimes [i]
p=pressureFROMpressures[i]
data_points=int((t[0]-t[-1])/-0.02) #determine nr of data points per column of the imported file
filtered_pressure = [] #create an empty list for the filtered pressure
pcounter = 0 #initiate a counter
tcounter = 0
time_new =[0,0.02] #these initial values are needed for the for-loop
for j in range(data_points):
if p[j]<8: #filter out all garbage pressure data points above 8 bar
if p[j]>0:
filtered_pressure.append(p[j]) #append the empty list ofsave all the new pressure values in new_pressure
pcounter += 1
for i in range(pcounter-1):
time_new[0]=0
time_new[i+1]=time_new[i]+0.02 #add 0.02 to the previous value
time_new.append(time) #append the time list
tcounter += 1 #increment the counter
del time_new[-1] #somehow a list of the entire time is formed at the end that should be removed
return filtered_pressure, time_new
#MAIN!!
P = list()
while (i <= number_of_files and i!=max_iter):
pressure,time = Get_source_files(list_of_source_files,i) #return pressure and time from specific file
pressures, times = SaveInDictionary (pressure,time,i)#save pressure and time in the dictionaries
i+=1 #increment i to loop
print('P1=',pressures[0])
print('P2=',pressures[1])
print('P3=',pressures[2])
print('t1=',times[0])
print('t2=',times[1])
print('t3=',times[2])
#I took this out of the main function to check my error:
for i in range(2):
filtered_pressure,changed_time = FilterData(pressures[i],times[i],i)
finalP, finalT = SaveInDictionary (filtered_pressure,changed_time,i)#save pressure and time in the dictionaries
In the loop you took out of the main function you already index into times, but in the
#I took this out of the main function to check my error:
for i in range(2):
filtered_pressure,changed_time = FilterData(pressures[i],times[i],i)
# Here you call FilterData with times[i]
but in the FilterData function, where you call the passed variable timeFROMtimes you index it again with i:
def FilterData (pressureFROMpressures,timeFROMtimes,i):
print('FilterData')
t=timeFROMtimes [i] # <-- right here
This seems odd. Try removing one of the index operators ([i]) and also do so for pressures.
Edit
As #strubbly pointed out, in your SaveInDictionary function you assign the value [Pressure]. The square brackets denote a list with one element, the numpy array Pressure. This can be seen in your error message, by the opening bracket before the arrays when you print P1-3 and t1-3.
To directly save your numpy array, loose the brackets:
def save_in_dictionary(pressure_array, time, i):
pressures[i] = pressure_array
times[i] = time
return pressures, times
I need to process over 10 million spectroscopic data sets. The data is structured like this: there are around 1000 .fits (.fits is some data storage format) files, each file contains around 600-1000 spectra in which there are around 4500 elements in each spectra (so each file returns a ~1000*4500 matrix). That means each spectra is going to be repeatedly read around 10 times (or each file is going to be repeatedly read around 10,000 times) if I am going to loop over the 10 million entries. Although the same spectra is repeatedly read around 10 times, it is not duplicate because each time I extract different segments of the same spectra. With the help of #Paul Panzer, I already avoid reading the same file multiple times.
I have a catalog file which contains all the information I need, like the coordinates x, y, the radius r, the strength s, etc. The catalog also contains the information to target which file I am going to read (identified by n1, n2) and which spectra in that file I am going to use (identified by n3).
The code I have now is:
import numpy as np
from itertools import izip
import itertools
import fitsio
x = []
y = []
r = []
s = []
n1 = []
n2 = []
n3 = []
with open('spectra_ID.dat') as file_ID, open('catalog.txt') as file_c:
for line1, line2 in izip(file_ID,file_c):
parts1 = line1.split()
parts2 = line2.split()
n1.append(int(parts1[0]))
n2.append(int(parts1[1]))
n3.append(int(parts1[2]))
x.append(float(parts2[0]))
y.append(float(parts2[1]))
r.append(float(parts2[2]))
s.append(float(parts2[3]))
def data_analysis(n_galaxies):
n_num = 0
data = np.zeros((n_galaxies), dtype=[('spec','f4',(200)),('x','f8'),('y','f8'),('r','f8'),('s','f8')])
idx = np.lexsort((n3,n2,n1))
for kk,gg in itertools.groupby(zip(idx, n1[idx], n2[idx]), lambda x: x[1:]):
filename = "../../data/" + str(kk[0]) + "/spPlate-" + str(kk[0]) + "-" + str(kk[1]) + ".fits"
fits_spectra = fitsio.FITS(filename)
fluxx = fits_spectra[0].read()
n_element = fluxx.shape[1]
hdu = fits_spectra[0].read_header()
wave_start = hdu['CRVAL1']
logwave = wave_start + 0.0001 * np.arange(n_element)
wavegrid = np.power(10,logwave)
for ss, plate1, mjd1 in gg:
if n_num % 1000000 == 0:
print n_num
n3new = n3[ss]-1
flux = fluxx[n3new]
### following is my data reduction of individual spectra, I will skip here
### After all my analysis, I have the data storage as below:
data['spec'][n_num] = flux_intplt
data['x'][n_num] = x[ss]
data['y'][n_num] = y[ss]
data['r'][n_num] = r[ss]
data['s'][n_num] = s[ss]
n_num += 1
print n_num
data_output = FITS('./analyzedDATA/data_ALL.fits','rw')
data_output.write(data)
I kind of understand that the multiprocessing need to remove one loop, but pass the index to the function. However, there are two loops in my function and those two are highly correlated, so I do not know how to approach. Since the most time-consuming part of this code is reading files from disk, so the multiprocessing need to take full advantage of cores to read multiple files at one time. Could any one shed a light on me?
Get rid of global vars, you can't use global vars with processes
Merge your multiple global vars into one container class or dict,
assigning different segments of the same spectra into one data set
Move your global with open(... into a def ...
Separate data_output into a own def ...
Try first, without multiprocessing, this concept:
for line1, line2 in izip(file_ID,file_c):
data_set = create data set from (line1, line2)
result = data_analysis(data_set)
data_output.write(data)
Consider to use 2 processes one for file reading and one for file writing.
Use multiprocessing.Pool(processes=n) for data_analysis.
Communicate between processes using multiprocessing.Manager().Queue()
Context:
I have large(30mb now but later it may be over gigabytes) csv file (with 185 line) that to be searched for some value (each element of times) chunk by chunk ( chunk of 6 value of csv) by rows and if found it to be written in the another file. i.e. get one element from times sorted deque and search in another deque (i.e. rdr = deque(reder)) by 6 element in rdr if found write to the file and continue for the next elemnt in the times deque.
Problem:
I have already written a code that does the work perfectly but it is so-so slow (8 hrs). I want better performance. I think of multiprocessing - i am not getting through and thus seek help. I used a function ddd that gets all arguments from the calling scope except times1 that i pass explicitly.
Code i tried with:
dim = [0,76,'1.040000',1,1,'1.000000']+min_max_ret(X,Y)
times = deque(sorted(list(timestep),key=lambda x:ast.literal_eval(x)))
def ddd(times1):# ddd(outfl, rdr, acc_ret, FR_XY, width, length) all these arguments are get from the calling scope.
for tim in times1:
time = ['{0:.6f}'.format(ast.literal_eval(tim)/1000.000000)]
outfl.writelines([u'2 ********* TIMESTEP']+['\n']+time+['\n'])
for index,line in enumerate(rdr):
if index!=0:
cnt = 8
for counter in [qq for qq in [line[jj:jj+6] for jj in range(8,len(line),6)] if len(qq)==6]:
counter = map(unicode.strip,counter)
if counter[5]==tim:
cr_id = line[0]
acc = '{0:.6f}'.format(acc_ret(counter[3], counter[4]))
car_ltlng = map(unicode.strip,[line[cnt],line[cnt+1],line[cnt+6],line[cnt+7]])
xy = FR_XY(*car_ltlng)
data = [3]+[cr_id]+[1,1]+xy+[length,width]+[counter[2]]+[acc]
outfl.writelines([unicode(ww).strip()+'\n' for ww in data])
cnt+=6
print "Time is %s is completed"%tim
with open(r"C:\my_output_ascii_14Dec.trj",'w') as outfl:
with open(fl,'r') as inf:
reder = csv.reader(inf,delimiter=';')
rdr = deque(reder)
outfl.writelines([str(w)+'\n' for w in dim])
p = Pool(5)
p.map(ddd,times)#[[xx for xx in islice(times,ii,ii+10)] for ii in range(0,len(times))])
Sample csv content:
car_id; car_type; entry_gate; entry_time(ms); exit_gate; exit_time(ms); traveled_dist(m); avg_speed(m/s); trajectory(x[m];y[m];speed[m/s];a_tangential[ms-2];a_lateral[ms-2];timestamp[ms];)
24; Bus; 25; 4300.00; 26; 48520.00; 118.47; 2.678999; 509552.78; 5039855.59; 10.0740; 0.4290; 0.2012; 0.0000; 509552.97; 5039855.57; 10.0821; 0.3853; 0.2183; 20.0000; 509553.17; 5039855.55; 10.0886; 0.2636; 0.2356; 40.0000; 509553.37; 5039855.53; 10.0927; 0.1420; 0.2532; 60.0000; 509553.57; 5039855.51; 10.0943; 0.0203; 0.2710; 80.0000; 509553.76; 5039855.48; 10.0935; -0.1014; 0.2890; 100.0000; 509553.96; 5039855.46; 10.0902; -0.2231; 0.3073; 120.0000; 509554.16; 5039855.44; 10.0846; -0.3448; 0.3257; 140.0000; 509554.36; 5039855.42; 10.0765; -0.4665; 0.3444; 160.0000; 509554.56; 5039855.40; 10.0659; -0.5881; 0.3633; 180.0000; 509554.76; 5039855.37; 10.0529; -0.7098; 0.3823; 200.0000; 509554.96; 5039855.35; 10.0375; -0.8315; 0.4016; 220.0000; 509555.17; 5039855.33; 10.0197; -0.9532; 0.4211; 240.0000; 509555.37;
Partial csv file at here.