live sensor reading store in a data file - python

i have a acceleration sensor which continuously outputs reading in 400 Hz ( like [0.21511 0.1451 0.2122] ). I want to store them and post process them. Now im able to store the first entry of the reading not all.
How to make it happen.
from altimu10v5.lsm6ds33 import LSM6DS33
from time import sleep
import numpy as np
lsm6ds33 = LSM6DS33()
while True:
DataOut = np.column_stack(accel)
np.savetxt('output.dat',np.expand_dims(accel, axis=0), fmt='%2.2f %2.2f %2.2f')

Actual problem is, You are calling get_accelerometer_g_forces() only once.
Just move it inside While looop
while True:
DataOut = np.column_stack(accel)
np.savetxt(f,np.expand_dims(accel, axis=0), fmt='%2.2f %2.2f %2.2f')
Here is a reference :How to write a numpy array to a csv file?

Make sure that reading the data is enclosed within to loop!
You don't need numpy here yet:
while True:
with open("output.dat", "w") as f:
f.write("%.5f, %.5f, %.5f" % tuple(accelerometer_g_forces()))
Note that there is no condition to stop outputting the data.


Losing some samples while writing to file in Python

I'm continuosly getting readings from an ADC in Python, but during the process of writing it to a file, I lose some samples because there is some small delay. Is there a way I could avoid losing these samples (I'm sampling at 100Hz)?
I'm using multithreading, but in the process of writing and cleaning the list used to write the data to a file, I always lose some samples. The code is copied here as I have written it and all advice is welcome.
Thanks in advance.
import threading
import time
from random import randint
import os
from datetime import datetime
import ADS1256
import RPi.GPIO as GPIO
import sys
import os
import csv
#adc_reading function reads adc values and writes a list continuously.
def adc_reading():
global value_list
while True:
#function to create a new file every 60 seconds with the values gathered in adc_reading()
def cronometro():
global value_list
while True:
while diferencia<=contador:
#write_to_file() function writes the values gathered in adc_reading() to a file every 60 seconds.
def write_to_file(lista):
with open(nombre_archivo, 'w') as f:
# using csv.writer method from CSV package
write = csv.writer(f)
escritor = threading.Thread(target=adc_reading)
temporizador = threading.Thread(target=cronometro)
At a 100Hz, I have to wonder if the write operation really takes longer than 10ms. You could probably do both operations in the same loop and just collect data in a buffer and write it (about 6000 values) once every 60 seconds without incurring more than a few milliseconds delay:
import time
import ADS1256
import csv
ADC = ADS1256.ADS1256()
def adc_reading():
buffer = []
contador = 60
while True:
check = inicio = time.time()
while check - inicio <= contador:
adc_value = ADC.ADS1256_GetAll()
buffer.append([(check := time.time()), *adc_value[1:4]])
nombre_archivo = str(int(check)) + ".finish"
with open(nombre_archivo, 'w') as f:
write = csv.writer(f)
buffer = []
if __name__ == '__main__':
If you do need them to run in parallel (slow computer, other circumstances), you shouldn't use threads, but processes from multiprocessing.
The two threads won't run in parallel, they will alternate. You could run the data collection in a separate process and collect data from that from the main process.
Here's an example of doing this with some toy code, I think it's easy to see how to adjust for your case:
from multiprocessing import SimpleQueue, Process
from random import randint
from time import sleep, time
def generate_signals(q: SimpleQueue):
c = 0
while True:
sleep(0.01) # about 100 Hz
q.put((c, randint(1, 42)))
c += 1
def write_signals(q: SimpleQueue):
delay = 3 # 3 seconds for demo, 60 works as well
while True:
start = time()
while (check := time()) - start < delay:
values = []
while not q.empty():
with open(f'{str(int(check))}.finish', 'w') as f:
if __name__ == "__main__":
q = SimpleQueue()
generator = Process(target=generate_signals, args=((q),))
writer = Process(target=write_signals, args=((q),))
writer.join(timeout=10) # run for no more than 10 seconds, enough for demo
Edit: added a counter, to show that no values are missed.

Continuously reading and plotting a CSV file using Python

this is my first time asking something here, so I hope I am asking the following question the "correct way". If not, please let me know, and I will give more information.
I am using one Python script, to read and write 4000Hz of serial data to a CSV file.
The structure of the CSV file is as follows: (this example shows the beginning of the file)
Time of mSure Calibration: 24.10.2020 20:03:14.462654
Calibration Data - AICC: 833.95; AICERT: 2109; AVCC: 0.00; AVCERT: 0
Sampling Frequency: 4000Hz
24.10.2020 20:03:16.495828,-0.00032,7e-05,-0.0,0.0
24.10.2020 20:03:16.496078,0.001424,7e-05,0.0,0.0
24.10.2020 20:03:16.496328,9.6e-05,7e-05,0.0,0.0
24.10.2020 20:03:16.496578,-0.000912,7e-05,-0.0,0.0
Data will be written to this CSV as long as the script reading serial data is active. Thus, this might become a huge file at some time. (Data is written in chunks of 8000 rows = every two seconds)
Here is my problem: I want to plot this data live. For example, update the plot each time data is written to the CSV file. The plotting shall be done from another script than the script reading and writing the serial data.
What is working: 1. Creating the CSV file. 2. Plotting a finished CSV file using another script - actually pretty well :-)
I have this script for plotting:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""Data Computation Software for TeensyDAQ - Reads and computes CSV-File"""
# region imports
import getopt
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import pathlib
from scipy.signal import argrelextrema
import sys
# endregion
# region globals
inputfile = ''
outputfile = ''
# endregion
# region functions
def main(argv):
"""Main application"""
# region define variables
global inputfile
global outputfile
inputfile = str(pathlib.Path(__file__).parent.absolute(
outputfile = str(pathlib.Path(__file__).parent.absolute(
# endregion
# region read system arguments
opts, args = getopt.getopt(
argv, "hi:o:", ["infile=", "outfile="])
except getopt.GetoptError:
print(' -i <inputfile> -o <outputfile>')
for opt, arg in opts:
if opt == '-h':
print(' -i <inputfile> -o <outputfile>')
elif opt in ("-i", "--infile"):
inputfile = str(pathlib.Path(
elif opt in ("-o", "--outfile"):
outputfile = str(pathlib.Path(
# endregion
# region read csv
colTypes = {'timestamp': 'str',
'instantaneousCurrent': 'float',
'instantaneousVoltage': 'float',
'activePowerCalculated': 'float',
'activePower': 'float',
'apparentPower': 'float',
'fundReactivePower': 'float'
cols = list(colTypes.keys())
df = pd.read_csv(inputfile, usecols=cols, dtype=colTypes,
parse_dates=True, dayfirst=True, skiprows=3)
df['timestamp'] = pd.to_datetime(
df['timestamp'], utc=True, format='%d.%m.%Y %H:%M:%S.%f')
df.insert(loc=0, column='tick', value=np.arange(len(df)))
# endregion
# region plot data
fig, axes = plt.subplots(nrows=6, ncols=1, sharex=True, figsize=(16,8))
df['instantaneousCurrent'].plot(ax=axes[0], color='red'); axes[0].set_title('Momentanstrom'); axes[0].set_ylabel('A',rotation=0)
df['instantaneousVoltage'].plot(ax=axes[1], color='blue'); axes[1].set_title('Momentanspannung'); axes[1].set_ylabel('V',rotation=0)
df['activePowerCalculated'].plot(ax=axes[2], color='green'); axes[2].set_title('Momentanleistung ungefiltert'); axes[2].set_ylabel('W',rotation=0)
df['activePower'].plot(ax=axes[3], color='brown'); axes[3].set_title('Momentanleistung'); axes[3].set_ylabel('W',rotation=0)
df['apparentPower'].plot(ax=axes[4], color='brown'); axes[4].set_title('Scheinleistung'); axes[4].set_ylabel('VA',rotation=0)
df['fundReactivePower'].plot(ax=axes[5], color='brown'); axes[5].set_title('Blindleitsung'); axes[5].set_ylabel('VAr',rotation=0); axes[5].set_xlabel('microseconds since start')
# endregion
# endregion
if __name__ == "__main__":
My thoughts on how to solve my problem:
Modify my plotting script to continuously read the CSV file and plot using the animation function of matplotlib.
Using some sort of streaming functionality to read the CSV in a stream. I have read about the streamz library, but I have no idea how I could use it.
Any help is highly appreciated!
Kind regards,
EDIT 31.10.2020:
Since I am not aware of the mean duration, how long to wait for help, I try to add more input, which maybe leads to helpful comments.
I wrote this script to write data continuously to a CSV file, which emulates my real script without the need for external hardware: (Random data is produced and CSV-formatted using a timer. Each time there are 50 new rows, the data is written to a CSV file)
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import csv
from random import randrange
import time
import threading
import pathlib
from datetime import datetime, timedelta
datarows = list()
datarowsToWrite = list()
outputfile = str(pathlib.Path(__file__).parent.absolute().resolve()) + "\\noFilenameProvided.csv"
sampleCount = 0
def startBatchWriteThread():
global outputfile
global datarows
global datarowsToWrite
datarowsToWrite = datarows[:]
thread = threading.Thread(target=batchWriteData,args=(outputfile, datarowsToWrite))
def batchWriteData(file, data):
print("Items to write: " + str(len(data)))
with open(file, 'a+') as f:
for item in data:
f.write("%s\n" % item)
def generateDatarows():
global sampleCount
timer1 = threading.Timer(0.001, generateDatarows)
timer1.daemon = True
datarow ="%d.%m.%Y %H:%M:%S.%f")[:] + "," + str(randrange(10)) + "," + str(randrange(10)) + "," + str(randrange(10)) + "," + str(randrange(10)) + "," + str(randrange(10)) + "," + str(randrange(10))
sampleCount += 1
datarows.append("row 1")
datarows.append("row 2")
datarows.append("row 3")
while True:
if len(datarows) == 50:
except KeyboardInterrupt:
print("Shutting down, writing the rest of the buffer.")
batchWriteData(outputfile, datarows)
print("Done, writing " + outputfile)
The script from my initial post can then plot the data from the CSV file.
I need to plot the data as it is written to the CSV file to see the data more or less live.
Hope this makes my problem more understandable.
For the Googlers: I could not find a way to achieve my goal as described in the question.
However, if you are trying to plot live data, coming with high speed over serial comms (4000Hz in my case), I recommend designing your application as a single program with multiple processes.
The problem in my special case was, that when I tried to plot and compute the incoming data simultaneously in the same thread/task/process/whatever, my serial receive rate went down to 100Hz instead of 4kHz. The solution with multiprocessing and passing data using the quick_queue module between the processes I could resolve the problem.
I ended up, having a program, which receives data from a Teensy via serial communication at 4kHz, this incoming data was buffered to blocks of 4000 samples and then the data was pushed to the plotting process and additionally, the block was written to a CSV-file in a separate Thread.

H5Py and storage

I am writing some code which needs to save a very large numpy array to memory. The numpy array is so large in fact that I cannot load it all into memory at once. But I can calculate the array in chunks. I.e. my code looks something like:
for i in np.arange(numberOfChunks):
myArray[(i*chunkSize):(i*(chunkSize+1)),:,:] = #... do some calculation
As I can't load myArray into memory all at once, I want to save it to a file one "chunk" at a time. i.e. I want to do something like this:
for i in np.arange(numberOfChunks):
myArrayChunk = #... do some calculation to obtain chunk
saveToFile(myArrayChunk, indicesInFile=[(i*chunkSize):(i*(chunkSize+1)),:,:], filename)
I understand this can be done with h5py but I am a little confused how to do this. My current understanding is that I can do this:
import h5py
# Make the file
h5py_file = h5py.File(filename, "a")
# Tell it we are going to store a dataset
myArray = h5py_file.create_dataset("myArray", myArrayDimensions, compression="gzip")
for i in np.arange(numberOfChunks):
myArrayChunk = #... do some calculation to obtain chunk
myArray[(i*chunkSize):(i*(chunkSize+1)),:,:] = myArrayChunk
But this is where I become a little confused. I have read that if you index a h5py datatype like I did when I wrote myArray[(i*chunkSize):(i*(chunkSize+1)),:,:], then this part of myArray has now been read into memory. So surely, by the end of my loop above, have I not still got the whole of myArray in memory now? How has this saved my memory?
Similarly, later on, I would like to read in my file back in one chunk at a time, doing further calculation. i.e. I would like to do something like:
import h5py
# Read in the file
h5py_file = h5py.File(filename, "a")
# Read in myArray
myArray = h5py_file['myArray']
for i in np.arange(numberOfChunks):
# Read in chunk
myArrayChunk = myArray[(i*chunkSize):(i*(chunkSize+1)),:,:]
# ... Do some calculation on myArrayChunk
But by the end of this loop is the whole of myArray now in memory? I am a little confused by when myArray[(i*chunkSize):(i*(chunkSize+1)),:,:] is in memory and when it isn't. Please could someone explain this.
You have the basic idea. Take care when saying "save to memory". NumPy arrays are saved in memory (RAM). HDF5 data is saved on disk (not to memory/RAM!), then accessed (memory used depends on how you access). In the first step you are creating and writing data in chunks to the disk. In the second step you are accessing data from disk in chunks. Working example provided at the end.
When reading data with h5py there 2 ways to read the data:
This returns a NumPy array:
myArrayNP = myArray[:,:,:]
This returns a h5py dataset object that operates like a NumPy array:
myArrayDS = myArray
The difference: h5py dataset objects are not read into memory all at once. You can then slice them as needed. Continuing from above, this is a valid operation to get a subset of the data:
myArrayChunkNP = myArrayDS[i*chunkSize):(i+1)*chunkSize),:,:]
My example also corrects 1 small error in your chunksize increment equation.
You had:
myArray[(i*chunkSize):(i*(chunkSize+1)),:,:] = myArrayChunk
You want:
myArray[(i*chunkSize):(i+1)*chunkSize),:,:] = myArrayChunk
Working Example (writes and reads):
import h5py
import numpy as np
# Make the file
with h5py.File("SO_61173314.h5", "w") as h5w:
numberOfChunks = 3
chunkSize = 4
print( 'WRITING %d chunks with w/ chunkSize=%d ' % (numberOfChunks,chunkSize) )
# Write dataset to disk
h5Array = h5w.create_dataset("myArray", (numberOfChunks*chunkSize,2,2), compression="gzip")
for i in range(numberOfChunks):
h5ArrayChunk = np.random.random(chunkSize*2*2).reshape(chunkSize,2,2)
print (h5ArrayChunk)
h5Array[(i*chunkSize):((i+1)*chunkSize),:,:] = h5ArrayChunk
with h5py.File("SO_61173314.h5", "r") as h5r:
print( '/nREADING %d chunks with w/ chunkSize=%d/n' % (numberOfChunks,chunkSize) )
# Access myArray dataset - Note: This is NOT a NumpPy array
myArray = h5r['myArray']
for i in range(numberOfChunks):
# Read a chunk into memory (as a NumPy array)
myArrayChunk = myArray[(i*chunkSize):((i+1)*chunkSize),:,:]
# ... Do some calculation on myArrayChunk
print (myArrayChunk)

How do I improve the speed of this parser using python?

I am currently parsing historic delay data from a public transport network in Sweden. I have ~5700 files (one from every 15 seconds) from the 27th of January containing momentary delay data for vehicles on active trips in the network. It's, unfortunately, a lot of overhead / duplicate data, so I want to parse out the relevant stuff to do visualizations on it.
However, when I try to parse and filter out the relevant delay data on a trip level using the script below it performs really slow. It has been running for over 1,5 hours now (on my 2019 Macbook Pro 15') and isn't finished yet.
How can I optimize / improve this python parser?
Or should I reduce the number of files, and i.e. the frequency of the data collection, for this task?
Thank you so much in advance. ๐Ÿ’—
from google.transit import gtfs_realtime_pb2
import gzip
import os
import datetime
import csv
import numpy as np
directory = '../data/tripu/27/'
datapoints = np.zeros((0,3), int)
read_trips = set()
# Loop through all files in directory
for filename in os.listdir(directory)[::3]:
# Uncompress and parse protobuff-file using gtfs_realtime_pb2
with + filename, 'rb') as file:
response =
feed = gtfs_realtime_pb2.FeedMessage()
print("Filename: " + filename, "Total entities: " + str(len(feed.entity)))
for trip in feed.entity:
if trip.trip_update.trip.trip_id not in read_trips:
if len(trip.trip_update.stop_time_update) == len(stopsOnTrip[trip.trip_update.trip.trip_id]):
print("\t","Adding delays for",len(trip.trip_update.stop_time_update),"stops, on trip_id",trip.trip_update.trip.trip_id)
for i, stop_time_update in enumerate(trip.trip_update.stop_time_update[:-1]):
# Store the delay data point (arrival difference of two ascending nodes)
delay = int(trip.trip_update.stop_time_update[i+1].arrival.delay-trip.trip_update.stop_time_update[i].arrival.delay)
# Store contextual metadata (timestamp and edgeID) for the unique delay data point
ts = int(trip.trip_update.stop_time_update[i+1].arrival.time)
key = int(str(trip.trip_update.stop_time_update[i].stop_id) + str(trip.trip_update.stop_time_update[i+1].stop_id))
# Append data to numpy array
datapoints = np.append(datapoints, np.array([[key,ts,delay]]), axis=0)
except KeyError:
except OSError:
I suspect the problem here is repeatedly calling np.append to add a new row to a numpy array. Because the size of a numpy array is fixed when it is created, np.append() must create a new array, which means that it has to copy the previous array. On each loop, the array is bigger and so all these copies add a quadratic factor to your execution time. This becomes significant when the array is quite big (which apparently it is in your application).
As an alternative, you could just create an ordinary Python list of tuples, and then if necessary convert that to a complete numpy array at the end.
That is (only the modified lines):
datapoints = []
# ...
# ...
npdata = np.array(datapoints, dtype=int)
I still think the parse routine is your bottleneck (even if it did come from Google), but all those '.'s were killing me! (And they do slow down performance somewhat.) Also, I converted your i, i+1 iterating to using two iterators zipping through the list of updates, this is a little more advanced style of working through a list. Plus the cur/next_update names helped me keep straight when you wanted to reference one vs. the other. Finally, I remove the trailing "else: continue", since you are at the end of the for loop anyway.
for trip in feed.entity:
this_trip_update = trip.trip_update
this_trip_id = this_trip_update.trip.trip_id
if this_trip_id not in read_trips:
if len(this_trip_update.stop_time_update) == len(stopsOnTrip[this_trip_id]):
print("\t", "Adding delays for", len(this_trip_update.stop_time_update), "stops, on trip_id",
# create two iterators to walk through the list of updates
cur_updates = iter(this_trip_update.stop_time_update)
nxt_updates = iter(this_trip_update.stop_time_update)
# advance the nxt_updates iter so it is one ahead of cur_updates
for cur_update, next_update in zip(cur_updates, nxt_updates):
# Store the delay data point (arrival difference of two ascending nodes)
delay = int(nxt_update.arrival.delay - cur_update.arrival.delay)
# Store contextual metadata (timestamp and edgeID) for the unique delay data point
ts = int(next_update.arrival.time)
key = "{}/{}".format(cur_update.stop_id, next_update.stop_id)
# Append data to numpy array
datapoints = np.append(datapoints, np.array([[key, ts, delay]]), axis=0)
except KeyError:
This code should be equivalent to what you posted, and I don't really expect major performance gains either, but perhaps this will be more maintainable when you come back to look at it in 6 months.
(This probably is more appropriate for CodeReview, but I hardly ever go there.)

search a 2GB WAV file for dropouts using wave module

`What is the best way to analyze a 2GB WAV file (1khz Tone) for audio dropouts using wave module? I tried the script below
import wave
file1 ="testdropout.wav", "r")
file2 = open("silence.log", "w")
for i in xrange(file1.getnframes()):
frame = file1.readframes(i)
zero = True
for j in xrange(len(frame)):
# check if amplitude is greater than 0
# the ord() function converts the hex values to integers
if ord(frame[j]) > 0:
zero = False
if zero:
print >> file2, 'dropout at second %s' % (file1.tell()/file1.getframerate())
I haven't used the wave module before, but file1.readframes(i) looks like it's reading 1 frame when you're at the first frame, 2 frames when you're at the second frame, 10 frames when you're in the tenth frame, and a 2Gb CD quality file might have a million frames - by the time you're at frame 100,000 reading 100,000 frames ... getting slower each time through the loop as well?
And from my comment, in Python 2 range() generates an in-memory array of the full size first, and xrange() doesn't, but not using range at all helps even more.
And push the looping down into the lower layers with any() to make the code shorter, and possibly faster:
import wave
file1 ="testdropout.wav", "r")
file2 = open("silence.log", "w")
chunksize = file1.getframerate()
chunk = file1.readframes(chunksize)
while chunk:
if not any(ord(sample) for sample in chunk):
print >> file2, 'dropout at second %s' % (file1.tell()/chunksize)
chunk = file1.readframes(chunksize)
This should read the file in 1-second chunks.
I think a simple solution to this would be to consider that the frame rate on audio files is pretty high. A sample file on my computer happens to have a framerate of 8,000. That means for every second of audio, I have 8,000 samples. If you have missing audio, I'm sure it will exist across multiple frames within a second, so you can essentially reduce your comparisons as drastically as your standards would allow. If I were you, I would try iterating over every 1,000th sample instead of every single sample in the audio file. That basically means it will examine every 1/8th of a second of audio to see if it's dead. Not as precise, but hopefully it will get the job done.
import wave
file1 ="testdropout.wav", "r")
file2 = open("silence.log", "w")
for i in range(file1.getnframes()):
frame = file1.readframes(i)
zero = True
for j in range(0, len(frame), 1000):
# check if amplitude is greater than 0
# the ord() function converts the hex values to integers
if ord(frame[j]) > 0:
zero = False
if zero:
print >> file2, 'dropout at second %s' % (file1.tell()/file1.getframerate())
At the moment, you're reading the entire file into memory, which is not ideal. If you look at the methods available for a "Wave_read" object, one of them is setpos(pos), which sets the position of the file pointer to pos. If you update this position, you should be able to only keep the frame you want in memory at any given time, preventing errors. Below is a rough outline:
import wave
file1 ="testdropout.wav", "r")
file2 = open("silence.log", "w")
def scan_frame(frame):
for j in range(len(frame)):
# check if amplitude is less than 0
# It makes more sense here to check for the desired case (low amplitude)
# rather than breaking at higher amplitudes
if ord(frame[j]) <= 0:
return True
for i in range(file1.getnframes()):
frame = file1.readframes(1) # only read the frame at the current file position
zero = scan_frame(frame)
if zero:
print >> file2, 'dropout at second %s' % (file1.tell()/file1.getframerate())
pos = file1.tell() # States current file position
file1.setpos(pos + len(frame)) # or pos + 1, or whatever a single unit in a wave
# file is, I'm not entirely sure
Hope this can help!

