Python csv multiprocessing load to dictionary or list

Python csv multiprocessing load to dictionary or list - python

I'm moving from MATLAB to python my algorithms and I have stuck in parallel processing
I need to process a very large amount of csv's (1 to 1M) with a large number of rows (10k to 10M) with 5 independent data columns.
I already have a code that does this, but with only one processor, loading csv's to a dictionary in RAM takes about 30 min(~1k csv's of ~100k rows).
The file names are in a list loaded from a csv(this is already done):
Amp Freq Offset PW FileName
3 10000.0 1.5 1e-08 FlexOut_20140814_221948.csv
3 10000.0 1.5 1.1e-08 FlexOut_20140814_222000.csv
3 10000.0 1.5 1.2e-08 FlexOut_20140814_222012.csv
...
And the CSV in the form: (Example: FlexOut_20140815_013804.csv)
# TDC characterization output file , compress
# TDC time : Fri Aug 15 01:38:04 2014
#- Event index number
#- Channel from 0 to 15
#- Pulse width [ps] (1 ns precision)
#- Time stamp rising edge [ps] (500 ps precision)
#- Time stamp falling edge [ps] (500 ps precision)
##Event Channel Pwidth TSrise TSfall
0 6 1003500 42955273671237500 42955273672241000
1 6 1003500 42955273771239000 42955273772242500
2 6 1003500 42955273871241000 42955273872244500
...
I'm looking for something like MATLAB 'parfor' that takes the name from the list opens the files and put the data in a list of dictionary's.
It's a list because there is an order in the files (PW), but in the examples I've found it seems to be more complicated to do this, so first I will try to put it in a dictonary and after I will arrange the data in a list.
Now I'm starting with the multiprocessing examples on the web:
Writing to dictionary of objects in parallel
I will post updates when I have a piece of "working" code.

Related

Mapping data frame descriptions based on values of multiple columns

I need to generate a mapping dataframe with each unique code and a description I want prioritised, but need to do it based off a set of prioritisation options. So for example the starting dataframe might look like this:
Filename TB Period Company Code Desc. Amount
0 3 - Foxtrot... Prior TB FOXTROT FOXTROT__1000 98 100
1 3 - Foxtrot... Prior TB FOXTROT FOXTROT__1000 7 200
2 3 - Foxtrot... Opening TB FOXTROT FOXTROT__1000 ZX -100
3 3 - Foxtrot... Closing TB FOXTROT FOXTROT__1000 29 -200
4 3 - Foxtrot... Prior TB FOXTROT FOXTROT__1001 BA 100
5 3 - Foxtrot... Opening TB FOXTROT FOXTROT__1001 9 200
6 3 - Foxtrot... Closing TB FOXTROT FOXTROT__1001 ARC -100
7 3 - Foxtrot... Closing TB FOXTROT FOXTROT__1001 86 -200
The options I have for prioritisation of descriptions are:
Firstly to search for viable options in each Period, so for example Closing first, then if not found Opening, then if not found Prior.
If multiple descriptions are in the prioritised period, prioritise either longest or first instance.
So for example, if I wanted prioritisation of Closing, then Opening, then Prior, with longest string, I should get a mapping dataframe that looks like this:
Code New Desc.
FOXTROT__1000 29
FOXTROT__1001 ARC
Just for context, I have a fairly simple way to do all this in tkinter, but its dependent on generating a GUI of inconsistent codes and comboboxes of their descriptions, which is then used to generate a mapping dataframe.
The issue is that for large volumes (>1000 up to 30,000 inconsistent codes), it becomes impractical to generate a GUI, so for large volumes I need this as a way to auto-generate the mapping dataframe directly from the initial data whilst circumventing tkinter entirely.

import numpy as np
import pandas as df
#Create a new column which shows the hierarchy given the value of Period
df['NewFilterColumn'] = np.where( df['Period'] == 'Closing', 1,
np.where(df['Period'] == 'Opening', 2,
np.where(df['Period'] == 'Prior', 3, None
)
)
)
df = df.sort_values(by = ['NewFilterColumn', 'Code','New Desc.'], ascending = True, axis = 0)

ParserError: Expected 2 fields in line 32, saw 4

I'm having trouble parsing a txt file (see here: File)
Here's my code
import pandas as pd
objectname = r"path"
df = pd.read_csv(objectname, engine = 'python', sep='\t', header=None)
Unfortunately it does not work. Since this question has been asked several times, I tried lots of proposed solutions (most of them can be found here: Possible solutions)
However, nothing did the trick for me. For instance, when I use
sep='delimiter'
The dataframe is created but everything ends up in a single column.
When I use
error_bad_lines=False
The rows I'm interested in are simply skipped.
The only way it works is when I first open the txt file, copy the content, paste it into google sheets, save the file as CSV and then open the dataframe.
I guess another workaround would be to use
df = pd.read_csv(objectname, engine = 'python', sep = 'delimiter', header=None)
in combination with the split function Split function
Is there any suggestion how to make this work without the need to convert the file or to use the split function? I'm using Python 3 and Windows 10.
Any help is appreciated.

Your file has tab separators but is not a TSV. The file is a mixture of metadata, followed by a "standard" TSV, followed by more metadata. Therefore, I found tackling the metadata as a separate task from loading the data to be useful.
Here's what I did to extract the metadata lines:
with open('example.txt','r') as file_handle:
file_content = file_handle.read().split('\n')
for index, line in enumerate(file_content):
if index<21 or index>37:
print(index, line.split('\t'))
Note that the lines denoting the start and stop of metadata (21 and 37 in my example) are specific to the file. I've provided the trimmed data I used below (based on your linked file).
Separately, I loaded the TSV into Pandas using
import pandas as pd
df = pd.read_csv('example.txt', engine = 'python',
sep='\t', error_bad_lines=False, header=None,
skiprows=list(range(21))+list(range(37,89)))
Again, I skipped the metadata at the start of the file and at the end of the file.
Here's the file I experimented with. I've trimmed the extra data to reduce line count.
TITLE Test123
DATA TYPE
ORIGIN JASCO
OWNER
DATE 19/03/28
TIME 16:39:44
SPECTROMETER/DATA SYSTEM
LOCALE 1031
RESOLUTION
DELTAX -0,5
XUNITS NANOMETERS
YUNITS CD [mdeg]
Y2UNITS HT [V]
Y3UNITS ABSORBANCE
FIRSTX 300,0000
LASTX 190,0000
NPOINTS 221
FIRSTY -0,78961
MAXY 37,26262
MINY -53,38971
XYDATA
300,0000 -0,789606 182,198 -0,0205245
299,5000 -0,691644 182,461 -0,0181217
299,0000 -0,700976 182,801 -0,0136756
298,5000 -0,614708 182,799 -0,0131957
298,0000 -0,422611 182,783 -0,0130073
195,0000 26,6231 997,498 4,7258
194,5000 -17,3049 997,574 4,6864
194,0000 16,0387 997,765 4,63967
193,5000 -14,4049 997,967 4,58593
193,0000 -0,277261 998,025 4,52411
192,5000 -29,6098 998,047 4,45244
192,0000 -11,5786 998,097 4,36608
191,5000 34,0505 998,282 4,27376
191,0000 28,2325 998,314 4,1701
190,5000 -13,232 998,336 4,05036
190,0000 -47,023 998,419 3,91883
##### Extended Information
[Comments]
Sample name X
Comment
User
Division
Company RWTH Aachen
[Detailed Information]
Creation date 28.03.2019 16:39
Data array type Linear data array * 3
Horizontal axis Wavelength [nm]
Vertical axis(1) CD [mdeg]
Vertical axis(2) HT [V]
Vertical axis(3) Abs
Start 300 nm
End 190 nm
Data interval 0,5 nm
Data points 221
[Measurement Information]
Instrument name CD-Photometer
Model name J-1100
Serial No. A001361635
Detector Standard PMT
Lock-in amp. X mode
HT volt Auto
Accessory PTC-514
Accessory S/N A000161648
Temperature 18.63 C
Control sonsor Holder
Monitor sensor Holder
Measurement date 28.03.2019 16:39
Overload detect 203
Photometric mode CD, HT, Abs
Measure range 300 - 190 nm
Data pitch 0.5 nm
CD scale 2000 mdeg/1.0 dOD
FL scale 200 mdeg/1.0 dOD
D.I.T. 0.5 sec
Bandwidth 1.00 nm
Start mode Immediately
Scanning speed 200 nm/min
Baseline correction Baseline
Shutter control Auto
Accumulations 3

Reading direct access binary file format in Python

Background:
A binary file is read on a Linux machine using the following Fortran code:
parameter(nx=720, ny=360, nday=365)
c
dimension tmax(nx,ny,nday),nmax(nx,ny,nday)
dimension tmin(nx,ny,nday),nmin(nx,ny,nday)
c
open(10,
&file='FILE',
&access='direct',recl=nx*ny*4)
c
do k=1,nday
read(10,rec=(k-1)*4+1)((tmax(i,j,k),i=1,nx),j=1,ny)
read(10,rec=(k-1)*4+2)((nmax(i,j,k),i=1,nx),j=1,ny)
read(10,rec=(k-1)*4+3)((tmin(i,j,k),i=1,nx),j=1,ny)
read(10,rec=(k-1)*4+4)((nmin(i,j,k),i=1,nx),j=1,ny)
end do
File Details:
options little_endian
title global daily analysis (grid box mean, the grid shown is the center of the grid box)
undef -999.0
xdef 720 linear 0.25 0.50
ydef 360 linear -89.75 0.50
zdef 1 linear 1 1
tdef 365 linear 01jan2015 1dy
vars 4
tmax 1 00 daily maximum temperature (C)
nmax 1 00 number of reports for maximum temperature (C)
tmin 1 00 daily minimum temperature (C)
nmin 1 00 number of reports for minimum temperature (C)
ENDVARS
Attempts at a solution:
I am trying to parse this into an array in python using the following code (purposely leaving out two attributes):
with gzip.open("/FILE.gz", "rb") as infile:
data = numpy.frombuffer(infile.read(), dtype=numpy.dtype('<f4'), count = -1)
while x <= len(data) / 4:
tmax.append(data[(x-1)*4])
tmin.append(data[(x-1)*4 + 2])
x += 1
data_full = zip(tmax, tmin)
When testing some records, the data does not seem to line up with some sample records from the file when using Fortran. I have also tried dtype=numpy.float32 as well with no success. It seems as though I am reading the file in correctly in terms of number of observations though. I was also using struct before I learned the file was created with Fortran. That was not working
There are similar questions out here, some of which have answers that I have tried adapting with no luck.
UPDATE
I am a little bit closer after trying out this code:
#Define numpy variables and empty arrays
nx = 720 #number of lon
ny = 360 #number of lat
nday = 0 #iterate up to 364 (or 365 for leap year)
tmax = numpy.empty([0], dtype='<f', order='F')
tmin = numpy.empty([0], dtype='<f', order='F')
#Parse the data into numpy arrays, shifting records as the date increments
while nday < 365:
tmax = numpy.append(tmax, data[(nx*ny)*nday:(nx*ny)*(nday + 1)].reshape((nx,ny), order='F'))
tmin = numpy.append(tmin, data[(nx*ny)*(nday + 2):(nx*ny)*(nday + 3)].reshape((nx,ny), order='F'))
nday += 1
I get the correct data for the first day, but for the second day I get all zeros, the third day the max is lower than the min, and so on.

While the exact format of Fortran binary files is compiler dependent, in all cases I'm aware of direct access files (files opened with access='direct' as in this question) do not have any record markers between records. Each record is of a fixed size, as given by the recl= specifier in the OPEN statement. That is, the record N starts at offset (N - 1) * RECL bytes in the file.
One portability gotcha is that the unit of the recl= is in terms of file storage units. For most compilers, the file storage unit specifies the size in 8-bit octets (as recommended in recent versions of the Fortran standard), but for the Intel Fortran compiler, recl= is in units of 32 bits; there is a commandline option -assume byterecl which can be used to make Intel Fortran match most other compilers.
So in the example given here and assuming a 8-bit file storage unit, your recl would be 1036800 bytes.
Further, looking at the code, it seems to assume the arrays are of a 4-byte type (e.g. integer or single precision real). So if it's single precision real, and the file has been created in little endian, then the numpy dtype <f4 that you have used seems to be the correct choice.
Now, getting back to the Intel Fortran compiler gotcha, if the file has been created by ifort without -assume byterecl then the data you want will be in the first quarter of each record, with the rest being padding (all zeros or maybe even random data?). Then you'll have to do some extra gymnastics to extract the correct data in python and not the padding. It should be easy to check this by checking the size of the file, is it nx * ny * 4 * nday *4 or is it nx * ny * 4 * nday * 4 * 4 bytes?

After the Update in my question, I realize I had an error with how I was looping. I of course spotted this about 10 minutes after issuing a bounty, aw well.
The error is with using the day to iterate through the records. This will not work as it iterates once per loop, not pushing the records far enough. Hence why some mins were higher than maxes. The new code is:
while nday < 365:
tmax = numpy.append(tmax, data[(nx*ny)*rm:(nx*ny)*(rm + 1)].reshape((nx,ny), order='F'))
rm = rm + 2
tmin = numpy.append(tmin, data[(nx*ny)*rm:(nx*ny)*(rm + 1)].reshape((nx,ny), order='F'))
rm = rm + 2
nday += 1
This used a Record Mover (or rm as I call it) to move the records the appropriate amount. That was all it needed.

How to calculate the presence time of student during a class session with image processing

I am trying to calculate the total presence time of students using face recognition. Such that at the end of class i can get two things: 1, total time a student was present. 2, from which time to which time he was present, and same for when he was not present(i.e. 9:00-9:20(Present), 9:20-9:22(not present), 9:22-9:42(present))
This is the way I am doing it.
In a 40 min class a python file runs after every 2 mins for 40 seconds.
Each time the file runs it stores the ids of students that are present, in a list data structure and stores it in the DB. I made totaClassTime/2 columns in table as the file runs after every 2 mins. By the end of the class(after 40 mins) it read the data from DB and calculates the total presence time and save it too in DB.
Is there a better way to do this all such that I don't have to create classTime/2 columns in table? Another ambiguity arising:
if for a student we get this data from DB:
9:00 9:02 9:04 9:06 9:18 9:10 9:12 9:14 9:16...
p p - p - p p p p ...
when calculating the total presence time it will add time from 9:00 to 9:02 then it will consider 9:02-9:04 as absence time and same for 9:04-9:06. however the student might be present b/w 9:04-9:06. I have searched alot but couldn't find the way to calculate the presence time accurately.

You could store each observation in a row instead of a column. Such a table looks like this:
classId | studentId | observationTime | present
----------------------------------------------------
1 1 9:00 p
1 1 9:02 p
1 1 9:04 -
1 1 9:06 p
1 1 9:08 -
1 1 9:10 p
...
Then to evaluate a student's presence time all rows containing observations of this student in the particular class can be selected and ordered by time. This can be achieved with a select statement similar to this one:
SELECT observationTime, present FROM observations WHERE classID='1' AND studentID='1' ORDER BY observationTime
Now, you can simply iterate of over the result set and of this query calculate the presence times as you did before.
Your problem with the student having an unclear presence state between 9:04-9:06 can be solved by defining for which time frame for which your observation is considered to be valid.
You have already split your class into two minute frames (from 9:00 to 9:02, from 09:02 to 09:04 and so on). Now, you can say that the 9:00 observation is valid for the time frame from 09:00 to 09:02, the 09:02 observation is valid from the timeslot from 09:02 to 09:04 and so on. This enables you to clearly interpret the data from your example: the 09:04 observation is valid for the time from 09:04 to 09:06. As the student was not observed at 09:04 he is considered to be absent in this slot. At the next observation at 09:06 he is present, so we consider him being the class from 09:06 to 09:08.
Obviously the student was not really away the full time between 09:04 and 09:06 unless he magically materialized in his seat exactly at 09:06. But as we only look at the class every to minutes we can only account for the student's presence at a two-minute resolution.
You are basically taking a sample of the state of the class at point in time every two minutes and assume it represents the state of the class at the whole two minutes.

Add two Pandas Series or DataFrame objects in-place?

I have a dataset where we record the electrical power demand from each individual appliance in the home. The dataset is quite large (2 years or data; 1 sample every 6 seconds; 50 appliances). The data is in a compressed HDF file.
We need to add the power demand for every appliance to get the total aggregate power demand over time. Each individual meter might have a different start and end time.
The naive approach (using a simple model of our data) is to do something like this:
LENGHT = 2**25
N = 30
cumulator = pd.Series()
for i in range(N):
# change the index for each new_entry to mimick the fact
# that out appliance meters have different start and end time.
new_entry = pd.Series(1, index=np.arange(i, LENGTH+i))
cumulator = cumulator.add(new_entry, fill_value=0)
This works fine for small amounts of data. It also works OK with large amounts of data as long as every new_entry has exactly the same index.
But, with large amounts of data, where each new_entry has a different start and end index, Python quickly gobbles up all the available RAM. I suspect this is a memory fragmentation issue. If I use multiprocessing to fire up a new process for each meter (to load the meter's data from disk, load the cumulator from disk, do the addition in memory, then save the cumulator back to disk, and exit the process) then we have fine memory behaviour but, of course, all that disk IO slows us down a lot.
So, I think what I want is an in-place Pandas add function. The plan would be to initialise cumulator to have an index which is the union of all the meters' indicies. Then allocate memory once for that cumulator. Hence no more fragmentation issues.
I have tried two approaches but neither is satisfactory.
I tried using numpy.add to allow me to set the out argument:
# Allocate enough space for the cumulator
cumulator = pd.Series(0, index=np.arange(0, LENGTH+N))
for i in range(N):
new_entry = pd.Series(1, index=np.arange(i, LENGTH+i))
cumulator, aligned_new_entry = cumulator.align(new_entry, copy=False, fill_value=0)
del new_entry
np.add(cumulator.values, aligned_new_entry.values, out=cumulator.values)
del aligned_new_entry
But this gobbles up all my RAM too and doesn't seem to do the addition. If I change the penaultiate line to cumulator.values = np.add(cumulator.values, aligned_new_entry.values, out=cumulator.values) then I get an error about not being able to assign to cumulator.values.
This second approach appears to have the correct memory behaviour but is far too slow to run:
for i in range(N):
new_entry = pd.Series(1, index=np.arange(i, LENGTH+i))
for index in cumulator.index:
try:
cumulator[index] += new_entry[index]
except KeyError:
pass
I suppose I could write this function in Cython. But I'd rather not have to do that.
So: is there any way to do an 'inplace add' in Pandas?
Update
In response to comments below, here is a toy example of our meter data and the sum we want. All values are watts.
time meter1 meter2 meter3 sum
09:00:00 10 10
09:00:06 10 20 30
09:00:12 10 20 30
09:00:18 10 20 30 50
09:00:24 10 20 30 50
09:00:30 10 30 40
If you want to see more details then here's the file format description of our data logger, and here's the 4TByte archive of our entire dataset.

After messing around a lot with multiprocessing, I think I've found a fairly simple and efficient way to do an in-place add without using multiprocessing:
import numpy as np
import pandas as pd
LENGTH = 2**26
N = 10
DTYPE = np.int
# Allocate memory *once* for a Series which will hold our cumulator
cumulator = pd.Series(0, index=np.arange(0, N+LENGTH), dtype=DTYPE)
# Get a numpy array from the Series' buffer
cumulator_arr = np.frombuffer(cumulator.data, dtype=DTYPE)
# Create lots of dummy data. Each new_entry has a different start
# and end index.
for i in range(N):
new_entry = pd.Series(1, index=np.arange(i, LENGTH+i), dtype=DTYPE)
aligned_new_entry = np.pad(new_entry.values, pad_width=((i, N-i)),
mode='constant', constant_values=((0, 0)))
# np.pad could be replaced by new_entry.reindex(index, fill_value=0)
# but np.pad is faster and more memory efficient than reindex
del new_entry
np.add(cumulator_arr, aligned_new_entry, out=cumulator_arr)
del aligned_new_entry
del cumulator_arr
print cumulator.head(N*2)
which prints:
0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
8 9
9 10
10 10
11 10
12 10
13 10
14 10
15 10
16 10
17 10
18 10
19 10

assuming that your dataframe looks something like:
df.index.names == ['time']
df.columns == ['meter1', 'meter2', ..., 'meterN']
then all you need to do is:
df['total'] = df.fillna(0, inplace=True).sum(1)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.