Append simulation data using HDF5 - python

I currently run a simulation several times and want to save the results of these simulations so that they can be used for visualizations.
The simulation is run 100 times and each of these simulations generates about 1 million data points (i.e. 1 million values for 1 million episodes), which I now want to store efficiently. The goal within each of these episodes would be to generate an average of each value across all 100 simulations.
My main file looks like this:
# Defining the test simulation environment
def test_simulation:
environment = environment(
periods = 1000000
parameter_x = ...
parameter_y = ...
)
# Defining the simulation
environment.simulation()
# Save simulation data
hf = h5py.File('runs/simulation_runs.h5', 'a')
hf.create_dataset('data', data=environment.value_history, compression='gzip', chunks=True)
hf.close()
# Run the simulation 100 times
for i in range(100):
print(f'--- Iteration {i} ---')
test_simulation()
The value_history is generated within game(), i.e. the values are continuously appended to an empty list according to:
def simulation:
for episode in range(periods):
value = doSomething()
self.value_history.append(value)
Now I get the following error message when going to the next simulation:
ValueError: Unable to create dataset (name already exists)
I am aware that the current code keeps trying to create a new file and generates an error because it already exists. Now I am looking to reopen the file created in the first simulation, append the data from the next simulation and save it again.

The example below shows how to pull all these ideas together. It creates 2 files:
Create 1 resizable dataset with maxshape() parameter on first loop, then use dataset.resize() on subsequent loops -- output is
simulation_runs1.h5
Create a unique dataset for each simulation -- output is
simulation_runs2.h5.
I created a simple 100x100 NumPy array of randoms for the "simulation data", and ran the simulation 10 times. They are variables, so you can increase to larger values to determine which method is better (faster) for your data. You may also discover memory limitations saving 1M data points for 1M time periods.
Note 1: If you can't save all the data in system memory, you can incrementally save simulation results to the H5 file. It's just a little more complicated.
Note 2: I added a mode variable to control whether a new file is created for the first simulation (i==0) or the existing file is opened in append mode for subsequent simulations.
import h5py
import numpy as np
# Create some psuedo-test data
def test_simulation(i):
periods = 100
times = 100
# Define the simulation with some random data
val_hist = np.random.random(periods*times).reshape(periods,times)
a0, a1 = val_hist.shape[0], val_hist.shape[1]
if i == 0:
mode='w'
else:
mode='a'
# Save simulation data (resize dataset)
with h5py.File('runs/simulation_runs1.h5', mode) as hf:
if 'data' not in list(hf.keys()):
print('create new dataset')
hf.create_dataset('data', shape=(1,a0,a1), maxshape=(None,a0,a1), data=val_hist,
compression='gzip', chunks=True)
else:
print('resize existing dataset')
d0 = hf['data'].shape[0]
hf['data'].resize( (d0+1,a0,a1) )
hf['data'][d0:d0+1,:,:] = val_hist
# Save simulation data (unique datasets)
with h5py.File('runs/simulation_runs2.h5', mode) as hf:
hf.create_dataset(f'data_{i:03}', data=val_hist,
compression='gzip', chunks=True)
# Run the simulation 100 times
for i in range(10):
print(f'--- Iteration {i} ---')
test_simulation(i)

Related

CSV file with multiple events with different headers and numbers of values in a row

So I have a very long csv file in this format below, these are the first two events of about 3000. The first row is date, detected with instrument number 16 (yes=16/no=0), detected with instrument number 17 (yes=17/no=0), and confidence of detection. Then the single value is the number of data points collected for that event. Then it goes into the actual data from that event, time in UTC, lat, lon, energy. My goal is for these 3000 events is to integrate all the energies (the light curve) to get a total energy. I need a way for my code to read the first value as a date and start a new "loop" to: read the next line so it integrates the next n events, print this integrate value, then recognize the next line as a time and do it again so I have 3000 integrated energies. I haven't really worked with a csv like this, especially without headers, so any start point is much appreciated!!
2021-10-15T16:49:30.059Z,16,0,medium
118
1634316570059,-77.00921630859375,36.263389587402344,3.0848139228168667e-15
1634316570061,-77.00920104980469,36.26337432861328,2.184921944046166e-15
1634316570063,-77.00919342041016,36.26335906982422,2.484885936969733e-15
1634316570065,-77.0091781616211,36.263343811035156,2.9848259251756777e-15
1634316570067,-77.00916290283203,36.26332473754883,2.084933946404977e-15
1634316570069,-77.00917053222656,36.2632942199707,3.584753911022812e-15
1634316570071,-77.0091781616211,36.263267517089844,2.8848379275344887e-15
1634316570073,-77.00918579101562,36.263240814208984,2.184921944046166e-15
1634316570075,-77.00919342041016,36.26321029663086,3.3847779157404337e-15
1634316570077,-77.00920104980469,36.26319122314453,2.9848259251756777e-15
1634316570079,-77.00920867919922,36.26319885253906,2.8848379275344887e-15
1634316570081,-77.00921630859375,36.263206481933594,3.1848019204580557e-15
1634316570083,-77.00921630859375,36.26321792602539,4.2846698945111345e-15
1634316570085,-77.00922393798828,36.26322555541992,2.9848259251756777e-15
1634316570087,-77.00922393798828,36.26323699951172,3.8847179039463786e-15
1634316570089,-77.00921630859375,36.26326370239258,4.6846218850758904e-15
1634316570091,-77.00920867919922,36.2632942199707,4.984585877999457e-15
1634316570093,-77.00920867919922,36.26332092285156,4.2846698945111345e-15
1634316570095,-77.00920104980469,36.26335144042969,4.984585877999457e-15
1634316570097,-77.00919342041016,36.26337814331055,4.4846458897935125e-15
1634316570099,-77.00918579101562,36.263389587402344,5.184561873281835e-15
1634316570101,-77.00917053222656,36.263404846191406,4.0846938992287565e-15
1634316570103,-77.00916290283203,36.26342010498047,4.7846098827170794e-15
1634316570105,-77.0091552734375,36.26343536376953,4.6846218850758904e-15
1634316570107,-77.00914764404297,36.26344680786133,3.8847179039463786e-15
1634316570109,-77.0091552734375,36.263431549072266,2.8848379275344887e-15
1634316570111,-77.0091552734375,36.26342010498047,2.284909941687355e-15
1634316570113,-77.0091552734375,36.263404846191406,2.9848259251756777e-15
1634316570115,-77.0091552734375,36.26339340209961,2.5848739346109218e-15
1634316570147,-77.0092544555664,36.26349639892578,3.2847899180992447e-15
1634316570149,-77.0092544555664,36.263450622558594,2.6848619322521108e-15
1634316570151,-77.00924682617188,36.263404846191406,2.384897939328544e-15
1634316570153,-77.00923919677734,36.26335906982422,2.9848259251756777e-15
1634316570155,-77.00923156738281,36.26331329345703,2.384897939328544e-15
1634316570157,-77.00922393798828,36.263267517089844,2.9848259251756777e-15
1634316570159,-77.00922393798828,36.263275146484375,2.384897939328544e-15
1634316570161,-77.00921630859375,36.263282775878906,3.8847179039463786e-15
1634316570163,-77.00921630859375,36.26329040527344,3.7847299063051896e-15
1634316570165,-77.00920867919922,36.263301849365234,4.3846578921523235e-15
1634316570167,-77.00920867919922,36.263309478759766,4.984585877999457e-15
1634316570169,-77.00920867919922,36.263328552246094,5.584513863846591e-15
1634316570171,-77.00921630859375,36.26334762573242,5.68450186148778e-15
1634316570173,-77.00922393798828,36.263370513916016,6.384417844976103e-15
1634316570175,-77.00922393798828,36.263389587402344,5.484525866205402e-15
1634316570177,-77.00923156738281,36.26340866088867,5.884477856770158e-15
1634316570179,-77.00923919677734,36.26340866088867,5.784489859128969e-15
1634316570181,-77.0092544555664,36.26340103149414,5.084573875640646e-15
1634316570183,-77.00926208496094,36.263397216796875,5.084573875640646e-15
1634316570185,-77.00926971435547,36.26339340209961,4.4846458897935125e-15
1634316570186,-77.00928497314453,36.26338577270508,5.284549870923024e-15
1634316570189,-77.00926971435547,36.263389587402344,5.084573875640646e-15
1634316570191,-77.0092544555664,36.26339340209961,4.4846458897935125e-15
1634316570193,-77.00923919677734,36.263397216796875,2.9848259251756777e-15
1634316570194,-77.00922393798828,36.26340103149414,4.0846938992287565e-15
1634316570196,-77.00920104980469,36.263404846191406,4.5846338874347014e-15
1634316570199,-77.00921630859375,36.26340866088867,4.4846458897935125e-15
1634316570201,-77.00923156738281,36.26340866088867,3.3847779157404337e-15
1634316570202,-77.00923919677734,36.26340866088867,2.8848379275344887e-15
1634316570207,-76.9164047241211,36.26325988769531,2.7848499298932998e-15
1634316570209,-76.91639709472656,36.26325225830078,3.4847659133816226e-15
1634316570210,-76.91638946533203,36.26324462890625,4.7846098827170794e-15
1634316570212,-76.9163818359375,36.26323318481445,4.8845978803582684e-15
1634316570214,-76.91637420654297,36.26322555541992,5.884477856770158e-15
1634316570217,-76.9163589477539,36.26321792602539,6.68438183789967e-15
1634316570218,-76.91635131835938,36.26320266723633,6.984345830823237e-15
1634316570220,-76.91633605957031,36.263187408447266,7.784249811952749e-15
1634316570222,-76.91632080078125,36.2631721496582,7.984225807235127e-15
1634316570224,-76.91631317138672,36.26315689086914,1.0083973757700096e-14
1634316570226,-76.91629791259766,36.26314163208008,1.258367369872982e-14
1634316570228,-76.91632080078125,36.263145446777344,1.358355367514171e-14
1634316570230,-76.9163589477539,36.26315689086914,1.568330162560668e-14
1634316570232,-76.91639709472656,36.26316833496094,1.7183121590224513e-14
1634316570234,-76.91643524169922,36.263179779052734,1.8682941554842348e-14
1634316570236,-76.91647338867188,36.263187408447266,2.2082533474642773e-14
1634316570238,-76.91646575927734,36.2631950378418,2.528214939916082e-14
1634316570240,-76.91645050048828,36.26319885253906,2.5782089387366766e-14
1634316570242,-76.91642761230469,36.26319885253906,2.828178932839649e-14
1634316570244,-76.91641235351562,36.26320266723633,2.928166930480838e-14
1634316570246,-76.91638946533203,36.263206481933594,3.028154928122027e-14
1634316570248,-76.91637420654297,36.263214111328125,3.048152527650265e-14
1634316570250,-76.91636657714844,36.263221740722656,3.008157328593789e-14
1634316570252,-76.91635131835938,36.26322937011719,2.928166930480838e-14
1634316570254,-76.91633605957031,36.26323699951172,2.9881597290655514e-14
1634316570256,-76.91632843017578,36.26324462890625,3.148140525291454e-14
1634316570258,-76.91632843017578,36.26325225830078,3.298122521753237e-14
1634316570260,-76.91634368896484,36.263267517089844,3.248128522932643e-14
1634316570262,-76.9163589477539,36.26327896118164,3.198134524112048e-14
1634316570264,-76.91636657714844,36.26329040527344,3.238129723168524e-14
1634316570266,-76.9163818359375,36.263301849365234,3.1781369245838105e-14
1634316570268,-76.91638946533203,36.26329040527344,3.0781489269426215e-14
1634316570270,-76.91639709472656,36.26326370239258,3.148140525291454e-14
1634316570272,-76.9164047241211,36.263240814208984,3.228130923404405e-14
1634316570274,-76.91641235351562,36.263214111328125,3.348116520573832e-14
1634316570276,-76.91641998291016,36.263187408447266,3.498098517035615e-14
1634316570278,-76.91641998291016,36.26317596435547,3.348116520573832e-14
1634316570280,-76.91642761230469,36.263179779052734,3.108145326234978e-14
1634316570282,-76.91642761230469,36.263179779052734,2.918168130716719e-14
1634316570284,-76.91642761230469,36.26318359375,3.048152527650265e-14
1634316570286,-76.91642761230469,36.26318359375,3.148140525291454e-14
1634316570288,-76.91643524169922,36.26318359375,3.1781369245838105e-14
1634316570290,-76.91643524169922,36.26317596435547,3.048152527650265e-14
1634316570292,-76.91644287109375,36.26316833496094,3.238129723168524e-14
1634316570294,-76.91644287109375,36.26316452026367,3.128142925763216e-14
1634316570296,-76.91644287109375,36.26315689086914,2.9681621295373136e-14
1634316570298,-76.91643524169922,36.26316452026367,2.858175332132006e-14
1634316570300,-76.9164047241211,36.263179779052734,2.748188534726698e-14
1634316570302,-76.9163818359375,36.2631950378418,2.2582473462848718e-14
1634316570304,-76.91635131835938,36.263214111328125,1.9982785524177805e-14
1634316570306,-76.91632843017578,36.26322937011719,1.4983385642118356e-14
1634316570308,-76.91632843017578,36.26323699951172,1.2283709705806253e-14
1634316570310,-76.91636657714844,36.26323318481445,1.068390174354723e-14
1634316570312,-76.9164047241211,36.26322937011719,8.884117786005828e-15
1634316570314,-76.91644287109375,36.26322555541992,7.284309823746804e-15
1634316570316,-76.9164810180664,36.263221740722656,6.084453852052536e-15
1634316570318,-76.9164810180664,36.263221740722656,4.5846338874347014e-15
1634316570320,-76.91645050048828,36.26322937011719,2.384897939328544e-15
1634316570342,-76.91644287109375,36.26319122314453,2.7848499298932998e-15
1634316570344,-76.91645812988281,36.263179779052734,2.484885936969733e-15
2021-10-15T16:28:56.018Z,17,0,low
53
1634315336018,-163.02442932128906,16.354591369628906,1.485005960557843e-15
1634315336020,-163.0244140625,16.35460090637207,1.584993958199032e-15
1634315336022,-163.0244140625,16.354612350463867,1.984945948763788e-15
1634315336024,-163.0244140625,16.354623794555664,2.284909941687355e-15
1634315336026,-163.0244140625,16.35463523864746,1.884957951122599e-15
1634315336028,-163.02439880371094,16.354646682739258,2.084933946404977e-15
1634315336030,-163.0244140625,16.354642868041992,2.484885936969733e-15
1634315336032,-163.0244140625,16.35463523864746,2.6848619322521108e-15
1634315336034,-163.02442932128906,16.35462760925293,2.6848619322521108e-15
1634315336036,-163.02442932128906,16.354618072509766,2.6848619322521108e-15
1634315336038,-163.02444458007812,16.354610443115234,2.5848739346109218e-15
1634315336040,-163.02444458007812,16.354612350463867,2.484885936969733e-15
1634315336042,-163.02444458007812,16.354616165161133,2.384897939328544e-15
1634315336044,-163.02444458007812,16.35462188720703,2.5848739346109218e-15
1634315336046,-163.02444458007812,16.35462760925293,2.7848499298932998e-15
1634315336048,-163.02444458007812,16.354633331298828,3.2847899180992447e-15
1634315336050,-163.02444458007812,16.354637145996094,2.9848259251756777e-15
1634315336052,-163.0244598388672,16.35464096069336,3.1848019204580557e-15
1634315336054,-163.0244598388672,16.354642868041992,2.9848259251756777e-15
1634315336055,-163.02447509765625,16.354646682739258,3.3847779157404337e-15
1634315336058,-163.02447509765625,16.354650497436523,3.8847179039463786e-15
1634315336060,-163.02447509765625,16.354637145996094,4.5846338874347014e-15
1634315336062,-163.0244598388672,16.354618072509766,5.084573875640646e-15
1634315336063,-163.02444458007812,16.354597091674805,5.184561873281835e-15
1634315336066,-163.0244140625,16.354576110839844,5.68450186148778e-15
1634315336068,-163.02439880371094,16.354557037353516,5.68450186148778e-15
1634315336070,-163.0244140625,16.35456085205078,6.784369835540859e-15
1634315336071,-163.02442932128906,16.354576110839844,7.084333828464426e-15
1634315336073,-163.02444458007812,16.35459327697754,7.284309823746804e-15
1634315336076,-163.0244598388672,16.3546085357666,7.384297821387993e-15
1634315336078,-163.02447509765625,16.354623794555664,7.284309823746804e-15
1634315336079,-163.0244903564453,16.354625701904297,7.084333828464426e-15
1634315336081,-163.0244903564453,16.354618072509766,7.384297821387993e-15
1634315336083,-163.0244903564453,16.354612350463867,7.68426181431156e-15
1634315336086,-163.0244903564453,16.354604721069336,8.484165795441072e-15
1634315336087,-163.02447509765625,16.354597091674805,8.484165795441072e-15
1634315336089,-163.02447509765625,16.354597091674805,8.784129788364639e-15
1634315336091,-163.02447509765625,16.35460090637207,8.884117786005828e-15
1634315336093,-163.02447509765625,16.35460662841797,8.884117786005828e-15
1634315336095,-163.02447509765625,16.354610443115234,9.184081778929395e-15
1634315336097,-163.02447509765625,16.3546142578125,8.784129788364639e-15
1634315336099,-163.02447509765625,16.354612350463867,8.184201802517505e-15
1634315336101,-163.02447509765625,16.35460662841797,7.884237809593938e-15
1634315336103,-163.02447509765625,16.35460090637207,7.184321826105615e-15
1634315336105,-163.02447509765625,16.35459327697754,6.68438183789967e-15
1634315336107,-163.0244903564453,16.35458755493164,6.384417844976103e-15
1634315336109,-163.02447509765625,16.35458755493164,5.584513863846591e-15
1634315336111,-163.0244598388672,16.354595184326172,4.7846098827170794e-15
1634315336113,-163.02444458007812,16.35460090637207,3.8847179039463786e-15
1634315336115,-163.02444458007812,16.3546085357666,3.0848139228168667e-15
1634315336117,-163.02442932128906,16.354616165161133,2.5848739346109218e-15
1634315336119,-163.0244140625,16.354616165161133,2.084933946404977e-15
1634315336121,-163.0244140625,16.3546085357666,1.285029965275465e-15
2021-10-14T17:04:46.571Z,17,0,low
26
...
Update: formatting is different for events detected with both instruments 16 and 17 (see comment below for more info):
2021-10-14T10:16:07.969Z,16,17,medium
74
1634206567969,-87.70339965820312,-12.525328636169434,1.78496995348141e-15
1634206567971,-87.70339965820312,-12.525323867797852,1.584993958199032e-15
1634206567973,-87.70340728759766,-12.525325775146484,1.684981955840221e-15
1634206567975,-87.70341491699219,-12.525327682495117,1.884957951122599e-15
1634206567977,-87.70341491699219,-12.52532958984375,1.78496995348141e-15
1634206567979,-87.70342254638672,-12.525331497192383,1.584993958199032e-15
1634206567981,-87.70342254638672,-12.525333404541016,1.884957951122599e-15
1634206567983,-87.70341491699219,-12.525336265563965,1.984945948763788e-15
1634206567985,-87.70339965820312,-12.52534008026123,2.284909941687355e-15
1634206567987,-87.7033920288086,-12.52534294128418,1.984945948763788e-15
1634206567989,-87.70338439941406,-12.525346755981445,1.78496995348141e-15
1634206567991,-87.70337677001953,-12.525354385375977,1.684981955840221e-15
1634206567993,-87.70337677001953,-12.525372505187988,1.185041967634276e-15
1634206567997,-87.703369140625,-12.525408744812012,1.485005960557843e-15
1634206567999,-87.703369140625,-12.525426864624023,1.285029965275465e-15
1634206568007,-87.78045654296875,-12.52658462524414,1.78496995348141e-15
1634206568009,-87.78046417236328,-12.52658462524414,2.484885936969733e-15
1634206568011,-87.78047943115234,-12.526582717895508,2.5848739346109218e-15
1634206568013,-87.78047943115234,-12.526573181152344,3.0848139228168667e-15
1634206568015,-87.78048706054688,-12.526564598083496,3.584753911022812e-15
1634206568017,-87.78048706054688,-12.526555061340332,4.1846818968699455e-15
1634206568019,-87.7804946899414,-12.526545524597168,4.6846218850758904e-15
1634206568021,-87.7804946899414,-12.526537895202637,5.084573875640646e-15
1634206568023,-87.78048706054688,-12.526537895202637,5.284549870923024e-15
1634206568025,-87.78047943115234,-12.52653694152832,5.884477856770158e-15
1634206568027,-87.78047180175781,-12.526535987854004,6.984345830823237e-15
1634206568029,-87.78045654296875,-12.526535987854004,7.484285819029182e-15
1634206568031,-87.78044891357422,-12.526533126831055,7.984225807235127e-15
1634206568033,-87.78043365478516,-12.526520729064941,7.984225807235127e-15
1634206568035,-87.78042602539062,-12.526508331298828,7.784249811952749e-15
1634206568037,-87.78041076660156,-12.526495933532715,7.68426181431156e-15
1634206568039,-87.7803955078125,-12.526482582092285,7.884237809593938e-15
1634206568041,-87.78038787841797,-12.526473999023438,7.584273816670371e-15
1634206568043,-87.78040313720703,-12.526484489440918,7.084333828464426e-15
1634206568045,-87.7804183959961,-12.526494026184082,6.484405842617292e-15
1634206568047,-87.78043365478516,-12.526503562927246,5.284549870923024e-15
1634206568049,-87.78044891357422,-12.52651309967041,4.8845978803582684e-15
1634206568051,-87.78045654296875,-12.526520729064941,5.084573875640646e-15
1634206568053,-87.78045654296875,-12.52651596069336,4.7846098827170794e-15
1634206568055,-87.78045654296875,-12.526511192321777,4.3846578921523235e-15
1634206568057,-87.78045654296875,-12.526506423950195,4.4846458897935125e-15
1634206568059,-87.78045654296875,-12.526500701904297,3.8847179039463786e-15
1634206568060,-87.78045654296875,-12.526496887207031,3.0848139228168667e-15
1634206568063,-87.78046417236328,-12.52650260925293,3.0848139228168667e-15
1634206568065,-87.78047943115234,-12.526508331298828,2.5848739346109218e-15
1634206568067,-87.78048706054688,-12.52651309967041,1.984945948763788e-15
1634206568068,-87.78050231933594,-12.526518821716309,1.984945948763788e-15
1634206568071,-87.78050994873047,-12.526522636413574,1.684981955840221e-15
1634206568073,-87.780517578125,-12.526510238647461,1.584993958199032e-15
1634206568075,-87.780517578125,-12.526497840881348,2.184921944046166e-15
1634206568076,-87.780517578125,-12.526485443115234,2.5848739346109218e-15
1634206568078,-87.78052520751953,-12.526473045349121,2.7848499298932998e-15
1634206568081,-87.78052520751953,-12.52646255493164,3.0848139228168667e-15
1634206568083,-87.780517578125,-12.526474952697754,3.9847059015875675e-15
1634206568084,-87.78050994873047,-12.526488304138184,3.9847059015875675e-15
1634206568086,-87.78050231933594,-12.526501655578613,4.1846818968699455e-15
1634206568088,-87.78050231933594,-12.526514053344727,4.7846098827170794e-15
1634206568091,-87.7804946899414,-12.526527404785156,5.084573875640646e-15
1634206568093,-87.78048706054688,-12.526535987854004,5.184561873281835e-15
1634206568094,-87.78048706054688,-12.526544570922852,5.084573875640646e-15
1634206568096,-87.78048706054688,-12.5265531539917,5.384537868564213e-15
1634206568099,-87.78048706054688,-12.526561737060547,5.68450186148778e-15
1634206568101,-87.78047943115234,-12.526570320129395,5.284549870923024e-15
1634206568102,-87.78048706054688,-12.52657413482666,5.084573875640646e-15
1634206568104,-87.78048706054688,-12.526578903198242,4.7846098827170794e-15
1634206568106,-87.78048706054688,-12.526583671569824,4.984585877999457e-15
1634206568109,-87.7804946899414,-12.52658748626709,5.484525866205402e-15
1634206568110,-87.7804946899414,-12.526592254638672,6.784369835540859e-15
1634206568112,-87.78048706054688,-12.526575088500977,6.68438183789967e-15
1634206568114,-87.78048706054688,-12.526556968688965,5.884477856770158e-15
1634206568116,-87.78047943115234,-12.526538848876953,5.184561873281835e-15
1634206568118,-87.78047180175781,-12.526520729064941,3.584753911022812e-15
1634206568120,-87.78047180175781,-12.52650260925293,2.6848619322521108e-15
1634206568122,-87.78047943115234,-12.526507377624512,1.485005960557843e-15
51
1634206567977,-86.63197326660156,-12.489493370056152,4.984585877999457e-15
1634206567979,-86.63197326660156,-12.489490509033203,4.6846218850758904e-15
1634206567981,-86.6319808959961,-12.489486694335938,5.184561873281835e-15
1634206567983,-86.6319808959961,-12.489477157592773,5.184561873281835e-15
1634206567985,-86.63198852539062,-12.489458084106445,5.484525866205402e-15
1634206567987,-86.63199615478516,-12.489439964294434,6.184441849693725e-15
1634206567990,-86.63200378417969,-12.489420890808105,5.68450186148778e-15
1634206567991,-86.63201141357422,-12.489401817321777,4.984585877999457e-15
1634206567999,-86.63202667236328,-12.489419937133789,5.284549870923024e-15
1634206568001,-86.63203430175781,-12.48942756652832,4.4846458897935125e-15
1634206568005,-86.63209533691406,-12.489420890808105,5.784489859128969e-15
1634206568007,-86.63213348388672,-12.489412307739258,6.484405842617292e-15
1634206568009,-86.63217163085938,-12.48940372467041,7.68426181431156e-15
1634206568011,-86.63221740722656,-12.489395141601562,7.68426181431156e-15
1634206568013,-86.63224029541016,-12.48939323425293,8.784129788364639e-15
1634206568015,-86.63221740722656,-12.48940372467041,8.484165795441072e-15
1634206568017,-86.6322021484375,-12.48941421508789,9.484045771852962e-15
1634206568019,-86.6321792602539,-12.489424705505371,9.484045771852962e-15
1634206568021,-86.63216400146484,-12.489436149597168,8.484165795441072e-15
1634206568023,-86.63214874267578,-12.489445686340332,9.084093781288206e-15
1634206568025,-86.63213348388672,-12.489455223083496,1.0083973757700096e-14
1634206568027,-86.63212585449219,-12.48946475982666,1.2083733710523875e-14
1634206568029,-86.63211059570312,-12.489473342895508,1.1783769717600308e-14
1634206568031,-86.63209533691406,-12.489482879638672,1.1183841731753174e-14
1634206568033,-86.63208770751953,-12.489483833312988,1.1483805724676741e-14
1634206568035,-86.63208770751953,-12.48946475982666,1.0883877738829607e-14
1634206568037,-86.632080078125,-12.489446640014648,1.0483925748264851e-14
1634206568039,-86.632080078125,-12.48942756652832,9.084093781288206e-15
1634206568041,-86.632080078125,-12.489409446716309,8.58415379308226e-15
1634206568043,-86.63207244873047,-12.489396095275879,7.284309823746804e-15
1634206568045,-86.63207244873047,-12.489398002624512,5.084573875640646e-15
1634206568079,-86.7547607421875,-12.485508918762207,5.68450186148778e-15
1634206568081,-86.75476837158203,-12.485518455505371,5.484525866205402e-15
1634206568083,-86.75476837158203,-12.485527038574219,7.384297821387993e-15
1634206568085,-86.7547378540039,-12.485523223876953,5.984465854411347e-15
1634206568087,-86.75469970703125,-12.485518455505371,8.68414179072345e-15
1634206568089,-86.7432632446289,-12.516714096069336,1.0983865736470796e-14
1634206568091,-86.74429321289062,-12.513812065124512,1.1083853734111985e-14
1634206568093,-86.74303436279297,-12.517171859741211,1.2183721708165064e-14
1634206568095,-86.74327087402344,-12.516566276550293,1.2183721708165064e-14
1634206568097,-86.74252319335938,-12.518638610839844,1.2183721708165064e-14
1634206568099,-86.74343872070312,-12.516180038452148,1.168378171995912e-14
1634206568101,-86.74300384521484,-12.517393112182617,1.158379372231793e-14
1634206568103,-86.7431411743164,-12.517057418823242,1.1383817727035552e-14
1634206568105,-86.74283599853516,-12.517965316772461,9.983985760058907e-15
1634206568107,-86.74272155761719,-12.518328666687012,1.2083733710523875e-14
1634206568109,-86.7419204711914,-12.520586967468262,1.2783649694012198e-14
1634206568111,-86.7414321899414,-12.52199935913086,1.3783529670424088e-14
1634206568113,-86.74019622802734,-12.525436401367188,1.3683541672782899e-14
1634206568115,-86.74127960205078,-12.52233600616455,1.168378171995912e-14
1634206568117,-86.74078369140625,-12.523541450500488,9.68402176713534e-15
I don't know calculus, so I cannot help you with the integration, but here's how to parse that file and get a start on making sense of the data.
I'm using Python's standard CSV module. It has a reader class which allows you to iterate through all the rows of a CSV. There are two ways to step through the rows, and my code uses both:
for row in reader: this automatically advances the reader, the reader signals when it's run out of rows and the loop exits
next(reader): this "manually advances" the reader one row
import csv
# Will look like:
# [
# [
# ['2021-10-15T16:49:30.059Z','16','0','medium'],
# [
# 118 data-point rows (all values are strings)
# ]
# ],
# next event
# ]
data = []
with open('sample.csv', newline='') as f:
reader = csv.reader(f)
for row in reader:
header_row = row # first line of event
data_points = int(next(reader)[0]) # second line, single integer count
data_rows = []
while len(data_rows) < data_points: # loop through the right amount of data points
try:
data_row = next(reader)
except StopIteration:
print(f'error: for {header_row}, expected {data_points} data-point rows, but event only has {len(data_rows)} data-point rows')
break
data_rows.append(data_row)
data.append([header_row, data_rows]) # glue it together
# Print out a sanity check to make sure that all worked
for datum in data:
header_row = datum[0]
data_rows = datum[1]
event_date = header_row[0]
print(f'Event on {event_date} has {len(data_rows)} data points')
I'm using the for-loop to control reading the file as a section of events:
In each section I start off with what I'm calling the "header row"
Next I advance one row and parse the number of data points for this event
With the number of data point, I manually advance through that many data-point rows.
At the bottom of the loop I'm grouping the header row and all the accumulated data points into one list and appending it to data.
All that concludes a single event, and the for-loop does it all again for the next event, till there are no more events.
When I run that with your sample data, I get:
Event on 2021-10-15T16:49:30.059Z has 118 data points
Event on 2021-10-15T16:28:56.018Z has 53 data points
With that in place, you can now loop over events, and for each event loop over the data points, integrating along the way.
Update 1
An event has multiple sets of data-point rows. I think the structure of my code still stands.
The main loop processes an "event"; inside each event you know there is a 1-column row that signals how many rows of data points follow.
I recommend taking a step back, make a much smaller example set of data that you can look at one glance, maybe 20 rows tops, and has all these characteristics. Write down on paper what your algorithm is.
Something like: "I start at row 1, it must be an event header. I move to row 2, it must be the count of data-point rows that follow. I move on and process that many rows. I move on to the next row: is it single-column (a count of data-point rows to follow)? If so, run that inner data-point loop again. If not, this event is done, I go back to the top of the main loop for the next event."

Hadoop stuck on reduce 67% (only with large data)

I'm a beginner at Hadoop and Linux.
The Problem
Hadoop reduce stuck (or move really really slow) when the input data is large (e.x. 600k rows or 6M rows) even though the Map and Reduce functions are quite simple, 2021-08-08 22:53:12,350 INFO mapreduce.Job: map 100% reduce 67%.
In Linux System Monitor I can see when reduce hit the 67% only one CPU keep running at the time at 100% and the rest of them are sleeping :) see this picture
What ran successfully
I ran the MapReduce job with small input data (600 rows) fast and successfully without any issue map 100% reduce 100%, 2021-08-08 19:44:13,350 INFO mapreduce.Job: map 100% reduce 100%.
Mapper (Python)
#!/usr/bin/env python3
import sys
from itertools import islice
from operator import itemgetter
def read_input(file):
# read file except first line
for line in islice(file, 1, None):
# split the line into words
yield line.split(',')
def main(separator='\t'):
# input comes from STDIN (standard input)
data = read_input(sys.stdin)
for words in data:
# for each row we take only the needed columns
data_row = list(itemgetter(*[1, 2, 4, 5, 6, 9, 10, 18])(words))
data_row[7] = data_row[7].replace('\n', '')
# taking year and month No.from first column to create the
# key that will send to reducer
date = data_row[0].split(' ')[0].split('-')
key = str(date[0]) + '_' + str(date[1])
# value that will send to reducer
value = ','.join(data_row)
# print here will send the output pair (key, value)
print('%s%s%s' % (key, separator, value))
if __name__ == "__main__":
main()
Reducer (Python)
#!/usr/bin/env python3
from itertools import groupby
from operator import itemgetter
import sys
import pandas as pd
import numpy as np
import time
def read_mapper_output(file):
for line in file:
yield line
def main(separator='\t'):
all_rows_2015 = []
all_rows_2016 = []
start_time = time.time()
names = ['tpep_pickup_datetime', 'tpep_dropoff_datetime', 'trip_distance',
'pickup_longitude', 'pickup_latitude', 'dropoff_longitude',
'dropoff_latitude', 'total_amount']
df = pd.DataFrame(columns=names)
# input comes from STDIN (standard input)
data = read_mapper_output(sys.stdin)
for words in data:
# get key & value from Mapper
key, value = words.split(separator)
row = value.split(',')
# split data with belong to 2015 from data belong to 2016
if key in '2015_01 2015_02 2015_03':
all_rows_2015.append(row)
if len(all_rows_2015) >= 10:
df=df.append(pd.DataFrame(all_rows_2015, columns=names))
all_rows_2015 = []
elif key in '2016_01 2016_02 2016_03':
all_rows_2016.append(row)
if len(all_rows_2016) >= 10:
df=df.append(pd.DataFrame(all_rows_2016, columns=names))
all_rows_2016 = []
print(df.to_string())
print("--- %s seconds ---" % (time.time() - start_time))
if __name__ == "__main__":
main()
More Info
I'm using Hadoop v3.2.1 on Linux installed on VMware to run MapReduce job in Python.
Reduce Job in Numbers:
Input Data Size
Number of rows
Reduce job time
~98 Kb
600 rows
~0.1 sec
good
~953 Kb
6,000 rows
~1 sec
good
~9.5 Mb
60,000 rows
~52 sec
good
~94 Mb
600,000 rows
~5647 sec (~94 min)
very slow
~11 Gb
76,000,000 rows
??
impossible
The goal is running on ~76M rows input data, it's impossible with this issue remaining.
"when reduce hit the 67% only one CPU keep running at the time at 100% and the rest of them are sleeping" - you have skew. One key has far more values than any other key.
I see some problems here.
In the reduce phase you don't make any summarization, just fiter 2015Q1 and 2015Q2 - reduce is supposed to be used for summarization like grouping by key or doing some calculations based on the keys.
If you just need to filter data, do it on the map phase to save cycles (assume you're billed for all data):
You store a lot of stuff in RAM inside a dataframe. Since you don't know how big is the key, you are experiencing trashing. This combined with heavy keys will make your process do a page fault on every DataFrame.append after some time.
There are some fixes:
Do you really need a reduce phase? Since you are just filtering the first three months os 2015 and 2016 you cand do this on the Map phase. This will make the process go a bit faster if you need a reduce later since it will need less data for the reduce phase.
def main(separator='\t'):
# input comes from STDIN (standard input)
data = read_input(sys.stdin)
for words in data:
# for each row we take only the needed columns
data_row = list(itemgetter(*[1, 2, 4, 5, 6, 9, 10, 18])(words))
# Find out first if you are filtering this data
# taking year and month No.from first column to create the
# key that will send to reducer
date = data_row[0].split(' ')[0].split('-')
# Filter out
if (date[1] in [1,2,3]) and (date[0] in [2015,2016]):
# We keep this data. Calulate key and clean up data_row[7]
key = str(date[0]) + '_' + str(date[1])
data_row[7] = data_row[7].replace('\n', '')
# value that will send to reducer
value = ','.join(data_row)
# print here will send the output pair (key, value)
print('%s%s%s' % (key, separator, value))
Try not to store data in memory during the reduce. Since you are filtering, print() the results as soon as you have it. If your source data is not sorted, the reduce will serve as a way to have all data from the same month together.
You've got a bug in your reduce phase: you're losing number_of_records_per_key modulo 10 because you don't append the results to the dataframe. Dont' append to the dataframe and print the result asap.

Can I speed up my reading and processing of many .csv files in python?

I am currently occupied with a dataset consisting of 90 .csv files. There are three types of .csv files (30 of each type).
Each csv has from 20k to 30k rows average and 3 columns(timestamp in linux format, Integer,Integer).
Here's an example of the header and a row:
Timestamp id1 id2
151341342 324 112
I am currently using 'os' to list all files in the directory.
The process for each CSV file is as follows:
Read it through pandas into a dataframe
iterate the rows of the file and for each row convert the timestamp to readable format.
Use the converted timestamp and Integers to create a relationship-type of object and add it on a list of relationships
The list will later be looped to create the relationships in my neo4j database.
The problem I am having is that the process takes too much time. I have asked and searched for ways to do it faster (I got answers like PySpark, Threads) but I did not find something that really fits my needs. I am really stuck as with my resources it takes around 1 hour and 20 minutes to do all that process for one of the big .csv file(meaning one with around 30k rows)
Converting to readable format:
ts = int(row['Timestamp'])
formatted_ts = datetime.utcfromtimestamp(ts).strftime('%Y-%m-%d %H:%M:%S')
And I pass the parameters to the Relationship func of py2neo to create my relationships. Later that list will be looped .
node1 = graph.evaluate('MATCH (n:User) WHERE n.id={id} RETURN n', id=int(row["id1"]))
node2 = graph.evaluate('MATCH (n:User) WHERE n.id={id} RETURN n', id=int(row['id2']))
rels.append(Relationship(node1, rel_type, node2, date=date, time=time))
time to compute row: 0:00:00.001000
time to create relationship: 0:00:00.169622
time to compute row: 0:00:00.001002
time to create relationship: 0:00:00.166384
time to compute row: 0:00:00
time to create relationship: 0:00:00.173672
time to compute row: 0:00:00
time to create relationship: 0:00:00.171142
I calculated the time for the two parts of the process as shown above. It is fast and there really seems to not be a problem except the size of the files. This is why the only things that comes to mind is that Parallelism would help to compute those files faster(by computing lets say 4 files in the same time instead of one)
sorry for not posting everything
I am really looking forward for replies
Thank you in advance
That sounds fishy to me. Processing csv files of that size should not be that slow.
I just generated a 30k line csv file of the type you described (3 columns filled with random numbers of the size you specified.
import random
with open("file.csv", "w") as fid:
fid.write("Timestamp;id1;id2\n")
for i in range(30000):
ts = int(random.random()*1000000000)
id1 = int(random.random()*1000)
id2 = int(random.random()*1000)
fid.write("{};{};{}\n".format(ts, id1, id2))
Just reading the csv file into a list using plain Python takes well under a second. Printing all the data takes about 3 seconds.
from datetime import datetime
def convert_date(string):
ts = int(string)
formatted_ts = datetime.utcfromtimestamp(ts).strftime('%Y-%m-%d %H:%M:%S')
split_ts = formatted_ts.split()
date = split_ts[0]
time = split_ts[1]
return date
with open("file.csv", "r") as fid:
header = fid.readline()
lines = []
for line in fid.readlines():
line_split = line.strip().split(";")
line_split[0] = convert_date(line_split[0])
lines.append(line_split)
for line in lines:
print(line)
Could you elaborate what you do after reading the data? Especially "create a relationship-type of object and add it on a list of relationships"
That could help pinpoint your timing issue. Maybe there is a bug somewhere?
You could try timing different parts of your code to see which one takes the longest.
Generally, what you describe should be possible within seconds, not hours.

accessing many grib messages at once with pygrib

I need to access some grib files. I have figured out how to do it, using pygrib.
However, the only way I figured out how to do it is painstakingly slow.
I have 34 years of 3hrly data, they are organized in ~36 files per year (one every 10 days more or less). for a total of about 1000 files.
Each file has ~80 “messages” (8 values per day for 10 days). (they are spatial data, so they have (x,y) dimensions).
To read all my data I write:
grbfile = pygrib.index(filename, 'shortName', 'typeOfLevel', 'level')
var1 = grbfile.select(typeOfLevel='pressureFromGroundLayer', level=180, shortName='unknown')
for it in np.arange(len(var1)):
var_values, lat1, lon1 = var1[it].data()
if (it==0):
tot_var = np.expand_dims(var_values,axis=0)
else:
tot_var = np.append(tot_var, np.expand_dims(var_values,axis=0),axis=0)
and repeat this for each of the 1000 files.
is there a quicker way? like loading all the ~80 layers per grib files at once? something like:
var_values, lat1, lon1 = var1[:].data()
If I understand you correctly, you want the data from all 80 messages in each file stacked up in one array.
I have to warn you, that that array will get very large, and may cause NumPy to throw a MemoryError (happened to me before) depending on your grid size etc.
That being said, you can do something like this:
# substitute with a list of your file names
# glob is a builtin library that can help accomplish this
files = list_of_files
grib = pygrib.open(files[0]) # start with the first one
# grib message numbering starts at 1
data, lats, lons = grib.message(1).data()
# while np.expand_dims works, the following is shorter
# syntax wise and will accomplish the same thing
data = data[None,...] # add an empty dimension as axis 0
for m in xrange(2, grib.messages + 1):
data = np.vstack((data, grib.message(m).values[None,...]))
grib.close() # good practice
# now data has all the values from each message in the first file stacked up
# time to stack the rest on there
for file_ in files[1:]: # all except the first file which we've done
grib = pygrib.open(file_)
for msg in grib:
data = np.vstack((data, msg.values[None,...]))
grib.close()
print data.shape # should be (80 * len(files), nlats, nlons)
This may gain you some speed. pygrib.open objects act like generators, so they pass you each pygrib.gribmessage object as it's called for instead of building a list of them like the select() method of a pygrib.index does. If you need all the messages in a particular file then this is the way I would access them.
Hope it helps!

Pandas Frame Iteration : Efficiency

I have 2 separate datasets
Dataset 1: Database of Items that are loaded and When they were Loaded. It Looks something like this
Loaded DataSet
Dataset 2 : Database of Items that are Unloaded and when they were Unloaded. It Loos Exactly Similar to the Above Data set
hse_time is of the format "2016-01-07 19:38:56" i.e. "YYYY-mm-dd HH:MM:SS"
Now my exercise is to tag Each Loaded Item with the corresponding Unloaded Time, Number of Times Loaded and Number of times Unloaded, Current Status[Loaded or Unloaded]
The Dataset has following rules:
An Item can be Loaded and Unloaded multiple Times.
Since this is for a Particular Time-Frame we can have an Item Unloaded beforfe it is loaded in the dataset[For eg: I am analysing JFM'16 data its possible that it was loaded before in Dec'15 but Unloaded in Jan'16)
No of times Loaded = Number of Times Unloaded
No of times Loaded = Number of times Unloaded + 1
Now I have written an Algorithm to satisfy all the Conditions and Tag everything I need but the Problem is I have a Data set of 400K and it takes forever for my algorithm to run since I've to Iterate Over each Row in the Loaded Frame.
AnyOther Way to do this so that I can reduce my Run time?
Here is my Code:
Loaded_Frame = Loaded_Frame.sort_values(by=["BranchID","hse_time"],ascending=True)
Unloaded_Frame = Unloaded_Frame.sort_values(by["BranchID","hse_time"],ascending=True)
Grouped = Loaded_Frame.groupby(["BranchID","Item Name"]).agg({"weight":"count"}).reset_index()
Grouped.rename(columns={"weight":"LoadedCount"},inplace=True)
temp_frame = Unloaded_Frame.groupby(["BranchID","Item Name"]).agg({"weight":"count"}).reset_index()
temp_frame.rename(columns={"weight":"UnLoadedCount"},inplace=True)
Grouped = Grouped.merge(temp_frame,on=["BranchID","Item Name"],how="outer")
Grouped["UnLoadedCount"] = Grouped["UnLoadedCount"].fillna(0)
Grouped["LoadedCount"] = Grouped["LoadedCount"].fillna(0)
The Main Logic
import numpy as np
Final_Frame=Loaded_Frame.copy()
Final_Frame["Multiple Loads"]=np.nan
Final_Frame["Number of times Loaded"]=np.nan
Final_Frame["Number of times UnLoaded"]=np.nan
Final_Frame["UnLoaded Date"]=np.nan
Final_Frame["Load Status"]=np.nan
for i in Grouped.index:
x=UnLoaded_Frame[(UnLoaded_Frame["BranchID"]==Grouped.loc[i,"BranchID"])\
& (UnLoaded_Frame["Item Name"]==Grouped.loc[i,"Item Name"])].reset_index()
y=Loaded_Frame[(Loaded_Frame["BranchID"]==Grouped.loc[i,"BranchID"]) \
& (Loaded_Frame["Item Name"]==Grouped.loc[i,"Item Name"])].reset_index()
Loaded_Count=y["BranchID"].count()
Unloaded-Count=x["BranchID"].count()
if Loaded_Count==Unloaded: #Condition where both are equal
Multiple_Load=False
if Loaded_Count>1:
Multiple_Load=True
else:
Multiple_Load=False
for j in y.index:
Final_Frame.loc[((Final_Frame["BranchID"]==y.loc[j,"BranchID"]) \
& (Final_Frame["Item Name"]==y.loc[j,"Item Name"]) \
& (Final_Frame["hse_time"]==y.loc[j,"hse_time"]))\
,["Multiple Loads","Number of times Loaded","Number of times UnLoaded","UnLoaded Date","Load Status"]]\
=[Multiple_Load,Loaded_Count,UnLoaded_Count,x.loc[j,"hse_time"],"Unloaded"]
The Problem is While i run this code, It takes a lot of time Iterating Over 400K records.

Categories

Resources