I need to access some grib files. I have figured out how to do it, using pygrib.
However, the only way I figured out how to do it is painstakingly slow.
I have 34 years of 3hrly data, they are organized in ~36 files per year (one every 10 days more or less). for a total of about 1000 files.
Each file has ~80 “messages” (8 values per day for 10 days). (they are spatial data, so they have (x,y) dimensions).
To read all my data I write:
grbfile = pygrib.index(filename, 'shortName', 'typeOfLevel', 'level')
var1 = grbfile.select(typeOfLevel='pressureFromGroundLayer', level=180, shortName='unknown')
for it in np.arange(len(var1)):
var_values, lat1, lon1 = var1[it].data()
if (it==0):
tot_var = np.expand_dims(var_values,axis=0)
else:
tot_var = np.append(tot_var, np.expand_dims(var_values,axis=0),axis=0)
and repeat this for each of the 1000 files.
is there a quicker way? like loading all the ~80 layers per grib files at once? something like:
var_values, lat1, lon1 = var1[:].data()
If I understand you correctly, you want the data from all 80 messages in each file stacked up in one array.
I have to warn you, that that array will get very large, and may cause NumPy to throw a MemoryError (happened to me before) depending on your grid size etc.
That being said, you can do something like this:
# substitute with a list of your file names
# glob is a builtin library that can help accomplish this
files = list_of_files
grib = pygrib.open(files[0]) # start with the first one
# grib message numbering starts at 1
data, lats, lons = grib.message(1).data()
# while np.expand_dims works, the following is shorter
# syntax wise and will accomplish the same thing
data = data[None,...] # add an empty dimension as axis 0
for m in xrange(2, grib.messages + 1):
data = np.vstack((data, grib.message(m).values[None,...]))
grib.close() # good practice
# now data has all the values from each message in the first file stacked up
# time to stack the rest on there
for file_ in files[1:]: # all except the first file which we've done
grib = pygrib.open(file_)
for msg in grib:
data = np.vstack((data, msg.values[None,...]))
grib.close()
print data.shape # should be (80 * len(files), nlats, nlons)
This may gain you some speed. pygrib.open objects act like generators, so they pass you each pygrib.gribmessage object as it's called for instead of building a list of them like the select() method of a pygrib.index does. If you need all the messages in a particular file then this is the way I would access them.
Hope it helps!
Related
I currently run a simulation several times and want to save the results of these simulations so that they can be used for visualizations.
The simulation is run 100 times and each of these simulations generates about 1 million data points (i.e. 1 million values for 1 million episodes), which I now want to store efficiently. The goal within each of these episodes would be to generate an average of each value across all 100 simulations.
My main file looks like this:
# Defining the test simulation environment
def test_simulation:
environment = environment(
periods = 1000000
parameter_x = ...
parameter_y = ...
)
# Defining the simulation
environment.simulation()
# Save simulation data
hf = h5py.File('runs/simulation_runs.h5', 'a')
hf.create_dataset('data', data=environment.value_history, compression='gzip', chunks=True)
hf.close()
# Run the simulation 100 times
for i in range(100):
print(f'--- Iteration {i} ---')
test_simulation()
The value_history is generated within game(), i.e. the values are continuously appended to an empty list according to:
def simulation:
for episode in range(periods):
value = doSomething()
self.value_history.append(value)
Now I get the following error message when going to the next simulation:
ValueError: Unable to create dataset (name already exists)
I am aware that the current code keeps trying to create a new file and generates an error because it already exists. Now I am looking to reopen the file created in the first simulation, append the data from the next simulation and save it again.
The example below shows how to pull all these ideas together. It creates 2 files:
Create 1 resizable dataset with maxshape() parameter on first loop, then use dataset.resize() on subsequent loops -- output is
simulation_runs1.h5
Create a unique dataset for each simulation -- output is
simulation_runs2.h5.
I created a simple 100x100 NumPy array of randoms for the "simulation data", and ran the simulation 10 times. They are variables, so you can increase to larger values to determine which method is better (faster) for your data. You may also discover memory limitations saving 1M data points for 1M time periods.
Note 1: If you can't save all the data in system memory, you can incrementally save simulation results to the H5 file. It's just a little more complicated.
Note 2: I added a mode variable to control whether a new file is created for the first simulation (i==0) or the existing file is opened in append mode for subsequent simulations.
import h5py
import numpy as np
# Create some psuedo-test data
def test_simulation(i):
periods = 100
times = 100
# Define the simulation with some random data
val_hist = np.random.random(periods*times).reshape(periods,times)
a0, a1 = val_hist.shape[0], val_hist.shape[1]
if i == 0:
mode='w'
else:
mode='a'
# Save simulation data (resize dataset)
with h5py.File('runs/simulation_runs1.h5', mode) as hf:
if 'data' not in list(hf.keys()):
print('create new dataset')
hf.create_dataset('data', shape=(1,a0,a1), maxshape=(None,a0,a1), data=val_hist,
compression='gzip', chunks=True)
else:
print('resize existing dataset')
d0 = hf['data'].shape[0]
hf['data'].resize( (d0+1,a0,a1) )
hf['data'][d0:d0+1,:,:] = val_hist
# Save simulation data (unique datasets)
with h5py.File('runs/simulation_runs2.h5', mode) as hf:
hf.create_dataset(f'data_{i:03}', data=val_hist,
compression='gzip', chunks=True)
# Run the simulation 100 times
for i in range(10):
print(f'--- Iteration {i} ---')
test_simulation(i)
Question moved to CodeReview:https://codereview.stackexchange.com/questions/257465/can-i-optimize-two-for-loops-that-look-for-the-closest-zip-code-based-on-lat-lon
I am new to python and I had the task to find the US zip code based on latitude and longitude. After messing with arcgis I realized that this was giving me empty values for certain locations. I ended up coding something that accomplishes my task by taking a dataset containing all US codes and using Euclidean distance to determine the closest zip code based on their lat/lon. However, this takes approximately 1.3 seconds on average to compute which for my nearly million records will take a while as a need a zip code for each entry. I looked that vectorization is a thing on python to speed up tasks. But, I cannot find a way to apply it to my code. Here is my code and any feedback would be appreciated:
for j in range(len(myFile)):
p1=0
p1=0
point1 = np.array((myFile["Latitude"][j], myFile["Longitude"][j])) # This is the reference point
i = 0
resultZip = str(usZips["Zip"][0])
dist = np.linalg.norm(point1 - np.array((float(usZips["Latitude"][0]), float(usZips["Longitude"][0]))))
for i in range(0, len(usZips)):
lat = float(usZips["Latitude"][i])
lon = float(usZips["Longitude"][i])
point2 = np.array((lat, lon)) # This will serve as the comparison from the dataset
temp = np.linalg.norm(point1 - point2)
if (temp <= dist): # IF the temp euclidean distance is lower than the alread set it will:
dist = temp # set the new distance to temp and...
resultZip = str(usZips["Zip"][i]) # will save the zip that has the same index as the new temp
# p1=float(myFile["Latitude"][58435])
# p2=float(myFile["Longitude"][58435])
i += 1
I am aware Google also has a reverse geocoder API but it has a request limit per day.
The file called myFile is a csv file with the attributes userId, latitude, longitude, timestamp with about a million entries. The file usZips is public dataset with information about the city, lat, lon, zip and timezone with about 43k records of zips across the US.
I don't know what your myFile and usZips look like (I cannot verify the code). So, try something like this in the framework of vectorization:
your_needed_dist = 10 # for example
lat = float(usZips["Latitude"][0])
lon = float(usZips["Longitude"][0])
lat0 = np.array(myFile["Latitude"])
lon0 = np.array(myFile["Longitude"])
dist = np.sqrt((lat-lat0)**2 - (lon-lon0)**2)
condition = dist <= your_needed_dist
# get index (or indices) that satisfy dist <= your_needed_dist
np.argwhere(condition)
# or
resultsZip = str(usZips["Zip"][condition])
Also check the definition of distance in my code (this is what you need or not).
I've been going in circles for days now, and I've run out of steam. Doesn't help that I'm new to python / numpy / pandas etc.
I started with numpy which led me to pandas, because of a GIS function that delivers a numpy array of data. That is my starting point. I'm trying to get to an endpoint being a small enriched dataset, in an excel spreadsheet.
But it seems like going down a rabbit hole trying to extract that data, and then manipulate it with the numpy toolsets. The delivered data is one dimensional, but each row contains 8 fields. A simple conversion to pandas and then to ndarray, magically makes it all good. Except that I lose headers in the process, and it just snowballs from there.
I've had to revaluate my understanding, based on some feedback on another post, and that's fine. But I'm just going in circles. Example after example seems to use predominantly numerical data, and I'm starting to get the feeling that's where it's strength lies. My trying to use it for what I call more of a non-mathematical / numerical purpose looks like I'm barking up the wrong tree.
Any advice?
Addendum
The data I extract from the GIS system is names, dates, other textual data. I then have another csv file that I need to use as a lookup, so that I can enrich the source with more textual information which finally gets published to excel.
SAMPLE DATA - SOURCE
WorkCode Status WorkName StartDate EndDate siteType Supplier
0 AT-W34319 None Second building 2020-05-04 2020-05-31 Type A Acem 1
1 AT-W67713 None Left of the red office tower 2019-02-11 2020-08-28 Type B Quester Q
2 AT-W68713 None 12 main street 2019-05-23 2020-11-03 Class 1 Type B Dettlim Group
3 AT-W70105 None city central 2019-03-07 2021-08-06 Other Hans Int
4 AT-W73855 None top floor 2019-05-06 2020-10-28 Type a None
SAMPLE DATA - CSV
["Id", "Version","Utility/Principal","Principal Contractor Contact"]
XM-N33463,7.1,"A Contracting company", "555-12345"
XM-N33211,2.1,"Contractor #b", "555-12345"
XM-N33225,1.3,"That other contractor", "555-12345"
XM-N58755,1.0,"v Contracting", "555-12345"
XM-N58755,2.3,"dsContracting", "555-12345"
XM-222222,2.3,"dsContracting", "555-12345"
BM-O33343,2.1,"dsContracting", "555-12345"
def SMAN():
####################################################################################################################
# Exporting the results of the analysis...
####################################################################################################################
"""
Approach is as follows:
1) Get the source data
2) Get he CSV lookup data loaded into memory - it'll be faster
3) Iterate through the source data, looking for matches in the CSV data
4) Add an extra couple of columns onto the source data, and populate it with the (matching) lookup data.
5) Export the now enhanced data to excel.
"""
arcpy.env.workspace = workspace + filenameGDB
input = "ApprovedActivityByLocalBoard"
exportFile = arcpy.da.FeatureClassToNumPyArray(input, ['WorkCode', 'Status','WorkName', 'PSN2', 'StartDate', 'EndDate', 'siteType', 'Supplier'])
# we have our data, but it's (9893,) instead of [9893 rows x 8 columns]
pdExportFile = pandas.DataFrame(exportFile)
LBW = pdExportFile.to_numpy()
del exportFile
del pdExportFile
# Now we have [9893 rows x 8 columns] - but we've lost the headers
col_list = ["WorkCode", "Version","Principal","Contact"]
allPermits = pandas.read_csv("lookup.csv", usecols=col_list)
# Now we have the CSV file loaded ... and only the important parts - should be fast.
# Shape: (94523, 4)
# will have to find a way to improve this...
# CSV file has got more than WordCode, because there are different versions (as different records)
# Only want the last one.
# each record must now be "enhanced" with matching record from the CSV file.
finalReport = [] # we are expecting this to be [9893 rows x 12 columns] at the end
counter = -1
for eachWorksite in LBW [:5]: #let's just work with 5 records right now...
counter += 1
# eachWorksite=list(eachWorksite) # eachWorksite is a tuple - so need to convert it
# # but if we change it to a list, we lose the headers!
certID = LBW [counter][0] # get the ID to use for lookup matching
# Search the CSV data
permitsFound = allPermits[allPermits['Id']==certID ]
permitsFound = permitsFound.to_numpy()
if numpy.shape(permitsFound)[0] > 1:
print ("Too many hits!") # got to deal with that CSV Version field.
exit()
else:
# now "enrich" the record/row by adding on the fields from the lookup
# so a row goes from 8 fields to 12 fields
newline = numpy.append (eachWorksite, permitsFound)
# and this enhanced record/row must become the new normal
# but I cannot change the original, so it must go into a new container
finalReport = numpy.append(finalReport, newline, axis = 0)
# now I should have a new container, of "enriched" data
# which as gone from [9893 rows x 8 columns] to [9893 rows x 12 columns]
# Some of the columns of course, could be empty.
#Now let's dump the results to an Excel file and make it accessible for everyone else.
df = pandas.DataFrame (finalReport)
filepath = 'finalreport.csv'
df.to_csv('filepath', index = False)
# Somewhere I was getting Error("Cannot convert {0!r} to Excel".format(value))
# Now I get
filepath = 'finalReport.xlsx'
df.to_excel(filepath, index=False)
I have eventually answered my own question, and this is how:
Yes, for my situation, pandas worked just fine, even beautifully for
manipulating non numerical data. I just had to learn some basics.
The biggest learning was to understand the pandas data frame as an object that has to be manipulated remotely by various functions/tools. Just because I "print" the dataframe, doesn't mean it's just text. (Thanks juanpa.arrivillaga for poitning out my erroneous assumptions) in Why can I not reproduce a nd array manually?
I also had to wrap my mind around the concept of indexes and columns, and how they could be altered/manipulated/ etc. And then, how to use them to maximum effect.
Once those fundamentals had been sorted, the rest followed naturally, and my code reduced to a couple of nice elegant functions.
Cheers
I am currently occupied with a dataset consisting of 90 .csv files. There are three types of .csv files (30 of each type).
Each csv has from 20k to 30k rows average and 3 columns(timestamp in linux format, Integer,Integer).
Here's an example of the header and a row:
Timestamp id1 id2
151341342 324 112
I am currently using 'os' to list all files in the directory.
The process for each CSV file is as follows:
Read it through pandas into a dataframe
iterate the rows of the file and for each row convert the timestamp to readable format.
Use the converted timestamp and Integers to create a relationship-type of object and add it on a list of relationships
The list will later be looped to create the relationships in my neo4j database.
The problem I am having is that the process takes too much time. I have asked and searched for ways to do it faster (I got answers like PySpark, Threads) but I did not find something that really fits my needs. I am really stuck as with my resources it takes around 1 hour and 20 minutes to do all that process for one of the big .csv file(meaning one with around 30k rows)
Converting to readable format:
ts = int(row['Timestamp'])
formatted_ts = datetime.utcfromtimestamp(ts).strftime('%Y-%m-%d %H:%M:%S')
And I pass the parameters to the Relationship func of py2neo to create my relationships. Later that list will be looped .
node1 = graph.evaluate('MATCH (n:User) WHERE n.id={id} RETURN n', id=int(row["id1"]))
node2 = graph.evaluate('MATCH (n:User) WHERE n.id={id} RETURN n', id=int(row['id2']))
rels.append(Relationship(node1, rel_type, node2, date=date, time=time))
time to compute row: 0:00:00.001000
time to create relationship: 0:00:00.169622
time to compute row: 0:00:00.001002
time to create relationship: 0:00:00.166384
time to compute row: 0:00:00
time to create relationship: 0:00:00.173672
time to compute row: 0:00:00
time to create relationship: 0:00:00.171142
I calculated the time for the two parts of the process as shown above. It is fast and there really seems to not be a problem except the size of the files. This is why the only things that comes to mind is that Parallelism would help to compute those files faster(by computing lets say 4 files in the same time instead of one)
sorry for not posting everything
I am really looking forward for replies
Thank you in advance
That sounds fishy to me. Processing csv files of that size should not be that slow.
I just generated a 30k line csv file of the type you described (3 columns filled with random numbers of the size you specified.
import random
with open("file.csv", "w") as fid:
fid.write("Timestamp;id1;id2\n")
for i in range(30000):
ts = int(random.random()*1000000000)
id1 = int(random.random()*1000)
id2 = int(random.random()*1000)
fid.write("{};{};{}\n".format(ts, id1, id2))
Just reading the csv file into a list using plain Python takes well under a second. Printing all the data takes about 3 seconds.
from datetime import datetime
def convert_date(string):
ts = int(string)
formatted_ts = datetime.utcfromtimestamp(ts).strftime('%Y-%m-%d %H:%M:%S')
split_ts = formatted_ts.split()
date = split_ts[0]
time = split_ts[1]
return date
with open("file.csv", "r") as fid:
header = fid.readline()
lines = []
for line in fid.readlines():
line_split = line.strip().split(";")
line_split[0] = convert_date(line_split[0])
lines.append(line_split)
for line in lines:
print(line)
Could you elaborate what you do after reading the data? Especially "create a relationship-type of object and add it on a list of relationships"
That could help pinpoint your timing issue. Maybe there is a bug somewhere?
You could try timing different parts of your code to see which one takes the longest.
Generally, what you describe should be possible within seconds, not hours.
I am trying to import data from a .txt file that contains four columns that are separated by tab and is several thousands lines long. This is how the start of the document look like:
Data info
File name: D:\(path to file)
Start time: 6/26/2019 15:39:54.222
Number of channels: 3
Sample rate: 1E6
Store type: fast on trigger
Post time: 20
Global header information: from DEWESoft
Comments:
Events
Event Type Event Time Comment
1 storing started at 7.237599
2 storing stopped at 7.257599
Data1
Time Incidente Transmitida DI 6
s um/m um/m -
0 2.1690152 140.98599 1
1E-6 2.1690152 140.98599 1
2E-6 4.3380303 145.32402 1
3E-6 4.3380303 145.32402 1
4E-6 -2.1690152 145.32402 1
I have several of these files that I want to loop trough and store in a cell/list that each cell/list item contains the four columns. After that I just use that cell/list to plot the data with a loop.
I saw that pandas library was suitable, but I don't understand how to use it.
fileNames = (["Test1_0001.txt", "Test2_0000.txt", "Test3_0000.txt",
"Test4_0000.txt", "Test5_0000.txt", "Test6_0001.txt", "Test7_0000.txt",
"Test8_0000.txt", "Test9_0000.txt", "Test10_0000.txt", "RawblueMat_0000.txt"])
folderName = 'AuxeticsSHPB\\' #Source folder for all files above
# Loop trough each source document
for i in range(0,len(fileNames)):
print('File location: '+folderName+fileNames[i])
# Get data from source as arrays, cut out the first 20 lines
temp=pd.read_csv(folderName+fileNames[i], sep='\t', lineterminator='\r',
skiprows=[19], error_bad_lines=False)
# Store data in list/cell
# data[i] = temp # sort it
This is something I tried that didn't work, don't really know how to proceed. I know there are some documentation on this problem but I am new to this and need some help.
An error I get when trying the above:
ParserError: Error tokenizing data. C error: Expected 1 fields in line 12, saw 4
So it was an easy fix, just had to remove the braces from skiprows=[19].
The cods now looks like this and works.
fileNames = ["Test1_0001.txt", "Test2_0000.txt", "Test3_0000.txt",
"Test4_0000.txt", "Test5_0000.txt", "Test6_0001.txt", "Test7_0000.txt",
"Test8_0000.txt", "Test9_0000.txt", "Test10_0000.txt", "RawblueMat_0000.txt"]
folderName = 'AuxeticsSHPB\\' #Source folder for all files above
# Preallocation
data = []
for i in range(0,len(fileNames)):
temp=pd.read_csv(folderName+fileNames[i], sep='\t', lineterminator='\r',
skiprows=19)
data.append(temp)