Extracting ID and Relevant data from a csv dataset in python

Extracting ID and Relevant data from a csv dataset in python - python

Making a program for my Final Year Project.
Program takes the longitude and latitude coords from a .csv dataset and plots them on the map.
Issue I am having is there is multiple ID's and this totals 445,000+ points.
How would I refine it so the program can differentiate between the IDs?
def create_image(self, color, width=2):
# Creates an image that contains the Map and the GPS record
# color = color the GPS line is
# width = width of the GPS line
data = pd.read_csv(self.data_path, header=0)
# sep will separate the latitude from the longitude
data.info()
self.result_image = Image.open(self.map_path, 'r')
img_points = []
gps_data = tuple(zip(data['latitude'].values, data['longitude'].values))
for d in gps_data:
x1, y1 = self.scale_to_img(d, (self.result_image.size[0], self.result_image.size[1]))
img_points.append((x1, y1))
draw = ImageDraw.Draw(self.result_image)
draw.line(img_points, fill=color, width=width)
I have also attached the github project here the program works but I am just trying to minimize how many users it plots at once.
Thanks in advance.

To check for a specific ID you could create a filter. For this dataframe
long lat ID
0 10 5 test1
1 15 20 test2
you could do the following:
id_filt = df_data['ID'] == 'test1'
This can be used to filter out every entry from the dataframe that has the ID 'test1'
df_data[id_filt]
long lat ID
10 5 test1

Related

Create a scatterplot from the data of two dataframes?

I have two dataframes in python. The content of them is the following:
Table=conn
A B relevance
1 3 0.7
2 7 0.1
5 20 2
6 2 7
table=point
Point Lat Lon
1 45.3 -65.2
2 34.4 -60.2
3 40.2 -60.1
20 40.4 -63.1
In the first table, column A represents an origin, column B a destination and the relevance of the link.
On the other hand, in the second table we have for each point (origin or destination) its coordinates.
The problem is that I want to create a visualization in Python that allows to query the coordinates of each origin or destination (column A and B of the first table) in the second table and make a scatterplot with it. Then, link each of the origins and destinations of the first column taking into account the relevance with thicker lines as it has more relevance.
link refers to the line that joins the points in the graphic representation.
Any idea? I've started with a very basic code approach but I'm really having trouble following along.
for row in conn.interrows():
row[1][0]
row[1][1]
row[1][3]

Do you have two DataFrames: point and conn, right?
# To set indexes of "point" equal to "Points"
point.set_index(point.Point, inplace=True)
# config width of lines
min_width = 0.5
max_width = 4.0
min_relevance = conn.relevance.min()
max_relevance = conn.relevance.max()
slope = (max_width - min_width)/(max_relevance - min_relevance)
widths = min_width + slope*(conn.relevance - min_relevance)
# plot lines
for i in range(len(conn)):
origin = conn.loc[i, 'A']
destin = conn.loc[i, 'B']
lat = point.loc[[origin, destin], 'Lat']
lon = point.loc[[origin, destin], 'Lon']
plt.plot(lat, lon, c='red', lw=widths[i])
# plot points
plt.plot(point.Lat, point.Lon, ls='', marker='o', c='blue')

How to plot data on a basemap using matplotlib basemap

Two sections of my code are giving me trouble, I am trying to get the basemap created in this first section here:
#Basemap
epsg = 6060; width = 2000.e3; height = 2000.e3 #epsg 3413. 6062
m=Basemap(epsg=epsg,resolution='l',width=width,height=height) #lat_ts=(90.+35.)/2.
m.drawcoastlines(color='white')
m.drawmapboundary(fill_color='#99ffff')
m.fillcontinents(color='#cc9966',lake_color='#99ffff')
m.drawparallels(np.arange(10,70,20),labels=[1,1,0,0])
m.drawmeridians(np.arange(-100,0,20),labels=[0,0,0,1])
plt.title('ICESAT2 Tracks in Greenland')
plt.figure(figsize=(20,10))
Then my next section is meant to plot the data its getting from a file, and plot these tracks on top of the Basemap. Instead, it creates a new plot entirely. I have tried rewording the secondary plt.scatter to match Basemap, such as m.scatter, m.plt, etc. But it only returns with “RuntimeError: Can not put single artist in more than one figure” when I do so.
Any ideas on how to get this next section of code onto the basemap? Here is the next section, focus on the end to see where it is plotting.
icesat2_data[track] = dict() # creates a sub-dictionary, track
icesat2_data[track][year+month+day] = dict() # and one layer more for the date under the whole icesat2_data dictionary
icesat2_data[track][year+month+day] = dict.fromkeys(lasers)
for laser in lasers: # for loop, access all the gt1l, 2l, 3l
if laser in f:
lat = f[laser]["land_ice_segments"]["latitude"][:] # data for a particular laser's latitude.
lon = f[laser]["land_ice_segments"]["longitude"][:] #data for a lasers longitude
height = f[laser]["land_ice_segments"]["h_li"][:] # data for a lasers height
quality = f[laser]["land_ice_segments"]["atl06_quality_summary"][:].astype('int')
# Quality filter
idx1 = quality == 0 # data dictionary to see what quality summary is
#print('idx1', idx1)
# Spatial filter
idx2 = np.logical_and( np.logical_and(lat>=lat_min, lat<=lat_max), np.logical_and(lon>=lon_min, lon<=lon_max) )
idx = np.where( np.logical_and(idx1, idx2) ) # combines index 1 and 2 from data quality filter. make sure not empty. if empty all data failed test (low quality or outside box)
icesat2_data[track][year+month+day][laser] = dict.fromkeys(['lat','lon','height']) #store data, creates empty dictionary of lists lat, lon, hi, those strings are the keys to the dict.
icesat2_data[track][year+month+day][laser]['lat'] = lat[idx] # grabbing only latitudes using that index of points with good data quality and within bounding box
icesat2_data[track][year+month+day][laser]['lon'] = lon[idx]
icesat2_data[track][year+month+day][laser]['height'] = height[idx]
if lat[idx].any() == True and lon[idx].any() == True:
x, y = transformer.transform(icesat2_data[track][year+month+day][laser]['lon'], \
icesat2_data[track][year+month+day][laser]['lat'])
plt.scatter(x, y, marker='o', color='#000000')
Currently, they output separately, like this:

Not sure if you're still working on this, but here's a quick example I put together that you might be able to work with (obviously I don't have the data you're working with). A couple things that might not be self-explanatory - I used m() to transform the coordinates to map coordinates. This is Basemap's built-in transformation method so you don't have to use PyProj. Also, setting a zorder in the scatter function ensures that your points are plotted above the countries layer and don't get hidden underneath.
#Basemap
epsg = 6060; width = 2000.e3; height = 2000.e3 #epsg 3413. 6062
plt.figure(figsize=(20,10))
m=Basemap(epsg=epsg,resolution='l',width=width,height=height) #lat_ts=(90.+35.)/2.
m.drawcoastlines(color='white')
m.drawmapboundary(fill_color='#99ffff')
m.fillcontinents(color='#cc9966',lake_color='#99ffff')
m.drawparallels(np.arange(10,70,20),labels=[1,1,0,0])
m.drawmeridians(np.arange(-100,0,20),labels=[0,0,0,1])
plt.title('ICESAT2 Tracks in Greenland')
for coord in [[68,-39],[70,-39]]:
lat = coord[0]
lon = coord[1]
x, y = m(lon,lat)
m.scatter(x,y,color='red',s=100,zorder=10)
plt.show()

I think you might need:
plt.figure(figsize(20,10))
before creating the basemap, not after. As it stands it's creating a map and then creating a new figure after that which is why you're getting two figures.
Then your plotting line should be m.scatter() as you mentioned you tried before.

How do I label a specific point in a scatter plot with a unique ID?

I am creating an interactive graph for a layout that looks a lot like this:
Each point has a unique ID and is usually part of a group. Each group has their own color so I use multiple scatter plots to create the entire layout. I need the following to occur when I click on a single point:
On mouse click, retrieve the ID of the selected point.
Plug the ID into a black box function that returns a list of nearby* IDs.
Highlight the points of the IDs in the returned list.
*It is possible for some of the IDs to be from different groups/plots.
How do I:
Associate each point with an ID and return the ID when the point is clicked?
Highlight other points in the layout when all I know is their IDs?
Re-position individual points while maintaining their respective groups i.e. swapping positions with points that belong to different groups/plots.
I used pyqtgraph before switching over to matplotlib so I first thought of creating a dictionary of IDs and their point objects. After experimenting with pick_event, it seems to me that the concept of point objects does not exist in matplotlib. From what I've learned so far, each point is represented by an index and only its PathCollection can return information about itself e.g. coordinates. I also learned that color modification of a specific point is done through its PathCollection whereas in pyqtgraph I can do it through a point object e.g. point.setBrush('#000000').

I am still convinced that using a single scatter plot would be the much better option. There is nothing in the question that would contradict that.
You can merge all your data in a single DataFrame, with columns group, id, x, y, color. The part in the code below which says "create some dataset" does create such a DataFrame
group id x y color
0 1 AEBB 0 0 palegreen
1 3 DCEB 1 0 plum
2 0 EBCC 2 0 sandybrown
3 0 BEBE 3 0 sandybrown
4 3 BEBB 4 0 plum
Note that each group has its own color. One can then create a scatter from it, using the colors from the color column.
A pick event is registered as in this previous question and once a point is clicked, which is not already black, the id from the DataFrame corresponding to the selected point is obtained. From the id, other ids are generated via the "blackbox function" and for each id obtained that way the respective index of the point in the dataframe is determined. Because we have single scatter this index is directly the index of the point in the scatter (PathCollection) and we can paint it black.
import numpy as np; np.random.seed(1)
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.colors
### create some dataset
x,y = np.meshgrid(np.arange(20), np.arange(20))
group = np.random.randint(0,4,size=20*20)
l = np.array(np.meshgrid(list("ABCDE"),list("ABCDE"),
list("ABCDE"),list("ABCDE"))).T.reshape(-1,4)
ide = np.random.choice(list(map("".join, l)), size=20*20, replace=False)
df = pd.DataFrame({"id" : ide, "group" : group ,
"x" : x.flatten(), "y" : y.flatten() })
colors = ["sandybrown", "palegreen", "paleturquoise", "plum"]
df["color"] = df["group"]
df["color"].update(df["color"].map(dict(zip(range(4), colors ))))
print df.head()
### plot a single scatter plot from the table above
fig, ax = plt.subplots()
scatter = ax.scatter(df.x,df.y, facecolors=df.color, s=64, picker=4)
def getOtherIDsfromID(ID):
""" blackbox function: create a list of other IDs from one ID """
l = [np.random.permutation(list(ID)) for i in range(5)]
return list(set(map("".join, l)))
def select_point(event):
if event.mouseevent.button == 1:
facecolor = scatter._facecolors[event.ind,:]
if (facecolor == np.array([[0, 0, 0, 1]])).all():
c = df.color.values[event.ind][0]
c = matplotlib.colors.to_rgba(c)
scatter._facecolors[event.ind,:] = c
else:
ID = df.id.values[event.ind][0]
oIDs = getOtherIDsfromID(ID)
# for each ID obtained, make the respective point black.
rows = df.loc[df.id.isin([ID] + oIDs)]
for i, row in rows.iterrows():
scatter._facecolors[i,:] = (0, 0, 0, 1)
tx = "You selected id {}.\n".format(ID)
tx += "Points with other ids {} will be affected as well"
tx = tx.format(oIDs)
print tx
fig.canvas.draw_idle()
fig.canvas.mpl_connect('pick_event', select_point)
plt.show()
In the image below, the point with id DAEE has been clicked on, and other points with ids ['EDEA', 'DEEA', 'EDAE', 'DEAE'] have been chosen by the blackbox function. Not all of those IDs exist, such that two other points with an existing id are colorized as well.

How to Specify Dimension Values when Creating NetCDF File in Python?

I am creating a NetCDF4 file which currently has four variables:
1) Land Surface Temperature (3D array - time, latitude, longitude)
2) Longitude (1D - coordinate of each pixel centre)
3) Latitude (1D - coordinate of each pixel centre)
4) Time (time of image acquisition in hours since 1900-01-01 00:00:00)
I am currently using the following code to do this:
#==========================WRITE THE NETCDF FILE==========================#
newfile = nc.Dataset(export_filename, 'w', format = 'NETCDF4_CLASSIC')
#==========================SET FILE DIMENSIONS============================#
newfile.createDimension('lat', ny)
newfile.createDimension('lon', nx)
newfile.createDimension('time', len(filenames))
#==========================SET GLOBAL ATTRIBUTES==========================#
newfile.title = ('Title')
newfile.history = "File created on " + datetime.strftime(datetime.today(), "%c")
newfile.Conventions = 'CF-1.6'
#==========================CREATE DATA VARIABLES==========================#
#--------------------------LST VARIABLE-----------------------------------#
LSTs = newfile.createVariable('LST', np.int16, ('time', 'lat', 'lon'), fill_value = -8000)
LSTs.units = 'Degrees C'
LSTs.add_offset = 273.15
LSTs.scale_factor = 0.01
LSTs.standard_name = 'LST'
LSTs.long_name = 'Land Surface Temperature'
LSTs.grid_mapping = 'latitude_longitude'
LSTs.coordinates = 'lon lat'
LSTs[:] = LSTd[:]
#--------------------------LON AND LAT AND TIME--------------------------#
LONGITUDEs = newfile.createVariable('LONGITUDE', np.float64, ('lon',))
LONGITUDEs.units = 'Decimal Degrees East'
LONGITUDEs.standard_name = 'Longitude'
LONGITUDEs.long_name = 'Longitude'
LONGITUDEs[:] = LONd[:]
LATITUDEs = newfile.createVariable('LATITUDE', np.float64, ('lat',))
LATITUDEs.units = 'Decimal Degrees North'
LATITUDEs.standard_name = 'Latitude'
LATITUDEs.long_name = 'Latitude'
LATITUDEs[:] = LATd[:]
TIMEs = newfile.createVariable('TIME', np.int32, ('time',))
TIMEs.units = 'hours since 1900-01-01 00:00:00'
TIMEs.standard_name = 'Time'
TIMEs.long_name = 'Time of Image Acquisition'
TIMEs.axis = 'T'
TIMEs.calendar = 'gregorian'
TIMEs[:] = time[:]
#--------------------------SAVE THE FILE---------------------------------#
newfile.close();
This code produces a netCDF file with the land surface temperature variable having 24 bands (one for each hour of the day). This code works as I wanted it to albeit one small problem which I wish to address. When I run gdalinfo for the LST variable, I get (this is a reduced version):
Band 1.....
...
NETCDF_DIM_TIME = 1
...
I want this value of 1 to be set to the same as the 'time' variable (which is something like 1081451 hours since 1900-01-01 00:00:00) which I have included in my above code. I therefore want to understand how this can be changed for each band in the file?
UPDATE TO QUESTION: When I do gdalinfo on the file (again, a subset):
NETCDF_DIM_EXTRA={time}
NETCDF_DIM_time_DEF={24,3}
but there is an option missing 'NETCDF_DIM_time_VALUES' and I need to set this to the time variable and it should work. HOW DO I DO THIS?
At present it is just being set to the band number but I want it to contain information regarding its hour of acquisition.
UPDATE 1:
I have tried to specify
LSTs.NETCDF_DIM_Time = time
during the netCDF file formation and this has assigned all values from time to the NETCDF_DIM_TIME in gdal so that each band has 24 time values rather than just one.
UPDATE 2:
With some further digging I think it is the NETCDF_DIM_time_VALUES metadata which needs to be set to the 'time' variable. I have updated my question to ask how to do this.

The variables associated with the dimensions should have the same name as the dimensions. So in your code above replace the create variable line with:
TIMEs = newfile.createVariable('time', np.int32, ('time',))
now gdalinfo knows where to find the data. I ran your code using dummy times [1000000, 1000024] and gdal info returns:
Band1...
...
NETCDF_DIM_time=1000000
...
Band2...
...
NETCDF_DIM_time=1000024
...
To answer your title question: You can't assign values to a Dimension but you can have a variable with the same name as the dimension that holds the data/values associated with the dimension. Readers of netcdf files, like gdal, look for conventions like this to interpret the data. See for example Unidata's 'Writing NetCDF Files: Best Practices' 'Coordinate Systems'

Paraview for Data Display Over Time

I am trying to use Paraview to display points as colored boxes that change scalar value over time. I have tried to write a programmable source and filter that adds box glyphs to data points and will change the scalar value of the data point based on the time point.
Specifically this script attempts to return an integer value for the current time point and then use this value to parse through a text file and choose the point data corresponding to that time point.
I can’t quite get it to work though and I have two big questions:
Within the Filter, how do I create glyph boxes for the data points?
How do I get the below script to work without getting "indices must be integers not float" error.
Many thanks,
Matt
UPDATE
solved problem by adding:
time = int(time)
after time = 0 in programmable filter
Programmable Source Script
import vtk
import numpy as np
coordinate = np.array([[0,0,0]])
node_file = open(‘…/pointWeights.txt', 'r')
data = node_file.readlines()
timePoint = 0
pointWeight = float(data[timePoint])
node_file.close()
output.Points = coordinate
output.Allocate(1)
output.PointData.append(pointWeight, "Point Weight")
Programmable Source RequestInformation
timeSteps = range(4)
outInfo = self.GetOutputInformation(0)
timeRange = [timeSteps[0], timeSteps[-1]]
outInfo.Set(vtk.vtkStreamingDemandDrivenPipeline.TIME_RANGE(), timeRange, 2)
outInfo.Set(vtk.vtkStreamingDemandDrivenPipeline.TIME_STEPS(), timeSteps, len(timeSteps))
Programmable Filter Script
import vtk
import numpy as np
outInfo = self.GetOutputInformation(0)
if outInfo.Has(vtk.vtkStreamingDemandDrivenPipeline.UPDATE_TIME_STEP()):
time = outInfo.Get(vtk.vtkStreamingDemandDrivenPipeline.UPDATE_TIME_STEP())
else:
time = 0
coordinate = np.array([[0,0,0]])
node_file = open(‘…/pointWeights.txt', 'r')
data = node_file.readlines()
pointWeight = float(data[time])
node_file.close()
output.Points = coordinate
output.Allocate(1)
output.PointData.append(pointWeight, "Point Weight")
pointWeights.txt
0
.33
.66
1

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extracting ID and Relevant data from a csv dataset in python - python

To check for a specific ID you could create a filter. For this dataframe long lat ID 0 10 5 test1 1 15 20 test2 you could do the following: id_filt = df_data['ID'] == 'test1' This can be used to filter out every entry from the dataframe that has the ID 'test1' df_data[id_filt] long lat ID 10 5 test1

Related

Create a scatterplot from the data of two dataframes?

How to plot data on a basemap using matplotlib basemap

How do I label a specific point in a scatter plot with a unique ID?

How to Specify Dimension Values when Creating NetCDF File in Python?

Paraview for Data Display Over Time

Categories

Resources