How to debug pandas_udfs without having to use Spark? - python

If I'm using Python Transforms in Palantir Foundry and I'm trying to run an algorithm which uses in-memory/non-spark libraries, and I want it automatically scale and work in Spark (not pandas). If I'm having a hard time writing the code and want to test and develop it locally, yet use the same code in pyspark later, how do I do this?
For a concrete example, I want to calculate the area of a geojson column which contains a polygon. Since I would need to use some libraries which arn't native to Spark (shapely and pyproj). I know that the best way (performance wise) is to use a pandas_udf (otherwise known as streaming udfs or vectorized udfs). But after reading a couple of guides, specifically Introducing Pandas UDF for PySpark, pandas user-defined functions
and Modeling at Scale with Pandas UDFs w/code examples, it's still challenging to debug and get working, and it seems like I can't use break statements and there isn't a first class way to log/print.
The actual dataframe would have millions of rows (relating to millions of polygons), but for simplicity I wanted to test locally with a simple dataframe, and it scale to larger dataset later:
df = spark.createDataFrame(
[
("AFG", "{\"type\":\"Polygon\",\"coordinates\":[[[61.210817,35.650072],[62.230651,35.270664],[62.984662,35.404041],[63.193538,35.857166],[63.982896,36.007957],[64.546479,36.312073],[64.746105,37.111818],[65.588948,37.305217],[65.745631,37.661164],[66.217385,37.39379],[66.518607,37.362784],[67.075782,37.356144],[67.83,37.144994],[68.135562,37.023115],[68.859446,37.344336],[69.196273,37.151144],[69.518785,37.608997],[70.116578,37.588223],[70.270574,37.735165],[70.376304,38.138396],[70.806821,38.486282],[71.348131,38.258905],[71.239404,37.953265],[71.541918,37.905774],[71.448693,37.065645],[71.844638,36.738171],[72.193041,36.948288],[72.63689,37.047558],[73.260056,37.495257],[73.948696,37.421566],[74.980002,37.41999],[75.158028,37.133031],[74.575893,37.020841],[74.067552,36.836176],[72.920025,36.720007],[71.846292,36.509942],[71.262348,36.074388],[71.498768,35.650563],[71.613076,35.153203],[71.115019,34.733126],[71.156773,34.348911],[70.881803,33.988856],[69.930543,34.02012],[70.323594,33.358533],[69.687147,33.105499],[69.262522,32.501944],[69.317764,31.901412],[68.926677,31.620189],[68.556932,31.71331],[67.792689,31.58293],[67.683394,31.303154],[66.938891,31.304911],[66.381458,30.738899],[66.346473,29.887943],[65.046862,29.472181],[64.350419,29.560031],[64.148002,29.340819],[63.550261,29.468331],[62.549857,29.318572],[60.874248,29.829239],[61.781222,30.73585],[61.699314,31.379506],[60.941945,31.548075],[60.863655,32.18292],[60.536078,32.981269],[60.9637,33.528832],[60.52843,33.676446],[60.803193,34.404102],[61.210817,35.650072]]]}"),
("ALB", "{\"type\":\"Polygon\",\"coordinates\":[[[20.590247,41.855404],[20.463175,41.515089],[20.605182,41.086226],[21.02004,40.842727],[20.99999,40.580004],[20.674997,40.435],[20.615,40.110007],[20.150016,39.624998],[19.98,39.694993],[19.960002,39.915006],[19.406082,40.250773],[19.319059,40.72723],[19.40355,41.409566],[19.540027,41.719986],[19.371769,41.877548],[19.304486,42.195745],[19.738051,42.688247],[19.801613,42.500093],[20.0707,42.58863],[20.283755,42.32026],[20.52295,42.21787],[20.590247,41.855404]]]}"),
],# can continue with more countries from https://raw.githubusercontent.com/johan/world.geo.json/34c96bba9c07d2ceb30696c599bb51a5b939b20f/countries.geo.json
["country", "geometry"]
)
Given the geometry column which is actually geojson, how can I calculate the area in square-meters using a good GIS approach? For example using the methods outlined in these questions:
Calculate Polygon area in planar units (e.g. square-meters) in Shapely
How do I get the area of a GeoJSON polygon with Python
How to calculate the area of a polygon on the earth's surface using python?

The way you can think about pandas_udfs is that you are writing your logic to be applied to a pandas series. This means that you would be applying an operation and it would automatically apply to every row.
If you want to develop this locally, you can actually take a much smaller sample of your data (like you did), and have it stored in a pandas series, and get it working there:
from shapely.geometry import Polygon
import json
from pyproj import Geod
#just select the column you want to use the pandas udf
pdf = df.select("geometry").toPandas()
#convert to pandas series
pdf_geom_raw = pdf.ix[:,0]
#how to apply converting string to json/dict
pdf_geom = pdf_geom_raw.apply(json.loads)
# function using non-spark functions
def get_area(shape):
geod = Geod(ellps="WGS84")
poly = Polygon(shape["coordinates"][0])
area = abs(geod.geometry_area_perimeter(poly)[0])
return area
pdf_geom = pdf_geom.apply(get_area)
Here you could just try it locally (without spark) by replacing pdf = df.select("geometry").toPandas() to pdf = pd.read_csv("geo.csv")
Now that you have it working locally, you can copy paste the code in your pandas_udf
from shapely.geometry import Polygon
import json
from pyproj import Geod
from pyspark.sql.functions import pandas_udf, PandasUDFType
#pandas_udf('double', PandasUDFType.SCALAR)
def geodesic_polygon_area(pdf_geom):
pdf_geom = pdf_geom.apply(json.loads)
def get_area(shape):
geod = Geod(ellps="WGS84")
poly = Polygon(shape["coordinates"][0])
area = abs(geod.geometry_area_perimeter(poly)[0])
return area
pdf_geom = pdf_geom.apply(get_area)
return pdf_geom
df = df.withColumn('area_square_meters', geodesic_polygon_area(df.geometry))
When running the code:
>>> df.show()
+-------+--------------------+--------------------+
|country| geometry| area_square_meters|
+-------+--------------------+--------------------+
| AFG|{"type":"Polygon"...|6.522700837770404E11|
| ALB|{"type":"Polygon"...|2.969479517410540...|
+-------+--------------------+--------------------+

Related

In AWS, what is the easiest way to run a python script which reads an excel file and computes & returns various arrays?

I am working on a primitive version of a script that performs factor analysis and computes some parameters for item response theory. I need to make this code run in AWS because I am requested so. However I have absolutely zero experience in Cloud computing and AWS and anything related to that (I am just somewhat OK with writing Python and MATLAB scripts).
Can anyone please suggest me the easiest way to make the following python code work in AWS in AN EASY-TO-IMPLEMENT way that is doable for a total noob (including changes that I need to make inside the python code):
P.S: I am expecting this script to give me the "estimates" and "ev" parameters. Converting this script to a function also did not work for me but probably the issue is different so I can convert this to a function with desired returns as well.
import os
import pandas as pd
import numpy as np
from factor_analyzer import FactorAnalyzer
import matplotlib.pyplot as plt
os.chdir('C:/Users/ege/Desktop/Neurolize') #change to where your excel file is.
#access the relevant sheet with answers of each participant
df = pd.read_excel(open('irtTestCase.xlsx', 'rb'),
sheet_name='Survey Module 1',skiprows = [0])
#drop irrelevant columns (User ID & Response Time)
df.drop(['UserId', 'Unnamed: 15'],axis=1,inplace=True)
#drop the participants that did not answer all of the questions
df.dropna(inplace=True)
#replace the answers with numeric values
df = df.replace(regex={'None of the time':1.0, 'Rarely':2.0, 'Some of the time':3.0,
'Often':4.0, 'All of the time':5.0})
#see if factor analysis can be performed on the data (p <0.05 = eligibility)
from factor_analyzer.factor_analyzer import calculate_bartlett_sphericity
chi_square_value,p_value=calculate_bartlett_sphericity(df)
chi_square_value, p_value
#perform factor analysis - get eigenvectors
fa = FactorAnalyzer(rotation = None,n_factors=df.shape[1])
fa.fit(df)
ev,_ = fa.get_eigenvalues()
#get the ratio of variance explained by the addition of each factor
variances=fa.get_factor_variance()
cum_variance = variances[2] # %variance explained via the addition of each factor
#plot the relative component amplitudes (eigenvalues)
plt.scatter(range(1,df.shape[1]+1),ev)
plt.plot(range(1,df.shape[1]+1),ev)
plt.title('Scree Plot')
plt.xlabel('Factors')
plt.ylabel('Eigen Value')
plt.grid()
#get how much each question contributes to each of the factors
factorLoadings = fa.loadings_
'''
Traditional criteria is to consider each factor that has >1 eigenvalue as a
significant factor. So now we have 3 factors that have eigenvalues >1. So
it may be a good idea to exclude the questions that mainly load into the
second and third factors
'''
# trimmed_df = df.drop(["I've been feeling optimistic about the future","I've been feeling interested in other people"], axis=1)
# #perform factor analysis - get eigenvectors
# fa = FactorAnalyzer(rotation = None,n_factors=trimmed_df.shape[1])
# fa.fit(trimmed_df)
# ev,_ = fa.get_eigenvalues()
# factorLoadings = fa.loadings_
#item response theory
from girth import pcm_mml
df_transpose = df.T
df_array = df_transpose.values
df_int = df_array.astype(int)
estimates = pcm_mml(df_int)
I have tried forming .zip files and make an instance in EC2 and build a docker image... but I failed in all attempts and I am honestly just copying youtube videos on the topic which is really frustrating when you fail at copy pasting solutions. I think I have some fundamental issues with my python code in terms of compatibility with AWS.
With the method of adding a .zip file to the AWS lambda, I previously got this error even though my libraries are actually compatible with the python version I am using in AWS (3.9).
I thought maybe my local Python environment is 3.8 and that creates an issue but I am not sure about that either.

How do you save textnets (python) to gml / gexf or access dataframe of graph?

I have been using textnets (python) to analyse a corpus. I need to export the resulting graph for further analysis / layout editing in Gephi. Having read the docs I am still confused on how to either save the resulting igraph Graph in the appropriate format or to access the pandas dataframe which could then be exported. For example using the tutorial from docs, if using:
from textnets import Corpus, Textnet
from textnets import examples
corpus = Corpus(examples.moon_landing)
tn = Textnet(corpus.tokenized(), min_docs=1)
print(tn)
I had thought I could either return a pandas data frame by calling 'tn' though this returns a 'Textnet' object.
I had also thought I could return an igraph.Graph object and then subsequently use Graph.write_gml() using something like tn.project(node_type='doc').write_gml('test.gml') to save the file in an appropriate format but this returns a ProjectedTextnet.
Any advise would be most welcome.
For the second part of your question, you can convert the textnet object to an igraph:
g = tn.graph
Then save as gml:
g.write_gml("test.gml")

Association Rule Mining in python on Census Data

Hello everyone, I am working on a project. I need to perform Association Rule Mining on a Census Data Which Looks Like the image given below.
I am using the Apriori Algorithm from the mlxtend library. Here is the Code.
# Library Imports
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
# Reading the Data File
data = pd.read_csv("Census.csv")
# Reading Certain Columns of the Data File.
df = data[["Region","Residence Type","Sex","Student"]]
# Initializing the Transaction Encoder
te = TransactionEncoder()
# Fitting the Data.
te_ary = te.fit(df).transform(df)
# Creating a Dataframe of Support and Element name
df2 = pd.DataFrame(te_ary, columns=te.columns_)
# Calling in the Apriori Algorithm.
fre = apriori(df2,min_support=0.6,use_colnames=True)
# Calling the Association Rule Function.
association_rules(fre, metric="confidence",min_threshold=0.7)
But the fre variable does not returns any rules, it is always empty. Can someone help me please.It is request.
I tried many ways and a friend of mine suggested me a way for solving my problem. It worked in a graceful way, but yes it requires your logic too.
This is the link which helped me to implement my question in a better way, here it is.
In my solution I used correlating parameters and created a basket.
This is it. Thank you.

How to convert Pandas DataFrame to RDF (Resource Description Framework)?

I'm looking for a recipe for converting Pandas DataFrames to RDF data in Python. I'm aware of the following Python modules (I know how to Google!), but they do not work for me:
rdfpandas
pandasrdf
Neither seems mature. I have problems with both. In the case of rdfpandas, I'm unable to install and there are no examples and insufficient documentation. In the case of pandasrdf, the example doesn't work and crashes. I can fix it, but the RDF file has zero triples, so the result is useless. I'd rather not have to write out the data to some intermediate data file that I have to injest later. Pandas->numpy->RDF would be OK I guess. Does anybody have a working example of converting a Pandas DataFrame to RDF in one of the common serialisation formats that does not involve an artisanal black magic package installation?
A newer version of RdfPandas is out, so you can try it out and see if it covers your use case: https://rdfpandas.readthedocs.io/en/latest (thanks to
Carmoreno for the prompt to fix the link)
Example based on https://github.com/cadmiumkitty/capability-models/blob/master/notebooks/investment_management_capabilities.csv is below
import pandas as pd
import rdfpandas
df = pd.read_csv('investment_management_capabilities.csv', index_col = '#id', keep_default_na = True)
g = rdfpandas.to_graph(df)
ttl = g.serialize(format = 'turtle')
with open('investment_management_capabilities.ttl', 'wb') as file:
file.write(ttl)
The code that does the conversion is pretty minimal and is here (just look at the to_graph method) https://github.com/cadmiumkitty/rdfpandas/blob/master/rdfpandas/graph.py, so you can use it directly as an inspiration to create your own conversion logic.

fastest way to get NetCDF variable min/max using Python?

My usual method for extracting the min/max of a variable's data values from a NetCDF file is a magnitude of order slower when switching to the netCDF4 Python module compared to scipy.io.netcdf.
I am working with relatively large ocean model output files (from ROMS) with multiple depth levels over a given map region (Hawaii). When these were in NetCDF-3, I used scipy.io.netcdf.
Now that these files are in NetCDF-4 ("Classic") I can no longer use scipy.io.netcdf and have instead switched over to using the netCDF4 Python module. However, the slowness is a concern and I wondered if there is a more efficient method of extracting a variable's data range (minimum and maximum data values)?
Here was my NetCDF-3 method using scipy:
import scipy.io.netcdf
netcdf = scipy.io.netcdf.netcdf_file(file)
var = netcdf.variables['sea_water_potential_temperature']
min = var.data.min()
max = var.data.max()
Here is my NetCDF-4 method using netCDF4:
import netCDF4
netcdf = netCDF4.Dataset(file)
var = netcdf.variables['sea_water_potential_temperature']
var_array = var.data.flatten()
min = var_array.data.min()
max = var_array.data.max()
The notable difference is that I must first flatten the data array in netCDF4, and this operation apparently slows things down.
Is there a better/faster way?
Per suggestion of hpaulj here is a function that calls the nco command ncwa using subprocess. It hangs terribly when using an OPeNDAP address, and I don't have any files on hand to test it locally.
You can see if it works for you and what the speed difference is.
This assumes you have the nco library installed.
def ncwa(path, fnames, var, op_type, times=None, lons=None, lats=None):
'''Perform arithmetic operations on netCDF file or OPeNDAP data
Args
----
path: str
prefix
fnames: str or iterable
Names of file(s) to perform operation on
op_type: str
ncwa arithmetic operation to perform. Available operations are:
avg,mabs,mebs,mibs,min,max,ttl,sqravg,avgsqr,sqrt,rms,rmssdn
times: tuple
Minimum and maximum timestamps within which to perform the operation
lons: tuple
Minimum and maximum longitudes within which to perform the operation
lats: tuple
Minimum and maximum latitudes within which to perform the operation
Returns
-------
result: float
Result of the operation on the selected data
Note
----
Adapted from the OPeNDAP examples in the NCO documentation:
http://nco.sourceforge.net/nco.html#OPeNDAP
'''
import os
import netCDF4
import numpy
import subprocess
output = 'tmp_output.nc'
# Concatenate subprocess command
cmd = ['ncwa']
cmd.extend(['-y', '{}'.format(op_type)])
if times:
cmd.extend(['-d', 'time,{},{}'.format(times[0], times[1])])
if lons:
cmd.extend(['-d', 'lon,{},{}'.format(lons[0], lons[1])])
if lats:
cmd.extend(['-d', 'lat,{},{}'.format(lats[0], lats[1])])
cmd.extend(['-p', path])
cmd.extend(numpy.atleast_1d(fnames).tolist())
cmd.append(output)
# Run cmd and check for errors
subprocess.run(cmd, stdout=subprocess.PIPE, check=True)
# Load, read, close data and delete temp .nc file
data = netCDF4.Dataset(output)
result = float(data[var][:])
data.close()
os.remove(output)
return result
path = 'https://ecowatch.ncddc.noaa.gov/thredds/dodsC/hycom/hycom_reg6_agg/'
fname = 'HYCOM_Region_6_Aggregation_best.ncd'
times = (0.0, 48.0)
lons = (201.5, 205.5)
lats = (18.5, 22.5)
smax = ncwa(path, fname, 'salinity', 'max', times, lons, lats)
If you're just getting the min/max values across an array of a variable, you can use xarray.
%matplotlib inline
import xarray as xr
da = xr.open_dataset('infile/file.nc')
max = da.sea_water_potential_temperature.max()
min = da.sea_water_potential_temperature.min()
This should give you a single value of min/max, respectively. You could also get the min/max of a variable across a selected dimension like time, longitude, latitude etc. Xarray is great for handling multidimensional arrays that is why it's pretty easy to handle NetCDF in python when you're not using other operating tools like CDO and NCO.
Lastly, xarray is also used in other related libraries that deals with weather and climate data in python ( http://xarray.pydata.org/en/stable/related-projects.html ).
A Python solution (using CDO as a backend) is my package nctoolkit (https://pypi.org/project/nctoolkit/ https://nctoolkit.readthedocs.io/en/latest/installing.html).
This has a number of built in methods for calculating different types of min/max values.
We would first need to read the file in as a dataset:
import nctoolkit as nc
data = nc.open_data(file)
If you wanted the maximum value across space, for each timestep, you would do the following:
data.spatial_max()
Maximum across all depths for each grid cell and time step would be calculated as follows:
data.vertical_max()
If you wanted the maximum across time, you would do:
data.max()
These methods are chainable, and the CDO backend is very efficient, so should be ideal for working with ROMS data.

Categories

Resources