Print decision trees in Python - python

i have a project on the university of making a decision tree, i already have the code that creates the tree but i want to print it, can anyone help me?
#IMPORT ALL NECESSARY LIBRARIES
import Chefboost as chef
import pandas as pd
archivo = input("INSERT FILE NAMED FOLLOWED BY .CSV:\n")
# READ THE DATA SET FROM THE CSV FILE
df = pd.read_csv(str(archivo))
df.columns = ['ph', 'soil_temperature', 'soil_moisture', 'illuminance', 'env_temperature','env_humidity','Decision']
# print(df.head(10)) #UNCOMMENT IF WANT FIRST 10 ROWS PRINTED OUT
config = {'algorithm':'ID3'} # CONFIGURE THE ALGORITH. CHOOSE BETWEEN ID3, C4.5, CART, Regression
model = chef.fit(df.copy(), config) #CREATE THE DECISION TREE BASED OF THE CONFIGURATION ABOVE
resultados = pd.DataFrame(columns = ["Real", "Predicción"]) #CREATE AN EMPTY PANDAS DATAFRAME
# SAVE ALL REAL VS ESTIMATED VALUES IN THE ABOVE DATAFRAME
for i in range(1,372):
l = []
l.append(df.iloc[i]['Decision'])
feature = df.iloc[i]
prediction = chef.predict(model, feature)
l.append(prediction)
resultados.loc[i] = l
print(l)

Not knowing the Chefboost library, I can't directly answer your question, but when I am working with a new library, I will often use a few tools to help me understand what the library is giving me. Use dir(object) to get a listing of the attributes and methods of the object.
You might also get a little more specific about what you want to see when you "Print the decision tree." Are you trying to print the model, or the predictions? What trouble are you having or what errors are you seeing?
Hope this helps.

Related

In AWS, what is the easiest way to run a python script which reads an excel file and computes & returns various arrays?

I am working on a primitive version of a script that performs factor analysis and computes some parameters for item response theory. I need to make this code run in AWS because I am requested so. However I have absolutely zero experience in Cloud computing and AWS and anything related to that (I am just somewhat OK with writing Python and MATLAB scripts).
Can anyone please suggest me the easiest way to make the following python code work in AWS in AN EASY-TO-IMPLEMENT way that is doable for a total noob (including changes that I need to make inside the python code):
P.S: I am expecting this script to give me the "estimates" and "ev" parameters. Converting this script to a function also did not work for me but probably the issue is different so I can convert this to a function with desired returns as well.
import os
import pandas as pd
import numpy as np
from factor_analyzer import FactorAnalyzer
import matplotlib.pyplot as plt
os.chdir('C:/Users/ege/Desktop/Neurolize') #change to where your excel file is.
#access the relevant sheet with answers of each participant
df = pd.read_excel(open('irtTestCase.xlsx', 'rb'),
sheet_name='Survey Module 1',skiprows = [0])
#drop irrelevant columns (User ID & Response Time)
df.drop(['UserId', 'Unnamed: 15'],axis=1,inplace=True)
#drop the participants that did not answer all of the questions
df.dropna(inplace=True)
#replace the answers with numeric values
df = df.replace(regex={'None of the time':1.0, 'Rarely':2.0, 'Some of the time':3.0,
'Often':4.0, 'All of the time':5.0})
#see if factor analysis can be performed on the data (p <0.05 = eligibility)
from factor_analyzer.factor_analyzer import calculate_bartlett_sphericity
chi_square_value,p_value=calculate_bartlett_sphericity(df)
chi_square_value, p_value
#perform factor analysis - get eigenvectors
fa = FactorAnalyzer(rotation = None,n_factors=df.shape[1])
fa.fit(df)
ev,_ = fa.get_eigenvalues()
#get the ratio of variance explained by the addition of each factor
variances=fa.get_factor_variance()
cum_variance = variances[2] # %variance explained via the addition of each factor
#plot the relative component amplitudes (eigenvalues)
plt.scatter(range(1,df.shape[1]+1),ev)
plt.plot(range(1,df.shape[1]+1),ev)
plt.title('Scree Plot')
plt.xlabel('Factors')
plt.ylabel('Eigen Value')
plt.grid()
#get how much each question contributes to each of the factors
factorLoadings = fa.loadings_
'''
Traditional criteria is to consider each factor that has >1 eigenvalue as a
significant factor. So now we have 3 factors that have eigenvalues >1. So
it may be a good idea to exclude the questions that mainly load into the
second and third factors
'''
# trimmed_df = df.drop(["I've been feeling optimistic about the future","I've been feeling interested in other people"], axis=1)
# #perform factor analysis - get eigenvectors
# fa = FactorAnalyzer(rotation = None,n_factors=trimmed_df.shape[1])
# fa.fit(trimmed_df)
# ev,_ = fa.get_eigenvalues()
# factorLoadings = fa.loadings_
#item response theory
from girth import pcm_mml
df_transpose = df.T
df_array = df_transpose.values
df_int = df_array.astype(int)
estimates = pcm_mml(df_int)
I have tried forming .zip files and make an instance in EC2 and build a docker image... but I failed in all attempts and I am honestly just copying youtube videos on the topic which is really frustrating when you fail at copy pasting solutions. I think I have some fundamental issues with my python code in terms of compatibility with AWS.
With the method of adding a .zip file to the AWS lambda, I previously got this error even though my libraries are actually compatible with the python version I am using in AWS (3.9).
I thought maybe my local Python environment is 3.8 and that creates an issue but I am not sure about that either.

How to retrieve all variable names within a netcdf using GDAL

I am struggling to find a way to retrieve metadata information from a FILE using GDAL.
Specifically, I would like to retrieve the band names and the order in which they are stored in a given file (may that be a GEOTIFF or a NETCDF).
For instance, if we follow the description within the GDAL documentation, we have the "GetMetaData" method from the gdal.Dataset (see here and here). Despite this method returning a whole set of information regarding the dataset, it does not provide the band names and the order that they are stored within the given FILE. As a matter of fact, it seems to be an old problem (from 2015) that seems not to be solved yet (more info here). As it seems, "R" language has already solved this problem (see here), though Python hasn't.
Just to be thorough here, I know that there are other Python packages that can help in this endeavour (e.g., xarray, rasterio, etc.); nevertheless, it would be important to be concise with the set of packages that one should use in a single script. Therefore, I would like to know a definite way to find the band (a.k.a., variable) names and the order they are stored within a single FILE using gdal.
Please, let me know your thoughs in this regard.
Below, I present a starting point for solving this Issue, in which a file is opened by GDAL (creating a Dataset object).
from gdal import Dataset
from osgeo import gdal
OpeneddatasetFile = gdal.Open(f'NETCDF:{input}/{file_name}.nc:' + var)
if isinstance(OpeneddatasetFile , Dataset):
print("File opened successfully")
# here is where one should be capable of fetching the variable (a.k.a., band) names
# of the OpeneddatasetFile.
# Ideally, it would be most welcome some kind of method that could return a dictionary
# with this information
# something like:
# VariablesWithinFile = OpeneddatasetFile.getVariablesWithinFileAsDictionary()
I have finally found a way to retrieve variable names from the NETCDF file using GDAL, and that is thank's to the comments given by Robert Davy above.
I have organized the code into a set of functions to help its visualization. Notice that there is also a function for reading metadata from the NETCDF, which returns this info in a dictionary format (see the "readInfo" function).
from gdal import Dataset, InfoOptions
from osgeo import gdal
import numpy as np
def read_data(filename):
dataset = gdal.Open(filename)
if not isinstance(dataset, Dataset):
raise FileNotFoundError("Impossible to open the netcdf file")
return dataset
def readInfo(ds, infoFormat="json"):
"how to: https://gdal.org/python/"
info = gdal.Info(ds, options=InfoOptions(format=infoFormat))
return info
def listAllSubDataSets(infoDict: dict):
subDatasetVariableKeys = [x for x in infoDict["metadata"]["SUBDATASETS"].keys()
if "_NAME" in x]
subDatasetVariableNames = [infoDict["metadata"]["SUBDATASETS"][x]
for x in subDatasetVariableKeys]
formatedsubDatasetVariableNames = []
for x in subDatasetVariableNames:
s = x.replace('"', '').split(":")[-1]
s = ''.join(s)
formatedsubDatasetVariableNames.append(s)
return formatedsubDatasetVariableNames
if "__main__" == __name__:
filename = "netcdfFile.nc"
ds = read_data(filename)
infoDict = readInfo(ds)
infoDict["VariableNames"] = listAllSubDataSets(infoDict)

Association Rule Mining in python on Census Data

Hello everyone, I am working on a project. I need to perform Association Rule Mining on a Census Data Which Looks Like the image given below.
I am using the Apriori Algorithm from the mlxtend library. Here is the Code.
# Library Imports
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
# Reading the Data File
data = pd.read_csv("Census.csv")
# Reading Certain Columns of the Data File.
df = data[["Region","Residence Type","Sex","Student"]]
# Initializing the Transaction Encoder
te = TransactionEncoder()
# Fitting the Data.
te_ary = te.fit(df).transform(df)
# Creating a Dataframe of Support and Element name
df2 = pd.DataFrame(te_ary, columns=te.columns_)
# Calling in the Apriori Algorithm.
fre = apriori(df2,min_support=0.6,use_colnames=True)
# Calling the Association Rule Function.
association_rules(fre, metric="confidence",min_threshold=0.7)
But the fre variable does not returns any rules, it is always empty. Can someone help me please.It is request.
I tried many ways and a friend of mine suggested me a way for solving my problem. It worked in a graceful way, but yes it requires your logic too.
This is the link which helped me to implement my question in a better way, here it is.
In my solution I used correlating parameters and created a basket.
This is it. Thank you.

Parsing osm.pbf data using GDAL/OGR python module

I'm trying to extract data from an OSM.PBF file using the python GDAL/OGR module.
Currently my code looks like this:
import gdal, ogr
osm = ogr.Open('file.osm.pbf')
## Select multipolygon from the layer
layer = osm.GetLayer(3)
# Create list to store pubs
pubs = []
for feat in layer:
if feat.GetField('amenity') == 'pub':
pubs.append(feat)
While this little bit of code works fine with small.pbf files (15mb). However, when parsing files larger than 50mb I get the following error:
ERROR 1: Too many features have accumulated in points layer. Use OGR_INTERLEAVED_READING=YES MODE
When I turn this mode on with:
gdal.SetConfigOption('OGR_INTERLEAVED_READING', 'YES')
ogr does not return any features at all anymore, even when parsing small files.
Does anyone know what is going on here?
Thanks to scai's answer I was able to figure it out.
The special reading pattern required for interleaved reading that is mentioned in gdal.org/1.11/ogr/drv_osm.html is translated into a working python example that can be found below.
This is an example of how to extract all features in an .osm.pbf file that have the 'amenity=pub' tag
import gdal, ogr
gdal.SetConfigOption('OGR_INTERLEAVED_READING', 'YES')
osm = ogr.Open('file.osm.pbf')
# Grab available layers in file
nLayerCount = osm.GetLayerCount()
thereIsDataInLayer = True
pubs = []
while thereIsDataInLayer:
thereIsDataInLayer = False
# Cycle through available layers
for iLayer in xrange(nLayerCount):
lyr=osm.GetLayer(iLayer)
# Get first feature from layer
feat = lyr.GetNextFeature()
while (feat is not None):
thereIsDataInLayer = True
#Do something with feature, in this case store them in a list
if feat.GetField('amenity') == 'pub':
pubs.append(feat)
#The destroy method is necessary for interleaved reading
feat.Destroy()
feat = lyr.GetNextFeature()
As far as I understand it, a while-loop is needed instead of a for-loop because when using the interleaved reading method, it is impossible to obtain the featurecount of a collection.
More clarification on why this piece of code works like it does would be greatly appreciated.

For loop with Python for ArcGIS using Add Field and Field Calculator

I'll try to give a brief background here. I recently received a large amount of data that was all digitized from paper maps. Each map was saved as an individual file that contains a number of records (polygons mostly). My goal is to merge all of these files into one shapefile or geodatabase, which is an easy enough task. However, other than spatial information, the records in the file do not have any distinguishing information so I would like to add a field and populate it with the original file name to track its provenance. For example, in the file "505_dmg.shp" I would like each record to have a "505_dmg" id in a column in the attribute table labeled "map_name". I am trying to automate this using Python and feel like I am very close. Here is the code I'm using:
# Import system module
import arcpy
from arcpy import env
from arcpy.sa import *
# Set overwrite on/off
arcpy.env.overwriteOutput = "TRUE"
# Define workspace
mywspace = "K:/Research/DATA/ADS_data/Historic/R2_ADS_Historical_Maps/Digitized Data/Arapahoe/test"
print mywspace
# Set the workspace for the ListFeatureClass function
arcpy.env.workspace = mywspace
try:
for shp in arcpy.ListFeatureClasses("","POLYGON",""):
print shp
map_name = shp[0:-4]
print map_name
arcpy.AddField_management(shp, "map_name", "TEXT","","","20")
arcpy.CalculateField_management(shp, "map_name","map_name", "PYTHON")
except:
print "Fubar, It's not working"
print arcpy.GetMessages()
else:
print "You're a genius Aaron"
The output I receive from running this script:
>>>
K:/Research/DATA/ADS_data/Historic/R2_ADS_Historical_Maps/Digitized Data/Arapahoe/test
505_dmg.shp
505_dmg
506_dmg.shp
506_dmg
You're a genius Aaron
Appears successful, right? Well, it has been...almost: a field was added and populated for both files, and it is perfect for 505_dmg.shp file. Problem is, 506_dmg.shp has also been labeled "505_dmg" in the "map_name" column. Though the loop appears to be working partially, the map_name variable does not seem to be updating. Any thoughts or suggestions much appreciated.
Thanks,
Aaron
I received a solution from the ESRI discussion board:
https://geonet.esri.com/thread/114520
Basically, a small edit in the Calculate field function did the trick. Here is the new code that worked:
arcpy.CalculateField_management(shp, "map_name","\"" + map_name + "\"", "PYTHON")

Categories

Resources