Association Rule Mining in python on Census Data - python

Hello everyone, I am working on a project. I need to perform Association Rule Mining on a Census Data Which Looks Like the image given below.
I am using the Apriori Algorithm from the mlxtend library. Here is the Code.
# Library Imports
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
# Reading the Data File
data = pd.read_csv("Census.csv")
# Reading Certain Columns of the Data File.
df = data[["Region","Residence Type","Sex","Student"]]
# Initializing the Transaction Encoder
te = TransactionEncoder()
# Fitting the Data.
te_ary = te.fit(df).transform(df)
# Creating a Dataframe of Support and Element name
df2 = pd.DataFrame(te_ary, columns=te.columns_)
# Calling in the Apriori Algorithm.
fre = apriori(df2,min_support=0.6,use_colnames=True)
# Calling the Association Rule Function.
association_rules(fre, metric="confidence",min_threshold=0.7)
But the fre variable does not returns any rules, it is always empty. Can someone help me please.It is request.

I tried many ways and a friend of mine suggested me a way for solving my problem. It worked in a graceful way, but yes it requires your logic too.
This is the link which helped me to implement my question in a better way, here it is.
In my solution I used correlating parameters and created a basket.
This is it. Thank you.

Related

In AWS, what is the easiest way to run a python script which reads an excel file and computes & returns various arrays?

I am working on a primitive version of a script that performs factor analysis and computes some parameters for item response theory. I need to make this code run in AWS because I am requested so. However I have absolutely zero experience in Cloud computing and AWS and anything related to that (I am just somewhat OK with writing Python and MATLAB scripts).
Can anyone please suggest me the easiest way to make the following python code work in AWS in AN EASY-TO-IMPLEMENT way that is doable for a total noob (including changes that I need to make inside the python code):
P.S: I am expecting this script to give me the "estimates" and "ev" parameters. Converting this script to a function also did not work for me but probably the issue is different so I can convert this to a function with desired returns as well.
import os
import pandas as pd
import numpy as np
from factor_analyzer import FactorAnalyzer
import matplotlib.pyplot as plt
os.chdir('C:/Users/ege/Desktop/Neurolize') #change to where your excel file is.
#access the relevant sheet with answers of each participant
df = pd.read_excel(open('irtTestCase.xlsx', 'rb'),
sheet_name='Survey Module 1',skiprows = [0])
#drop irrelevant columns (User ID & Response Time)
df.drop(['UserId', 'Unnamed: 15'],axis=1,inplace=True)
#drop the participants that did not answer all of the questions
df.dropna(inplace=True)
#replace the answers with numeric values
df = df.replace(regex={'None of the time':1.0, 'Rarely':2.0, 'Some of the time':3.0,
'Often':4.0, 'All of the time':5.0})
#see if factor analysis can be performed on the data (p <0.05 = eligibility)
from factor_analyzer.factor_analyzer import calculate_bartlett_sphericity
chi_square_value,p_value=calculate_bartlett_sphericity(df)
chi_square_value, p_value
#perform factor analysis - get eigenvectors
fa = FactorAnalyzer(rotation = None,n_factors=df.shape[1])
fa.fit(df)
ev,_ = fa.get_eigenvalues()
#get the ratio of variance explained by the addition of each factor
variances=fa.get_factor_variance()
cum_variance = variances[2] # %variance explained via the addition of each factor
#plot the relative component amplitudes (eigenvalues)
plt.scatter(range(1,df.shape[1]+1),ev)
plt.plot(range(1,df.shape[1]+1),ev)
plt.title('Scree Plot')
plt.xlabel('Factors')
plt.ylabel('Eigen Value')
plt.grid()
#get how much each question contributes to each of the factors
factorLoadings = fa.loadings_
'''
Traditional criteria is to consider each factor that has >1 eigenvalue as a
significant factor. So now we have 3 factors that have eigenvalues >1. So
it may be a good idea to exclude the questions that mainly load into the
second and third factors
'''
# trimmed_df = df.drop(["I've been feeling optimistic about the future","I've been feeling interested in other people"], axis=1)
# #perform factor analysis - get eigenvectors
# fa = FactorAnalyzer(rotation = None,n_factors=trimmed_df.shape[1])
# fa.fit(trimmed_df)
# ev,_ = fa.get_eigenvalues()
# factorLoadings = fa.loadings_
#item response theory
from girth import pcm_mml
df_transpose = df.T
df_array = df_transpose.values
df_int = df_array.astype(int)
estimates = pcm_mml(df_int)
I have tried forming .zip files and make an instance in EC2 and build a docker image... but I failed in all attempts and I am honestly just copying youtube videos on the topic which is really frustrating when you fail at copy pasting solutions. I think I have some fundamental issues with my python code in terms of compatibility with AWS.
With the method of adding a .zip file to the AWS lambda, I previously got this error even though my libraries are actually compatible with the python version I am using in AWS (3.9).
I thought maybe my local Python environment is 3.8 and that creates an issue but I am not sure about that either.

How to debug pandas_udfs without having to use Spark?

If I'm using Python Transforms in Palantir Foundry and I'm trying to run an algorithm which uses in-memory/non-spark libraries, and I want it automatically scale and work in Spark (not pandas). If I'm having a hard time writing the code and want to test and develop it locally, yet use the same code in pyspark later, how do I do this?
For a concrete example, I want to calculate the area of a geojson column which contains a polygon. Since I would need to use some libraries which arn't native to Spark (shapely and pyproj). I know that the best way (performance wise) is to use a pandas_udf (otherwise known as streaming udfs or vectorized udfs). But after reading a couple of guides, specifically Introducing Pandas UDF for PySpark, pandas user-defined functions
and Modeling at Scale with Pandas UDFs w/code examples, it's still challenging to debug and get working, and it seems like I can't use break statements and there isn't a first class way to log/print.
The actual dataframe would have millions of rows (relating to millions of polygons), but for simplicity I wanted to test locally with a simple dataframe, and it scale to larger dataset later:
df = spark.createDataFrame(
[
("AFG", "{\"type\":\"Polygon\",\"coordinates\":[[[61.210817,35.650072],[62.230651,35.270664],[62.984662,35.404041],[63.193538,35.857166],[63.982896,36.007957],[64.546479,36.312073],[64.746105,37.111818],[65.588948,37.305217],[65.745631,37.661164],[66.217385,37.39379],[66.518607,37.362784],[67.075782,37.356144],[67.83,37.144994],[68.135562,37.023115],[68.859446,37.344336],[69.196273,37.151144],[69.518785,37.608997],[70.116578,37.588223],[70.270574,37.735165],[70.376304,38.138396],[70.806821,38.486282],[71.348131,38.258905],[71.239404,37.953265],[71.541918,37.905774],[71.448693,37.065645],[71.844638,36.738171],[72.193041,36.948288],[72.63689,37.047558],[73.260056,37.495257],[73.948696,37.421566],[74.980002,37.41999],[75.158028,37.133031],[74.575893,37.020841],[74.067552,36.836176],[72.920025,36.720007],[71.846292,36.509942],[71.262348,36.074388],[71.498768,35.650563],[71.613076,35.153203],[71.115019,34.733126],[71.156773,34.348911],[70.881803,33.988856],[69.930543,34.02012],[70.323594,33.358533],[69.687147,33.105499],[69.262522,32.501944],[69.317764,31.901412],[68.926677,31.620189],[68.556932,31.71331],[67.792689,31.58293],[67.683394,31.303154],[66.938891,31.304911],[66.381458,30.738899],[66.346473,29.887943],[65.046862,29.472181],[64.350419,29.560031],[64.148002,29.340819],[63.550261,29.468331],[62.549857,29.318572],[60.874248,29.829239],[61.781222,30.73585],[61.699314,31.379506],[60.941945,31.548075],[60.863655,32.18292],[60.536078,32.981269],[60.9637,33.528832],[60.52843,33.676446],[60.803193,34.404102],[61.210817,35.650072]]]}"),
("ALB", "{\"type\":\"Polygon\",\"coordinates\":[[[20.590247,41.855404],[20.463175,41.515089],[20.605182,41.086226],[21.02004,40.842727],[20.99999,40.580004],[20.674997,40.435],[20.615,40.110007],[20.150016,39.624998],[19.98,39.694993],[19.960002,39.915006],[19.406082,40.250773],[19.319059,40.72723],[19.40355,41.409566],[19.540027,41.719986],[19.371769,41.877548],[19.304486,42.195745],[19.738051,42.688247],[19.801613,42.500093],[20.0707,42.58863],[20.283755,42.32026],[20.52295,42.21787],[20.590247,41.855404]]]}"),
],# can continue with more countries from https://raw.githubusercontent.com/johan/world.geo.json/34c96bba9c07d2ceb30696c599bb51a5b939b20f/countries.geo.json
["country", "geometry"]
)
Given the geometry column which is actually geojson, how can I calculate the area in square-meters using a good GIS approach? For example using the methods outlined in these questions:
Calculate Polygon area in planar units (e.g. square-meters) in Shapely
How do I get the area of a GeoJSON polygon with Python
How to calculate the area of a polygon on the earth's surface using python?
The way you can think about pandas_udfs is that you are writing your logic to be applied to a pandas series. This means that you would be applying an operation and it would automatically apply to every row.
If you want to develop this locally, you can actually take a much smaller sample of your data (like you did), and have it stored in a pandas series, and get it working there:
from shapely.geometry import Polygon
import json
from pyproj import Geod
#just select the column you want to use the pandas udf
pdf = df.select("geometry").toPandas()
#convert to pandas series
pdf_geom_raw = pdf.ix[:,0]
#how to apply converting string to json/dict
pdf_geom = pdf_geom_raw.apply(json.loads)
# function using non-spark functions
def get_area(shape):
geod = Geod(ellps="WGS84")
poly = Polygon(shape["coordinates"][0])
area = abs(geod.geometry_area_perimeter(poly)[0])
return area
pdf_geom = pdf_geom.apply(get_area)
Here you could just try it locally (without spark) by replacing pdf = df.select("geometry").toPandas() to pdf = pd.read_csv("geo.csv")
Now that you have it working locally, you can copy paste the code in your pandas_udf
from shapely.geometry import Polygon
import json
from pyproj import Geod
from pyspark.sql.functions import pandas_udf, PandasUDFType
#pandas_udf('double', PandasUDFType.SCALAR)
def geodesic_polygon_area(pdf_geom):
pdf_geom = pdf_geom.apply(json.loads)
def get_area(shape):
geod = Geod(ellps="WGS84")
poly = Polygon(shape["coordinates"][0])
area = abs(geod.geometry_area_perimeter(poly)[0])
return area
pdf_geom = pdf_geom.apply(get_area)
return pdf_geom
df = df.withColumn('area_square_meters', geodesic_polygon_area(df.geometry))
When running the code:
>>> df.show()
+-------+--------------------+--------------------+
|country| geometry| area_square_meters|
+-------+--------------------+--------------------+
| AFG|{"type":"Polygon"...|6.522700837770404E11|
| ALB|{"type":"Polygon"...|2.969479517410540...|
+-------+--------------------+--------------------+

Print decision trees in Python

i have a project on the university of making a decision tree, i already have the code that creates the tree but i want to print it, can anyone help me?
#IMPORT ALL NECESSARY LIBRARIES
import Chefboost as chef
import pandas as pd
archivo = input("INSERT FILE NAMED FOLLOWED BY .CSV:\n")
# READ THE DATA SET FROM THE CSV FILE
df = pd.read_csv(str(archivo))
df.columns = ['ph', 'soil_temperature', 'soil_moisture', 'illuminance', 'env_temperature','env_humidity','Decision']
# print(df.head(10)) #UNCOMMENT IF WANT FIRST 10 ROWS PRINTED OUT
config = {'algorithm':'ID3'} # CONFIGURE THE ALGORITH. CHOOSE BETWEEN ID3, C4.5, CART, Regression
model = chef.fit(df.copy(), config) #CREATE THE DECISION TREE BASED OF THE CONFIGURATION ABOVE
resultados = pd.DataFrame(columns = ["Real", "Predicción"]) #CREATE AN EMPTY PANDAS DATAFRAME
# SAVE ALL REAL VS ESTIMATED VALUES IN THE ABOVE DATAFRAME
for i in range(1,372):
l = []
l.append(df.iloc[i]['Decision'])
feature = df.iloc[i]
prediction = chef.predict(model, feature)
l.append(prediction)
resultados.loc[i] = l
print(l)
Not knowing the Chefboost library, I can't directly answer your question, but when I am working with a new library, I will often use a few tools to help me understand what the library is giving me. Use dir(object) to get a listing of the attributes and methods of the object.
You might also get a little more specific about what you want to see when you "Print the decision tree." Are you trying to print the model, or the predictions? What trouble are you having or what errors are you seeing?
Hope this helps.

How to convert Pandas DataFrame to RDF (Resource Description Framework)?

I'm looking for a recipe for converting Pandas DataFrames to RDF data in Python. I'm aware of the following Python modules (I know how to Google!), but they do not work for me:
rdfpandas
pandasrdf
Neither seems mature. I have problems with both. In the case of rdfpandas, I'm unable to install and there are no examples and insufficient documentation. In the case of pandasrdf, the example doesn't work and crashes. I can fix it, but the RDF file has zero triples, so the result is useless. I'd rather not have to write out the data to some intermediate data file that I have to injest later. Pandas->numpy->RDF would be OK I guess. Does anybody have a working example of converting a Pandas DataFrame to RDF in one of the common serialisation formats that does not involve an artisanal black magic package installation?
A newer version of RdfPandas is out, so you can try it out and see if it covers your use case: https://rdfpandas.readthedocs.io/en/latest (thanks to
Carmoreno for the prompt to fix the link)
Example based on https://github.com/cadmiumkitty/capability-models/blob/master/notebooks/investment_management_capabilities.csv is below
import pandas as pd
import rdfpandas
df = pd.read_csv('investment_management_capabilities.csv', index_col = '#id', keep_default_na = True)
g = rdfpandas.to_graph(df)
ttl = g.serialize(format = 'turtle')
with open('investment_management_capabilities.ttl', 'wb') as file:
file.write(ttl)
The code that does the conversion is pretty minimal and is here (just look at the to_graph method) https://github.com/cadmiumkitty/rdfpandas/blob/master/rdfpandas/graph.py, so you can use it directly as an inspiration to create your own conversion logic.

Pyspark reading pickled files [duplicate]

My data are available as sets of Python 3 pickled files. Most of them are serialization of Pandas DataFrames.
I'd like to start using Spark because I need more memory and CPU that one computer can have. Also, I'll use HDFS for distributed storage.
As a beginner, I didn't found relevant information explaining how to use pickle files as input file.
Does it exists? If not, are there any workaround?
Thanks a lot
A lot depends on the data itself. Generally speaking Spark doesn't perform particularly well when it has to read large, not splittable files. Nevertheless you can try to use binaryFiles method and combine it with the standard Python tools. Lets start with a dummy data:
import tempfile
import pandas as pd
import numpy as np
outdir = tempfile.mkdtemp()
for i in range(5):
pd.DataFrame(
np.random.randn(10, 2), columns=['foo', 'bar']
).to_pickle(tempfile.mkstemp(dir=outdir)[1])
Next we can read it using bianryFiles method:
rdd = sc.binaryFiles(outdir)
and deserialize individual objects:
import pickle
from io import BytesIO
dfs = rdd.values().map(lambda p: pickle.load(BytesIO(p)))
dfs.first()[:3]
## foo bar
## 0 -0.162584 -2.179106
## 1 0.269399 -0.433037
## 2 -0.295244 0.119195
One important note is that it typically requires significantly more memory than a simple methods like textFile.
Another approach is to parallelize only the paths and use libraries which can read directly from a distributed file system like hdfs3. This typically means lower memory requirements at the price of a significantly worse data locality.
Considering these two facts it is typically better to serialize your data in a format which can be loaded with a higher granularity.
Note:
SparkContext provides pickleFile method, but the name can be misleading. It can be used to read SequenceFiles containing pickle objects not the plain Python pickles.

Categories

Resources