Word Clouds using TabPy - python

I want to create some code in TabPy that will count the frequency of words in a column and remove stop words for a word cloud in Tableau.
I'm able to do this easily enough in Python:
other1_count = other1.answer.str.split(expand=True).stack().value_counts()
other1_count = other1_count.to_frame().reset_index()
other1_count.columns = ['Word', 'Count']
### Remove stopwords
other1_count['Word'] = other1_count['Word'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
other1_count['Word'].replace('', np.nan, inplace=True)
other1_count.dropna(subset=['Word'], inplace=True)
other1_count = other1_count[~other1_count.Word.str.contains("nan")]
But less sure how to run this through TabPy. Anyone familiar with TabPy and how I can make this run?
Thanks in advance.

I worked on a project that accomplished something very similar a while back in R. Here's a video example showing the proof-of-concept (no audio). https://www.screencast.com/t/xa0yemiDPl
It essentially shows the end state of using Tableau to interactively examine the description of wines in a word-cloud for the selected countries. The key components were:
have Tableau connect to the data to analyze, as well as a placeholder dataset that has the number of records you expect to get back from your Python/R code (the call out to Python/R from Tableau expects to get back the same number of records it sends off to process... that can be problematic if your sending text data, but processing it to return back many more records - as would be the case in the word cloud example)
have the Python/R code connect to your data and return the Word and Frequency counts in a single vector, separated by a delimiter (what Tableau will require for a word cloud)
split the single vector using Tableau Calculated Fields
leverage parameter actions to select parameter values to pass to the Python/R code
High-Level Overview
Tableau Calculated Field - [R Words+Freq]:
Script_Str('
print("STARTING NEW SCRIPT RUN")
print(Sys.time())
print(.arg2) # grouping
print(.arg1) # selected country
# TEST VARIABLE (non-prod)
.MaxSourceDataRecords = 1000 # -1 to disable
# TABLEAU PARAMETER VARIABLES
.country = "' + [Country Parameter] + '"
.wordsToReturn = ' + str([Return Top N Words]) + '
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^#
# VARIABLES DERIVED FROM TABLEAU PARAMETER VALUES
.countryUseAll = (.country == "All")
print(.countryUseAll)
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^#
#setwd("C:/Users/jbelliveau/....FILL IN HERE...")
.fileIn = ' + [Source Data Path] + '
#.fileOut = "winemag-with-DTM.csv"
#install.packages("wordcloud")
#install.packages("RColorBrewer") # not needed if installed wordcloud package
library(tm)
library(wordcloud)
library(RColorBrewer) # color package (maps or wordclouds)
wineAll = read.csv(.fileIn, stringsAsFactors=FALSE)
# TODO separately... polarity
# use all the data or just the parameter selected
print(.countryUseAll)
if ( .countryUseAll ) {
wine = wineAll # filter down to parameter passed from Tableau
}else{
wine = wineAll[c(wineAll$country == .country),] # filter down to parameter passed from Tableau
}
# limited data for speed (NOT FOR PRODUCTION)
if( .MaxSourceDataRecords > 0 ){
print("limiting the number of records to use from input data")
wine = head(wine, .MaxSourceDataRecords)
}
corpus = Corpus(VectorSource(wine$description))
corpus = tm_map(corpus, tolower)
#corpus = tm_map(corpus, PlainTextDocument) # https://stackoverflow.com/questions/32523544/how-to-remove-error-in-term-document-matrix-in-r/36161902
corpus = tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, removeWords, stopwords("English"))
#length(corpus)
dtm = DocumentTermMatrix(corpus)
#?sample
mysample = dtm # no sampling (used Head on data read... for speed/simplicity on this example)
#mysample <- dtm[sample(1:nrow(dtm), 5000, replace=FALSE),]
#nrow(mysample)
wineSample = as.data.frame(as.matrix(mysample))
# column names (the words)
# use colnames to get a vector of the words
#colnames(wineSample)
# freq of words
# colSums to get the frequency of the words
#wineWordFreq = colSums(wineSample)
# structure in a way Tableau will like it
wordCloudData = data.frame(words=colnames(wineSample), freq=colSums(wineSample))
str(wordCloudData)
# sort by word freq
wordCloudDataSorted = wordCloudData[order(-wordCloudData$freq),]
# join together by ~ for processing once Tableau gets it
wordAndFreq = paste(wordCloudDataSorted[, 1], wordCloudDataSorted[, 2], sep = "~")
#write.table(wordCloudData, .fileOut, sep=",",row.names=FALSE) # if needed for performance refactors
topWords = head(wordAndFreq, .wordsToReturn)
#print(topWords)
return( topWords )
',
Max([Country Parameter])
, MAX([RowNum]) // for testing the grouping being sent to R
)
Tableau Calculated Field for the Word Value:
// grab the first token to the left of ~
Left([R Words+Freq], Find([R Words+Freq],"~") - 1)
Tableau Calculated Field for the Frequency Value:
INT(REPLACE([R Words+Freq],[Word]+"~",""))
If you're not familiar with Tableau, you'll likely want to work alongside a Tableau analyst at your company that is. They'll be able to help you create the calculated fields and configure Tableau to connect to TabPy.

I think that the best way to get familiar with Python related to Tableau could be this (old) thread on the Tableau community:
https://community.tableau.com/s/news/a0A4T000002NznhUAC/tableau-integration-with-python-step-by-step?t=1614700410778
It explains step-by-step the initial set up and how to "call" Python via Tableau Calculated fields.
In addition, you'll find at the top of the post the reference to the more updated TabPy GitHub repository:
https://github.com/tableau/TabPy

Related

Speeding up fuzzy match on large list

I am working on a project that uses fuzzy logic on a list of names that could go about 100,000 unique records. On the recent screening that we have conducted, the functions that we use can complete a single name within 2.20 seconds on average. This means that on a list of 10,000 names, this process could take 6 hours, which is really too long.
Is there a way that we can speed up our process? Here's the snippet of the script that we use.
# Importing packages
import pandas as pd
import Levenshtein as lev
# Reading cleaned datasets
df_name_reference = pd.read_csv('path_to_file')
df_name_to_screen = pd.read_csv('path_to_file')
# Function used in name screening
def get_similarity_score(s1, s2):
''' Return match percentage between 2 strings disregarding name swapping
Parameters
-----------
s1 : str : name from df_name_reference (to be used within pandas apply)
s2 : str : name from df_name_to_screen (ref_name variable)
Return
-----------
float
'''
# Get sorted names
s1_sort = ' '.join(sorted(s1.split(' '))).strip() if type(s1)==str else ''
s2_sort = ' '.join(sorted(s2.split(' '))).strip() if type(s2)==str else ''
# Get ratios and return the max value
# THIS COULD BE THE BOTTLENECK OF OUR SCRIPT: MORE DETAILS BELOW
return max([
lev.ratio(s1, s2),
lev.ratio(s1_sort, s2),
lev.ratio(s1, s2_sort),
lev.ratio(s1_sort, s2_sort)
])
# Returning file
screening_results = []
for row in range(df_name_to_screen.shape[0]):
# Get name to screen
ref_name = df_name_to_screen.loc[row, 'fullname']
# Get scores
scores = df_name_reference.fullname.apply(lev.ratio, args=(ref_name,))
# Append results
screening_results.append(pd.DataFrame({'screened_name':ref_name, 'scores':scores}))
I took four scores from lev.ratio. This is to address variations in the arrangement of names, ie. firstname-lastname and lastname-firstname formats. I know that fuzzywuzzy package has token_sort_ratio, but I've noticed that it's just splitting the name parts, and sorting it alphabetically, which leads to lower scores. Plus, fuzzywuzzy is slower than Levenshtein. So, I had to manually capture the similarity score of sorted and unsorted names.
Can anyone give an approach that I could try? Thanks!
EDIT: Here's a sample dataset that you may try. This is in Google Drive.
In case you don't need scores for all entries in the reference data but just the top N then you can use difflib.get_close_matches to remove the others before calculating any scores:
screening_results = []
for row in range(df_name_to_screen.shape[0]):
ref_name = df_name_to_screen.loc[row, 'fullname']
skimmed = pd.DataFrame({
'fullname': difflib.get_close_matches(
ref_name,
df_name_reference.fullname,
N_RESULTS,
0
)
})
scores = skimmed.fullname.apply(lev.ratio, args=(ref_name,))
screening_results.append(pd.DataFrame({'screened_name': ref_name, 'scores': scores}))
This takes about 50ms per row using the file you provided.

create new dataframe using function

I ran into this nice blogpost : https://towardsdatascience.com/the-search-for-categorical-correlation-a1c.
The author creates a function that allows you to calculate associations between categorical features and then create a heatmap out of it.
The function is given as:
def cramers_v(x, y):
confusion_matrix = pd.crosstab(x,y)
chi2 = ss.chi2_contingency(confusion_matrix)[0]
n = confusion_matrix.sum().sum()
phi2 = chi2/n
r,k = confusion_matrix.shape
phi2corr = max(0, phi2-((k-1)*(r-1))/(n-1))
rcorr = r-((r-1)**2)/(n-1)
kcorr = k-((k-1)**2)/(n-1)
return np.sqrt(phi2corr/min((kcorr-1),(rcorr-1)))
I am able to create a list of associations between one feature and the rest by running the function in a for loop.
for item in raw[categorical].columns.tolist():
value = cramers_v(raw['status_group'], raw[item])
print(item, value)
It works in the sense that I get a list of association values
but I don't know how I would run this function for all features against eachother and turn that into a new dataframe.
The author of this article has written a nice new library that has this feature built in, but it doesn't turn out nicely for my long list of features (my laptop can't handle it).
Running it on the first 100 lines of my df results in this... (note: this is what I get by running the associations function of the dython library written by the author).
How could I run the cramers_v function for all combinations of features and then turn this into a df which I could display in a heatmap?

h2o aggregate method "none" mapping unknown words to NAN and not vector

I am currently using h2o.ai to perform some NLP. I have a trained model for my corpus in Word2Vec and have successfully aggregated a number of records with the method "Average". The problem comes in when I want to create features for my DRF model by using this w2v model to create a bag of words for each entry. When I use the aggregate method "none" the vectors are returned in a single column containing NaN's where The records begin and end, however the unknown words in the model are also being mapped to NaN and not the the unknown word vector. This is stopping me from reorganizing the vectors into a bag of words for each record because the record separation association is lost due to the extra and unpredictably entered NaNs. Is there a fix for this?
I am currently going to use the original tokenized list to make an index of the original double NaN structure that is used to deliminate between records and then recombine my vectors based off of this. Just wanted to throw this out there to see if anyone else is dealing with this or if there is some type of fix in place that I cannot find on the interwebs.
DATA = pd.read_sql(sql, conn1)
steps = [
(r'[\n\t\’\–\”\“\!~`\"##\$%\^\&\*()_+\{\}|:<>\?\-=\[\]\\;\',./\d]', ' '),
(r'\s+', ' ')
]
steps = [ (re.compile(a), b) for (a, b) in steps ]
def do_steps(anarr):
for pattern,replacement in steps:
anarr = pattern.sub(replacement,anarr)
return anarr
DATA.NARR = DATA.NARR.apply(do_steps)
train_hdata = h2o.H2OFrame(DATA).ascharacter()
train_narr = train_hdata["NARR"]
train_key = train_hdata["KEY"]
train_tokens_narr = train_narr.tokenize(split=' ')
train_vecs = w2v.transform(train_tokens_narr, aggregate_method='NONE')
VECS = train_vecs.as_data_frame()
df = train_tokens_narr.as_data_frame()
B=(VECS.isnull()&df.isnull())
idx = B[B['C1'] == True].index.tolist()
X = []
X.append('')
j=0
for i in tqdm(range(len(VECS.C1)-1)):
if i in idx:
X[j]= X[j][:-2]
j+=1
X.append('')
else:
X[j]= X[j] + str(VECS.C1[i])[:6] + ', '
s = pd.DataFrame({"C1":X})
print(s)
The above is the current code looking to take some records and encode them with the word2vec model for a bag of words. The bottom portion is a draft loop that I am using to put the correct vectors with the correct records. Let me know if I need to clarify.
Unfortunately the functionality to distinguish between words that are missing from your dictionary and NAs used to demarcate the start and end of a record is not currently available. I've made a jira ticket here to track the issue. Please feel free to comment or update the ticket.

Extract embedded vecor per word from h2o.word2vec object

I'm trying to create a pre-trained embedding layer, using h2o.word2vec, i'm looking to extract each word in the model and its equivalent embedded vector.
Code:
library(data.table)
library(h2o)
h2o.init(nthreads = -1)
comment <- data.table(comments='ExplanationWhy the edits made under my username Hardcore Metallica
Fan were reverted They werent vandalisms just closure on some GAs after I voted
at New York Dolls FAC And please dont remove the template from the talk page since Im retired now')
comments.hex <- as.h2o(comment, destination_frame = "comments.hex", col.types=c("String"))
words <- h2o.tokenize(comments.hex$comments, "\\\\W+")
vectors <- 3 # Only 10 vectors to save time & memory
w2v.model <- h2o.word2vec(words
, model_id = "w2v_model"
, vec_size = vectors
, min_word_freq = 1
, window_size = 2
, init_learning_rate = 0.025
, sent_sample_rate = 0
, epochs = 1) # only a one epoch to save time
print(h2o.findSynonyms(w2v.model, "the",2))
The h2o API enables me to get the cosine of two word, but i'm just looking to get the vector of each work in my vocabulary, how can i get it? couldn't find any simple method in the API that gives it
Thanks in advance
you can use the method w2v_model.transform(words=words)
(complete options are: w2v_model.transform(words =, aggregate_method =)
where words is an H2O Frame made of a single column containing source words (Note that you can specify to include a subset of this frame) and aggregate_method specifies how to aggregate sequences of words.
if you don't specify an aggregation method, then no aggregation is performed, and each input word is mapped to a single word-vector. If the method is AVERAGE, then the input is treated as sequences of words delimited by NA.
For example:
av_vecs = w2v_model.transform(words, aggregate_method = "AVERAGE")

Arcpad, arcpy, checkout/in data and limiting to selected features

am working on developing a python toolbox to automate the steps required for checking data out and back into a file geodatabase. My question is what is the best way to limit the features checked out to only those selected while using the python command line and ArcPad Data Management tools instead of the ArcPad data manager? The "Only get selected features" checkbox in the ArcPad Data Manager makes it easy. This is important because I want to limit the areas of use and reduce the file size as one of the feature classes is a large parcel map.
I do know one way to get selected items by using arcpy only.
#get an map object
mxd = arcpy.mapping.MapDocument("CURRENT")
#get an dataframe object, here the first one is taken
df = arcpy.mapping.ListDataFrames(mxd)[0]
#get a layer object
lyr = arcpy.mapping.ListLayers(mxd,"NameOfRequiredLayer",df)
#now get the FIDs of the selected Elements of your layer
selection = arcpy.Describe(lyr).FIDSet
"selection" then contains the FIDs of selected elements. With that you can carry one with whatever you have to do. For instance you can then set a layers definition query:
#shapes use FID, feature classes use OBJECTID, so you better check
IDname = "\"OBJECTID\""
if lyr.dataSource.endswith("shp"):
IDname = "\"FID\""
querystring = IDname + " = " + str(selection[0])
for count in range(1,len(selection)):
querystring = querystring + " OR " + IDname + " = " + str(selection[count])
if lyr.supports("DEFINITIONQUERY"):
lyr.definitionQuery = querystring

Categories

Resources