How to store the results of tokenization for further indexing?

How to store the results of tokenization for further indexing? - python

I am totally a beginner and now trying to implement a simple search engine in python.
I doing the tokenizer well by used functions in NLTK. But I am now confused on storing the results of the tokenizer. I need to keep them for further indexing.
What's the common way to do this? What kind of database should I use?

Introduction to Information Retrieval by Manning, Raghavan and Schütze devotes several chapters to index construction and storage; so does Modern Information Retrieval by Baeza-Yates and Ribeiro-Neto.
For a simple hobby/study project, though, SQLite will suffice for index storage. You need a table that holds (term, document-id, frequency) triples to compute tf and one that stores (term, df) pairs, both with an index on the terms; that's enough to compute tf-idf.

Related

How to extract out unique words and there pos tags in separate columns while working with Dataset

I am working through Indonesian Data to use data for NER and as I get to know, there is no pretrained NLTK model to help for this language. So, to do this manually I tried to extract all the unique words used in the entire data frame, I still don't know how to apply tags to the words but this is what I did so far.
the first step,
the second step,
the third step,
the fourth step
please let me know if there is any other convenient way to do this, what I did in the following codes. also, let me know how to add tags to each row(if possible) and how to do NER for this.
(I am new to coding that's why I don't know how to ask, but I am trying my best to provide as much information as possible.)

Depending on what you want to do if results is all that matters you could use a pretrained transfomer model from huggingface instead of NLTK. This will be more computionally heavy but also give you a better performance.
There is one fitting model I could find (I don't speak Indonesian obviously, so excuse eventual errors in the sample sentence):
https://huggingface.co/cahya/xlm-roberta-large-indonesian-NER?text=Nama+saya+Peter+dan+saya+tinggal+di+Berlin.
The easiest way to use this would probably be either the API or using an inference-only pipeline, check out this guide, all you would have to do to get this running for the Indonesian model is to replace the previous model path (dslim/bert-base-NER) with cahya/xlm-roberta-large-indonesian-NER.
Note that this Indonesian model is quite large, so you need to have some decent hardware. If you don't you could alternatively use some (free) cloud computing service such as Google Colab.

Getting around for loops in PySpark?

I have a clustering algorithm in Python that I am trying to convert to PySpark (for parallel processing).
I have a dataset that contains regions, and stores within those regions. I want to perform my clustering algorithm for all stores in a single region.
I have a few for loops before getting to the ML. How can I modify the code to remove the for loops in PySpark? I have read for loops in PySpark are generally not a good practice - but I need to be able to perform the model on many sub-datasets. Any advice?
For reference, I"m currently looping (through Pandas DataFrames) like this pseudocode below:
for region in df_region:
for distinct stores in region:
[apply ML clustering algorithm]

Search Built-in Algorithms
You could consider looking up RDD-based built-in clustering algorithms first since they are usually common and were released via rigorous validation process.
Clustering - RDD-based API
If you're more familiar with DataFrame-based API, then you could go here for a glance. And you might want to keep in mind as of Spark 2.0, the RDD-based APIs in the spark.mllib package have entered maintenance mode (no new features, only bug fixes). The primary ML API is now the DataFrame-based API in the spark.ml package.
Implement Yourself
Pandas UDFs
If you do have a model object already, consider Pandas UDFs since they have iterator support now (Since 3.0.0). Simply saying, it means a model won't be loaded for each row.
from pyspark.sql.functions import pandas_udf
#pandas_udf(...)
def categorize(iterator):
model = ... # load model
for features in iterator:
yield model.predict(features)
"""
GROUP BY in Spark SQL or window functions can be considered.
It depends on your scenarios, just remember DataFrames are still based on RDDs.
They are immutable and are high-level abstraction.
"""
spark_df.withColumn("clustered_result", categorize("some_column")).show()
RDD Exploring
If, unfortunately, your intentional execution of the clustering algorithm is not included in the set of Spark built-in clustering algorithms and won't have a progress of training which means the generation of a model. You could consider converting the Pandas DataFrame into RDD data structures, then implementing your clustering algorithm. A rough process will be like the following:
pandas_df = ....
spark_df = spark.createDataFrame(pandas_df)
.
.
clustering_result = spark_df.rdd.map{p => cluster_algorithm(p)}
note1: It's only a rough progress, you might want to partition the whole dataset into few RDDs based on region then execute the clustering algorithm in each partitioned RDDs. Because the information of the clustering algorithm kinda not too clear, I could only give the advice based on some assumptions.
note2: RDD implementation should be your last option
RDD Programming Guide
2017, Chen Jin, A Scalable Hierarchical Clustering Algorithm Using Spark

Extract relevant sentences based on input words NLP

I'm trying to build a app that finds relevant sentences in a document based on keywords or statements that a user enters. Performing this manually using a needle in the haystack approach seems highly inefficient.
Is there a ideal approach or library that can tackle this problem?

The field that deals with problems like you described is called information retrieval.
A simplest method of doing such queries is based on bag of words model - you treat documents as vectors, such that their cosine similarity corresponds to containing similar words.
In Python you can do it for example using utilities from scikit-learn (this would be low level) or do stuff using more production-ready tools like whoosh - see Python for Humanities for a sample tutorial.
If you want to dig deeper, I encourage you to read Information Retrieval book, at least couple of first chapters.

Python interval based sparse container

I am trying to create an interface between structured data and NLTK. NLP libraries generally work with bags of words, hence I need to turn my structured data into bags of words.
I need to associate the offset of a word with it's meta-data.Therefore my best bet is to have some sort of container that holds ranges as keys (allowing nested ranges) and can retrieve all the meta-data (multiple if the word offset is part of a nested range).
What code can I pickup that would do this efficiently (--i.e., sparse represention of the data ) ? Efficient because my global corpus will have at least a few hundred megabytes.
Note :
I am serialising structured forum posts. which will include posts with sections of quotes with them. I want to know which topic a word belonged to, and weather it's a quote or user-text. There will probably be additional metadata as my work progresses. Note that a word belonging to a quote is what I meant by nested meta-data, so the word is part of a quote, that belongs to a post made by a user.
I know that one can tag words in NLTK I haven't looked into it, if its possible to do what I want that way please comment. But I am still looking for the original approach.
There is probably something in numpy that can solve my problem, looking at that now
edit
The input data is far too complex to rip out and post. I have found what I was looking for tho http://packages.python.org/PyICL/. I needed to talk about intervals and not ranges :D I have used boost extensively, however making that a dependency makes me a bit uneasy (Sadly, I am having compiler errors with PyICL :( ).
The question now is: anyone know an interval container library or data structure that can be used to index nested intervals in a sparse fashion. Or put differently provides similar semantics to boost.icl

If you don't want to use PyICL or boost.icl Instead of relying on a specialized library you could just use sqlite3 to do the job ? If you use an in0memory version it will still be a few orders of magnitudes slower than boost.icl (from experience coding other data structures vs sqlite3) but should be more effective than using a c++ std::vector style approach on top of python containers.
You can use two integers and have date_type_low < offset < date_type_high predicate in your where clause. And depending on your table structure this will return nested/overlapping ranges.

Google App Engine Database Index

I need to store a undirected graph in a Google App Engine database.
For optimization purposes, I am thinking to use database indexes.
Using Google App Engine, is there any way to define the columns of a database table to create its index?
I will need some optimization, since my app uses this stored undirected graph on a content-based filtering for item recommendation. Also, the recommender algorithm updates the weights of some graph's edges.
If it is not possible to use database indexes, please suggest another method to reduce query time for the graph table. I believe my algorithm does more data retrieval operations from graph table than write operations.
PS: I am using Python.

Perhaps this will help: http://code.google.com/intl/sv-SE/appengine/docs/python/datastore/queriesandindexes.html#Defining_Indexes_With_Configuration

are you actually seeing prohibitively slow queries? i'm guessing not. i suspect this is somewhat premature optimization. the app engine datastore doesn't do any sorting, filtering, joins, or other meaningful operations in memory, so query times are generally fairly constant. in particular, query latency does not depend on the number of entities of your datastore, or even the number of entities that match your query. it only depends on the number of results you ask for.
on a related note, adding indexes to your datastore will not speed up existing queries. if a query needs a custom index, it won't degrade and run slower without it. the query simply won't run at all until you add the index.
for the specific query you mention, select * from edges where vertex1 == x and vertex2 == y, the datastore can run it without a custom index at all. see this section of the docs for more details.
in short, just run the queries you need, and don't think too much about indices or try to optimize as if you were a DBA. it's not a relational database. :P

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.