Tokenizing a Gensim dataset

Tokenizing a Gensim dataset - python

Im trying to tokenize a gensim dataset, which I've never worked with before and Im not sure if its a small bug or im not doing it properly.
I loaded the dataset using
model = api.load('word2vec-google-news-300')
and from my understanding, to tokenize using nltk all I need to do it call
tokens = word_tokenize(model)
However, the error im getting is "TypeError: expected string or bytes-like object". What am I doing wrong?

word2vec-google-news-300 isn't a dataset that's appropriate to 'tokenize'; it's the pretrained GoogleNews word2vec model released by Google circa 2013 with 3 million word-vectors. It's got lots of word-tokens, each with a 300-dimensional vector, but no multiword texts needing tokenization.
You can run type(model) on the object that api.load() returns to see its Python type, which will offer more clues as to what's appropriate to do with it.
Also, something like nltk's word_tokenize() appears to take a single string; you'd typically not pass it any full large dataset, in one call, in any case. (You'd be more likely to iterate over many individual texts as strings, tokenizing each in turn.)
Rewind a bit & think more about what kind of dataset you're looking for.
Try to get it in a simple format you can inspect it yourself, as files, before doing extra steps. (Gensim's api.load() is really bad/underdocumented for that, returning who-knows-what depending on what you've requested.)
Try building on well-explained examples that already work, making minimal individual changes that you understand individually, checking continued proper operation after each step.
(Also, for future SO questions that may be any more complicated than this: it's usually best to include the full error message you've received, including all lines of 'traceback' context showing involved files and lines-of-code, in order to better point at relevant lines-of-code in your code, or the libraries you're using, that are most-directly involved.)

Related

How to extract out unique words and there pos tags in separate columns while working with Dataset

I am working through Indonesian Data to use data for NER and as I get to know, there is no pretrained NLTK model to help for this language. So, to do this manually I tried to extract all the unique words used in the entire data frame, I still don't know how to apply tags to the words but this is what I did so far.
the first step,
the second step,
the third step,
the fourth step
please let me know if there is any other convenient way to do this, what I did in the following codes. also, let me know how to add tags to each row(if possible) and how to do NER for this.
(I am new to coding that's why I don't know how to ask, but I am trying my best to provide as much information as possible.)

Depending on what you want to do if results is all that matters you could use a pretrained transfomer model from huggingface instead of NLTK. This will be more computionally heavy but also give you a better performance.
There is one fitting model I could find (I don't speak Indonesian obviously, so excuse eventual errors in the sample sentence):
https://huggingface.co/cahya/xlm-roberta-large-indonesian-NER?text=Nama+saya+Peter+dan+saya+tinggal+di+Berlin.
The easiest way to use this would probably be either the API or using an inference-only pipeline, check out this guide, all you would have to do to get this running for the Indonesian model is to replace the previous model path (dslim/bert-base-NER) with cahya/xlm-roberta-large-indonesian-NER.
Note that this Indonesian model is quite large, so you need to have some decent hardware. If you don't you could alternatively use some (free) cloud computing service such as Google Colab.

Assistance with Keras for a noise detection script

I'm currently trying to learn more about Deep learning/CNN's/Keras through what I thought would be a quite simple project of just training a CNN to detect a single specific sound. It's been a lot more of a headache than I expected.
I'm currently reading through this ignoring the second section about gpu usage, the first part definitely seems like exactly what I'm needing. But when I go to run the script, (my script is pretty much totally lifted from the section in the link above that says "Putting the pieces together, you may end up with something like this:"), it gives me this error:
AttributeError: 'DataFrame' object has no attribute 'file_path'
I can't find anything in the pandas documentation about a DataFrame.file_path function. So I'm confused as to what that part of the code is attempting to do.
My CSV file contains two columns, one with the paths and then a second column denoting the file paths as either positive or negative.
Sidenote: I'm also aware that this entire guide just may not be the thing I'm looking for. I'm having a very hard time finding any material that is useful for the specific project I'm trying to do and if anyone has any links that would be better I'd be very appreciative.

The statement df.file_path denotes that you want access the file_path column in your dataframe table. It seams that you dataframe object does not contain this column. With df.head() you can check if you dataframe object contains the needed fields.

How to convert images to TFRecords with tf.data.Dataset in most efficient way possible

I am absolutely baffled by how many unhelpful error messages I've received while trying to use this supposedly simple API to write TFRecords in a manner that doesn't take 30 minutes every time I have a new dataset.
Task:
I'd like to feed a list of image paths and a list of labels to a tf.data.Dataset, parse them in parallel to read the images and encode as tf.train.Examples, use tf.data.Dataset.shard to distribute them into different TFRecord shards (e.g. train-001-of-010.tfrecord, train-002-of-010.tfrecord, etc.), and for each shard finally write them to the corresponding file.
Since I've been debugging this for hours I haven't gotten any single particular error to fix, otherwise I would provide it. I've struggled to find any up to date tutorial that doesn't either (a) come from 2017 and use queue runners, (b) use a tf.Session (I'm using tensorflow 1.15 but official docs keep telling me to phase out sessions), (c) Conveniently do the record creating in pure python, which makes a simple tutorial but is too slow for any actual application, or (d) use already created TFRecords and just skip the whole process.
If necessary, I can put together an example of what I'm talking about. But since I'm getting stuck at every level of the process, at the moment it seems unhelpful.
Tldr:
If anyone has utilized tf.data.Dataset to create TFRecord shards in parallel please point me in a better direction than google has.

What is the input file format for the function word2vec from package word2vec?

I am trying to do my own word embedding using the package word2vec (https://pypi.org/project/word2vec/).
However, I can't find the file format of the input file for the function "word2vec".
I tried .txt format and pickle file but neither does work.
For example, where corpus.txt has been made with the Windows Notepad and contains "I am a foo bar corpus test"
import word2vec
word2vec.word2vec("corpus.txt", "corpus.bin", size=100, verbose=True)
I would have expected:
Vocab size: 7
Words in train file: 7
as in the example here : https://nbviewer.jupyter.org/github/danielfrg/word2vec/blob/master/examples/word2vec.ipynb
but got only
Vocab size: 1
Words in train file: 0
Does anyone knows which type/format of file this function accepts ?
Thank you in advance !

There's a good chance your specific results are because most word2vec implementations discard all words that appear fewer than some minimum-count value, usually 5. (Word2Vec doesn't create good vectors for such rare words, and their presence usually interferes with better vectors for other more-common words, so discarding them is usually a good idea on real-sized corpuses.)
So a toy-sized input file, of just 7 words each appearing once, leaves nothing but (maybe) one synthetic word.
Because that PyPI package appears to be a thin wrapper around the word2vec.c code originally released by Google, you could probably refer to that code to learn more details about formats/usage.
But, you could also use the Word2Vec implementation in the Gensim library - a far more common choice when using Python, with much more documentation & flexibility.

Parser generation

i am doing a project on SOFWARE PLAGIARISM DETECTION..i am intended to do it with language C..for that i am supposed to create a token generator, and a parser..but i dont know where to start..any one can help me out with this..
i created a database of tokens and i separated the tokens from my program.Next thing i wanna do is to compare two programs to find out whether it's plagiarized or not. For that i need to create a syntax analyzer.I don't know where to start from...
i.e I want to create a parser for c programs in python

If you want to create a parser in Python you can look at these libraries:
PLY
pyparsing
and Lepl - new but very powerful

Building a real C parser by yourself is a really big task.
I suggest you either find one that is already done, eg. pycparser or you define a really simple subset of C that is easily parsed.
You'll have plenty of work to do for your plagiarism detector after you are done parsing C.

I'm not sure you need to parse the token stream to detect the features you're looking for. In fact, it's probably going to complicate things more than anything.
what you're really looking for is sequences of original source code that have a very strong similarity with a suspect sample code being tested. This sounds very similar to the purpose of a Bayes classifier, like those used in spam filtering and language detection.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Tokenizing a Gensim dataset - python

Related

How to extract out unique words and there pos tags in separate columns while working with Dataset

Assistance with Keras for a noise detection script

How to convert images to TFRecords with tf.data.Dataset in most efficient way possible

What is the input file format for the function word2vec from package word2vec?

Parser generation

Categories

Resources