Preparing csv data to ML

Preparing csv data to ML - python

I would like to implement ML model for classification problem. My csv data looks like this:
Method1; Method2; Method3; Method4; Category; Class
result1; result2; result3; result4; Sport; 12
...
...
All methods, gives a text. Sometimes it is a one word, sometimes more and sometimes the cell is empty (no answer for this method). Column "category" always has a text and column "class" is a numerical showing number of method with correct answer (i.e. number 12 means that only result from method 1 and 2 is correct). Maybe will add more column if necessary.
Now, having a new answers from all methods I would like to classify it to one of the class.
How should I prepare this data? I know I should have a numerical data but how to do that, and handle with all empty cells, and inconsistent number of words in each answer?

How should I prepare this data? I know I should have a numerical data but how to do that, and handle with all empty cells, and inconsistent number of words in each answer?
There are many different ways of doing this, but the most simple would be to just use a Bag of Words representation, which means concatenating all your Methodx columns and counting how many times does each word appears on them.
With that, you have a vector representation (each word is a column/feature, each count is a numerical value).
Now, from here there are several problems (the main one is that the number of columns/features in your dataset will be quite large), so you may have to either preprocess your data further or find a ML technique that can dealt with it for you. But, in any case, I would recommend to try to have a look at several tutorials on NLP to get a better idea of this and get a better estimation of what would be the best solution for your dataset.

Related

Building machine learning with a dataset which has only string values

I am working with a Dataset consist of 190 columns and more than 3mln rows.
But unfortunately it has all the data as string values.
is there any way of building a model with such kind of data
except tokenising
Thank you and regards!

This may not answer your question fully, but I hope it will shed some light on what you can and cannot do.
Firstly, I don't think there is any straight-forward answer, as most ML models depend on the data that you have. If your string data is simply Yes or No (binary), it could be easily dealt with by replacing Yes = 1 and No = 0, but it doesn't work on something like country.
One-Hot Encoding - For features like country, it would be fairly simple to just one-hot encode it and start training the model with the thus obtained numerical data. But with the number of columns that you have and based on the unique values in such a large amount of data, the dimension will be increased by a lot.
Assigning numeric values - We also cannot simply assign numeric values to the strings and expect our model to work, as there is a very high chance that the model will pick up the numeric order which we do not have in the first place. more info
Bag of words, Word2Vec - Since you excluded tokenization, I don't know if you want to do this but there are also these option.
Breiman's randomForest in R - ""This implementation would allow you to use actual strings as inputs."" I am not familiar with R so cannot confirm to how far this is true. Nevertheless, find more about it here
One-Hot + Vector Assembler - I only came up with this in theory. If you manage to somehow convert your string data into numeric values (using one-hot or other encoders) while preserving the underlying representation of the data, the numerical features can be converted into a single vector using VectorAssembler in PySpark(spark). More info about VectorAssembler is here

What is the best way to integrate different Excel files with different sheets with different formats in Python?

I have multiple Excel files with different sheets in each file, these files have been made my people, so each one has different formats, different number of columns and also different structures to represent the data.
For example, in one sheet, the dataframe/table starts at 8th row, second column. In other it starts at 122 row, etc...
I want to retrieve something in common from these Excels, it is variable names and information.
However, I don't how could I possibly retrieve all this information without needing to parse each individual file. This is not an option because there are lot of these files with lots of sheets in each file.
I have been thinking about using regex as well as edit distance between words, but I don't know if that is the best option.
Any help is appreciated.

I will divide my answer into what I think you can do now, and suggestions for the future (if feasible).
An attempt to "solve" the problem you have with existing files.
Without regularity on your input files (such as at least a common name in the column), I think what you're describing is among the best solutions. Having said that, perhaps a "fancier" similarity metric between column names would be more useful than using regular expressions.
If you believe that there will be some regularity in the column names, you could look at string distances such as the Hamming Distance or the Levenshtein distance, and using a threshold on the distance that works for you. As an example, let's say that you have a function d(a:str, b:str) -> float that calculates a distance between column names, you could do something like this:
# this variable is a small sample of "expected" column names
plausible_columns = [
'interesting column',
'interesting',
'interesting-column',
'interesting_column',
]
for f in excel_files:
# process the file until you find columns
# I'm assuming you can put the colum names into
# a variable `columns` here.
for c in columns:
for p in plausible_columns:
if d(c,p) < threshold:
# do something to process the column,
# add to a pandas DataFrame, calculate the mean,
# etc.
If the data itself can tell you something on whether you should process it (such as having a particular distribution, or being in a particular range), you can use such features to decide on whether you should be using that column or not. Even better, you can use many of these characteristics to make a finer decision.
Having said this, I don't think a fully automated solution exists without inspecting some of the data manually, and studying the ditribution of the data, or variability in the names of the columns, etc.
For the future
Even with fancy methods to calculate features and doing some data analysis on the data you have right now, I think it would be impossible to ensure that you will always get the data you need (by the very nature of the problem). A reasonable way to solve this, in my opinion (and if this is feasible in whatever context you're working in), is to impose a stricter format in the data generation end (I suppose this is a manual thing with people inputting data to excel directly). I would argue that the best solution is to get rid of the problem at the root, and create a unified form, or excel sheet format, and distribute it to the people that will fill the files with data, so that you can ensure the data is then automatically ingested minimizing the risk of errors.

Python3 - Doc2Vec: Get document by vector/ID

I've already built my Doc2Vec model, using around 20.000 files. I'm looking for a way to find the string representation of a given vector/ID, which might be similar to Word2Vec's index2entity. I'm able to get the vector itself, using model['n'], but now I'm wondering whether there's a way to get some sort of string representation of it as well.

If you want to look up your actual training text, for a given text+tag that was part of training, you should retain that mapping outside the Doc2Vec model. (The model doesn't store training texts – only looking at them, repeatedly, during training.)
If you want to generate a text from a Doc2Vec doc-vector, that's not an existing feature, nor do I know any published work describing a reliable technique for doing so.
There's a speculative/experimental bit of work-in-progress for gensim Doc2Vec that will forward-propagate a doc-vector through the model's neural-network, and report back the most-highly-predicted target words. (This is somewhat the opposite of the way infer_vector() works.)
That might, plausibly, give a sort-of summary text. For more details see this open issue & the attached PR-in-progress:
https://github.com/RaRe-Technologies/gensim/issues/2459
Whether this is truly useful or likely to become part of gensim is still unclear.
However, note that such a set-of-words wouldn't be grammatical. (It'll just be the ranked-list of most-predicted words. Perhaps some other subsystem could try to string those words together in a natural, grammatical way.)
Also, the subtleties of whether a concept has many potential associates words, or just one, could greatly affect the "top N" results of such a process. Contriving a possible example: there are many words for describing a 'cold' environment. As a result, a doc-vector for a text about something cold might have lots of near-synonyms for 'cold' in the 11th-20th ranked positions – such that the "total likelihood" of at least one cold-ish word is very high, maybe higher than any one other word. But just looking at the top-10 most-predicted words might instead list other "purer" words whose likelihood isn't so divided, and miss the (more-important-overall) sense of "coldness". So, this experimental pseudo-summarization method might benefit from a second-pass that somehow "coalesces" groups-of-related-words into their most-representative words, until some overall proportion (rather than fixed top-N) of the doc-vector's predicted-words are communicated. (This process might be vaguely like finding a set of M words whose "Word Mover's Distance" to the full set of predicted-words is minimized – though that could be a very expensive search.)

Should I drop a variable that has the same value in the whole column for building machine learning models?

For instance, column x has 50 values and all of these values are the same.
Is it a good idea to delete variables like these for building machine learning models? If so, how can I spot these variables in a large data set?
I guess a formula/function might be required to do so. I am thinking of using nunique that can take account of the whole dataset.

You should be deleting such columns because it will provide no extra information about how each data point is different from another. It's fine to leave the column for some machine learning models (due to the nature of how the algorithms work), like random forest, because this column will actually not be selected to split the data.
To spot those, especially for categorical or nominal variables (with fixed number of possible values), you can count the occurrence of each unique value, and if the mode is larger than a certain threshold (say 95%), then you delete that column from your model.
I personally will go through variables one by one if there aren't any so that I can fully understand each variable in the model, but the above systematic way is possible if the feature size is too large.

Data Structure for Text Classification Task

I'm doing a text classification / tagging task and I would like to ask what kind of data structure would serve me best. The training data set I have is about 4 gigs (after some cleaning, but should be even smaller if I discard the rare words) with 6 million documents. Each document has 4 fields:
Document ID
Title
Body
Tags (as a string, e.g. "apple sql-server linux". This represents three tags, separated by a space. Documents can have 1-5 tags)
I've just finished the cleaning phase (stemming, stop words etc etc) and I'm about to convert them into a TF-IDF word vector with scikit so the output is a scipy sparse matrix. I would like to keep the Title and Body as two vectors and combine them at a later stage when I decide on what weighting to give the Title. The Title and Body are sparse vectors, but they are built with the same dictionary so have the same no. of columns.
What is the best way to represent this information? I come from R so I'm just used to storing things in data.tables / data frames but that doesn't seem very applicable for text classification and sparse matrices. One thing I thought about doing is creating my own "Document" class and just have a list of these objects to represent the corpus. I don't think this is very efficient, since I would probably want to do something like return all docs with the Tag apple.
ML algorithms I plan to run are k-means clustering, kNN, Naive Bayes and possibly SVM. There will probably others that I haven't thought about yet.
I'm new to Python and text classification - any help is greatly appreciated and I am especially interested in ppl who have done it before.
Thank you!

Your best bet is a list of dictionary objects. A list of keep all the documents, and a dictionary to keep all the information regarding the document.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.