Storing unstructured data for sentiment analysis - python

I am doing an NLP term project and am analyzing over 100,000 news articles from this corpus. https://github.com/philipperemy/financial-news-dataset
I am looking to perform sentiment analysis on this dataset using NLTK. However, I am a bit confused about how this pipeline should look for storing and accessing all of these articles.
The articles are text files that I read and perform some preprocessing on in order to extract some metadata and extract the main article text. Currently, I am storing the data from each article in a python object such as this:
{
'title' : title,
'author' : author,
'date' : date,
'text' : text,
}
I would like to store these objects in a database so I don't have to read all of these files every time I want to do analysis. My problem is, I'm not really sure which database to use. I want to be able to use regexes on certain fields such as date and title so I can isolate documents by date and company names. I was thinking of going the NoSql route and using a DB like MongoDb or CouchDB or maybe even a search engine such as ElasticSearch.
After I query for the documents I want to use for analysis, I will tokenize the text, POS tag it, and perform NER using NLTK. I have already implemented this part of the pipeline. Is it smart to do this after the database is already indexed in the database? Or should I look at storing the processed data in the database well?
Finally, I will use this processed data to classify each article, using a trained model I've already developed. I already have a gold standard, so I will compare the classification against the gold standard.
Does this pipeline generally look correct? I don't have much experience with using large datasets like this.

Related

how to classified and digitalized huge amount of paper using python

I have an archive papers in a company representing different business operation form different sections.
I want to scan all these documents and after that I want a way to classify all these scanned document into different category and sub-category based on custom preference such as (name, age, section, ..etc).
I want the end result to be digital files categorized according to the preferences that I set.
How can I do this using Python NLP or any other machine learning approach
I think that this can be a basic pipeline:
Scanning part: papers images preprocessing with opencv + text extraction using some OCR libraries (pytesseract, easyOCR);
Topic extraction: get the desired information to classify the documents using e.g. Spacy
Cathegorize using simply python, maybe pandas.

Spacy NER model for date and text extraction from syllabi to create Calendar events: tips on how to make it perform optimally?

I'm creating a spacy NER model to extract events from documents. To do this, I created an entity (that essentially looks for terms like 'exam', 'test', 'assignment') and trained the model on some data that I wrote myself (because I couldn't find data online for my purpose). So far, I've only written 30 lines of data which I know is abysmally small and I hope to write far more. Right now, the model performs well on nicely formatted documents but also ends up parsing random text as events. Is this because of how small my training data is or should I approach this some other way? I found mixed responses to similar questions so I figured it would be best to ask here.
Additionally, any Python date parsing library I use fails a lot (fails to parse a lot of dates when the fuzzy property is false and parses random numbers as dates when fuzzy is true). Is there some guidance on that front?
Thanks in advance.

Extract PDF table data using Azure Form Recognizer

I am working on an invoice processing project using Azure From Recognizer. All the invoices are in PDF format. I am using a custom form recognizer with labeling. I can extract some data from PDF like Invoice No, Invoice Date, Amount, etc., but I want to extract table data from the pdf using Azure Form Recognizer, but it is not reading the table correctly.
I have labeled the cells which I need and when the number of rows in the table increases it reads the column correctly, but it is unable to separate the values of each row from each other and returns the whole column as a single value.
I tried to provide more examples, but it is still failing to detect the correct table. Is there any way to extract table data properly from PDF using Azure Form Recognizer?
Scanning the table is an essential requirement for our application, and it will decide if we base our application using Azure Form Recognizer or not.
Please see the below PDF table image and want to extract all row data from all columns.
If you can point us in the right direction with some documentation on this, then it would be beneficial.
Thanks
Please try the following -
Train without labels and see if it detects and extracts the table you need. See quickstart here - https://learn.microsoft.com/en-us/azure/cognitive-services/form-recognizer/quickstarts/python-train-extract?tabs=v2-0
If he table is not detected by train without labels and if you are using train with labels and the table is not detected automatically than we do not yet support labeling of tables natively. You could try labeling the table as key values pairs as a workaround to extract the values. When labeling tables as key value pairs label each cell as a value so for the above table you should have 5 values per column - Desc1, Desc2, Desc3...Desc5, Hours1, Hours2, Hours3, ...Hours5. In this case you will need to train with tables with the maximum number of rows.
Neta - MSFT
Form Recognizer released Invoice specific model which works across different invoice layouts. Please take a look at documentation below:
https://learn.microsoft.com/en-us/azure/applied-ai-services/form-recognizer/concept-invoice
It allows to extract header fields as well as line items and its details.
You can try this model using Form Recognizer Studio (need Azure Subscription and Form Recognizer resource):
https://formrecognizer.appliedai.azure.com/studio/prebuilt?formType=invoice

Dose meta data of data frame help build features for ML algorithms

Recently I was given a task by a potential employer to do the following :
- transfer a data set to S3
- create metadata for the data set
- creat a feature for the data set in spark
Now this is a trainee position, and I am new to data engineering in terms of concepts and I am having trouble understanding how or even if metadata is used to create a feature.
I have gone through numerous sites in feature engineering and metadata but none of which really give me an indication on if metadata is directly used to build a feature.
what I have gathered so far from sites is that when you build a feature it extracts certain columns from a given data set and then you put this information into a feature vector for the ML algorithm to learn from. So to me, you could just build a feature directly from the data set directly, and not be concerned with the metadata.
However, I am wondering if is it common to use metadata to search for given information within multiple datasets to build the feature, i.e you look in the metadata file see certain criteria
that fit the feature your building and then load the data in from the metadata and build the feature from there to train the model.
So as an example say I have multiple files or certain car models for manufacture i.e (vw golf, vw fox, etc) and it contains the year and the price of the car for that year and I would like the ML algorithm to predict the depreciation of the car for the future or depreciation of the newest model of that car for years to come. Instead of going directly through all the dataset, you would check the metadata (tags, if that the correct wording) for certain attributes to train the model then by using the (tags) it loads the data in from the specific data sets.
I could very well be off base here, or my example I given above may be completely wrong, but if anyone could just explain how metadata can be used to build features if it can that would be appreactied or even if links to data engineering websites that explain. It just over the last day or two researching, I find that there more on data sic than data engineering itself and most data engineering info is coming from blogs so I feel like there a pre-existing knowledge I am supposed to have when reading them.
P.S though not a coding question, I have used the python tag as it seems most data engineers use python.
I'll give synopsis on this !!!
Here we need to understand two conditions
1)Do we have features which directly related in building ML models.
2)are we in data scarcity ?
Always make a question , what the problem statement suggest us in generating features ?
There are many ways we can generate features from given dataset like PCA,truncated SVD,TSNE used for dimensionality reduction techniques where new features are created from given features.feature engineering techniques like fourier features,trignometric features etc. and then we move to the metadata like type of feature,size of feature,time when it extracted if it etc..like this metadata also helps us in creating features for building ML Models but this depends how we have performed feature engineering on datacorpus of respective Problem.

Methods to extract keywords from large documents that are relevant to a set of predefined guidelines using NLP/ Semantic Similarity

I'm in need of suggestions how to extract keywords from a large document. The keywords should be inline what we have defined as the intended search results.
For example,
I need the owner's name, where the office is situated, what the operating industry is when a document about a company is given, and the defined set of words would be,
{owner, director, office, industry...}-(1)
the intended output has to be something like,
{Mr.Smith James, ,Main Street, Financial Banking}-(2)
I was looking for a method related to Semantic Similarity where sentences containing words similar to the given corpus (1), would be extracted, and using POS tagging to extract nouns from those sentences.
It would be a useful if further resources could be provided that support this approach.
What you want to do is referred to as Named Entity Recognition.
In Python there is a popular library called SpaCy that can be used for that. The standard models are able to detect 18 different entity types which is a fairly good amount.
Persons and company names should be extracted easily, while whole addresses and the industry might be more difficult. Maybe you would have to train your own model on these entity types. SpaCy also provides an API for training your own models.
Please note, that you need quite a lot of training data to have decent results. Start with 1000 examples per entity type and see if it's sufficient for your needs. POS can be used as a feature.
If your data is unstructured, this is probably one of most suited approaches. If you have more structured data, you could maybe take advantage of that.

Categories

Resources