I'm new to programming and do not have much experience yet. I understand some python codes, but not into detail.
I have an Excel file which contains log files of problems people encountered. The description of the problem is pasted as an email (so it's a bunch of text). I want to analyze all of these texts (almost 1.000 rows in Excel) at once, and I think Python can do this.
The type of analysis I want to do is sentiment analysis (positive, neutral, negative) or I want to see the main problem out of the text. I don't know if the second one is possible.
I copied the emails that are listed in the Excel file, to a .txt file, so now every rule is one message. How can I use Python to analyze every single rule as one message and let it show me the sentiment or the main problem?
I'd appreciate the help
Sentiment analysis is a fairly large problem in computer science/language. How specific did you want to get?
I'd recommend looking into Text-Processing for simple SA.
Their API docs are here, http://text-processing.com/docs/sentiment.html, which will return a simple pos and neg score for your text.
If you want anything more specific, I'd recommend looking into the IBM Watson, specifically Natural Language Understanding https://www.ibm.com/watson/developercloud/natural-language-understanding.html
Related
For those who want to spare the reasoning behind the question jump to the TL;DR
Hi I'm currently reading a lot of financial annual reports of companies. While the first one is the most interesting, the documents that come after it often are the same in a lot of regards. So obviously I'm more interested in the differences between them. The documents come in pdfs which are hard to compare. So I thought it would be nice to get them as pure text and compare them with a compare tool. So thats what I did. I piped the following two pdfs through pdftotext with the below params:
annual report for 2018
annual report for 2019
pdftotext -enc UTF-8 -nopgbrk -eol mac
I then realized that compare tools seem to have problems with line breaks. So if I have the exact same sentences, but with different line breaks in both documents, it is shown as a difference. Bullet points in pdfs are transformed to different symbols in the text file which leads to differences as well. So I went into nlp and thought I might get some help there.
TL;DR
I just want to reformat the two snippets below in a defined way that I don't get diffs in a difftool anymore. Like lines are only 80 characters long at most and I want to have some normalized/canonical way for printing bullet points and stuff like that.
I'm currently using spacy and here is an example of two text snippets that are essentially the same but lead to a lot of diffs in difftools. So how can I reprint both snippets to a text document so that the line breaks are the same? Is there even a method to find things like two sentences are exactly the same but in one sentence there is one additional word. I would like reformat that as well without shifting the line break by one word.
import spacy
nlp = spacy.load("en_core_web_sm")
SE_2018_10k_string = '''x
“paying users” refers to the number of unique accounts through which a payment is made in our online games in a particular period. A unique
account through which payments are made in more than one online game or in more than one market is counted as more than one paying user.
“QPUs” refers to the aggregate number of paying users during the quarterly period;
x'''
doc1 = nlp(SE_2018_10k_string)
print('SE_2018_10k_string')
for token in doc1:
print(token.text)
SE_2019_10k_string = '''●
“paying users” refers to the number of unique accounts through which a payment is made in our online games in a particular period. A unique account
through which payments are made in more than one online game or in more than one market is counted as more than one paying user. “QPUs” refers to
the aggregate number of paying users during the quarterly period;
●'''
doc2 = nlp(SE_2019_10k_string)
print('SE_2019_10k_string')
for token in doc2:
print(token.text)
print(doc1.similarity(doc2))
There is no universal way to get rid of the problems you are seeing.
If you find that you have line breaks in different places but your texts are otherwise the same, you can normalize things by removing line breaks. If you find only spaces are different, you can remove spaces, or convert any run of spaces to a single space. If bullets are an issue you can remove them or convert them to a single type of character (but how do you tell if something is a bullet in code? there is no standard way).
Appropriate normalization depends on your data, and for OCR it's typically going to just be hard.
Is there even a method to find things like two sentences are exactly the same but in one sentence there is one additional word.
You can use edit distance metrics like Levenshtein distance to find this. It won't help you with existing diff tools though, since they show any difference.
I have been trying to pull out financial statements embedded in annual reports in pdf and export them in excel/CSV format using python But I am encountering some problems:
1. A specific Financial statement can be on any page in the report. If I were to process hundreds of pdfs, I would have to specify page numbers which takes alot of time. Is there any way through which the scraper knows where the exact statement is?
2. Some reports span over multiple pages and the end result after scraping a pdf isnt what I want
3. Different annual reports have different financial statement formats. Is there any way to process them and change them to a specific standard format?
I would also appreciate if anyone have done something like this and can share examples.
Ps I am working with python and used tabula and Camelot
I had a similar case where the problem was to extract specific form information from pdfs (name, date of birth and so on). I used the tesseract open source software with pytesseract to perform OCR on the files . Since I did not need the whole pdfs, but specific information from them, I designed an algorithm to find the information: In my case I used simple heuristics (specific fields, specific line number and some other domain specific stuff), but you can also use a machine-learning approach and train a classifier which can find the needed text-parts. You could use domain-specific heuristics as well, because I am sure that a financial statement has special vocabulary or some text markers which indicate its beginning/its end.
I hope I could at least give you some ideas how to approach the problem
P.S.: With tesseract you can also process multipage pdfs. To 3) - Machine learning approach would need some samples to learn a good generalization of how a financial statement may look like.
I would like to code a script that could locate a specific word or number in a financial statement. Financial statements roughly contain the same information, they are however not identical and organized in the same way. My thought is that by using Tensorflow I could train a neural network to locate the specific words or numbers for me. I am thinking that if I label different text and numbers in 1000 financial statements and use them to train the neural network, it will then be able to identify these numbers or words in all financial statements. For example, tell it in all 1000 training statements which number that is the profit of the company.
Is this doable? I have been working with coding in python for a couple of months and so far I've built some web scrapers and integrated them with twitter, slack and google sheets. I would be very grateful for all your thoughts on this project and if anyone could steer me in the right direction by sharing relevant tutorials.
Thanks a lot!
Great thing that you're getting started, I believe before thinking about the actual implementation using tensorflow or any other library, you should first try to understand the problem in regards with the basic domain of the problem itself.
I'm not really sure what are you exactly trying to achieve but to a rough idea I'm guessing it's about trying to find is a statement turns out to be a benificial to the company or not, something like of semantic analysis type of problem.
So I strongly believe that, first you should try to learn the various methodologies related to semantic analysis and find the most appropriate technique.
In short theory/understanding before the actual code.
Finally i would suggest you ask such theoratical questions on stack exchange of AI, here in SO we generally deal with code or something that of intermediate to code.
I hope that makes sense? ;)
drop a comment if any doubts.
I have lot of PDF, DOC[X], TIFF and others files (scans from a shared folder). Each file converted into pack of text files: one text file per page.
Each pack of files could contain multiple documents (for example thee contracts). Document kind could be not only contract.
During the processing the pack of the files I don't know what kind of the documents current pack contains and it's possible that one pack contains multiple document kinds (contracts, invoices, etc).
I'm looking for some possible approaches to solve this programmatically.
I'm tried to search something like that but without any success.
UPD: I tried to create binary classificator with scikit-learn and now looking for another solution.
This at its basis, being they are "scans" sounds more like something that could be approached with computer vision, however this is currently far far above my current level of programming.
E.g. projects like SimpleCV may be a good starting point,
http://www.simplecv.org/
Or possibly you could get away with OCR reading the "scans" and working based on the contents. pytesseract seems popular for this type of task,
https://pypi.org/project/pytesseract/
However that still lacks defining how you would tell your program that this part of the image means that this is 3 separate contracts, Is there anything about these files in particular that make this clear, e.g. "1 of 3" on the pages,, a logo or otherwise? that will be the main part that determines how complex a problem you are trying to solve.
Best solution was to create binary classifier (SGDClassifier) and train it on classes first-page and not-first-page. Each item from the dataset was trimmed to 100 tokens (words)
I am trying to solve what I have realized is quite a hard problem to address due to my lack of expertise in the subject. Suppose I have an image of a table with 3 rows and 5 columns. Each row contains text (let's assume only english for now) or numbers (normal Indo-Arabic numerals). There is nothing but whitespace between the columns and between each row. Now assuming all rows and all columns are aligned, my task would be to get an algorithm to recognize and extract each row out from the document (don't know if I'm articulating this well enough).
Could someone suggest a good starting point (library , similar example , textbook chapter that deals with something like this) etc.. for me to get started.
My background is data science but I have just never been exposed to computer vision.
Any help would be appreciated.
You should start off with OpenCV, like Racialz suggested. This tool contains a Hough lines/Hough transform method which should be the primary and easiest way for you to find and crop text from table sections. There are many different tasks for lines to find for which people use this algorythm (like THIS or THIS), but with your task it would be much easier, because lines should be much clearer and simplier, rather than in these examples. After you do your extraction, you then will need to scan your text, for this I would suggest you using tesseract ocr engine. This engine is for free, really easy to use, it provides pretty decent results and allows you to train it to scan specific types of letters.