I am currently cleaning data from text files. And the files contains transcriptions of speeches from daily conversations. Some of the files are multilingual, a few examples of a multilingual portion are like so:
around that area,<tamil>அம்மா:ammaa</tamil> would have cooked too
so at least need to <mandarin>跑两趟:pao liang tang</mandarin>,then I told them that it is fine
There can be multiple of such other languages in one file
Going back to the first example, what I am trying to do with the data is to remove "<tamil>", "அம்மா:" and "</tamil>", keeping just the english pronunciation of the word. I have tried to replace the <tamil> to "", but am quite unsure of how to approach the removal of the tamil words.
The expected output would be:
around that area, ammaa would have cooked too
so at least need to pao liang tang,then I told them that it is fine
How would I go about doing so?
Yes, Pls try this
content="around that area,<tamil>அம்மா:ammaa</tamil> would have cooked too"
ft=' '.join([word for line in [item.strip() for item in content.replace('<',' <').replace('>','> ').split('>') if not (item.strip().startswith('<') or (item.strip().startswith('&') and item.strip().endswith(';')))] for word in line.split() if not (word.strip().startswith('<') or (word.strip().startswith('&') and word.strip().endswith(';')))])
outputs=ft.encode('ascii','ignore')
print(outputs.decode('utf-8'))
output
around that area, :ammaa would have cooked too
It's not complete output..Like if you see final string there some extra things like ":", some punctuations..So pls edit them yourself using regex..I've posted 99% of the answer
Related
Summary
I am building a text summarizer in Python. The kind of documents that I am mainly targeting are scholarly papers that are usually in pdf format.
What I Want to Achieve
I want to effectively extract the body of the paper (abstract to conclusion), excluding title of the paper, publisher names, images, equations and references.
Issues
I have tried looking for effective ways to do this, but I was not able to find something tangible and useful. The current code I have tries to split the pdf document by sentences and then filters out the entries that have less than average number of characters per sentence. Below is the code:
from pdfminer import high_level
# input: string (path to the file)
# output: list of sentences
def pdf2sentences(pdf):
article_text = high_level.extract_text(pdf)
sents = article_text.split('.') #splitting on '.', roughly splits on every sentence
run_ave = 0
for s in sents:
run_ave += len(s)
run_ave /= len(sents)
sents_strip = []
for sent in sents:
if len(sent.strip()) >= run_ave:
sents_strip.append(sent)
return sents_strip
Note: I am using this article as input.
Above code seems to work fine, but I am still not effectively able to filter out thing like title and publisher names that come before the abstract section and things like the references section that come after the conclusion. Moreover, things like images are causing gibberish characters to show up in the text which is messing up the overall quality of the output. Due to the weird unicode characters I am not able to write the output to a txt file.
Appeal
Are there ways I can improve the performance of this parser and make it more consistent?
Thank you for your answers!
I am using docx library to read files from a word doc, I am trying to extract only the questions using regex search and match. I found infinite ways of doing it but I keep getting a "TypeError".
The data I am trying to extract is this:
Will my financial aid pay for housing?
Off Campus Housing - After financial aid applies toward your tuition and fees, any remaining funds will be sent to you as a refund that will either be directly deposited (which can be set up through your account) or mailed to you as a paper check. You can then use the refund to pay your rent. It is important to note that financial aid may not be available when rent is due, so make sure to have a plan in place to pay your rent. Will my financial aid pay for housing?
"financial" "help" "house"
funds "univ oak"
"money" "chisho"
"pay" "chap"
"grant" "laurel"
What are the requirements to receive a room and grant?
How do I pay for my housing?
How do I pay for housing?
If there's also an easier method of exporting the word doc into a different type of file, that'll be great to know for feedback. Thank you
I am using regex 101, I've tried the following regex expressions to match only the sentences that end in a question mark
".*[?=?]$"
"^(W|w).*[?=?]$"
"^[A-Za-z].*[?=?]$"
import re
import sys
from docx import Document
wordDoc = Document('botDoc.docx')
result = re.search('.*[?=?]$', wordDoc)
print(result)
if result:
print(result.group(0))
for table in wordDoc.tables:
for row in table.rows:
for cell in row.cells:
print("test")
I expect to save the matching patterns into directories so I can export the data to a csv file
Your error:
result = re.search('.*[?=?]$', wordDoc)
I believe that this line is the cause of the problem. search() is expecting a string as a second parameter, but is receiving a Document object.
What you should do is use the findall() function. search() only finds the first match for a pattern; findall() finds all the matches and returns them as a list of strings, with each string representing one match.
Since you are working with docx, you would have to extract the contents of the docx and use them as second parameter of the findall() method. If I remember correctly, this is done by first extracting all the paragraphs, and then extracting the text of the individual paragraphs. Refer to this question.
FYI, the way you would do this for a simple text file is the following:
# Open file
f = open('test.txt', 'r')
# Feed the file text into findall(); it returns a list of all the found strings
strings = re.findall(r'your pattern', f.read())
Your Regex:
Unfortunately, your regex is not quite correct, because although logically it makes sense to match only sentences that end on a ?, one of your matches is place to pay your rent. Will my financial aid pay for housing?, for example. Only the second part of that sentence is an actual question. So discard any lower case letters. Your regex should be something like:
[A-Z].*\?$
My text file is like below.
[0, "we break dance not hearts by Short Stack is my ringtone.... i LOVE that !!!.....\n"]
[1, "I want to write a . I think I will.\n"]
[2, "#va_stress broke my twitter..\n"]
[3, "\" "Y must people insist on talking about stupid politics on the comments of a bubblegum pop . Sorry\n"]
[4, "aww great "Picture to burn"\n"]
[5, "#jessdelight I just played ur joint two s ago. Everyone in studio was feeling it!\n"]
[6, "http://img207.imageshack.us/my.php?image=wpcl10670s.jpg her s are so perfect.\n"]
[7, "cannot hear the new due to geographic location. i am geographically undesirable. and tune-less\n"]
[8, "\" couples in public\n"]
[9, "damn wendy's commerical got that damn in my head.\n"]
[10, "i swear to cheese & crackers #zyuuup is in Detroit like every 2 months & i NEVER get to see him! i swear this blows monkeyballs!\n"]
[11, "\" getting ready for school. after i print out this\n"]
I want to read every second element from the list mean all the text tweets into array.
I wrote
tweets = []
for line in open('tweets.txt').readlines():
print line[1]
tweets.append(line)
but when I see the output, It just takes 2nd character of every line.
When you read a text file in Python, the lines are just strings. They aren't automatically converted to some other data structure.
In your case, it looks like each line in your file contains a JSON list. In that case, you can parse the line first using json.loads(). This converts the string to a Python list which you can then take the second element of:
import json
with open('tweets.txt') as fp:
tweets = [json.loads(line)[1] for line in fp]
May be you should consider to use json.loads method :
import json
tweets = []
for line in open('tweets.txt').readlines():
print json.loads(line)[1]
tweets.append(line)
There is more pythonic way in #Erik Cederstrand 's comment.
Rather than guessing what format the data is in, you should find out.
If you're generating it yourself, and don't know how to parse back in what you're creating, change your code to generate something that can be easily parsed with the same library used to generate it, like JsonLines or CSV.
If you're ingesting it from some API, read the documentation for that API and parse it the way it's documented.
If someone handed you the file and told you to parse it, ask that someone what format it's in.
Occasionally, you do have to deal with some crufty old file in some format that was never documented and nobody remembers what it was. In that case, you do have to reverse engineer it. But what you want to do then is guess at likely possibilities, and try to parse it with as much validation and error handling as possible, to verify that you guessed right.
In this case, the format looks a lot like either JSON lines or ndjson. Both are slightly different ways of encoding multiple objects with one JSON text per line, with specific restrictions on those texts and the way they're encoded and the whitespace between them.
So, while a quick&dirty parser like this will probably work:
with open('tweets.txt') as f:
for line in f:
tweet = json.loads(line)
dosomething(tweet)
You probably want to use a library like jsonlines:
with jsonlines.open('tweets.txt') as f:
for tweet in f:
dosomething(tweet)
The fact that the quick&dirty parser works on JSON lines is, of course, part of the point of that format—but if you don't actually know whether you have JSON lines or not, you're better off making sure.
Since your input looks like Python expressions, I'd use ast.literal_eval to parse them.
Here is an example:
import ast
with open('tweets.txt') as fp:
tweets = [ast.literal_eval(line)[1] for line in fp]
print(tweets)
Output:
['we break dance not hearts by Short Stack is my ringtone.... i LOVE that !!!.....\n', 'I want to write a . I think I will.\n', '#va_stress broke my twitter..\n', '" "Y must people insist on talking about stupid politics on the comments of a bubblegum pop . Sorry\n', 'aww great "Picture to burn"\n', '#jessdelight I just played ur joint two s ago. Everyone in studio was feeling it!\n', 'http://img207.imageshack.us/my.php?image=wpcl10670s.jpg her s are so perfect.\n', 'cannot hear the new due to geographic location. i am geographically undesirable. and tune-less\n', '" couples in public\n', "damn wendy's commerical got that damn in my head.\n", 'i swear to cheese & crackers #zyuuup is in Detroit like every 2 months & i NEVER get to see him! i swear this blows monkeyballs!\n', '" getting ready for school. after i print out this\n']
Let's say I open a text file with Python3:
fname1 = "filename.txt"
with open(fname1, "rt", encoding='latin1') as in_file:
readable_file = in_file.read()
The output is a standard text file of paragraphs:
\n\n"Well done, Mrs. Martin!" thought Emma. "You know what you are about."\n\n"And when she had come away, Mrs. Martin was so very kind as to send\nMrs. Goddard a beautiful goose--the finest goose Mrs. Goddard had\never seen. Mrs. Goddard had dressed it on a Sunday, and asked all\nthe three teachers, Miss Nash, and Miss Prince, and Miss Richardson,\nto sup with her."\n\n"Mr. Martin, I suppose, is not a man of information beyond the line\nof his own business? He does not read?"\n\n"Oh yes!--that is, no--I do not know--but I believe he has\nread a good deal--but not what you would think any thing of.\nHe reads the Agricultural Reports, and some other books that lay\nin one of the window seats--but he reads all _them_ to himself.\nBut sometimes of an evening, before we went to cards, he would read\nsomething aloud out of the Elegant Extracts, very entertaining.\nAnd I know he has read the Vicar of Wakefield. He never read the\nRomance of the Forest, nor The Children of the Abbey. He had never\nheard of such books before I mentioned them, but he is determined\nto get them now as soon as ever he can."\n\nThe next question was--\n\n"What sort of looking man is Mr. Martin?"
How can one save only a certain string within this file? For example, how does one save the sentence
And when she had come away, Mrs. Martin was so very kind as to send\nMrs. Goddard a beautiful goose--the finest goose Mrs. Goddard had\never seen.
into a separate text file? How do you know the indices where to access this sentence?
CLARIFICATION: There should be no decision statements to make. The end goal is to create a program which the user could "save" sentences or paragraphs separately. I am asking a more general question at the moment.
Let's say there's a certain paragraph I like in this text. I would like a way to append this quickly to a JSON file or text file. In principle, how does one do this? Tagging all sentences? Is there a way to isolate paragraphs? (To repeat above) How do you know the indices where to access this sentence? (especially if there is no "decision logic")
If I know the indices, couldn't I simply slice the string?
I'm writing an application that analyzes emails and it would save me a bunch of time if I could use a python library that would parse email text down into named components like <salutation><body><signature><reply text> etc.
For example, the following text "Hi Dave,\nLets meet up this Tuesday\nCheers, Tom\n\nOn Sunday, 15 May 2011 at 5:02 PM, Dave Trindall wrote: Hey Tom,\nHow about we get together ..." would be parsed as
Salutation: "Hi Dave,\n"
Body: "Lets meet up this Tuesday\n"
Signature: "Cheers, Tom\n\n"
Reply Text: "On Sunday, 15 May 2011 at 5:02 PM, Dave Trindal wrote: ..."
I know there's no perfect solution for this kind of problem, but even a library that does good approximation would help. Where can I find one?
https://github.com/Trindaz/EFZP
This provides functionality posed in the original question, plus fair recognition of email zones as they commonly appear in email written by native English speakers from common email clients like Outlook and Gmail.
If you score each line based on the types of words it contains you may get a fairly good indication.
E.G. A line with greeting words near the start is the salutation (also salutations may have phrases that refer to the past tense e.g. it was good to see you last time)
A Body will typically contain words such as "movie, concert" etc. It will also contain verbs (go to, run, walk, etc) and questions marks and offerings (e.g. want to, can we, should we, prefer..).
Check out http://nodebox.net/code/index.php/Linguistics#verb_conjugation
http://ogden.basic-english.org/
http://osteele.com/projects/pywordnet/
the signature will contain closing words.
If you find a datasource that has messages of the structure you want you could do some frequency analysis to see how often each word occurs in each section.
Each word would get a score [salutation score, body score, signature score,..]
e.g. hello could occur 900 times in the salutation, 10 times in the body, and 3 times in the signature.
this means hello would get assigned [900, 10, 3, ..]
cheers might get assigned [10,3,100,..]
now you will have a large list of about 500,000 words.
words that don't have a large range aren't useful.
e.g. catch might have [100,101,80..] = range of 21
(it was good to catch up, wanna go catch a fish, catch you later). catch can occur anywhere.
Now you can reduce the number of words down to about 10,000
now for each line, give the line a score also of the form [salutation score, body score, signature score,..]
this score is calculated by adding the vector scores of each word.
e.g. a sentence "hello cheers for giving me your number" could be:
[900, 10, 3, ..] + [10,3,100,..] + .. + .. + = [900+10+..,10+3+..,3+100,..]
=[1023,900,500,..] say
then because the biggest number is at the start in the salutation score position, this sentence is a salutation.
then if you had to score one of your lines to see what component the line should be in, for each word you would add on its score
Good luck, there is always a trade-off between computation complexity and accuracy. If you can find a good set of words and make a good model to base you calculations it will help.
The first approach that comes to mind (not necessarily the best...) would be to start off by using split. here's a little bit of code and stuff
linearray=emailtext.split('\n')
now you have an array of strings, each one like a paragraph or whatever
so linearray[0] would contain the salutation
deciding where the reply text starts is a little more tricky, i noticed that there is a double newline just before it so maybe do a search for that from the back and hope that the last one indicates the start of the reply text.
Or store some signature words you might expect and search for those from the front, like cheers, regards, and whatever else.
Once you figure out where the signature is the rest is the rest is easy
hope this helped
I built a pretty cheap API for this actually to parse the contact data from signatures of emails and email chains. It's called SigParser. You can see the Swagger docs here for it.
Basically you send it a header 'x-api-key' with a JSON body like so and it parses all the contacts in the reply chain of an email.
{
"subject": "Thanks for meeting...",
"from_address": "bgates#example.com",
"from_name": "Bill Gates",
"htmlbody": "<div>Hi, good seeing you the other day.</div><div>--</div><div>Bill Gates</div><div>Cell 777-444-8888</div>LinkedInTwitter",
"plainbody": "Hi, good seeing you the other day. \r\n--\r\nBill Gates\r\nCell 777-444-8888",
"date": "Mon, 28 May 2018 23:33:40 +0000 (UTC)"
}