I'm using Python and imaplib to obtain emails from a IMAP server (supports all kinds of IMAP servers - GMail, etc.).
My problem is: Using the IMAP BODY[INDEX] command to fetch a specific body part, the HTML comes with extra tabs. As in:
(...)</a>\t\t\t\t\t\t\t\t<a>(...)
When showing the HTML the tabs are obviously extra:
(The screenshot is in the Portuguese language but I believe that is not relevant.
I have searched IMAP documentation but found nothing that helps. I am guessing these \t are always following tag closes (such as \t\t\t\t\t), so I could just find all tabs that come after a tag close and delete them, but I don't know if that would be a reliable method at all.
Thank you
I found a solution (for the time being at least).
When receiving data from a IMAP call response, there are \\r\\n characters delimiting the lines. I remove these.
However, I discovered that besides these there are also \\t characters coupled with these in some instances. For example:
\\r\\n\\t\\t\\t\t
If I remove the \\t together with the \\r\\n, the HTML is presented perfectly.
Related
There are two pre-existing questions on the site.
One for Python, one for Java.
Java How to remove the quoted text from an email and only show the new text
Python Reliable way to only get the email text, excluding previous emails
I want to be able to do pretty much exactly the same (in PHP). I've created a mail proxy, where two people can have a correspondance together by emailing a unique email address.
The problem I am finding however, is that when a person receives the email and hits reply, I am struggling to accurately capture the text that he has written and discard the quoted text from previous correspondance.
I'm trying to find a solution that will work for both HTML emails and Plaintext email, because I am sending both.
I also have the ability if it helps to insert some <*****RESPOND ABOVE HERE*******> tag if neccessary in the emails meaning that I can discard everything below.
What would you recommend I do? Always add that tag to the HTML copy and the plaintext copy then grab everything above it?
I would still then be left with the scenario of knowing how each mail client creates the response. Because for example Gmail would do this:
On Wed, Nov 2, 2011 at 10:34 AM, Message Platform <35227817-7cfa-46af-a190-390fa8d64a23#dev.example.com> wrote:
## In replies all text above this line is added to your message conversation ##
Any suggestions or recommendations of best practices?
Or should I just grab the 50 most popular mail clients, and start creating custom Regex for each. Then for each of these clients, also a bizallion different locale settings since I'm guessing the locale of the user will also influence what is added.
Or should I just remove the preceding line always if it contains a date?.. etc
Unfortunately, you're in for a world of hurt if you want to try to clean up emails meticulously (removing everything that's not part of the actual reply email itself). The ideal way would be to, as you suggest, write up regex for each popular email client/service, but that's a pretty ridiculous amount of work, and I recommend being lazy and dumb about it.
Interestingly enough, even Facebook engineers have trouble with this problem, and Google has a patent on a method for "Detecting quoted text".
There are three solutions you might find acceptable:
Leave It Alone
The first solution is to just leave everything in the message. Most email clients do this, and nobody seems to complain. Of course, online message systems (like Facebook's 'Messages') look pretty odd if they have inception-style replies. One sneaky way to make this work okay is to render the message with any quoted lines collapsed, and include a little link to 'expand quoted text'.
Separate the Reply from the Older Message
The second solution, as you mention, is to put a delineating message at the top of your messages, like --------- please reply above this line ----------, and then strip that line and anything below when processing the replies. Many systems do this, and it's not the worst thing in the world... but it does make your email look more 'automated' and less personal (in my opinion).
Strip Out Quoted Text
The last solution is to simply strip out any new line beginning with a >, which is, presumably, a quoted line from the reply email. Most email clients use this method of indicating quoted text. Here's some regex (in PHP) that would do just that:
$clean_text = preg_replace('/(^\w.+:\n)?(^>.*(\n|$))+/mi', '', $message_body);
There are some problems using this simpler method:
Many email clients also allow people to quote earlier emails, and preface those quote lines with > as well, so you'll be stripping out quotes.
Usually, there's a line above the quoted email with something like On [date], [person] said. This line is hard to remove, because it's not formatted the same among different email clients, and it may be one or two lines above the quoted text you removed. I've implemented this detection method, with moderate success, in my PHP Imap library.
Of course, testing is key, and the tradeoffs might be worth it for your particular system. YMMV.
There are many libraries out there that can help you extract the reply/signature from a message:
Ruby: https://github.com/github/email_reply_parser
Python: https://github.com/zapier/email-reply-parser or https://github.com/mailgun/talon
JavaScript: https://github.com/turt2live/node-email-reply-parser
Java: https://github.com/Driftt/EmailReplyParser
PHP: https://github.com/willdurand/EmailReplyParser
I've also read that Mailgun has a service to parse inbound email and POST its content to a URL of your choice. It will automatically strip quoted text from your emails: https://www.mailgun.com/blog/handle-incoming-emails-like-a-pro-mailgun-api-2-0/
Hope this helps!
Possibly helpful: quotequail is a Python library that helps identify quoted text in emails
Afaik, (standard) emails should quote the whole text by adding a ">" in front of every line. Which you could strip by using strstr(). Otherwise, did you trie to port that Java example to php? It's nothing else than Regex.
Even pages like Github and Facebook do have this problem.
Just an idea: You have the text which was originally sent, so you can look for it and remove it and additional surrounding noise from the reply. It is not trivial, because additional line breaks, HTML elements, ">" characters are added by the mail client application.
The regex is definitely better if it works, because it is simple and it perfectly cuts the original text, but if you find that it frequently does not work then this can be a fallback method.
I agree that quoted text or reply is just a TEXT. So there's no accurate way to fetch it. Anyway you can use regexp replace like this.
$filteringMessage = preg_replace('/.*\n\n((^>+\s{1}.*$)+\n?)+/mi', '', $message);
Test
https://regex101.com/r/xO8nI1/2
I'm using Getmail (http://pyropus.ca/software/getmail/) to check a POP3 account regularlly and download arriving mail to a folder.
On a second step I parse those mails using Python and use the sent data.
The problem is that, as the received mails are in spanish, latin characters may arrive and when saved to a file they are replaced. I can't find how to avoid that replacement or even which encoding the file is in. Examples follow:
Ñ is replaced by =F1
á is replaced by =E1
Line breaks are replaced by =20
It looks like URL encoding but it uses the = sign instead of the %.
I also tried to download the mails using Python's poplib and I also get the characters replaces, that leads me to think that the problem can be a configuration on my machine.
Hope someone can help me with this.
Thanks!!
I am writing a python script for mass-replacement of links(actually image and script sources) in HTML files; I am using lxml. There is one problem, the html files are quizzes and they have data packaged like this(there is also some Cyrillic here):
<input class="question_data" value="{"text":"<p>[1] је наука која се бави чувањем, обрадом и преносом информација помоћу рачунара.</p>","fields":[{"id":"1","type":"fill","element":{"sirina":"103","maxDuzina":"12","odgovor":["Информатика"]}}]}" name="question:1:data" id="id3a1"/>
When I try to print out this data in python using:
print "OLD_DATA:", data
It just prints out the error "UnicodeEncodeError: character maps to undefined". There are more of these elements. My goal is to change the links of images in the value part of input, but I can't change the links if I don't know how to print this data(or how it should be written to the file). How does Python handle(interpret) this? Please help. Thanks!!! :)
You're running into the same problem I've hit many times in the past. That error almost always means that the console environment you're using can't display the characters it's trying to print. It might be worth trying to log to a file instead, then opening the log in an editor that can display the characters.
If you really want to be able to see it on your console, it might be worth writing a function to screen the strings you're printing for unprintable characters
I also found a couple other StackOverflow posts that might be helpful in your efforts:
How do I get Cyrillic in the output, Python?
What is right way to use cyrillic in python lxml library
I would also recommend this article and python manual entry:
https://docs.python.org/2/howto/unicode.html
http://www.joelonsoftware.com/articles/Unicode.html
I have done a lot of search and experimentation, and I havent been able to find the solution. So, if there is something trivial I missed, I appologize ahead of time.
Problem:
I have a python turbogears app that is downloading url resources. It is being given a URL to download by clients.
One client in particular sends unescaped urls. For eg, 'http://www.foo.com/file with space.txt'
When I try to download it, the download fails, because the server does not recognize this url. It needs to have the spaces escaped to be a valid url.
I know that there are methods ( urllib.urlencode/urllib.quote etc) that will encode strings. However they assume that the strings they work on are not urls. If you give a URL to these methods, they escape the scheme of the url, and make it even more invalid.
So, the summary is: How do I unescape a whole fully qualified url in python?
NOTE: I have tried using urlparse to parse out the url components to get at the path. However sometimes the url will have query parameters, fragments etc. So, I do not want to write code that splits the url into its parts, escapes whatever is required only from the path+query+fragment, and then reconstructs the url.
Is there any helper function that directly takes the url, and escapes it?
Also, note that sometimes I get valid escaped urls from clients. So, I want to handle them as well, without double escaping them.
Ok, I found the following on pypi. This seems to solve the problem.
https://github.com/seomoz/url-py/
This is the url egg from seomoz. Seems to do the job very well.
You can use regular expressions to separate the domain name and the file path, then only urlencode the path. Here's the regex documentation, here's a tutorial.
I am trying to write a script to scrape a website, and am using this one (http://www.theericwang.com/scripts/eBayRead.py).
I however want to use it to crawl sites other than ebay, and to customize to my needs.
I am fairly new to python and have limited re experience.
I am unsure of what this line achieves.
for url, title in re.findall(r'href="([^"]+).*class="vip" title=\'([^\']+)', lines):
Could someone please give me some pointers?
Is there anything else I need to consider if I port this for other sites?
In general, parsing HTML is best done with a library such as BeautifulSoup, which takes care of virtually all of the heavy lifting for you, leaving you with more intuitive code. Also, read #Tadeck's link below - regex and HTML shouldn't be mixed if it can be avoided (to put it lightly).
As for your question, that line uses something called 'regular expression' to find matching patterns in a text (in this case, HTML). re.findall() is a method that returns a list, so if we focus on just that:
re.findall(r'href="([^"]+).*class="vip" title=\'([^\']+)', lines):
r indicates that the following will be interpreted 'raw', meaning that characters like backslashes, etc., will be interpreted literally.
href="([^"]+)
The parentheses indicate a group (what we care about in the match), and the [^"]+ means 'match anything that isn't a quote'. As you can probably guess, this group will return the URL of the link.
.*class="vip"
The .* matches anything (well, almost anything) 0 or more times (which here could include other tags, the closing quote of the link, whitespace, etc.). Nothing special with class="vip" - it just needs to appear.
title=\'([^\']+)', lines):
Here you see an escaped quote and then another group as we saw above. This time, we are capturing anything between the two apostrophes after the title tag.
The end result of this is you are iterating through a list of all matches, and those matches are going to look something like (my_matched_link, my_matched_title), which are passed into for url, title, after which further processing is done.
I am not sure if this would answer your question. But you can consider scrapy: http://scrapy.org for crawling various websites. It is a nice infrastructure which provides a lot of flexibility and is easy to customize to some specific needs.
Regular expressions are bad for parsing HTML
The above is the main idea I would like to communicate to you. For why, see this question: RegEx match open tags except XHTML self-contained tags.
In short, HTML can change as a text (eg. new attribute can be added, order of attributes can be changed, or some other changes may be introduced), but it will result in the exact same HTML as interpreted by web browsers, while completely breaking your script.
The HTML should be parsed using specialized HTML parsers or web scrapers. They know the difference, when it becomes significant.
What to use for scraping?
There are multiple solutions, but one of the most notable ones is: ScraPy. Try it, you may start to love it.