I Have an IRC bot I'm working on, and one of the features I would like it to have is to take any link a person posts and use BeautifulSoup to parse that page. Now, I have the bot working, getting the messages people post, etc. But, how would I pull a link from the IRC message? Say someone says this:
Person: Check out http://www.site.com, it's cool!
How would I take the link out and assign it to a variable for later use, without pulling the other parts of the message?
I think it's something to do with regexs, but I'm not sure.
You will indeed need to use regular expressions.
There's a decent article with a regular expression for matching URLs and somewhat of a description of what it's doing at daring fireball.
You can look at how Django does it here.
Finally, Python's regular expression documentation may also be useful.
You are on the exact track to finish this. You gave yourself the answer with the last sentence of your question. You will use a regular expression with a capture group to get the url and from there you can parse/grab the page that the user has said in the irc.
This site may be of some use for you: http://www.regular-expressions.info/
Related
There are two pre-existing questions on the site.
One for Python, one for Java.
Java How to remove the quoted text from an email and only show the new text
Python Reliable way to only get the email text, excluding previous emails
I want to be able to do pretty much exactly the same (in PHP). I've created a mail proxy, where two people can have a correspondance together by emailing a unique email address.
The problem I am finding however, is that when a person receives the email and hits reply, I am struggling to accurately capture the text that he has written and discard the quoted text from previous correspondance.
I'm trying to find a solution that will work for both HTML emails and Plaintext email, because I am sending both.
I also have the ability if it helps to insert some <*****RESPOND ABOVE HERE*******> tag if neccessary in the emails meaning that I can discard everything below.
What would you recommend I do? Always add that tag to the HTML copy and the plaintext copy then grab everything above it?
I would still then be left with the scenario of knowing how each mail client creates the response. Because for example Gmail would do this:
On Wed, Nov 2, 2011 at 10:34 AM, Message Platform <35227817-7cfa-46af-a190-390fa8d64a23#dev.example.com> wrote:
## In replies all text above this line is added to your message conversation ##
Any suggestions or recommendations of best practices?
Or should I just grab the 50 most popular mail clients, and start creating custom Regex for each. Then for each of these clients, also a bizallion different locale settings since I'm guessing the locale of the user will also influence what is added.
Or should I just remove the preceding line always if it contains a date?.. etc
Unfortunately, you're in for a world of hurt if you want to try to clean up emails meticulously (removing everything that's not part of the actual reply email itself). The ideal way would be to, as you suggest, write up regex for each popular email client/service, but that's a pretty ridiculous amount of work, and I recommend being lazy and dumb about it.
Interestingly enough, even Facebook engineers have trouble with this problem, and Google has a patent on a method for "Detecting quoted text".
There are three solutions you might find acceptable:
Leave It Alone
The first solution is to just leave everything in the message. Most email clients do this, and nobody seems to complain. Of course, online message systems (like Facebook's 'Messages') look pretty odd if they have inception-style replies. One sneaky way to make this work okay is to render the message with any quoted lines collapsed, and include a little link to 'expand quoted text'.
Separate the Reply from the Older Message
The second solution, as you mention, is to put a delineating message at the top of your messages, like --------- please reply above this line ----------, and then strip that line and anything below when processing the replies. Many systems do this, and it's not the worst thing in the world... but it does make your email look more 'automated' and less personal (in my opinion).
Strip Out Quoted Text
The last solution is to simply strip out any new line beginning with a >, which is, presumably, a quoted line from the reply email. Most email clients use this method of indicating quoted text. Here's some regex (in PHP) that would do just that:
$clean_text = preg_replace('/(^\w.+:\n)?(^>.*(\n|$))+/mi', '', $message_body);
There are some problems using this simpler method:
Many email clients also allow people to quote earlier emails, and preface those quote lines with > as well, so you'll be stripping out quotes.
Usually, there's a line above the quoted email with something like On [date], [person] said. This line is hard to remove, because it's not formatted the same among different email clients, and it may be one or two lines above the quoted text you removed. I've implemented this detection method, with moderate success, in my PHP Imap library.
Of course, testing is key, and the tradeoffs might be worth it for your particular system. YMMV.
There are many libraries out there that can help you extract the reply/signature from a message:
Ruby: https://github.com/github/email_reply_parser
Python: https://github.com/zapier/email-reply-parser or https://github.com/mailgun/talon
JavaScript: https://github.com/turt2live/node-email-reply-parser
Java: https://github.com/Driftt/EmailReplyParser
PHP: https://github.com/willdurand/EmailReplyParser
I've also read that Mailgun has a service to parse inbound email and POST its content to a URL of your choice. It will automatically strip quoted text from your emails: https://www.mailgun.com/blog/handle-incoming-emails-like-a-pro-mailgun-api-2-0/
Hope this helps!
Possibly helpful: quotequail is a Python library that helps identify quoted text in emails
Afaik, (standard) emails should quote the whole text by adding a ">" in front of every line. Which you could strip by using strstr(). Otherwise, did you trie to port that Java example to php? It's nothing else than Regex.
Even pages like Github and Facebook do have this problem.
Just an idea: You have the text which was originally sent, so you can look for it and remove it and additional surrounding noise from the reply. It is not trivial, because additional line breaks, HTML elements, ">" characters are added by the mail client application.
The regex is definitely better if it works, because it is simple and it perfectly cuts the original text, but if you find that it frequently does not work then this can be a fallback method.
I agree that quoted text or reply is just a TEXT. So there's no accurate way to fetch it. Anyway you can use regexp replace like this.
$filteringMessage = preg_replace('/.*\n\n((^>+\s{1}.*$)+\n?)+/mi', '', $message);
Test
https://regex101.com/r/xO8nI1/2
I'm learning a practice called 'web scraping' using python. From what I can tell so far the idea is to send out a request to load the site data from a server, store the DOM html in a variable, and then basically data mine the s*** out of the resulting string until you are able to quickly access exactly and only the information you need.
Well I'm ready to start fiddling with statements that might help me do the actual data mining, but first I need to see and understand all of the html in my string. After I've got the hang of it I won't care what the html looks like, but right now I need to be able to reference it to properly analyze my output. so far I've tried google, python.net, youtube, various blogs and etc. But they all look like alianeese.
I'm just looking for the typical stuff you know?
<html><head><meta><script src=""><style src=""><title></title></head><body><div class=""><img src=""></div><div><h1>my page</h1><li></li><li></li><li></li><li></li><li></li><li></li><p>click here</p></div></body></html>
You get what I'm saying? Just a website... that uses like... html... to render some simple structured data.
P.S. This is kind of neat. I went to give this post some tags and I discovered 'simple-html-dom'. So I googled it. Apparently it's some kind of language that lets you parse html from online sources in exactly the way I am trying to. I may check that out later, but I still want to figure out how to do this with python.
EDIT Actually something like this would work fine but it's just so big. I would prefer something smaller to work with.
While it would probably be nice to build your own web pages to use, you can also try looking for pages "optimized for lynx". Lynx is a text-only browser with which "simple" pages naturally work best.
Most of the links you'll find will be dead already, but I found this list for instance, which still has many alive and equally simple pages: http://www.put.com/dead.html (please ignore the content itself... there is no particular reason I chose this example other than that it probably works nicely for your purposes!)
I am trying to write a program to retrieve all of the links for questions that have active bounties in a specific tag. I have not yet implemented the specific tag feature, because I am stuck just try to get all of the links.
from re import findall
from urllib.request import urlopen
def fetch_source(url):
return str(urlopen(url).read())
site = 'http://stackoverflow.com/?tab=featured'
def fetch_links(source):
source = fetch_source(source)
return findall("\/questions\/[0-9]*\/(?:[A-z]|\-)+", source)
print(fetch_links(site))
This will fetch many of the links, but it misses a lot of them because my regex only allows [A-z]|\- in the title. I'm not sure how to fix this though because some questions have quotation marks in the titles, and if I allow those, I will not know when the question link ends?
I'm sorry for being new to python, but I am just trying to figure stuff out.
Using regex would become completely infeasible for getting questions by specific tag.
You are correct that your regex is missing a lot of titles, but using findall really isn't appropriate in this situation. Beautiful soup, is a much better tool for retrieving links, and I recommend you look into it.
In this instance, however, the Stack Exchange API has you covered.
For similar questions, just search(or Google) through the API documentation until you see the feature you're looking for, in your case featured question.
Enter the parameters you want, and the API will show generate a link:
https://api.stackexchange.com/2.2/questions/featured?order=desc&sort=votes&tagged=python&site=stackoverflow
Example for retrieving all feature Python questions
I'm trying to build a small wiki, but I'm having problems writing the regex rules for them.
What I'm trying to do is that every page should have an edit page of its own, and when I press submit on the edit page, it should redirect me to the wiki page.
I want to have the following urls in my application:
http://example.com/<page_name>
http://example.com/_edit/<page_name>
My URLConf has the following rules:
url(r'(_edit/?P<page_name>(?:[a-zA-Z0-9_-]+/?)*)', views.edit),
url(r'(?P<page_name>(^(?:_edit?)?:[a-zA-Z0-9_-]+/?)*)', views.page),
But they're not working for some reason.
How can I make this work?
It seems that one - or both - match the same things.
Following a more concise approach I'd really define the edit URL as:
http://example.com/<pagename>/edit
This is more clear and guessable in my humble opinion.
Then, remember that Django loops your url patterns, in the same order you defined them, and stops on the first one matching the incoming request. So the order they are defined with is really important.
Coming with the answer to your question:
^(?P<page_name>[\w]+)$ matches a request to any /PageName
Please always remember the starting caret and the final dollar signs, that are saying we expect the URL to start and stop respectively right before and after our regexp, otherwise any leading or trailing symbol/character would make the regexp match as well (while you likely want to show up a 404 in that case).
^_edit/(?P<page_name>[\w]+)$ matches the edit URL (or ^(?P<page_name>[\w]+)/edit$ if you like the user-friendly URL commonly referred to as REST urls, while RESTfullnes is a concept that has nothing to do with URL style).
Summarizing put the following in your urls:
url(r'^(?P<page_name>[\w]+)$', views.page)
url(r'^_edit/(?P<page_name>[\w]+)$', views.edit)
You can easily force URLs not to have some particular character by changing \w with a set defined by yourself.
To learn more about Django URL Dispatching read here.
Note: Regexp's are as powerful as dangerous, especially when coming on network. Keep it simple, and be sure to really understand what are you defining, otherwise your web application may be exposed to several security issues.
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. -- Jamie Zawinski
Please try the following URLs, that are simpler:
url(r'_edit/(?P<page_name>[\w-]+), views.edit)'
url(r'(?P<page_name>[\w-]+), views.page),
I'm trying to do something similar to placekitten.com, wherein a user can input two strings after the base URL and have those strings alter the output. I'm doing this in Python, and I cannot for the life of me figure out how to grab the URL. In PHP I can do it with query string and $_REQUEST. I can't find a similar method in Python that doesn't rely on CGI.
(I know I could do this with Django, but that's serious overkill for this project.)
This is just by looking at the docs but have you tried it?
cherrypy.request.path_info
The docs say:
The ‘relative path’ portion of the Request-URI. This is relative to the script_name (‘mount point’) of the application which is handling this request.
http://docs.cherrypy.org/stable/refman/_cprequest.html#cherrypy._cprequest.Request.path_info