How to process multi-line text in Spark

How to process multi-line text in Spark - python

Suppose I have a dictionary with words and explanations in the following format:
Firefox
<web> A complete {free}, {open-source} {web
browser} from the {Mozilla Foundation} and therefore a true
code descendent of {Netscape Navigator}. The first non-{beta
release} was in late 2004.
{Firefox Home (http://mozilla.org/products/firefox)}.
(2005-01-26)
firehose syndrome
<networking, jargon> An absence, failure or inadequacy of flow
control mechanisms causing the sender to overwhelm the
receiver. The implication is that, like trying to drink from
a firehose, the consequenses are worse than just loss of data,
e.g. the receiver may {crash}.
See {ping-flood}.
[{Jargon File}]
(2007-03-12)
firewall
1. {firewall code}.
2. {firewall machine}.
firewall code
1. The code you put in a system (say, a telephone switch) to
make sure that the users can't do any damage. Since users
always want to be able to do everything but never want to
suffer for any mistakes, the construction of a firewall is a
question not only of defensive coding but also of interface
presentation, so that users don't even get curious about those
corners of a system where they can burn themselves.
2. Any sanity check inserted to catch a {can't happen} error.
Wise programmers often change code to fix a bug twice: once to
fix the bug, and once to insert a firewall which would have
arrested the bug before it did quite as much damage.
[{Jargon File}]
How would I map this type of data so that I can do things with it in spark?

Related

Efficient approach to catching database errors

I have a desktop app that has 65 modules, about half of which read from or write to an SQLite database. I've found that there are 3 ways that the database can throw an SQliteDatabaseError:
SQL logic error or missing database (happens unpredictably every now and then)
Database is locked (if it's being edited by another program, like SQLite Database Browser)
Disk I/O error (also happens unpredictably)
Although these errors don't happen often, when they do they lock up my application entirely, and so I can't just let them stand.
And so I've started re-writing every single access of the database to be a pointer to a common "database-access function" in its own module. That function then can catch these three errors as exceptions and thereby not crash, and also alert the user accordingly. For example, if it is a "database is locked error", it will announce this and ask the user to close any program that is also using the database and then try again. (If it's the other errors, perhaps it will tell the user to try again later...not sure yet). Updating all the database accesses to do this is mostly a matter of copy/pasting the redirect to the common function--easy work.
The problem is: it is not sufficient to just provide this database-access function and its announcements, because at all of the points of database access in the 65 modules there is code that follows the access that assumes the database will successfully return data or complete a write--and when it doesn't, that code has to have a condition for that. But writing those conditionals requires carefully going into each access point and seeing how best to handle it. This is laborious and difficult for the couple of hundred database accesses I'll need to patch in this way.
I'm willing to do that, but I thought I'd inquire if there were a more efficient/clever way or at least heuristics that would help in finishing this fix efficiently and well.
(I should state that there is no particular "architecture" of this application...it's mostly what could be called "ravioli code", where the GUI and database calls and logic are all together in units that "go together". I am not willing to re-write the architecture of the whole project in MVC or something like this at this point, though I'd consider it for future projects.)

Your gut feeling is right. There is no way to add robustness to the application without reviewing each database access point separately.
You still have a lot of important choice at how the application should react on errors that depends on factors like,
Is it attended, or sometimes completely unattended?
Is delay OK, or is it important to report database errors promptly?
What are relative frequencies of the three types of failure that you describe?
Now that you have a single wrapper, you can use it to do some common configuration and error handling, especially:
set reasonable connect timeouts
set reasonable busy timeouts
enforce command timeouts on client side
retry automatically on errors, especially on SQLITE_BUSY (insert large delays between retries, fail after a few retries)
use exceptions to reduce the number of application level handlers. You may be able to restart the whole application on database errors. However, do that only if you have confidence as to in which state you are aborting the application; consistent use of transactions may ensure that the restart method does not leave inconsistent data behind.
ask a human for help when you detect a locking error
...but there comes a moment where you need to bite the bullet and let the error out into the application, and see what all the particular callers are likely to do with it.

What can change my floating point control word behind my back?

I have a 32 bit Windows application, written primarily in Delphi, which performs floating point numerical simulations using the 8087 FPU. I have recently added the ability to link in external Python code using the Python API through python2x.dll. This recent change has led to some very strange behaviour.
The application has a batch mode of operation where it performs multiple simulations in parallel to take advantage of multi-core architectures. As soon as Python code has been executed in the process, I start to see changes to the 8087 control word on different threads. I've checked this very carefully and I have observed the control word having changed even in a thread which has never called into the Python DLL.
I know this sounds fantastical, but, as I have discovered, there are mechanisms for this behaviour to manifest. I have learnt about signals. I first hypothesised that the Python DLL was setting process wide signal handlers (by calling signal()) and these signal handlers were responsible for changing the control word. For example, a thread, unrelated to the Python code, could perhaps cause SIGFPE and that may, in turn, modify the control word.
I have rather come to the conclusion that signal() is not the mechanism. I arranged to execute the Python code at startup. Then I set of the signal handlers back to SIG_DFL. Then I started the simulations. But still the control word changes occurred.
My question (finally) is whether anyone knows of another mechanism by which the control word could be changed in such a manner. I'm looking for interrupts, APCs etc., I think!
Update
The control word is being changed to 0x037F which is the Intel default value. This differs from the MSVC/Windows default of 0x027F. I hypothesise that something is calling FPINIT.
I also discovered Py_InitializeEx which allows the caller to stop Python setting signal handlers. The control word changes occur even if I use this approach to initialisation so I'm even more convinced that is not the mechanism.

For example, a DllMain call with DLL_THREAD_ATTACH flag, see msdn
Update
I have found a link to similar problem - DLL Load "Poisons" FPU Control Word for New Threads. But yes, it is about the threads created after Dll load.

If I remember correctly, that's Delphi's problem. There are some discussions of the issue here and here. I remember bumping into it when trying to write some VST plugins in Delphi.

I have seen a case like this where the printer driver of the default printer changed the control word in my back. When I changed the default printer, the problem went away.
To circumvent this problem I set the control word to the default value at approriate places with:
_control87(_CW_DEFAULT, _CW_DEFAULT);
I have also seen the same problem on all machines of a customer that had Norton Security 2011 installed, but the problem went away with the fix for the printer driver, so I'm not really sure if Norton was really the cause.

How to recognise a particular user in a long multi-user internet chat log?

Here is an online programming contest we are planning to have.
What could be possible approaches to solving the same?
From a random IRC (Internet Relay Chat) log, a small percentage of the user nicknames will be randomly deleted. The participant’s code must be able to fill in the missing user nicks. In other words, this event requires you to come up with an intelligent program that can figure out “who could have said what”.
It may be assumed that all communication will be in modern English, with or without punctuation.
For example -
Original Chat:
...
<user1>: Hey!
<user2>: Hello! Where are you from, user1?
<user3>: Can anybody help me out with Gnome installation?
<user1>: India. user3, do you have the X Windows System installed?
<user2>: Cool. What is Gnome, user3?
<user3>: I don’t know. How do I check?
<user3>: Its a desktop environment, user2.
<user2>: Oh yeah! Just googled.
<user1>: Type “startx” on the command line. Login as root and type “apt-get install gnome”.
<user3>: Thanks!
<user5>: I’m root, obey me!
<user2>: Huh?!
<user3>: user2, you better start using Linux!
...
The following only will be given to the participant.
Chat log with some nicks deleted:
..
: Hey!
: Hello! Where are you from, user1?
: Can anybody help me out with Gnome installation?
: India. user3, do you have the X Windows System installed?
: Cool. What is Gnome, user3?
<%%%>: I don’t know. How do I check?
<%%%>: Its a desktop environment, user2.
: Oh yeah! Just googled.
: Type “startx” on the command line. Login as root and type “apt-get install gnome”.
: Thanks!
<%%%>: I’m root, obey me!
<%%%>: Huh?!
: user2, you better start using Linux!
...
The participant’s code will have the task of replacing "<%%%>s" with the appropriate user nicks. In ambiguous cases, like the random comment by in the above example (which could have been said by any other user too!), the code should indicate the same.

Two things spring to my mind: authorship attribution and chat disentaglement. Neither are exactly what you describe, but they both come pretty close.
Authorship attribution is the problem of trying to find which of a known set of authors wrote a particular document. Classic authorship attribution is typically used on large sections of text (e.g. plays, novels, speeches) but people have been trying to do the same on shorter samples of text from internet sources. A good reference is probably anything written by Moshe Koppel with 'authorship' in the title, for example the recent paper Authorship Attribution in the Wild. The usual approach to this task involves using typical document classification approaches, that is using bag of words features and a machine learning classifier, on a set of what are usually thought of as stop words (e.g. as, of, the, etc.). The problem here is that all of this work is on documents and does not take into account the conversational nature of IRC data.
Chat disentanglement is the problem of identifying a number of coherent 'conversations' from chat data. This is quite a hard problem as you often need to use the context of conversation in order to know who is replying to who. I imagine this kind of approach would be important to this task as well. For example, if the anonymised message is part of a conversation then that limits the set of authors to the people in the conversation. I really only know about this from the paper Disentangling Chat by Elsner and Charniak. Their 'related work' section is a good overview of the field.

One possible solution would be to take the Naive Bayes Classifier 'spam filter' idea and see which words different nicks tend to use. Classify messages according to which user uses words 'most like' the ones sent from an unknown user. The downfall of this would be that if they were using new words you hadn't seen before (which is very likely), then you'd need to understand higher-level context information.

Game login authentication and security

First off I will say I am completely new to security in coding. I am currently helping a friend develop a small game (in Python) which will have a login server. I don't have much knowledge regarding security, but I know many games do have issues with this. Everything from 3rd party applications (bots) to WPE packet manipulation. Considering how small this game will be and the limited user base, I doubt we will have serious issues, but would like to try our best to limit problems. I am not sure where to start or what methods I should use, or what's worth it. For example, sending data to the server such as login name and password.
I was told his information should be encrypted when sending, so in-case someone was viewing it (with whatever means), that they couldn't get into the account. However, if someone is able to capture the encrypted string, wouldn't this string always work since it's decrypted server side? In other words, someone could just capture the packet, reuse it, and still gain access to the account?
The main goal I am really looking for is to make sure the players are logging into the game with the client we provide, and to make sure it's 'secure' (broad, I know). I have looked around at different methods such as Public and Private Key encryption, which I am sure any hex editor could eventually find. There are many other methods that seem way over my head at the moment and leave the impression of overkill.
I realize nothing is 100% secure. I am just looking for any input or reading material (links) to accomplish the main goal stated above. Would appreciate any help, thanks.

This is a tough problem, because the code runs on the client. The replay problem can be solved by using a challenge by letting the server sending a random token which the client adds to the string to be encrypted. This way, the password string will be different each time, and replaying the encrypted string doesn't work (as the server checks if the encrypted password has the last sent token)
The problem is that the encryption key has to be stored on the client, and it's possible to retrieve that key.

The simple answer to how to protect the password going over the wire, replay attacks, and message tampering is: use SSL. Yes, there are other things you can do with challenge-response authentication schemes for the login part of it, but it sounds like you really want the whole channel protected anyway. Use SSL at the socket layer and then you don't have to do anything else complicated with how you send your credentials.

As for how to protect the client... The most realistic thing is to say that you don't. When you're writing the server code, never trust any data that the client sends you. Never give the client any information you don't want the player to have.
In some games, like chess (or really anything turn-based), that actually works pretty well, because it's very easy for the server to verify that the move passed in by the client is a legal move.
In other games, those restrictions aren't so practical, and then I don't know what you'd do, from a code perspective. I'd try to shift the problem to a social one at that point: Are the other players people you'd trust to bring their own dice to your gaming table? If not, can you play with someone else?

Although it might conceivably be overkill for this specific application (small game, limited user base), you should seriously consider using oAuth, since this application gives you a great chance to learn this technology (which is really solid and widespread, and spreading more and more!-) to apply it in the future -- it's been designed and implemented by excellent programmers with strong security background. You can study their tutorial to get a solid understanding and background, then use libraries such as python-oauth2 to implement oauth simply and productively in your desktop application.

In Python in GAE, what is the best way to limit the risk of executing untrusted code?

I would like to enable students to submit python code solutions to a few simple python problems. My applicatoin will be running in GAE. How can I limit the risk from malicios code that is sumitted? I realize that this is a hard problem and I have read related Stackoverflow and other posts on the subject. I am curious if the restrictions aleady in place in the GAE environment make it simpler to limit damage that untrusted code could inflict. Is it possible to simply scan the submitted code for a few restricted keywords (exec, import, etc.) and then ensure the code only runs for less than a fixed amount of time, or is it still difficult to sandbox untrusted code even in the resticted GAE environment? For example:
# Import and execute untrusted code in GAE
untrustedCode = """#Untrusted code from students."""
class TestSpace(object):pass
testspace = TestSpace()
try:
#Check the untrusted code somehow and throw and exception.
except:
print "Code attempted to import or access network"
try:
# exec code in a new namespace (Thanks Alex Martelli)
# limit runtime somehow
exec untrustedCode in vars(testspace)
except:
print "Code took more than x seconds to run"

#mjv's smiley comment is actually spot-on: make sure the submitter IS identified and associated with the code in question (which presumably is going to be sent to a task queue), and log any diagnostics caused by an individual's submissions.
Beyond that, you can indeed prepare a test-space that's more restrictive (thanks for the acknowledgment;-) including a special 'builtin' that has all you want the students to be able to use and redefines __import__ &c. That, plus a token pass to forbid exec, eval, import, __subclasses__, __bases__, __mro__, ..., gets you closer. A totally secure sandbox in a GAE environment however is a real challenge, unless you can whitelist a tiny subset of the language that the students are allowed.
So I would suggest a layered approach: the sandbox GAE app in which the students upload and execute their code has essentially no persistent layer to worry about; rather, it "persists" by sending urlfetch requests to ANOTHER app, which never runs any untrusted code and is able to vet each request very critically. Default-denial with whitelisting is still the holy grail, but with such an extra layer for security you may be able to afford a default-acceptance with blacklisting...

You really can't sandbox Python code inside App Engine with any degree of certainty. Alex's idea of logging who's running what is a good one, but if the user manages to break out of the sandbox, they can erase the event logs. The only place this information would be safe is in the per-request logging, since users can't erase that.
For a good example of what a rathole trying to sandbox Python turns into, see this post. For Guido's take on securing Python, see this post.
There are another couple of options: If you're free to choose the language, you could run Rhino (a Javascript interpreter) on the Java runtime; Rhino is nicely sandboxed. You may also be able to use Jython; I don't know if it's practical to sandbox it, but it seems likely.
Alex's suggestion of using a separate app is also a good one. This is pretty much the approach that shell.appspot.com takes: It can't prevent you from doing malicious things, but the app itself stores nothing of value, so there's no harm if you do.

Here's an idea. Instead of running the code server-side, run it client-side with Skuplt:
http://www.skulpt.org/
This is both safer, and easier to implement.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.