Processing strings in Python - Packages or Hard Code?

Processing strings in Python - Packages or Hard Code? - python

In Python I am doing a number of different string processing functions in a program. The user enters a term in a form and the term is processed through different functions. These include, stemming, stop word removal, punctuation removal, spell checking and getting synonyms.
Stemming is done using the stemming package,
stop word & punctuation removal using string.replace() and REGEX,
spell checking using pyEnchant
getting synonyms using the Big Huge Thesaurus API.
The term is sent to an API. The results are returned and put through a hard-coded sorting process. After all that the results are output to the user. The whole process takes over 10 seconds which is too long. I'm wondering if the fact that I am using many extensions, thereby importing them, causing the long delays.
Hope this isn't against the stackoverflow rules but I'm new to python and this is the kind of thing that I need to know.

I'm wondering if the fact that I am using many extensions, thereby importing them, causing the long delays.
Very unlikely. If you just import once, then call in a loop, the loop should take most of the time. (Or are firing up a Python process per word/sentence?)
As a rule of thumb, computer programs tend to spend 90% of their time executing 10% of the code. That part is worth optimizing. Things like import statements are usually not. To find out where your program is spending its time, use a profiler.

Time how long each of the individual checks take. Then compare the results to see what is actually taking the most time.
import time
start = time.time()
#after the individual piece has completed
end = time.time()
print (end - start, "seconds")
It'd be interesting to actually know how long each component of the string processing is taking.

Related

discord.py: too big variable?

I'm very new to python and programming in general, and I'm looking to make a discord bot that has a lot of hand-written chat lines to randomly pick from and send back to the user. Making a really huge variable full of a list of sentences seems like a bad idea. Is there a way that I can store the chatlines on a different file and have the bot pick from the lines in that file? Or is there anything else that would be better, and how would I do it?

I'll interpret this question as "how large a variable is too large", to which the answer is pretty simple. A variable is too large when it becomes a problem. So, how can a variable become a problem? The big one is that the machien could possibly run out of memory, and an OOM killer (out-of-memory killer) or similiar will stop your program. How would you know if your variable is causing these issues? Pretty simple, your program crashes.
If the variable is static (with a size fully known at compile-time or prior to interpretation), you can calculate how much RAM it will take. (This is a bit finnicky with Python, so it might be easier to load it up at runtime and figure it out with a profiler.) If it's more than ~500 megabytes, you should be concerned. Over a gigabyte, and you'll probably want to reconsider your approach[^0]. So, what do you do then?
As suggested by #FishballNooodles, you can store your data line-by-line in a file and read the lines to an array. Unfortunately, the code they've provided still reads the entire thing into memory. If you use the code they're providing, you've got a few options, non-exhaustively listed below.
Consume a random number of newlines from the file when you need a line of text. You would look at one character at a time, compare it to \n, and read the line if you've encountered the requested number of newlines. This is O(n) worst case with respect to the number of lines in the file.
Rather than storing the text you need at a given index, store its location in a file. Then, you can seek to the location (which is probably O(1)), and read the text. This requires an O(n) construction cost at the start of the program, but would work much better at runtime.
Use an actual database. It's usually better not to reinvent the wheel. If you're just storing plain text, this is probably overkill, but don't discount it.
[^0]: These numbers are actually just random. If you control the server environment on which you run the code, then you can probably come up with some more precise signposts.

You can store your data in a file, supposedly named response.txt
and retrieve it in the discord bot file as open("response.txt").readlines()

Why does this simple loop produce a "jittery" print?

I'm just learning Python and I have tried this simple loop based on Learn Python The Hard Way. With my basic understanding, this should keep printing "Hello", one letter at a time at the same position. This seems to be the case, but the print is not fluid, it doesn't spend the same amount of time on each character; some go very fast, and then it seems to get stuck for one or two seconds on one.
Can you explain why?
while True:
for i in ["H","e","l","l","o"]:
print "%s\r" % i,

you are running an infinite loop with very little work done in it, and most of it being printing. The bottleneck of such an application is how fast your output can be integrated in your running environment (you console).
There are various buffers involved, and the system can also schedule other processes and therefore pause your app for a few cycles.

Speed up feedparser

I'm using feedparser to print the top 5 Google news titles. I get all the information from the URL the same way as always.
x = 'https://news.google.com/news/feeds?pz=1&cf=all&ned=us&hl=en&topic=t&output=rss'
feed = fp.parse(x)
My problem is that I'm running this script when I start a shell, so that ~2 second lag gets quite annoying. Is this time delay primarily from communications through the network, or is it from parsing the file?
If it's from parsing the file, is there a way to only take what I need (since that is very minimal in this case)?
If it's from the former possibility, is there any way to speed this process up?

I suppose that a few delays are adding up:
The Python interpreter needs a while to start and import the module
Network communication takes a bit
Parsing probably consumes only little time but it does
I think there is no straightforward way of speeding things up, especially not the first point. My suggestion is that you have your feeds downloaded on a regularly basis (you could set up a cron job or write a Python daemon) and stored somewhere on your disk (i.e. a plain text file) so you just need to display them at your terminal's startup (echo would probably be the easiest and fastest).
I personally made good experiences with feedparser. I use it to download ~100 feeds every half hour with a Python daemon.

Parse at real time not better case if you want faster result.
You can try does it asynchronously by Celery or by similar other solutions. I like the Celery, it gives many abilities. There are abilities as task as the cron or async and more.

Multithreading / Multiprocessing in Python

I have created a simple substring search program that recursively looks through a folder and scans a large number of files. The program uses the Boyer-Moore-Horspool algorithm and is very efficient at parsing large amounts of data.
Link to program: http://pastebin.com/KqEMMMCT
What I'm trying to do now is make it even more efficient. If you look at the code, you'll notice that there are three different directories being searched. I would like to be able to create a process/thread that searches each directory concurrently, it would greatly speed up my program.
What is the best way to implement this? I have done some preliminary research, but my implementations have been unsuccessful. They seem to die after 25 minutes or so of processing (right now the single process version takes nearly 24 hours to run; it's a lot of data, and there are 648 unique keywords.)
I have done various experiments using the multiprocessing API and condensing all the various files into 3 files (one for each directory) and then mapping the files to memory via mmap(), but a: I'm not sure if this is the appropriate route to go, and b: my program kept dying at random points, and debugging was an absolute nightmare.
Yes, I have done extensive googleing, but I'm getting pretty confused between pools/threads/subprocesses/multithreading/multiprocessing.
I'm not asking for you to write my program, just help me understand the thought process needed to go about implementing a solution. Thank you!
FYI: I plan to open-source the code once I get the program running. I think it's a fairly useful script, and there are limited examples of real world implementations of multiprocessing available online.

What to do depends on what's slowing down the process.
If you're reading on a single disk, and disk I/O is slowing you down, multiple threads/process will probably just slow you down as the read head will now be jumping all over the place as different threads get control, and you'll be spending more time seeking than reading.
If you're reading on a single disk, and processing is slowing you down, then you might get a speedup from using multiprocessing to analyze the data, but you should still read from a single thread to avoid seek time delays (which are usually very long, multiple milliseconds).
If you're reading from multiple disks, and disk I/O is slowing you down, then either multiple threads or processes will probably give you a speed improvement. Threads are easier, and since most of your delay time is away from the processor, the GIL won't be in your way.
If you're reading from multiple disks,, and processing is slowing you down, then you'll need to go with multiprocessing.

Multiprocessing is easier to understand/use than multithreading(IMO). For my reasons, I suggest reading this section of TAOUP. Basically, everything a thread does, a process does, only the programmer has to do everything that the OS would handle. Sharing resources (memory/files/CPU cycles)? Learn locking/mutexes/semaphores and so on for threads. The OS does this for you if you use processes.
I would suggest building 4+ processes. 1 to pull data from the hard drive, and the other three to query it for their next piece. Perhaps a fifth process to stick it all together.
This naturally fits into generators. See the genfind example, along with the gengrep example that uses it.
Also on the same site, check out the coroutines section.

how to lengthen the pause between the words with text-to-speech (pyTTS or SAPI5)

Is it possible to extend the gap between spoken words when using text to speech with SAPI5 ?
The problem is that esp. with some voices, the words are almost connected to each other, which makes the speech more difficult to understand.
I'm using python and pyTTS module (on windows, since it's using SAPI)
I tried to hook to the OnWord event and add a time.sleep() or tts.Pause(), but apparently even though all the events are caught, they are being processed only at the end of the spoken text, whether i'm using the sync or async flag.
In this NON WORKING example, the sleep() method is executed only after the sentence is spoken:
tts = pyTTS.Create()
def f(x):
tts.Pause()
sleep(0.5)
tts.Resume()
tts.OnWord = f
tts.Speak(text)
Edit: -- accepted solutions
The actual answers for me were either
saying each word in its own "speak" command, (suggested by #Lennart Regebro), or
replacing each space with a comma, (as mentioned by #Dawson), e.g.
text = text.replace(" ", ",")
that sets a reasonable pause. I didn't investigate the Pause method more then i mentioned above, since' i'm happy with the accepted solutions.

Your talking about voice Rate, right?
http://msdn.microsoft.com/en-us/library/ms990078.aspx
Pause() I believe, works a lot like a comma in a normal speech pattern...except you determine the length (natural or not).

I don't have any great solutions here. But:
PyTTS last release was in 2007, and there seems to be no documentation. The same people now maintains a cross-platform library, called pyttsx, which also supports SAPI. It has a words per minute setting, but no setting to increase the pause between the words. This is most likely because there is no pause between the words at all.
You can insert a long pause by making each word it's own "utterance".
engine.say('The')
engine.say('quick')
engine.say('brown')
engine.say('fox.')
instead of
engine.say('The quick brown fox."
But that probably is too long. Other than that, you probably have to wrap or subclass the SAPI driver, but I'm not 100% sure that's going to work either. People don't have pauses between words, so I'm not sure that the speech engines themselves support it.

I've done some TTS work using the .NET APIs before. There is an enum in the System.Speech.Synthesis namespace called PromptBreak, which has different values for the length of the pause/break you want: http://msdn.microsoft.com/en-us/library/system.speech.synthesis.promptbreak.aspx
No idea if/how it can be used with PyTTS, but maybe it's a starting point.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.