Python: efficient comparison of strings in lists

Python: efficient comparison of strings in lists - python

I am monitoring a rss feed using feedparser. The feed has 100 items and am extracting a time stamp from it as a unique identifier in the form of a list of strings. This is what a single item looks of the list like:
2017-07-25T20:41:59-04:00
Next, I am doing some python magic with the other data from the feed which is parsed into lists as well (same index, different lists though) to extract the information I want. That part works well, I love it.
Now my problem: after a time delay
import time
time.sleep(60)
I'd like to monitor the feed again and see efficiently if the time stamp was observed before. If so, I'd pass on executing code and wait some more until a uniqe time stamp shows up.
I failed to implement it so far. I thought about making a second list and comparing the two. Each list has 100 items.
New items appear on top of the feed and move down over time. I should be fine if I only run to the first match. Should make the code more efficient then comparing everything.
I'd be happy iif someone could point me towards the solution. I am somewhat stuck whatever I have tried, failed.
Edit:
def compare(feed_id_l, feed_id_check_l):
#compares items in lists and returns the index
for i in range(0, len(feed_id_l)):
for j in range(0, len(feed_id_check_l)):
if feed_id_l[i] == feed_id_check_l[j]:
print('match for id ' + feed_id_l[i])
return i, j
else:
return -1
Works and returns 0, 0 if the first feed item is unchanged.
I will have to figure out what to do with other cases, let's say 0, 6.
Cheers!

Related

How to store/append/or CVS a very long list?

I've been trying to create a Python-based program that's continuously appending towards a list a long list of values (each one stored in its own element, of course) depending on the user's input on how big they would like these values to reach towards.
For example, if the user wants to generate a list with each element containing an element that's incrementing by +1, they can choose how long the list will be:
list1 = []
def addToList():
length = int(input("Enter the total length you would like the list to reach: "))
for i in range(length):
print("i = ", i)
list1.append(i)
print(list1)
addToList()
Pretty easy. But what happens if the value stored to generate the length of the list is an extremely long one? Like 2143017345534564656949646943634964694649368496460342305694634954560464963045493543064369345324095834954395840584958439404095345093459031450345813498572130956134095861587345891234501356982356238945613249056134805613105314561345613048561340561348905613489561340756130456130845610289345723948723947239742983472984723047024734032453453513451345715345734505577 etc...
What would a beneficial way to compute a program that's able to create lists in this size?
I'm aware that just appending to list would not work as the program freezes (and takes up space in memory as well). Additionally, I've tried to use Numpy when declaring the size of the array before the values for each element are computed before they are appended to the list itself. But nothing seems to be working so far.
Is there also a way to be able to store the values to a CVS file as each element if being computed?
So that way as each value is actually they automatically saved to a file and not within the program itself (the program can just be used to calculate the values and transfer each one of them to a CVS file)?
The provided code is simple but the project I'm working it relies on this type of code but I know it can freeze up Python if one tries to achieve it.

How to impliment a binary search on a list created from a file

This is my first post, please be gentle. I'm attempting to sort some
files into ascending and descending order. Once I have sorted a file, I am storing it in a list which is assigned to a variable. The user is then to choose a file and search for an item. I get an error message....
TypeError: unorderable types; int() < list()
.....when ever I try to search for an item using the variable of my sorted list, the error occurs on line 27 of my code. From research, I know that an int and list cannot be compared, but I cant for the life of me think how else to search a large (600) list for an item.
At the moment I'm just playing around with binary search to get used to it.
Any suggestions would be appreciated.
year = []
with open("Year_1.txt") as file:
for line in file:
line = line.strip()
year.append(line)
def selectionSort(alist):
for fillslot in range(len(alist)-1,0,-1):
positionOfMax=0
for location in range(1,fillslot+1):
if alist[location]>alist[positionOfMax]:
positionOfMax = location
temp = alist[fillslot]
alist[fillslot] = alist[positionOfMax]
alist[positionOfMax] = temp
def binarySearch(alist, item):
first = 0
last = len(alist)-1
found = False
while first<=last and not found:
midpoint = (first + last)//2
if alist[midpoint] == item:
found = True
else:
if item < alist[midpoint]:
last = midpoint-1
else:
first = midpoint+1
return found
selectionSort(year)
testlist = []
testlist.append(year)
print(binarySearch(testlist, 2014))
Year_1.txt file consists of 600 items, all years in the format of 2016.
They are listed in descending order and start at 2017, down to 2013. Hope that makes sense.

Is there some reason you're not using the Python: bisect module?
Something like:
import bisect
sorted_year = list()
for each in year:
bisect.insort(sorted_year, each)
... is sufficient to create the sorted list. Then you can search it using functions such as those in the documentation.
(Actually you could just use year.sort() to sort the list in-place ... bisect.insort() might be marginally more efficient for building the list from the input stream in lieu of your call to year.append() ... but my point about using the `bisect module remains).
Also note that 600 items is trivial for modern computing platforms. Even 6,000 won't take but a few milliseconds. On my laptop sorting 600,000 random integers takes about 180ms and similar sized strings still takes under 200ms.
So you're probably not gaining anything by sorting this list in this application at that data scale.
On the other hand Python also includes a number of modules in its standard libraries for managing structured data and data files. For example you could use Python: SQLite3.
Using this you'd use standard SQL DDL (data definition language) to describe your data structure and schema, SQL DML (data manipulation language: INSERT, UPDATE, and DELETE statements) to manage the contents of the data and SQL queries to fetch data from it. Your data can be returned sorted on any column and any mixture of ascending and descending on any number of columns with the standard SQL ORDER BY clauses and you can add indexes to your schema to ensure that the data is stored in a manner to enable efficient querying and traversal (table scans) in any order on any key(s) you choose.
Because Python includes SQLite in its standard libraries, and because SQLite provides SQL client/server semantics over simple local files ... there's almost no downside to using it for structured data. It's not like you have to install and maintain additional software, servers, handle network connections to a remote database server nor any of that.

I'm going to walk through some steps before getting to the answer.
You need to post a [mcve]. Instead of telling us to read from "Year1.txt", which we don't have, you need to put the list itself in the code. Do you NEED 600 entries to get the error in your code? No. This is sufficient:
year = ["2001", "2002", "2003"]
If you really need 600 entries, then provide them. Either post the actual data, or
year = [str(x) for x in range(2017-600, 2017)]
The code you post needs to be Cut, Paste, Boom - reproduces the error on my computer just like that.
selectionSort is completely irrelevant to the question, so delete it from the question entirely. In fact, since you say the input was already sorted, I'm not sure what selectionSort is actually supposed to do in your code, either. :)
Next you say testlist = [].append(year). USE YOUR DEBUGGER before you ask here. Simply looking at the value in your variable would have made a problem obvious.
How to append list to second list (concatenate lists)
Fixing that means you now have a list of things to search. Before you were searching a list to see if 2014 matched the one thing in there, which was a complete list of all the years.
Now we get into binarySearch. If you look at the variables, you see you are comparing the integer 2014 with some string, maybe "1716", and the answer to that is useless, if it even lets you do that (I have python 2.7 so I am not sure exactly what you get there). But the point is you can't find the integer 2014 in a list of strings, so it will always return False.
If you don't have a debugger, then you can place strategic print statements like
print ("debug info: binarySearch comparing ", item, alist[midpoint])
Now here, what VBB said in comments worked for me, after I fixed the other problems. If you are searching for something that isn't even in the list, and expecting True, that's wrong. Searching for "2014" returns True, if you provide the correct list to search. Alternatively, you could force it to string and then search for it. You could force all the years to int during the input phase. But the int 2014 is not the same as the string "2014".

How can I append data to a list of lists efficiently?

I have a list of lists just shy of two million elements, each with 7 entries.
I run a machine learning algorithm on the data and would like to append the result of the classification to the end of each element.
I use the .append() feature, something like
for j in range(len(data)):
data[j].append(results[j])
However, this takes a lot of time (8+ hours and it had still not terminated).
I'm wondering if there is a more efficient way to do this. The data is read in from a CSV file, so maybe I could write the results directly into the CSV?
I was thinking about using numpy arrays, but I recall someone saying that lists are faster.
Anyone have any ideas?
EDIT: here is my code
import csv
with open("measles_data_b", 'r') as f:
reader = csv.reader(f)
t = list(reader)
### Perform the machine Learning. That bit works fine.
#At this point, t is a list with size=1971203, and each element in t has 7 elements of its own
# results is a list with the same number of elements. Its entries are
# one of three things: '1','2','0'.
for j in range(len(t)):
t[j].append(results[j])

As an experiment, run the following code:
import random
def append_items(lists, items):
for i in range(len(lists)):
lists[i].append(items[i])
rand_lists = [[random.randint(0,9) for i in range(7)] for j in range(2000000)]
rand_list = [random.randint(0,9) for i in range(2000000)]
print("Lists generated")
append_items(rand_lists,rand_list)
print("Lists appended")
When I run it I need to wait for 20-30 seconds to see "Lists generated" printed, but the next print is almost instantaneous. If you don't get this sort of behavior then you have a buggy Python installation. If not -- hard to say what is happening. It might be interesting to look at type(t[0]) perhaps you have a list of list-like objects rather than a list of lists and your list-like objects implement an inefficient append method (I haven't used it yet but it seems at least possible that csv.reader returns some sort of custom object).

Converting a list of strings to a single pattern

I have a list of strings that follow a specific pattern. Here's an example
['ratelimiter:foobar:201401011157',
'ratelimiter:foobar:201401011158',
'ratelimiter:foobar:201401011159',
'ratelimiter:foobar:201401011200']
I'm trying to end up with a blob pattern that will represent this list like the following
'ratelimiter:foobar:201401011*
I know the first two fields ahead of time. The third field is a time stamp and I want to find the column at which they start to have different values from other values in the column.
In the example given the timestamp ranges from 2014-01-01-11:57 to 2014-01-01-12:00 and the column that's different is the third to the last column where 1 changes to 2. If I can find that, then I can slice the string to [:-3] += '*' (for this example)
Every time I try and tackle this problem I end up with loops everywhere. I just feel like there's a better way of doing this.
Or maybe someone knows a better way of doing this with redis. I'm doing this because I'm trying to get keys from redis and I don't want to make a request for every key but rather make a batch request using the pattern parameter. Maybe there's a better way of doing this but haven't found anything yet.
Thanks

Staying in the pattern thing (converting to timestamp is probably best, though), I would do that to find the longest prefix:
items = ['ratelimiter:foobar:201401011157',
'ratelimiter:foobar:201401011158',
'ratelimiter:foobar:201401011159',
'ratelimiter:foobar:201401011200']
print items[0][:[len(set(x)) == 1 for x in zip(*items)].index(False)] + '*'
# ratelimiter:foobar:201401011*
Which reads as: cut the first element of items where all nth elements of items are no longer equals.
[len(set(x)) == 1 for x in zip(*items)] will return a list of boolean being True for i if all elements at i are equal across items

This is what I will do:
convert the timestamp to numbers
find the max and min (if your list is not ordered)
take the difference between max and min and convert it back to pattern.
For example, in your case, the difference between max and min is 43. And the min is already 57, you can quickly deduct that if the min ends with ***157, the max should be ***200. And you know the pattern

You almost never want to use the '*' parameter in Redis in production because it is very slow-- much slower than making a request for each key individually in the vast majority of cases. Unless you're requesting so many keys that your bottleneck becomes the sheer amount of data you're transferring over the network (in which case you should really convert things to Lua and run the logic server-side), a pipeline is really want you want.
The reason you want a pipeline is you're probably getting hit by the costs of transferring data back and forth between your Redis server in separate hops right now. A pipeline, in contrast, queues up a bunch of commands to run against Redis, and then executes them all at once, when you're ready. Assuming you're using redis-py (if you're not, you really should be), and r is your connection to your Redis server, you can do this like so:
r = redis.Redis(...)
pipe = r.pipeline()
items = ['ratelimiter:foobar:201401011157',
'ratelimiter:foobar:201401011158',
'ratelimiter:foobar:201401011159',
'ratelimiter:foobar:201401011200']
for item in items:
pipe.get(item)
#all the values for each item you're getting from Redis will be here.
item_values = pipe.execute()
Note: this will only make one call to Redis and will be much faster than either getting each value individually or running a pattern selection.
All of the other answers so far are good Python answers, but you're dealing with a Redis problem. You need a Redis answer.

Skipping a pattern of elements using itertools and accompanying list

I have some code that is slow (30-60mins by last count), that I need to optimize, it is a data extraction script for Abaqus for a structural engineering model. The worst part of the script is the loop where it iterates through the object model database first by frame (i.e. the time in the time history of the simulation) and nested under this it iterates by each of the nodes. The silly thing is that there are ~100k 'nodes' but only about ~20k useful nodes. But luckily for me the nodes are always in the same order, meaning I do not need to look up the node's uniqueLabel, I can do this in a separate loop once and then filter what I get at the end. That is why I have dumped everything into one list and then I remove all the nodes that are repeats. But as you can see from the code:
timeValues = []
peeqValues = []
for frame in frames: #760 loops
setValues = frame.fieldOutputs['###fieldOutputType'].getSubset
region=abaqusSet, position=ELEMENT_NODAL).values
timeValues.append(frame.frameValue)
for value in setValues: # 100k loops
peeqValues.append(value.data)
It still needs to make the value.data calls unnecessarily, about ~80k times. If anyone is familiar with Abaqus odb (object database) objects, they're super slow under python. To add insult to injury they only run in a single thread, under Abaqus which has its own python version (2.6.x) and packages (so e.g. numpy is available, pandas is not). Another thing that may be annoying is the fact that you can address the objects by position e.g. frames[-1] gives you the last frame, but you cannot slice, so e.g. you can't do this: for frame in frames[0:10]: # iterate first 10 elements.
I don't have any experience with itertools but I'd want to provide it a list of nodeIDs (or list of True/False) to map onto the setValues. The length and pattern of setValues to skip is always the same for each of the 760 frames. Maybe something like:
for frame in frames: #still 760 calls
setValues = frame.fieldOutputs['###fieldOutputType'].getSubset(
region=abaqusSet, position=ELEMENT_NODAL).values
timeValues.append(frame.frameValue)
# nodeSet_IDs_TF = [True, True, False, False, False, ...] same length as
# setValues
filteredSetValues = ifilter(nodeSet_IDs_TF, setValues)
for value in filteredSetValues: # only 20k calls
peeqValues.append(value.data)
Any other tips also appreciated, after this I did want to "avoid the dots" by removing the .append() from the loop, and then putting the whole thing in a function to see if it helps. The whole script already runs in under 1.5 hours (down from 6 and at one point 21 hours), but once you start optimizing there is no way to stop.
Memory considerations also appreciated, I run these on a cluster and I believe I got away once with 80 GB of RAM. The scripts definitely work on 160 GB, the issue is getting the resources allocated to me.
I've searched around for a solution but maybe I'm using the wrong keywords, I'm sure this is not an uncommon issue in looping.
EDIT 1
Here is what I ended up using:
# there is no compress under 2.6.x ... so use the equivalent recipe:
from itertools import izip
def compress(data, selectors):
# compress('ABCDEF', [1,0,1,0,1,1]) --> ACEF
return (d for d, s in izip(data, selectors) if s)
def iterateOdb(frames, selectors): # minor speed up
peeqValues = []
timeValues = []
append = peeqValues.append # minor speed up
for frame in frames:
setValues = frame.fieldOutputs['###fieldOutputType'].getSubset(
region=abaqusSet, position=ELEMENT_NODAL).values
timeValues.append(frame.frameValue)
for value in compress(setValues, selectors): # massive speed up
append(value.data)
return peeqValues, timeValues
peeqValues, timeValues = iterateOdb(frames, selectors)
The biggest improvement came from using the compress(values, selectors) method (the whole script, including the odb portion went from ~1:30 hours to 25mins. There was also a minor improvement from append = peeqValues.append as well as enclosing everything in def iterateOdb(frames, selectors):.
I used tips from: https://wiki.python.org/moin/PythonSpeed/PerformanceTips
Thanks to everyone for answering & helping!

If you're not confident with itertools try using an if statement in your for loop first.
eg.
for index, item in enumerate(values):
if not selectors[index]:
continue
...
# where selectors is a truth array like nodeSet_IDs_TF
This way you can be more sure that you are getting the correct behaviour and will get getting most of the performance increase you would get from using itertools.
The itertools equivalent is compress.
for item in compress(values, selectors):
...
I'm not familiar with abaqus, but the best optimisations you could achieve would be to see if there is anyway way to give abaqus your selectors so it doesn't have to waste creating each value, only for it to be thrown away. If abaqus is used for doing large array-based manipulations of data then it's like this is the case.

Another variant in addition to those in Dunes's solution:
for value, selector in zip(setValues, selectors):
if selector:
peeqValue.append(value.data)
If you want to keep the output list length the same as the setValue length, then add an else clause:
for value, selector in zip(setValues, selectors):
if selector:
peeqValue.append(value.data)
else:
peeqValue.append(None)
selector is here a vector with True/False, and it has the same length as setValues.
In this case it is really a matter of taste which one you like. If the full iteration of 76 million nodes (760 x 100 000) takes 30 minutes, the time is not spent in python's loops.
I tried this:
def loopit(a):
for i in range(760):
for j in range(100000):
a = a + 1
return a
IPython's %timeit reports the loop time as 3.54 s. So, the looping spends maybe 0.1 % of the total time.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.