Any way to split python string without generate new strings? - python

The input is a string containing a huge number of characters, and I hope to split this string into a list of strings with a special delimiter.
But I guess that simply using split would generate new strings rather than split the original input string itself, and in that case it consumes large memory(it's guaranteed that the original string would not be used any longer).
So is there a convenient way to do this destructive split?
Here is the case:
input_string = 'data1 data2 <...> dataN'
output_list = ['data1', 'data2', <...> 'dataN']
What I hope is that the data1 in output_list is and the data1(and all others) in input_string shares the same memory area.
BTW, for each input string, the size is 10MB-20MB; but as there are lots of such strings(about 100), so I guess memory consumption should be taken into consideration here?

In Python, strings are immutable. This means that any operation that changes the string will create a new string. If you are worried about memory (although this shouldn't be much of an issue unless you are dealing with gigantic strings), you can always overwrite the old string with the new, modified string, replacing it.
The situation you are describing is a little different though, because the input to split is a string and the output is a list of strings. They are different types. In this case, I would just create a new variable containing the output of split and then set the old string (that was the input to the split function) to None, since you guarantee it will not be used again.
Code:
split_str = input_string.split(delim)
input_string = None

The only alternative would be to access the substrings using slicing instead of split. You can use str.find to find the position of each delimiter. However this would be slow, and fiddly. If you can use split and get the original string to drop out of scope then it would be worth the effort.
You say that this string is input, so you might like to consider reading a smaller number of characters so you are dealing with more manageable chunks. Do you really need all the data in memory at the same time?

Perhaps the Pythonic way would be to use iterators? That way, the new substrings will be in memory only one at a time. Based on
Splitting a string into an iterator :
import re
string_long = "my_string " * 100000000 # takes some memory
# strings_split = string_long.split() # takes too much memory
strings_reiter = re.finditer("(\S*)\s*", string_long) # takes no memory
for match in strings_reiter:
print match.group()
This works fine without leading to memory problems.

If you're talking about strings that are SO huge that you can't stand to put them in memory, then maybe running through the string once (O(n), probably improvable using str.find but I'm not sure) then storing a generator that holds slice objects would be more memory-efficient?
long_string = "abc,def,ghi,jkl,mno,pqr" # ad nauseum
splitters = [','] # add whatever you want to split by
marks = [i for i,ch in enumerate(long_string) if ch in splitters]
slices = []
start = 0
for end in marks:
slices.append(slice(start,end))
start = end+1
else:
slices.append(slice(start,None))
split_string = (long_string[slice_] for slice_ in slices)

Related

Approaches for finding matches in a large dataset

I have a project where, given a list of ~10,000 unique strings, I want to find where those strings occur in a file with 10,000,000+ string entries. I also want to include partial matches if possible. My list of ~10,000 strings is dynamic data and updates every 30 minutes, and currently I'm not able to process all of the searching to keep up with the updated data. My searches take about 3 hours now (compared to the 30 minutes I have to do the search within), so I feel my approach to this problem isn't quite right.
My current approach is to first create a list from the 10,000,000+ string entries. Then each item from the dynamic list is searched for in the larger list using an in-search.
results_boolean = [keyword in n for n in string_data]
Is there a way I can greatly speed this up with a more appropriate approach?
Using a generator with a set is probably your best bet ... this solution i think will work and presumably faster
def find_matches(target_words,filename_to_search):
targets = set(target_words)
with open("search_me.txt") as f:
for line_no,line in enumerate(f):
matching_intersection = targets.intersection(line.split())
if matching_intersection:
yield (line_no,line,matching_intersection) # there was a match
for match in find_matches(["unique","list","of","strings"],"search_me.txt"):
print("Match: %s"%(match,))
input("Hit Enter For next match:") #py3 ... just to see your matches
of coarse it gets harder if your matches are not single words, especially if there is no reliable grouping delimiter
In general, you would want to preprocess the large, unchanging data is some way to speed repeated searches. But you said too little to suggest something clearly practical. Like: how long are these strings? What's the alphabet (e.g., 7-bit ASCII or full-blown Unicode?)? How many characters total are there? Are characters in the alphabet equally likely to appear in each string position, or is the distribution highly skewed? If so, how? And so on.
Here's about the simplest kind of indexing, buiding a dict with a number of entries equal to the number of unique characters across all of string_data. It maps each character to the set of string_data indices of strings containing that character. Then a search for a keyword can be restricted to the only string_data entries now known in advance to contain the keyword's first character.
Now, depending on details that can't be guessed from what you said, it's possible even this modest indexing will consume more RAM than you have - or it's possible that it's already more than good enough to get you the 6x speedup you seem to need:
# Preprocessing - do this just once, when string_data changes.
def build_map(string_data):
from collections import defaultdict
ch2ixs = defaultdict(set)
for i, s in enumerate(string_data):
for ch in s:
ch2ixs[ch].add(i)
return ch2ixs
def find_partial_matches(keywords, string_data, ch2ixs):
for keyword in keywords:
ch = keyword[0]
if ch in ch2ixs:
result = []
for i in ch2ixs[ch]:
if keyword in string_data[i]:
result.append(i)
if result:
print(repr(keyword), "found in strings", result)
Then, e.g.,
string_data = ['banana', 'bandana', 'bandito']
ch2ixs = build_map(string_data)
find_partial_matches(['ban', 'i', 'dana', 'xyz', 'na'],
string_data,
ch2ixs)
displays:
'ban' found in strings [0, 1, 2]
'i' found in strings [2]
'dana' found in strings [1]
'na' found in strings [0, 1]
If, e.g., you still have plenty of RAM, but need more speed, and are willing to give up on (probably silly - but can't guess from here) 1-character matches, you could index bigrams (adjacent letter pairs) instead.
In the limit, you could build a trie out of string_data, which would require lots of RAM, but could reduce the time to search for an embedded keyword to a number of operations proportional to the number of characters in the keyword, independent of how many strings are in string_data.
Note that you should really find a way to get rid of this:
results_boolean = [keyword in n for n in string_data]
Building a list with over 10 million entries for every keyword search makes every search expensive, no matter how cleverly you index the data.
Note: a probably practical refinement of the above is to restrict the search to strings that contain all of the keyword's characters:
def find_partial_matches(keywords, string_data, ch2ixs):
for keyword in keywords:
keyset = set(keyword)
if all(ch in ch2ixs for ch in keyset):
ixs = set.intersection(*(ch2ixs[ch] for ch in keyset))
result = []
for i in ixs:
if keyword in string_data[i]:
result.append(i)
if result:
print(repr(keyword), "found in strings", result)

Using split after a set statement in Python

I have a list of words (equivalent to about two full sentences) and I want to split it into two parts: one part containing 90% of the words and another part containing 10% of them. After that, I want to print a list of the unique words within the 10% list, lexicographically sorted. This is what I have so far:
pos_90 = (90*len(words)) // 100 #list with 90% of the words
pos_90 = pos_90 + 1 #I incremented the number by 1 in order to use it as an index
pos_10 = (10*len(words)) // 100 #list with 10% of the words
list_90 = words[:pos_90] #Creation of the 90% list
list_10 = words[pos_10:] #Creation of the 10% list
uniq_10 = set(list_10) #List of unique words out of the 10% list
split_10 = uniq_10.split()
sorted_10 = split_10.sort()
print(sorted_10)
I get an error saying that split cannot be applied to set, so I assume my mistake must be in the last lines of code. Any idea about what I'm missing here?
split only makes sense when converting from one long str to a list of the components of said str. If the input was in the form 'word1 word2 word3', yes, split would convert that str to ['word1', 'word2', 'word3'], but your input is a set, and there is no sane way to "split" a set like you seem to want; it's already a bag of separated items.
All you really need to do is convert your set back to a sorted list. Replace:
split_10 = uniq_10.split()
sorted_10 = split_10.sort()
with either:
sorted_10 = list(uniq_10)
sorted_10.sort() # NEVER assign the result of .sort(); it's always going to be None
or the simpler one-liner that encompasses both listifying and sorting:
sorted_10 = sorted(uniq_10) # sorted, unlike list.sort, returns a new list
The final option is generally the most Pythonic approach to converting an arbitrary iterable to list and sorting that new list, returning the result. It doesn't mutate the input, doesn't rely on the input being a specific type (set, tuple, list, it doesn't matter), and it's simpler to boot. You only use list.sort() when you already have a known list, and don't mind mutating it.

How to speed up string match against a list of strings?

I have a list of strings. I am trying to find if any of these strings in the list appear in the english dictionary stored as another list.
I observed the time it takes to find a match grows linearly. However, it becomes way too long when the original list has a few thousand strings.
On my development EC2 instance, it takes ~2 seconds for 100 strings, ~15 seconds for 700 strings, ~100 seconds for 5000 strings, and ~800 seconds for 40000 strings!
Is there a way to speed this up? Thanks in advance.
matching_word = ""
for w in all_strings:
if w in english_dict:
if matching_word: # More than one possible word
matching_word = matching_word + ", " + w
else:
matching_word = w
Instead of creating a string and extend it you can use list comprehension for that:
matching_words = [x for x in all_strings if x in english_dict]
Now you can make a string from that list using ", ".join(matching_sords).
Another option - using two sets you can use the & operator:
set(all_strings) & set(english_dict)
The result here will be a set with the items you have in both lists.
Provided you don't have issues with memory, turn your english_dict to set (if you do have memory issues, load your dictionary as a set to begin with): english_dict = set(english_dict) (prior to the loop, of course)
That should significantly speed up the look-up. If that's not enough, you'll have to resort to creating search trees and similar search optimizations.

Swapping pairs of characters in a string

Okay, I'm really new to Python and have no idea how to do this:
I need to take a string, say 'ABAB__AB', convert it to a list, and then take the leading index of the pair I want to move and swap that pair with the __. I think the output should look something like this:
move_chars('ABAB__AB', 0)
'__ABABAB'
and another example:
move_chars('__ABABAB', 3)
'BAA__BAB'
Honestly have no idea how to do it.
Python strings are immutable, so you can't really modify a string. Instead, you make a new string.
If you want to be able to modify individual characters in a string, you can convert it to a list of characters, work on it, then join the list back into a string.
chars = list(str)
# work on the list of characters
# for example swap first two
chars[0], chars[1] = chars[1], chars[0]
return ''.join(chars)
I think this should go to the comment section, but I can't comment because of lack of reputation, so...
You'll probably want to stick with list index swapping, rather than using .pop() and .append(). .pop() can remove elements from arbitrary index, but only one at once, and .append() can only add to the end of the list. So they're quite limited, and it would complicate your code to use them in this kind of problems.
So, well, better stick with swapping with index.
The trick is to use list slicing to move parts of the string.
def move_chars(s, index):
to_index = s.find('__') # index of destination underscores
chars = list(s) # make mutable list
to_move = chars[index:index+2] # grab chars to move
chars[index:index+2] = '__' # replace with underscores
chars[to_index:to_index+2] = to_move # replace underscores with chars
return ''.join(chars) # stitch it all back together
print(move_chars('ABAB__AB', 0))
print(move_chars('__ABABAB', 3))

Breaking 1 String into 2 Strings based on special characters using python

I am working with python and I am new to it. I am looking for a way to take a string and split it into two smaller strings. An example of the string is below
wholeString = '102..109'
And what I am trying to get is:
a = '102'
b = '109'
The information will always be separated by two periods like shown above, but the number of characters before and after can range anywhere from 1 - 10 characters in length. I am writing a loop that counts characters before and after the periods and then makes a slice based on those counts, but I was wondering if there was a more elegant way that someone knew about.
Thanks!
Try this:
a, b = wholeString.split('..')
It'll put each value into the corresponding variables.
Look at the string.split method.
split_up = [s.strip() for s in wholeString.split("..")]
This code will also strip off leading and trailing whitespace so you are just left with the values you are looking for. split_up will be a list of these values.

Categories

Resources