Let's say I have a pretty large text document.
I need to imitate copy and paste operation of a text editor.
More concretely,
I want to write two functions copy(i,j) and paste(i), where i and j represent indices of characters in the text document.
Now, I understand that normal string slicing creates a new string object every time and doing something like
copy_text = str[i:j]
self.str = str[:i] + copy_text + str[j:]
will end up creating a lot of overhead of new string objects, given how many times we do a copy-paste function in a text editor.
How do I do it? Is it even possible?
I've looked into memoryview, where in they make use of a buffer, which may end up taking less time to execute given they create a zero-copy view of the original object. However, I want to do it algorithmacally and not play around with how strings are stored.
I've thinking on the lines of an array to store the string and use a B+ tree to store pointers to that string. I haven't been able to really materialize anything.
Look forward to your comments. Thanks.
Related
I have a CSV file with 100,000 rows.
Each row in column A is a sentence comprised of both chars and integers.
I want column B to contain only integers.
I want the new columns to be in the same CSV file.
How can I accomplish this?
If I'm understanding your question correctly, I would use .isdigit() to parse the data in column A. I'm frankly not sure what the format of column A is, so I don't know exactly what you would do with this (if you gave more information I could give a more specific answer). Your solution will likely come in a similar form to this:
def find(lines):
B = []
for line in lines:
numbers = [c for c in line if c.isdigit()]
current = int(''.join(numbers))
# current is the concatenation of all
# integers found in column A from left to right
B.append(current)
return B
Let me know if this makes sense or is even in the right track for your solution. Once again, without knowing what you're trying to do, and what A looks like, I'm not sure what your actual goals are.
EDIT
I'm not going to explain the csv stuff for you, mainly because there is a fantastic resource and library for it included in python here. If you have specific questions related to writing csv, definitely post them.
It sounds like you essentially want to pull int values out of column A then add them to a new column B. There are definitely many ways to solve this, but the general form of the problem is for each row you'll filter out the int, then you'll add the filtered int into the new column. I'll list a couple:
Regex: You could use a pattern such as [0-9]+ to pull the string out of A, then use int(whatever that output is) to cast to int, then store those values in B. I'm a sucker for a good regular expression and this one is fairly straight forward. Regexr is a great resource to learn about this and test your pattern.
Use an algorithm similar to above: The above algorithm worked before, but I've updated it slightly. Now that it's been updated it'll return an array of numbers correspondent to numbers in A from left to right. This is relatively sound, but it doesn't necessarily guarantee you have the right integer, given that if the title has an int in it, it'll mess some things up. It is likely one of the more clear ways of doing this, though.
I have a Python class that does some currency conversion and string formatting of numbers. It takes polymorphic input, but only spits out a stringified number. I can push those stringified numbers up to a LibreOffice Calc in Python easy enough:
stringifiednumber = str("1.01")
cell_a1 = sheet1.getCellRange("A1")
cell_a1.String = stringifiednumber
This actually works nicely since the builtin currency formats in Calc work just fine with stringified numbers.
What doesn't work is formulas, or sort of doesn't work. Calling SUM(A1:A2) will not see the stringified A1. There is a workaround (forgive me it is late and I forget it exactly but it is similar to:) =SUMRECORD(VALUE(A1:A2)).
As I understand it, each cell has a memory location for a number, a string, and a formula. The formula only acts on the VALUE memory location.
Through the spreadsheet UI, I can convert one cell type to another during a copy. To do that I just put the following formula in A2, and it converts STRING(A1) to VALUE( A2):
# formula placed in A2
=VALUE(A1)
but that only works by copying one cell to another. Obviously there is an internal recasting function within the spreadsheet that is doing the conversion during the copy.
What I want to do, is write a stringified number to the spreadsheet (as above) and then call the spreadsheets native recasting function in place from Python, so that VALUE(A1) is recast from STRING(A1).
If I knew what the recasting function was I could just call it after every string write. This would make macros in the UI work like the user expects them to work.
If your answer is: "do type conversion Python-side", I've already considered that, and it is not the solution I'm looking for.
Based on your Title, multiply by 1:
I have a question about string usage in lists in python for Maya. I am writing a script meant to take a selected object, then instance it 100 times with random translate, scale, and orient attributes. The script itself works and does what it's meant to, however I'm not being able to decipher how to instance the objects with the original object name, and then add a suffix that ends with "_instance#", where # assigns 1, 2, 3, etc. in order to the copies of the original mesh. This is where I'm at so far:
#Capture selected objects, sort into list
thing = MC.ls(sl=True)
print thing
#Create instances of objects
instanceObj = MC.instance(thing, name='thing' + '_instance#')
This returns a result that looks like "thing_instance1, thing_instance2".
Following this, I figured the single quote around the string for the object was causing it to just name it "thing", so I attempted to write it as follows
MC.instance(thing, name=thing + '_instance1'
I guess because instance uses a list, it's not accepting the second usage of the string as valid and returns a concatenate error. I've tried rewriting this a few times and the closest I get is with
instanceObj = MC.instance(thing)
which results in a list of (pCube1,2,3,4), but is lacking the suffix.
I'm not sure where to go from here to end up with a result where the instanced objects are named with the convention "pCube1_instance1, pCube1_instance2" etc.
Any assistance would be appreciated.
It is not clear if you want to use only one source object or more. In any case the
MC.ls(sl=True)
returns a list of strings. And concatenating a list and a string does not work. So use thing[0] or simply
MC.ls(sl=True)[0]
If you get errormessages, please always include the message in your question, it helps a lot to see what error appears.
I have some text data in a pandas column. Basically each document is part of the column value. Each document is multi sentence long.
I wanted to split each document into sentence and then for each sentence I want to get a list of words. So if a document is 5 sentence long, I will have a list of list of words with length 5.
I used a mapper function to do some operations on that and got a list of words for each sentence of a text. Here is a mapper code:
def text_to_words(x):
""" This function converts sentences in a text to a list of words
"""
nlp=spacy.load('en')
txt_to_words= [str(doc).replace(".","").split(" ") for doc in nlp(x).sents]
return txt_to_words
Then I did this:
%%time
txt_to_words=map(text_to_words,pandas_df.log_text_cleaned)
It got done in 70 micro seconds and I got a mapper iterator.
Now if I want to add each list of list of words of each document as a new value of a new column in the same pandas data frame.
I can simply do this:
txt_to_words=[*map(text_to_words,pandas_df.log_text_cleaned)]
Which will expand the map iterator and store it in txt_to_words as list of list of words.
But this process is very slow.
I even tried looping over the map object :
txt_to_words=map(text_to_words,pandas_df.log_text_cleaned)
txt_to_words_list=[]
for sent in txt_to_words:
txt_to_words_list.append(sent)
But this is similar slow.
extracting the output from a mapper object is very slow. And I just have 67K documents in that pandas data frame column.
Is there a way this can be sped up?
Thanks
The direct answer to your question is that the fastest way to convert an iterator to a list is probably by calling list on it, although that may depend on the size of your lists.
However, this is not going to matter, except to an unnoticeable, barely-measurable degree.
The difference between list(m), [*m], or even an explicit for statement is a matter of microseconds at most, but your code is taking seconds. In fact, you could even eliminate almost all the work done by list by using collections.deque(m, maxlen=0) (which just throws away all of the values without allocating anything or storing them), and you still won't see a difference.
Your real problem is that the work done for each element is slow.
Calling map doesn't actually do that work. All it does is construct a lazy iterator that sets up the work to be done later. When is later? When you convert the iterator to a list (or consume it in some other way).
So, it's that text_to_words function that you need to speed up.
And there's at least one obvious candidate for how to do that:
def text_to_words(x):
""" This function converts sentences in a text to a list of words
"""
nlp=spacy.load('en')
txt_to_words= [str(doc).replace(".","").split(" ") for doc in nlp(x).sents]
return txt_to_words
You're loading in an entire English tokenizer/dictionary/etc. for each sentence? Sure, you'll get some benefit from caching after the first time, but I'll bet it's still way too slow to do for every sentence.
If you were trying to speed things up by making it a local variable rather than a global (which probably won't matter, but it might), that's not the way to do it; this is:
nlp=spacy.load('en')
def text_to_words(x, *. _nlp=nlp):
""" This function converts sentences in a text to a list of words
"""
txt_to_words= [str(doc).replace(".","").split(" ") for doc in _nlp(x).sents]
return txt_to_words
Let's say I need to save a matrix(each line corresponds one row) that could be loaded from fortran later. What method should I prefer? Is converting everything to string is the only one approach?
You can save them in binary format as well. Please see the documentation on the struct standard module, it has a pack function for converting Python object into binary data.
For example:
import struct
value = 3.141592654
data = struct.pack('d', value)
open('file.ext', 'wb').write(data)
You can convert each element of your matrix and write to a file. Fortran should be able to load that binary data. You can speed up the process by converting a row as a whole, like this:
row_data = struct.pack('d' * len(matrix_row), *matrix_row)
Please note, that 'd' * len(matrix_row) is a constant for your matrix size, so you need to calculate that format string only once.
I don't know fortran, so it's hard to tell what is easy for you to perform on that side for parsing.
It sounds like your options are either saving the doubles in plaintext (meaning, 'converting' them to string), or in binary (using struct and the likes). The decision for which one is better depends.
I would go with the plaintext solution, as it means the files will be easily readable, and you won't have to mess with different kinds of details (endianity, default double sizes).
But, there are cases where binary is better (for example, if you have a really big list of doubles and space is of importance, or if it is easier for you to parse it and you need the optimization) - but this is likely not your case.
You can use JSON
import json
matrix = [[2.3452452435, 3.34134], [4.5, 7.9]]
data = json.dumps(matrix)
open('file.ext', 'wb').write(data)
File content will look like:
[[2.3452452435, 3.3413400000000002], [4.5, 7.9000000000000004]]
If legibility and ease of access is important (and file size is reasonable), Fortran can easily parse a simple array of numbers, at least if it knows the size of the matrix beforehand (with something like READ(FILE_ID, '2(F)'), I think):
1.234 5.6789e4
3.1415 9.265358978
42 ...
Two nested for loops in your Python code can easily write your matrix in this form.