Python .join[] multiplys total string characters across multiple lines

Python .join[] multiplys total string characters across multiple lines - python

I am working on a personal project to help my understanding of python 3.4.2 looping and concatenating strings from multiple sources.
My goal with this is to take 'string' use join and call __len__() inside to build a string it is multiplying my results. I would like the lengths to be 5 then 10 then 15. Right now it is coming out 5 then 25 then 105. If I keep going I get 425,1705,6825,etc...
I hope I'm missing something simple, but any help would be amazing. I'm also trying to do my joins efficiently (I know the prints aren't, those are for debugging purposes.)
I used a visualized python tool online to step through it and see if I could figure it out. I just am missing something.
http://www.pythontutor.com/visualize.html#mode=edit
Thank you in advance!
import random
def main():
#String values will be pulled from
string = 'valuehereisheldbythebeholderwhomrolledadtwentyandcriticalmissed'
#Initial string creation
strTest = ''
print('strTest Blank: ' + strTest)
#first round string generation
strTest = strTest.join([string[randomIndex(string.__len__())] for i in range(randomLength())])
print('strTest 1: ' + strTest)
print('strTest 1 length: ' + str(strTest.__len__()))
#second round string generation
strTest = strTest.join([string[randomIndex(string.__len__())] for i in range(randomLength())])
print('strTest 2: ' + strTest)
print('strTest 2 length: ' + str(strTest.__len__()))
#final round string generation
strTest = strTest.join([string[randomIndex(string.__len__())] for i in range(randomLength())])
print('strTest 3: ' + strTest)
print('strTest 3 length: ' + str(strTest.__len__()))
def randomIndex(index):
#create random value between position 0 and total string length to generate string
return random.randint(0,index)
def randomLength():
#return random length for string creation, static for testing
return 5
#return random.randint(10,100)
main()
# output desired is
# strTest 1 length: 5
# strTest 2 length: 10
# strTest 3 length: 15

The code runs without any issue, what's happening actually is, each time you call strTest.join(...), you are actually joining each random character and the next you get from string with the previous value of strTest.
Quoting from Python Doc:
str.join(iterable) Return a string which is the concatenation of the
strings in the iterable iterable. A TypeError will be raised if there
are any non-string values in iterable, including bytes objects. The
separator between elements is the string providing this method.
Example:
>>> s = 'ALPHA'
>>> '*'.join(s)
'A*L*P*H*A'
>>> s = 'TEST'
>>> ss = '-long-string-'
>>> ss.join(s)
'T-long-string-E-long-string-S-long-string-T'
So probably you want something like:
strTest = strTest + ''.join([string[randomIndex(string.__len__())] for i in range(randomLength())])

Related

Python closest match between two string columns

I am looking to get the closest match between two columns of string data type in two separate tables. I don't think the content matters too much. There are words that I can match by pre-processing the data (lower all letters, replace spaces and stop words, etc...) and doing a join. However I get around 80 matches out of over 350. It is important to know that the length of each table is different.
I did try to use some code I found online but it isn't working:
def Races_chien(df1,df2):
myList = []
total = len(df1)
possibilities = list(df2['Rasse'])
s = SequenceMatcher(isjunk=None, autojunk=False)
for idx1, df1_str in enumerate(df1['Race']):
my_str = ('Progress : ' + str(round((idx1 / total) * 100, 3)) + '%')
sys.stdout.write('\r' + str(my_str))
sys.stdout.flush()
# get 1 best match that has a ratio of at least 0.7
best_match = get_close_matches(df1_str, possibilities, 1, 0.7)
s.set_seq2(df1_str, best_match)
myList.append([df1_str, best_match, s.ratio()])
return myList
It says: TypeError: set_seq2() takes 2 positional arguments but 3 were given
How can I make this work?

I think you need s.set_seqs(df1_str, best_match) function instead of s.set_seq2(df1_str, best_match) (docs)

You can use jellyfish library that has useful tools for comparing how similar two strings are if that is what you are looking for.

Try changing:
s = SequenceMatcher(isjunk=None, autojunk=False)
To:
s = SequenceMatcher(None, isjunk=None, autojunk=False)

Here is an answer I finally got:
from fuzzywuzzy import process, fuzz
value = []
similarity = []
for i in df1.col:
ratio = process.extract(i, df2.col, limit= 1)
value.append(ratio[0][0])
similarity.append(ratio[0][1])
df1['value'] = pd.Series(value)
df1['similarity'] = pd.Series(similarity)
This will add the value with the closest match from df2 in df1 together with the similarity %

Trouble slicing based on function +1

I am trying to search c_item_number_one = (r'12" Pipe SA-106 GR. B SCH 40 WALL smls'.upper()) for " to pull both it and all information in front of it. i.e. I want 12"
I thought I could just search for what position " is in...
def find_nps_via_comma_item_one():
nps = '"'
print(c_item_number_one.find(nps))
find_nps_via_comma_item_one()
Image showing above function results in 2
and then slice everything off after it
c_item_number_one = (r'12" Pipe SA-106 GR. B SCH 40 WALL smls'.upper())
def find_nps_via_comma_item_one():
nps = '"'
print(c_item_number_one.find(nps))
find_nps_via_comma_item_one()
item_one_nps = slice(3)
print(c_item_number_one[item_one_nps])
Issue: It is returning an error
print(c_item_number_one[item_one_nps])
TypeError: slice indices must be integers or None or have an __index__ method
How can I turn the results of my function into an integer? I've tried changing print(c_item_number_one.find(nps)) to return(c_item_number_one.find(nps)) but then it stopped giving a value entirely.
Lastly, the slice portion does not produce the full answer I am looking for 12". Even if I enter the value produced by the function 2
item_one_nps = slice(2)
print(c_item_number_one[item_one_nps])
It only gives me 12. I need to +1 the function results.

You could do
sep_char = "\""
c_item_number_one.split(sep_char)[0] + sep_char

The print statement prints a value to the console whereas a return returns a value where the function is call.
In your code you are not storing the value but just printing it to the console even when you used return instead of print you weren't making use of the returned value.
1 is being added to the slice since while slicing python excludes the stop index so to include the stop index you add 1
c_item_number_one = (r'12" Pipe SA-106 GR. B SCH 40 WALL smls'.upper())
def find_nps_via_comma_item_one():
nps = '"'
return(c_item_number_one.find(nps))
item_one_nps = slice(find_nps_via_comma_item_one()+1)
print(c_item_number_one[item_one_nps])
The following code is more verbose.
c_item_number_one = (r'12" Pipe SA-106 GR. B SCH 40 WALL smls'.upper())
def find_nps_via_comma_item_one():
nps = '"'
return(c_item_number_one.find(nps))
index = find_nps_via_comma_item_one()
item_one_nps = slice(index+1)
print(c_item_number_one[item_one_nps])

Shortest possible generated unique ID

So we can generate a unique id with str(uuid.uuid4()), which is 36 characters long.
Is there another method to generate a unique ID which is shorter in terms of characters?
EDIT:
If ID is usable as primary key then even better
Granularity should be better than 1ms
This code could be distributed, so we can't assume time independence.

If this is for use as a primary key field in db, consider just using auto-incrementing integer instead.
str(uuid.uuid4()) is 36 chars but it has four useless dashes (-) in it, and it's limited to 0-9 a-f.
Better uuid4 in 32 chars:
>>> uuid.uuid4().hex
'b327fc1b6a2343e48af311343fc3f5a8'
Or just b64 encode and slice some urandom bytes (up to you to guarantee uniqueness):
>>> base64.b64encode(os.urandom(32))[:8]
b'iR4hZqs9'

TLDR
Most of the times it's better to work with numbers internally and encode them to short IDs externally. So here's a function for Python3, PowerShell & VBA that will convert an int32 to an alphanumeric ID. Use it like this:
int32_to_id(225204568)
'F2AXP8'
For distributed code use ULIDs: https://github.com/mdipierro/ulid
They are much longer but unique across different machines.
How short are the IDs?
It will encode about half a billion IDs in 6 characters so it's as compact as possible while still using only non-ambiguous digits and letters.
How can I get even shorter IDs?
If you want even more compact IDs/codes/Serial Numbers, you can easily expand the character set by just changing the chars="..." definition. For example if you allow all lower and upper case letters you can have 56 billion IDs within the same 6 characters. Adding a few symbols (like ~!##$%^&*()_+-=) gives you 208 billion IDs.
So why didn't you go for the shortest possible IDs?
The character set I'm using in my code has an advantage: It generates IDs that are easy to copy-paste (no symbols so double clicking selects the whole ID), easy to read without mistakes (no look-alike characters like 2 and Z) and rather easy to communicate verbally (only upper case letters). Sticking to numeric digits only is your best option for verbal communication but they are not compact.
I'm convinced: show me the code
Python 3
def int32_to_id(n):
if n==0: return "0"
chars="0123456789ACEFHJKLMNPRTUVWXY"
length=len(chars)
result=""
remain=n
while remain>0:
pos = remain % length
remain = remain // length
result = chars[pos] + result
return result
PowerShell
function int32_to_id($n){
$chars="0123456789ACEFHJKLMNPRTUVWXY"
$length=$chars.length
$result=""; $remain=[int]$n
do {
$pos = $remain % $length
$remain = [int][Math]::Floor($remain / $length)
$result = $chars[$pos] + $result
} while ($remain -gt 0)
$result
}
VBA
Function int32_to_id(n)
Dim chars$, length, result$, remain, pos
If n = 0 Then int32_to_id = "0": Exit Function
chars$ = "0123456789ACEFHJKLMNPRTUVWXY"
length = Len(chars$)
result$ = ""
remain = n
Do While (remain > 0)
pos = remain Mod length
remain = Int(remain / length)
result$ = Mid(chars$, pos + 1, 1) + result$
Loop
int32_to_id = result
End Function
Function id_to_int32(id$)
Dim chars$, length, result, remain, pos, value, power
chars$ = "0123456789ACEFHJKLMNPRTUVWXY"
length = Len(chars$)
result = 0
power = 1
For pos = Len(id$) To 1 Step -1
result = result + (InStr(chars$, Mid(id$, pos, 1)) - 1) * power
power = power * length
Next
id_to_int32 = result
End Function
Public Sub test_id_to_int32()
Dim i
For i = 0 To 28 ^ 3
If id_to_int32(int32_to_id(i)) <> i Then Debug.Print "Error, i=", i, "int32_to_id(i)", int32_to_id(i), "id_to_int32('" & int32_to_id(i) & "')", id_to_int32(int32_to_id(i))
Next
Debug.Print "Done testing"
End Sub

Yes. Just use the current UTC millis. This number never repeats.
const uniqueID = new Date().getTime();
EDIT
If you have the rather seldom requirement to produce more than one ID within the same millisecond, this method is of no use as this number‘s granularity is 1ms.

Error building bitstring in python 3.5 : the datatype is being set to U32 without my control

I'm using a function to build an array of strings (which happens to be 0s and 1s only), which are rather large. The function works when I am building smaller strings, but somehow the data type seems to be restricting the size of the string to 32 characters long (U32), without my having asked for it. Am I missing something simple?
As I build the strings, I am first casting them as lists so as to more easily manipulate individual characters before joining them into a string again. Am I somehow limiting my ability to use 'larger' data types by my method? The value of np.max(CM1) in this case is something like ~300 (one recent run yielded 253), but the string only come out 32 characters long...
''' Function to derive genome and count mutations in provided list of cells '''
def derive_genome_biopsy(biopsy_list, family_dict, CM1):
derived_genomes_inBx = np.zeros(len(biopsy_list)).astype(str)
for position, cell in np.ndenumerate(biopsy_list):
if cell == 0: continue
temp_parent = 2
bitstring = list('1')
bitstring += (np.max(CM1)-1)*'0'
if cell == 1:
derived_genomes_inBx[position] = ''.join(bitstring)
continue
else:
while temp_parent > 1:
temp_parent = family_dict[cell]
bitstring[cell-1] = '1'
if temp_parent == 1: break
cell = family_dict[cell]
derived_genomes_inBx[position] = ''.join(bitstring)
return derived_genomes_inBx
The specific error message I get is:
Traceback (most recent call last):
File "biopsyCA.py", line 77, in <module>
if genome[site] == '1':
IndexError: string index out of range
family_dict is a dictionary which carries a list of parents and children that the algorithm above works through to reconstruct the 'genome' of individuals from the branching family tree. it basically sets positions in the bitstring to '1' if your parent had it, then if your grandparent etc... until you get to the first bit, which is always '1', then it should be done.

The 32 character limitation comes from the conversion of float64 array to string array in this line:
derived_genomes_inBx = np.zeros(len(biopsy_list)).astype(str)
The resulting array contains datatype S32 values which limit the contents to 32 characters.
To change this limit, use 'S300' or larger instead of str.
You may also use map(str, np.zeros(len(biopsy_list)) to get more flexible string list and convert it back to numpy array with numpy.array() after you have populated it.

Thanks to help from a number of folks here and local, I finally got this working and the working function is:
''' Function to derive genome and count mutations in provided list of cells '''
def derive_genome_biopsy(biopsy_list, family_dict, CM1):
derived_genomes_inBx = list(map(str, np.zeros(len(biopsy_list))))
for biopsy in range(0,len(biopsy_list)):
if biopsy_list[biopsy] == 0:
bitstring = (np.max(CM1))*'0'
derived_genomes_inBx[biopsy] = ''.join(bitstring)
continue
bitstring = list('1')
bitstring += (np.max(CM1)-1)*'0'
if biopsy_list[biopsy] == 1:
derived_genomes_inBx[biopsy] = ''.join(bitstring)
continue
else:
temp_parent = family_dict[biopsy_list[biopsy]]
bitstring[biopsy_list[biopsy]-1] = '1'
while temp_parent > 1:
temp_parent = family_dict[position]
bitstring[temp_parent-1] = '1'
if temp_parent == 1: break
derived_genomes_inBx[biopsy] = ''.join(bitstring)
return derived_genomes_inBx
The original problem was as Teppo Tammisto pointed out an issue with the 'str' datastructure taking 'S32' format. Once I changed to using the list(map(str, ...) functionality a few more issues arose with the original code, which I've now fixed. When I finish this thesis chapter I'll publish the whole family of functions to use to virtually 'biopsy' a cellular automaton model (well, just an array really) and reconstruct 'genomes' from family tree data and the current automaton state vector.
Thanks all!

Python: Function takes 1 argument for 2 given

I have looked on this website for something similar, and attempted to debug using previous answers, and failed.
I'm testing (I did not write this module) a module that changes the grade value of a course's grades from a B- to say a B, but never going across base grade levels (ie, B+ to an A-).
The original module is called transcript.py
I'm testing it in my own testtranscript.py
I'm testing that module by importing it: 'import transcript' and 'import cornelltest'
I have ensured that all files are in the same folder/directory.
There is the function raise_grade present in transcript.py (there are multiple definitions in this module, but raise_grade is the only one giving me any trouble).
ti is in the form ('class name', 'gradvalue')
There's already another definition converting floats to strings and back (ie 3.0--> B).
def raise_grade(ti):
""""Raise gradeval of transcript line ti by a non-noticeable amount.
"""
# value of the base letter grade, e.g., 4 (or 4.0) for a 4.3
bval = int(ti.gradeval)
print 'bval is:"' + str(bval) + '"'
# part after decimal point in raised grade, e.g., 3 (or 3.0) for a 4.3
newdec = min(int((ti.gradeval + .3)*10) % 10, 3)
print 'newdec is:"' + str(newdec) + '"'
# get result by add the two values together, after shifting newdec one
# decimal place
newval = bval + round(newdec/10.0, 1)
ti.gradeval = newval
print 'newval is:"' + str(newval) + '"'
I will probably get rid of the print later.
When I run testtranscript, which imports transcript:
def test_raise():
"""test raise_grade"""
testobj = transcript.Titem('CS1110','B-')
transcript.raise_grade('CS1110','B-')
cornelltest.assert_floats_equal(3.0,transcript.lettergrade_to_val("B-"))
I get this from the cmd shell:
TypeError: raise_grade takes exactly 1 argument (2 given)
Edit1: So now I see that I am giving it two parameters when raise_grade(ti) is just one, but perhaps it would shed more light if I just put out the rest of the code. I'm still stuck as to why I get a ['str' object has no gradeval error]
LETTER_LIST = ['B', 'A']
# List of valid modifiers to base letter grades.
MODIFIER_LIST = ['-','+']
def lettergrade_to_val(lg):
"""Returns: numerical value of letter grade lg.
The usual numerical scheme is assumed: A+ -> 4.3, A -> 4.0, A- -> 3.7, etc.
Precondition: lg is a 1 or 2-character string consisting of a "base" letter
in LETTER_LIST optionally followed by a modifier in MODIFIER_LIST."""
# if LETTER_LIST or MODIFIER_LIST change, the implementation of
# this function must change.
# get value of base letter. Trick: index in LETTER_LIST is shifted from value
bv = LETTER_LIST.index(lg[0]) + 3
# Trick with indexing in MODIFIER_LIST to get the modifier value
return bv + ((MODIFIER_LIST.index(lg[1]) - .5)*.3/.5 if (len(lg) == 2) else 0)
class Titem(object):
"""A Titem is an 'item' on a transcript, like "CS1110 A+"
Instance variables:
course [string]: course name. Always at least 1 character long.
gradeval [float]: the numerical equivalent of the letter grade.
Valid letter grades are 1 or 2 chars long, and consist
of a "base" letter in LETTER_LIST optionally followed
by a modifier in MODIFIER_LIST.
We store values instead of letter grades to facilitate
calculations of GPA later.
(In "real" life, one would write a function that,
when displaying a Titem, would display the letter
grade even though the underlying representation is
numerical, but we're keeping things simple for this
lab.)
"""
def __init__(self, n, lg):
"""Initializer: A new transcript line with course (name) n, gradeval
the numerical equivalent of letter grade lg.
Preconditions: n is a non-empty string.
lg is a string consisting of a "base" letter in LETTER_LIST
optionally followed by modifier in MODIFIER_LIST.
"""
# assert statements that cause an error when preconditions are violated
assert type(n) == str and type(lg) == str, 'argument type error'
assert (len(n) >= 1 and 0 < len(lg) <= 2 and lg[0] in LETTER_LIST and
(len(lg) == 1 or lg[1] in MODIFIER_LIST)), 'argument value error'
self.course = n
self.gradeval = lettergrade_to_val(lg)
Edit2: I understand the original problem... but it seems that the original writer screwed up the code, since raise_grade doesn't work properly for grade values at 3.7 ---> 4.0, since bval takes the original float and makes it an int, which doesn't work in this case.

You are calling the function incorrectly, you should be passing the testobj:
def test_raise():
"""test raise_grade"""
testobj = transcript.Titem('CS1110','B-')
transcript.raise_grade(testobj)
...
The raise_grade function is expecting a single argument ti which has a gradeval attribute, i.e. a Titem instance.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python .join[] multiplys total string characters across multiple lines - python

Related

Python closest match between two string columns

Trouble slicing based on function +1

Shortest possible generated unique ID

Error building bitstring in python 3.5 : the datatype is being set to U32 without my control

Python: Function takes 1 argument for 2 given

Categories

Resources