How to concatenate string from integers? - python

I'm very new to python.
How do I get a string like this:
53 (46.49 %)
But, I'm getting this:
1 53 (1 46.49 %)
I'm trying to get the last value from the table count and the proportion (i'm not sure what it's called in python)
table = pd.value_counts(data[var].values, sort=False)
prop_table = (table/table.sum() * 100).round(2)
num = table[[1]].to_string()
prop = prop_table[[1]].to_string()
test = num + " (" + prop + " %)"
but, it puts 1 before displaying the number.

Related

Failing to append bigger data frames with pandas in nested loops. How to change to numpy vectorization?

I need to load a huge table (6 gb) from an older postgres db that contains some bad values that I need to delete on load. So I wrote a loop that tries to load bigger chunks for performance reasons, but reduces step by step to isolate and discard the bad values. Generally this works, but after more or less 500 k records the performance decreases rapidly.
I have already found that it is not advisable to process larger datasets with pandas. That's why I tried to use numpy. But that didn't change anything. Then I tried to use list comprehensions, but failed because of the exceptions I have to use to try to iterate in smaller chunks.
From my point of view numpy vectorisation looks like a good idea, but I have no idea how to make it work.
https://towardsdatascience.com/how-to-make-your-pandas-loop-71-803-times-faster-805030df4f06
In general this part I'd like to speed up massivley.
df = pds.read_sql_query(sql,conn,params=[(i * chunksize), chunksize])
appended_df.append(df)
products_df = pds.concat(appended_df, ignore_index=True)
If the snippet above isn't enough context below you'll find even more.
# set autocommit = True
conn = pyodbc.connect(conn_str, autocommit=True)
cur = conn.cursor()
# count rows for chunking
sql_count = """\
select count("item_no") from "products"
"""
cur.execute(sql_count)
sql_row_counter = cur.fetchone()[0]
print("Total rows: " + str(sql_row_counter))
# define chunksize and calculate chunks
chunksize = 35000
chunk_divisor = 100
if chunksize / chunk_divisor < 1:
chunk_divisor = chunksize
print("Chunk devisor on error: " + str(chunk_divisor))
chksz_lvl2 = int(chunksize / chunk_divisor)
if chksz_lvl2 < 1:
chksz_lvl2 = 1
chksz_lvl3 = int(chksz_lvl2 / chunk_divisor)
if chksz_lvl3 < 1:
chksz_lvl3 = 1
# print settings for iteration
print("Chunksize: " + str(chunksize) + "\nChunksize Level 2: " +
str(chksz_lvl2) + "\nChunksize Level 3: " + str(chksz_lvl3))
chunks = int(sql_row_counter / chunksize)
# Uncomment next row for testpurposes
chunks = 25
print("Chunks: " + str(chunks) + "\n")
error_counter = 0
# iterate chunks
appended_df = []
print("Starting to iterate chunks.\nPlease wait...")
for i in range(0, chunks):
# try to iterate in full speed
print("\nNext chunk starts from " + str((i * chunksize)) +
" with an limit of " + str(chunksize) + ".")
try:
# start runtime measurment
i_start = time.time()
# sql statement
sql = """\
select "item_no", "description_1", "description_2", "description_3" FROM "products" order by "item_no" offset ? limit ?"""
# store into dataframe
df = pds.read_sql_query(sql,
conn,
params=[(i * chunksize), chunksize])
# get first and last value from dataframe
head = df["item_no"].iloc[0]
tail = df["item_no"].iloc[-1]
# store query
# Appending data frames via pandas.append() suddenly becomes slower by a factor of 10 from approx. 500,000 data records per 4 columns.
appended_df.append(df)
# stop runtime measurement
i_end = time.time()
# print result
print(
str(i + 1) + " out of " + str(chunks) + " chunks in " +
"{:5.3f}s".format(i_end - i_start) + " processed.")
except:
# collect error information
print(
"\nChunk " + str(i + 1) +
" cannot be selected due to an error. Reduce chunk size from "
+ str(chunksize) + " to " + str(chksz_lvl2) +
". Entering level 2.\nFirst working item_no of last working chunk "
+ str(head) +
"\nLast working item_no of last working chunk " +
str(tail))
### 2 ### Successively reduce the chunks to narrow down and isolate errors.
for j in range(0, chunk_divisor):
and so on...
...
...
...
# Merge chunks
print("\nNote: Chunkzize = from row_no to row_no. Could be 1,2,3,4 = range of 4 or compleley different. Ex. 2,45,99,1002 = range of 4.\n\nConcatinate chunks.")
products_df = pds.DataFrame()
products_df = pds.concat(appended_df, ignore_index=True)
print("Done. " + str(error_counter) +
" rows had to be skipped. Details can be found in the full error log.")
conn.close()
I've just noticed that the python script is already running as expected. Other frameworks like Dask haven't had any chance to improve this. In my case the source Postgres DB (in my case v. 9.x) where I'd like to get some data has an issue regarding the usage of limit and order by at the same time during querying huge tables.
I was not able to detect this directly, because my SQL query tool (DBeaver) only loads a subset to display even if you want to query the full table. Therefor the result is a false friend. If you want to check properly run a short select with an pretty huge offset and limit via ordering.
With an offset of approx. 500 k of records the time to select of only one record took about 10 sec in my case.
The solution was to remove the order by in my embedded SQL script at the "try" part.

How to make a decent R function for rainbow table?

I'm trying to develop a rainbow table. The search function works well, but the problem is when I try to generate the table.
There are 54 possible characters and the passwords are 3 characters long (for now). I can generate a table with 1024 lines, and 153 columns. In theory, if there were no collisions, there would be a >95% chance that I crack the password (1024*153 ≈ 54^3).
But I'm getting 126662 collisions.... Here's my R function
def reduct(length, hashData, k):
### Initializing variables and mapping ###
while len(pwd) != length:
pwdChar = (int(hashData[0], 16) + int(hashData[1], 16) + int(hashData[2], 16) + int(hashData[3], 16) - 7 + 3*k) % 54
hashData = hashData[3:]
pwd += mapping[pwdChar][1]
return pwd
How can this function result in so many collisions? The maximum sum of the first 4 nibbles can be 60 so -7 ensures it's between 0 and 53 (equal chance for all chars). +3k makes it different for every column and % 54 to make sure it fits in the mapping.

Inserting random values based on condition

I have the following DataFrame containing various information about a certain product. Input3 is a list of sentences created as shown below:
sentence_list = (['Køb online her','Sammenlign priser her','Tjek priser fra 4 butikker','Se produkter fra 4 butikker', 'Stort udvalg fra 4 butikker','Sammenlign og køb'])
df["Input3"] = np.random.choice(sentence_list, size=len(df))
Full_Input is a string created by joining various columns, its content being something like: "ProductName from Brand - Buy online here - Sitename". It is created like this:
df["Full_Input"] = df['TitleTag'].astype(str) + " " + df['Input2'].astype(str) + " " + df['Input3'].astype(str) + " " + df['Input4'].astype(str) + " " + df['Input5'].astype(str)
The problem here is that Full_Input_Length should be under 55. Therefore I am trying to figure out how to put a condition while randomly generating Input3 so when it adds up with the other columns' strings, the full input length does not go over 55.
This is what I tried:
for col in range(len(df)):
condlist = [df["Full_Input"].apply(len) < 55]
choicelist = [sentence_list]
df['Input3_OK'][col] = np.random.choice.select(condlist, choicelist)
As expected, it doesn't work like that. np.random.choice.select is not a thing and I am getting an AttributeError.
How can I do that instead?
If you are guaranteed to have at least one item in Input3 that will satisfy this condition, you may want to try something like conditioning your random selection ONLY on values in your sentence_list that would be of an acceptable length:
# convert to series to enable use of pandas filtering mechanism:
my_sentences = [s for s in sentence_list if len(s) < MAX_LENGTH]
# randomly select from this filtered list:
np.random.choice(my_sentences)
In other words, perform the filter on each list of strings BEFORE you call random.choice.
You can run this for each row in a dataframe like so:
def choose_string(full_input):
return np.random.choice([
s
for s in sentence_list
if len(s) + len(full_input) < 55
])
df["Input3_OK"] = df.Full_Input.map(choose_string)

regex replace using function python

I'm trying to print a table of my database using :
pd.read_sql_query("SELECT name,duration FROM activity where (strftime('%W', date) = strftime('%W', 'now'))", conn))
and it work it prints :
name duration
0 programmation 150
1 lecture 40
2 ctf 90
3 ceh 90
4 deep learning 133
5 vm capture the flag 100
but I would like to use my function minuteToStr who translate the duration to string likes "1h30" on the duraton colowns.
I tried this code but it does'nt work :
tableau = str(pd.read_sql_query("SELECT name,duration FROM activity\
where (strftime('%W', date) = strftime('%W', 'now'))", conn))
tableau = re.sub("([0-9]{2,})", minuteToStr(int("\\1")), tableau)
print(tableau)
Thanks
Make this easy, just use a little mathemagic and string formatting.
h = df.duration // 60
m = df.duration % 60
df['duration'] = h.astype(str) + 'h' + m.astype(str) + 'm'
df
name duration
0 programmation 2h30m
1 lecture 0h40m
2 ctf 1h30m
3 ceh 1h30m
4 deep learning 2h13m
5 vm capture the flag 1h40m
re.sub doesn't work this way. It expects a string, not a DataFrame.
Given that minuteToStr accepts an integer, you can simply use apply:
tableau['duration'] = tableau['duration'].apply(minuteToStr)
Similar to using a function inside re.sub in pandas we can use str.replace . Similar type is used here i.e
If duration column is of integer type then
tableau['duration'].astype(str).str.replace("([0-9]{2,})", minuteToStr)
Else:
tableau['duration'].str.replace("([0-9]{2,})", minuteToStr)
To illustrate using function inside replace (I prefer you go with #colspeed's solution)
def minuteToStr(x):
h = int(x.group(1)) // 60
m = int(x.group(1)) % 60
return str(h) + 'h' + str(m)
df['duration'].astype(str).str.replace("([0-9]{2,})",minuteToStr)
name duration
0 programmation 2h30
1 lecture 0h40
2 ctf 1h30
3 ceh 1h30
4 deeplearning 2h13
5 vmcapturetheflag 1h40

Python .join[] multiplys total string characters across multiple lines

I am working on a personal project to help my understanding of python 3.4.2 looping and concatenating strings from multiple sources.
My goal with this is to take 'string' use join and call __len__() inside to build a string it is multiplying my results. I would like the lengths to be 5 then 10 then 15. Right now it is coming out 5 then 25 then 105. If I keep going I get 425,1705,6825,etc...
I hope I'm missing something simple, but any help would be amazing. I'm also trying to do my joins efficiently (I know the prints aren't, those are for debugging purposes.)
I used a visualized python tool online to step through it and see if I could figure it out. I just am missing something.
http://www.pythontutor.com/visualize.html#mode=edit
Thank you in advance!
import random
def main():
#String values will be pulled from
string = 'valuehereisheldbythebeholderwhomrolledadtwentyandcriticalmissed'
#Initial string creation
strTest = ''
print('strTest Blank: ' + strTest)
#first round string generation
strTest = strTest.join([string[randomIndex(string.__len__())] for i in range(randomLength())])
print('strTest 1: ' + strTest)
print('strTest 1 length: ' + str(strTest.__len__()))
#second round string generation
strTest = strTest.join([string[randomIndex(string.__len__())] for i in range(randomLength())])
print('strTest 2: ' + strTest)
print('strTest 2 length: ' + str(strTest.__len__()))
#final round string generation
strTest = strTest.join([string[randomIndex(string.__len__())] for i in range(randomLength())])
print('strTest 3: ' + strTest)
print('strTest 3 length: ' + str(strTest.__len__()))
def randomIndex(index):
#create random value between position 0 and total string length to generate string
return random.randint(0,index)
def randomLength():
#return random length for string creation, static for testing
return 5
#return random.randint(10,100)
main()
# output desired is
# strTest 1 length: 5
# strTest 2 length: 10
# strTest 3 length: 15
The code runs without any issue, what's happening actually is, each time you call strTest.join(...), you are actually joining each random character and the next you get from string with the previous value of strTest.
Quoting from Python Doc:
str.join(iterable) Return a string which is the concatenation of the
strings in the iterable iterable. A TypeError will be raised if there
are any non-string values in iterable, including bytes objects. The
separator between elements is the string providing this method.
Example:
>>> s = 'ALPHA'
>>> '*'.join(s)
'A*L*P*H*A'
>>> s = 'TEST'
>>> ss = '-long-string-'
>>> ss.join(s)
'T-long-string-E-long-string-S-long-string-T'
So probably you want something like:
strTest = strTest + ''.join([string[randomIndex(string.__len__())] for i in range(randomLength())])

Categories

Resources