I'm trying to change the display of 10000.000000001317/10000 to only two decimals.
Finished: 100%|████████████████████████████| 10000.000000001317/10000 [00:01<00:00, 9330.68it/s]
This is my code right now;
def create_dataframe(size):
df = pd.DataFrame()
pbar = tqdm(total=size)
#Increment is equal to the total number of records to be generated divided by the fields to be created
#divided by total (this being the necessary iterations for each field)
increment = size / 5 / size
df["ID"] = range(size)
pbar.set_description("Generating Names")
df["Name"], _ = zip(*[(fake.name(), pbar.update(increment)) for _ in range(size)])
pbar.set_description("Generating Emails")
df["Email"], _ = zip(*[(fake.free_email(), pbar.update(increment)) for _ in range(size)])
pbar.set_description("Generating Addresses")
df["Address"], _ = zip(*[(fake.address(), pbar.update(increment)) for _ in range(size)])
pbar.set_description("Generating Phones")
df["Phone"], _ = zip(*[(fake.phone_number(), pbar.update(increment)) for _ in range(size)])
pbar.set_description("Generating Comments")
df["Comment"], _ = zip(*[(fake.text(), pbar.update(increment)) for _ in range(size)])
pbar.set_description("Finished")
pbar.close()
return df
According to the docs, or at least what I've understood, this would be the default formatting to the argument bar_format;
pbar = tqdm(total=size, bar_format='{desc}{percentage:3.0f}%|{bar}| {n_fmt}/{total_fmt} [{elapsed}<{remaining}, ''{rate_fmt}{postfix}]')
I tried:
Setting .2f in n_ftm, the output of this results in an error.
pbar = tqdm(total=size, bar_format='{desc}{percentage:3.0f}%|{bar}| {n_fmt:.2f}/{total_fmt} [{elapsed}<{remaining}, ''{rate_fmt}{postfix}]')
Or formatting the set_description,
pbar.set_description("Finished: {:.2f}/{:.2f}".format(pbar.n, size))
This prints another x/total before the actual bar;
Finished: 10000.00/10000.00: 100%|██████████| 10000.000000001317/10000 [00:01<00:00, 9702.65it/s]
Plus it would be ideal once the bar finished having 10000/10000 no decimals.
You cannot specify the number of floating points because n_fmt is a string. You can however pass unit_scale=True:
pbar = tqdm(total=size,
bar_format='{desc}{percentage:3.0f}%|{bar}| {n_fmt}/{total_fmt} [{elapsed}<{remaining}, ''{rate_fmt}{postfix}]',
unit_scale=True)
According to the doc:
If 1 or True, the number of iterations will be reduced/scaled
automatically and a metric prefix following the
International System of Units standard will be added
(kilo, mega, etc.)
You'll get the following output for size=10000:
Finished: 100%|█████████████████████████████████████████████████████████████████████| 10.0k/10.0k [00:03<00:00, 3.31kit/s]
Related
Im totaly new in excel nor in VBA.
I need to write a VBA macro that will sort dogs by total gained points and if the points are same then check each atribute (from left to right) and sort by these.
I wrote some (i think) working sort in python:
import random
from functools import cmp_to_key
class Structure:
def __init__(self,total,agility,speed,endurance,follow,enthusiasm):
self.total = total
self.agility = agility
self.speed = speed
self.endurance = endurance
self.follow = follow
self.enthusiasm = enthusiasm
def __str__(self):
return 'Structure(Total=' + str(self.total) + ' ,agility=' + str(self.agility) +' ,speed=' + str(self.speed) + ' ,endurance=' + str(self.endurance)+\
' ,follow=' + str(self.follow)+' ,enthusiasm=' + str(self.enthusiasm)+')'
def compare(item1, item2):
if item1.total < item2.total:
return -1
elif item1.total > item2.total:
return 1
else:
#Agility compare
if(item1.agility>item2.agility):
return 1
elif(item1.agility<item2.agility):
return -1
#Speed compare
if(item1.speed>item2.speed):
return 1
elif(item1.speed<item2.speed):
return -1
#Endurance compare
if(item1.endurance>item2.endurance):
return 1
elif(item1.endurance<item2.endurance):
return -1
#Follow compare
if(item1.follow>item2.follow):
return 1
elif(item1.follow<item2.follow):
return -1
#Enthusiasm compare
if(item1.enthusiasm>item2.enthusiasm):
return 1
elif(item1.enthusiasm<item2.enthusiasm):
return -1
return 0
def fill():
#total = random.randint(163,170)
total = 170
agility = 0
speed = 0
endu = 0
fol = 0
enth = 0
while(total!=agility+speed+endu+fol+enth):
agility = random.randint(20,40)
speed = random.randint(20,40)
endu = random.randint(20,40)
fol = random.randint(20,40)
enth = random.randint(20,40)
return [total,agility,speed,endu,fol,enth]
if __name__ == "__main__" :
list = []
for i in range(10):
k = fill()
list.append(Structure(k[0],k[1],k[2],k[3],k[4],k[5]))
for i in list:
print(i)
print("*********************** after sort *******************")
zoznam = sorted(list, key=cmp_to_key(compare),reverse=True)
for i in zoznam:
print(i)
but i have no idea how to write it in excel.
My idea is that i select total numbers and it will sort whole row. The "data structure" in excel looks like this:
For example as you can see (on top) both of them have total of 170, agility same so pass and speed is higher so this is why he is on top.
Thanks in advance
EDIT:
Thanks a lot gimix :) Because i need more than three keys and i want only to sort selected ones i "changed" a little bit a macro to:
Selection.Sort Key1:=Range("I1"), _
Order1:=xlDescending, _
Key2:=Range("J1"), _
Order2:=xlDescending, _
Key3:=Range("K1"), _
Order3:=xlDescending, _
Header:=xlNo
Selection.Sort Key1:=Range("L1"), _
Order1:=xlDescending, _
Key2:=Range("G1"), _
Order2:=xlDescending, _
Key3:=Range("H1"), _
Order3:=xlDescending, _
Header:=xlNo
The thing is, it SEEMS like it is working but i dont know if it SHOULD be "sorted two times" like these, and if there wouldnt be some kind of "leaks" (unwanted behavior)
edit2:
Shouldnt it be rather like this ?
Selection.Sort Key1:=Range("K1"), _
Order1:=xlDescending, _
Header:=xlNo
Selection.Sort Key1:=Range("J1"), _
Order1:=xlDescending, _
Header:=xlNo
Selection.Sort Key1:=Range("I1"), _
Order1:=xlDescending, _
Header:=xlNo
Selection.Sort Key1:=Range("H1"), _
Order1:=xlDescending, _
Header:=xlNo
Selection.Sort Key1:=Range("G1"), _
Order1:=xlDescending, _
Header:=xlNo
Selection.Sort Key1:=Range("L1"), _
Order1:=xlDescending, _
Header:=xlNo
In VBA you have the sort method of the range object:
Range("A6:L11").Sort Key1:=Range("L1"), _
Order1:=xlDescending, _
Key2:=Range("G1"), _
Order2:=xlDescending, _
Key3:=Range("H1"), _
Order3:=xlDescending, _
Header:=xlNo
Key1 etc identify which column to use; Order1 etc tell if you want to sort from lowest to highest or the other way round (default is xlAscending so you need to specify this); finally Header tells if your data has a header row (in your case we use xlNo since you have non-data rows between the headers (row1) and the data (row6 and following).
Btw your Python code could be simpler: just create a tuple of total, agility and speed and use it as the key: no need for defining a function nor for calling cmp_to_key()
I am trying to generate unique set of values for example set of (int, int). I tried using multiprocessing and faker. Each process generates a unique list of length 125000, but at last when it gets appended the unique vales are just 125000 i.e each process has generated the same set of values. As we expect to have 1000000 unique values, Is there a better way to generate unique values ?
Code :
from faker import Faker
faker = Faker()
def get_row():
return [faker.unique.pyint(min_value=0, max_value=100000000, step=1), faker.unique.pyint(min_value=0, max_value=100000000, step=1)]
def get_rpwss(n):
return np.unique([get_row() for x in range(n)], axis=0).tolist()
t0 = time.time()
n_threads = 8
total = 1000000
n_per_task = total // n_threads
result = []
with multiprocessing.Pool() as p:
for batch in p.imap_unordered(get_rows,[n_per_task for x in range(n_threads)],):
result = result + batch
old = result
result = np.unique(result, axis=0).tolist()
t1 = time.time()
print(len(result))
print(t1-t0)
print(len(old))
Output :
125000
21.0101299338833
1000000
I need to read in a large csv data file which is however riddled with newline characters and generally quite chaotic. So instead of pandas I do it manually, however I'm running into a strange slow-down which seems to depend on the characters which appear in the file.
While trying to recreate the problem by randomly creating a csv file which looks similar I figured that maybe the problem lies in the count function.
Consider this example which creates a large file with chaotic random data, reads the file and then by using count orders it such that it can be read as columnar data.
Note that in the first run of the file I only use string.ascii_letters for the random data, for the second run I'm using characters from string.printable.
import os
import random as rd
import string
import time
# Function to create random data in a specific pattern with separator ";":
def createRandomString(num,io,fullLength):
lineFull = ''
nl = True
randstr = ''.join(rd.choice(string.ascii_letters) for _ in range(7))
for i in range(num):
if i == 0:
line = 'Start;'
else:
line = ''
bb = rd.choice([True,True,False])
if bb:
line = line+'\"\";'
else:
if rd.random() < 0.999:
line = line+randstr
else:
line = line+rd.randint(10,100)*randstr
if nl and i != num-1:
line = line+';\n'
nl = False
elif rd.random() < 0.04 and i != num-1:
line = line+';\n'
if rd.random() < 0.01:
add = rd.randint(1,10)*'\n'
line = line+add
else:
line = line+';'
lineFull = lineFull+line
return lineFull+'\n'
# Create file with random data:
outputFolder = "C:\\DataDir\\Output\\"
numberOfCols = 38
fullLength = 10000
testLines = [createRandomString(numberOfCols,i,fullLength) for i in range(fullLength)]
with open(outputFolder+"TestFile.txt",'w') as tf:
tf.writelines(testLines)
# Read in file:
with open(outputFolder+"TestFile.txt",'r') as ff:
lines = []
for line in ff.readlines():
lines.append(unicode(line.rstrip('\n')))
# Restore columns by counting the separator:
linesT = ''
lines2 = []
time0 = time.time()
for i in range(len(lines)):
linesT = linesT + lines[i]
count = linesT.count(';')
if count == numberOfCols:
lines2.append(linesT)
linesT = ''
if i%1000 == 0:
print time.time()-time0
time0 = time.time()
print time.time()-time0
The print statements output this:
0.0
0.0019998550415
0.00100016593933
0.000999927520752
0.000999927520752
0.000999927520752
0.000999927520752
0.00100016593933
0.0019998550415
0.000999927520752
0.00100016593933
0.0019998550415
0.00100016593933
0.000999927520752
0.00200009346008
0.000999927520752
0.000999927520752
0.00200009346008
0.000999927520752
0.000999927520752
0.00200009346008
0.000999927520752
0.00100016593933
0.000999927520752
0.00200009346008
0.000999927520752
Consistently fast performance.
Now I change the third line in createRandomString to randstr = ''.join(rd.choice(string.printable) for _ in range(7)), my output now becomes this:
0.0
0.0759999752045
0.273000001907
0.519999980927
0.716000080109
0.919999837875
1.11500000954
1.25199985504
1.51200008392
1.72199988365
1.8820002079
2.07999992371
2.21499991417
2.37400007248
2.64800000191
2.81900000572
3.04500007629
3.20299983025
3.55500006676
3.6930000782
3.79499983788
4.13900017738
4.19899988174
4.58700013161
4.81799983978
4.92000007629
5.2009999752
5.40199995041
5.48399996758
5.70299983025
5.92300009727
6.01099991798
6.44200015068
6.58999991417
3.99399995804
Not only is the performance very slow but it is consistently becoming slower over time.
The only difference lies in the range of characters which are written into the random data.
The full set of character which appear in my real data is this:
charSet = [' ','"','&',"'",'(',')','*','+',',','-','.','/','0','1','2','3','4','5','6',
'7','8','9',':',';','<','=','>','A','B','C','D','E','F','G','H','I','J','K',
'L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z','\\','_','`','a',
'b','d','e','g','h','i','l','m','n','o','r','s','t','x']
Lets do some benchmarking on the count-function:
import random as rd
rd.seed()
def Test0():
randstr = ''.join(rd.choice(string.digits) for _ in range(10000))
randstr.count('7')
def Test1():
randstr = ''.join(rd.choice(string.ascii_letters) for _ in range(10000))
randstr.count('a')
def Test2():
randstr = ''.join(rd.choice(string.printable) for _ in range(10000))
randstr.count(';')
def Test3():
randstr = ''.join(rd.choice(charSet) for _ in range(10000))
randstr.count(';')
I'm testing only digits, only letters, printable, and the charset from my data.
Results of %timeit:
%timeit(Test0())
100 loops, best of 3: 9.27 ms per loop
%timeit(Test1())
100 loops, best of 3: 9.12 ms per loop
%timeit(Test2())
100 loops, best of 3: 9.94 ms per loop
%timeit(Test3())
100 loops, best of 3: 8.31 ms per loop
The performance is consistent and doesn't suggest any problems of count with certain character sets.
I also tested if concatenating strings with + would cause a slow down but this wasn't the case either.
Can anyone explain this or give me some hints?
EDIT: Using Python 2.7.12
EDIT 2: In my original data the following is happening:
The file has around 550000 lines which are often broken by random newline characters yet defined by always 38 ";"-delimiters. Until roughtly 300000 lines the performance is fast, then from that line on it suddenly starts getting slower and slower. I'm investigating this further now with the new clues.
The problem is in count(';').
string.printable contains ';' while string.ascii_characters doesn't.
Then as the length of linesT grows, the execution time grows as well:
0.000236988067627
0.0460968017578
0.145275115967
0.271568059921
0.435608148575
0.575787067413
0.750104904175
0.899538993835
1.08505797386
1.24447107315
1.34459710121
1.45430088043
1.63317894936
1.90502595901
1.92841100693
2.07722711563
2.16924905777
2.30753016472
In particular this code is problematic with string.printable:
numberOfCols = 38
if count == numberOfCols:
lines2.append(linesT)
linesT = ''
Since there is a chance that ';' is included more than once in line 37 just before linesT is flushed, 38 will be skipped and linesT grows indefinitely.
You can observe this behaviour by leaving the initial set to string.ascii_characters and changing your code to count('a').
To fix the problem with printable you can modify your code like this:
if count > numberOfCols:
Then we go back to the expected runtime behaviour:
0.000234842300415
0.00233697891235
0.00247097015381
0.00217199325562
0.00262403488159
0.00262403488159
0.0023078918457
0.0024049282074
0.00231409072876
0.00233006477356
0.00214791297913
0.0028760433197
0.00241804122925
0.00250506401062
0.00254893302917
0.00266218185425
0.00236296653748
0.00201988220215
0.00245118141174
0.00206398963928
0.00219988822937
0.00230193138123
0.00205302238464
0.00230097770691
0.00248003005981
0.00204801559448
I am just reporting what I found. The performance difference seemingly does not come from str.count() function. I changed your code and refactored the str.count() into its own function. I also put your global code into a main function. The following is my version of your code:
import os
import time
import random as rd
import string
import timeit
# Function to create random data in a specific pattern with separator ";":
def createRandomString(num,io,fullLength):
lineFull = ''
nl = True
randstr = ''.join(rd.choice(string.ascii_letters) for _ in range(7))
#randstr = ''.join(rd.choice(string.printable) for _ in range(7))
for i in range(num):
if i == 0:
line = 'Start;'
else:
line = ''
bb = rd.choice([True,True,False])
if bb:
line = line+'\"\";'
else:
if rd.random() < 0.999:
line = line+randstr
else:
line = line+rd.randint(10,100)*randstr
if nl and i != num-1:
line = line+';\n'
nl = False
elif rd.random() < 0.04 and i != num-1:
line = line+';\n'
if rd.random() < 0.01:
add = rd.randint(1,10)*'\n'
line = line+add
else:
line = line+';'
lineFull = lineFull+line
return lineFull+'\n'
def counting_func(lines_iter):
try:
return lines_iter.next().count(';')
except StopIteration:
return -1
def wrapper(func, *args, **kwargs):
def wrapped():
return func(*args, **kwargs)
return wrapped
# Create file with random data:
def main():
fullLength = 100000
outputFolder = ""
numberOfCols = 38
testLines = [createRandomString(numberOfCols,i,fullLength) for i in range(fullLength)]
with open(outputFolder+"TestFile.txt",'w') as tf:
tf.writelines(testLines)
# Read in file:
with open(outputFolder+"TestFile.txt",'r') as ff:
lines = []
for line in ff.readlines():
lines.append(unicode(line.rstrip('\n')))
# Restore columns by counting the separator:
lines_iter = iter(lines)
print timeit.timeit(wrapper(counting_func, lines_iter), number=fullLength)
if __name__ == '__main__': main()
Tests are done 100000 times on each line generated. With string.ascii_letters, I get from timeit on average 0.0454177856445 seconds each loop. With string.printable, I get on average 0.0426299571991. In fact the latter is slightly faster than the former, though not really a significant difference.
I suspect the performance difference comes from what you are doing in the following loop besides counting:
for i in range(len(lines)):
linesT = linesT + lines[i]
count = linesT.count(';')
if count == numberOfCols:
lines2.append(linesT)
linesT = ''
if i%1000 == 0:
print time.time()-time0
time0 = time.time()
Another possibility is slow-down from accessing global variable without a main function. But that should happen in both case, so not really.
I am using url_analysis tools from spotify API (wrapper spotipy, with sp.) to process tracks, using the following code:
def loudness_drops(track_ids):
names = set()
tids = set()
tracks_with_drop_name = set()
tracks_with_drop_id = set()
for id_ in track_ids:
track_id = sp.track(id_)['uri']
tids.add(track_id)
track_name = sp.track(id_)['name']
names.add(track_name)
#get audio features
features = sp.audio_features(tids)
#and then audio analysis id
urls = {x['analysis_url'] for x in features if x}
print len(urls)
#fetch analysis data
for url in urls:
# print len(urls)
analysis = sp._get(url)
#extract loudness sections from analysis
x = [_['start'] for _ in analysis['segments']]
print len(x)
l = [_['loudness_max'] for _ in analysis['segments']]
print len(l)
#get max and min values
min_l = min(l)
max_l = max(l)
#normalize stream
norm_l = [(_ - min_l)/(max_l - min_l) for _ in l]
#define silence as a value below 0.1
silence = [l[i] for i in range(len(l)) if norm_l[i] < .1]
#more than one silence means one of them happens in the middle of the track
if len(silence) > 1:
tracks_with_drop_name.add(track_name)
tracks_with_drop_id.add(track_id)
return tracks_with_drop_id
The code works, but if the number of songs I search is set to, say, limit=20, the time it takes to process all the audio segments xand l makes the process too expensive, e,g:
time.time() prints 452.175742149
QUESTION:
how can I drastically reduce complexity here?
I've tried to use sets instead of lists, but working with set objects prohibts indexing.
EDIT: 10 urls:
[u'https://api.spotify.com/v1/audio-analysis/5H40slc7OnTLMbXV6E780Z', u'https://api.spotify.com/v1/audio-analysis/72G49GsqYeWV6QVAqp4vl0', u'https://api.spotify.com/v1/audio-analysis/6jvFK4v3oLMPfm6g030H0g', u'https://api.spotify.com/v1/audio-analysis/351LyEn9dxRxgkl28GwQtl', u'https://api.spotify.com/v1/audio-analysis/4cRnjBH13wSYMOfOF17Ddn', u'https://api.spotify.com/v1/audio-analysis/2To3PTOTGJUtRsK3nQemP4', u'https://api.spotify.com/v1/audio-analysis/4xPRxqV9qCVeKLQ31NxhYz', u'https://api.spotify.com/v1/audio-analysis/1G1MtHxrVngvGWSQ7Fj4Oj', u'https://api.spotify.com/v1/audio-analysis/3du9aoP5vPGW1h70mIoicK', u'https://api.spotify.com/v1/audio-analysis/6VIIBKYJAKMBNQreG33lBF']
This is what I see, not knowing much about spotify:
for id_ in track_ids:
# this runs N times, where N = len(track_ids)
...
tids.add(track_id) # tids contains all track_ids processed until now
# in the end: len(tids) == N
...
features = sp.audio_features(tids)
# features contains features of all tracks processed until now
# in the end, I guess: len(features) == N * num_features_per_track
urls = {x['analysis_url'] for x in features if x}
# very probably: len(urls) == len(features)
for url in urls:
# for the first track, this processes features of the first track only
# for the seconds track, this processes features of 1st and 2nd
# etc.
# in the end, this loop repeats N * N * num_features_per_track times
You should not any url twice. And you do, because you keep all tracks in tids and then for each track you process everything in tids, which turns the complexity of this into O(n2).
In general, always look for loops inside loops when trying to reduce complexity.
I believe in this case this should work, if audio_features expects a set of ids:
# replace this: features = sp.audio_features(tids)
# with:
features = sp.audio_features({track_id})
I have the following code
class Board:
def __init__(self, size=7):
self._size = size
self._list, self._llist =[],[]
for i in range (self._size):
self._list.append('_ ')
for j in range(self._size):
self._llist.append(self._list)
def printboard(self):
for i in range(self._size):
for j in range(self._size):
print(self._llist[i][j], end = ' ')
print('\n')
def updateboard(self,x,y,letter):
self._llist[x][y]=letter
self.printboard()
board = Board(3)
board.updateboard(0,0,'c')
and this prints
c _ _
c _ _
c _ _
instead of
c _ _
_ _ _
_ _ _
I can't see what is going wrong. Also, is there a simpler way to create the list of lists dynamically?
Thanks!
You are creating llist with the same list object, repeated multiple times. If you want each list in llist to be a separate, independent object (so that when you modify the contents only one list is changed) then you need to append a different copy to each. The easiest way to do this is to change:
self._llist.append(self._list)
to
self._llist.append(list(self._list))
Simpler code would be:
self._list = ['_ '] * self._size
self._llist = [list(self._list) for i in range(self._size)]