ladder have around 15000 elements, this code snippet performed in 5-8sec, is there any way to do it faster? I try do it without checking for duplicate and without creating accs list and time was down to 2-3sec, but I don't need duplicate in csv file.
I work in python 2.7.9
accs =[]
with codecs.open('test.csv','w', encoding="UTF-8") as out:
row = ''
for element in ladder:
if element['account']['name'] not in accs:
accs.append(element['account']['name'])
row += element['account']['name']
if 'twitch' in element['account']:
row += "," + element['account']['twitch']['name'] + ","
else:
row += ",,"
row += str(element['account']['challenges']['total']) + "\n"
out.write(row)
seen = set()
results = []
for user in ladder:
acc = user['account']
name = acc['name']
if name not in seen:
seen.add(name)
twitch_name = acc['twitch']['name'] if "twitch" in acc else ''
challenges = acc['challenges']['total']
results.append("%s,%s,%d" % (name, twitch_name, challenges))
with codecs.open('test.csv','w', encoding="UTF-8") as out:
out.write("\n".join(results))
You can’t do much about the loop, since you need to go through every element in ladder after all. However, you can improve this membership test:
if element['account']['name'] not in accs:
Since accs is a list, this will essentially loop through all items of accs and check if the name is in there. And you loop for every element in ladder, so this can easily become very inefficient.
Instead, use a set instead of a list for accs as this will give you a constant membership lookup. So you reduce your algorithm from a quadratic complexity to a linear complexity. For that, just use accs = set() and change your code to use accs.add() instead of append.
Another issue is that you are doing string concatenation. Every time you do someString + "something" you are throwing away that string object and create a new one. This can become inefficient for a high number of operations too. Instead, use a list here to collect all the elements you want to write, and then join them:
row = []
row.append(element['account']['name'])
if 'twitch' in element['account']:
row.append(element['account']['twitch']['name'])
else:
row.append('')
row.append(str(element['account']['challenges']['total']))
out.write(','.join(row))
out.write('\n')
Alternatively, since you are writing to a file anyway, you could just call out.write multiple times with each string part.
Finally, you could also look into the csv module if you are interested in writing out CSV data.
Related
I am struggling with trying to see what is wrong with my code. I am new to python.
import os
uniqueWorms = set()
logLineList = []
with open("redhat.txt", 'r') as logFile:
for eachLine in logFile:
logLineList.append(eachLine.split())
for eachColumn in logLineList:
if 'worm' in eachColumn.lower():
uniqueWorms.append()
print (uniqueWorms)
eachLine.split() returns a list of words. When you append this to logLineList, it becomes a 2-dimensional list of lists.
Then when you iterate over it, eachColumn is a list, not a single column.
If you want logLineList to be a list of words, use
logLineList += eachLine.split()
instead of
logLineList.append(eachLine.split())
Finally, uniqueWorms.append() should be uniqueWOrms.append(eachColumn). And print(uniqueWorms) should be outside the loop, so you just see the final result, not every time a worm is added.
I have a dictionary d with around 500 main keys (name1, name2, etc.). Each value is itself a small dictionary with 5 keys called ppty1, ppty2, etc.), and the corresponding values are floats converted to strings.
I want to extract data faster than I presently do, based on a list of lists of the form ['name1', 'ppty3','ppty4'] (name1 could by any other nameX and ppty3 and ppty4 could be any other pptyX).
In my application, I have many dictionaries, but they differ only by the values of the fields ppty1, ..., ppty5. All the keys are "static". I do not care if there are some preliminary operations, I would just like the processing time of one dictionary to be, ideally, much faster than now. My poor implementation, consisting in looping over every field takes about 3 ms.
Here is the code to generate d and fields; this is just to simulate dummy data, it does not need to be improved:
import random
random.seed(314)
# build dictionary
def make_small_dict():
d = {}
for i in range(5):
key = "ppty" + str(i)
d[key] = str(random.random())
return d
d = {}
for i in range(100):
d["name" + str(i)] = make_small_dict()
# build fields
def make_row():
line = ['name' + str(random.randint(0,100))]
[line.append('ppty' + str(random.randint(0,5))) for i in range(2)]
return line
fields = [0]*300
for i in range(300):
fields[i] = [make_row() for j in range(3)]
For example, fields[0] returns
[['name420', 'ppty1', 'ppty1'],
['name206', 'ppty1', 'ppty2'],
['name21', 'ppty2', 'ppty4']]
so the first row of the output should be something like
[[d['name420']['ppty1'], d['name420']['ppty1'],
[d['name206']['ppty1'], d['name206']['ppty2']],
[d['name21']['ppty2'], d['name21']['ppty4']]]]
My solution:
start = time.time()
data = [0] * len(fields)
i = 0
for field in fields:
data2 = [0] * 3
j = 0
for row in field:
lst = [d[row[0]][key] for key in [row[1], row[2]]]
data2[j] = lst
j += 1
data[i] = data2
i += 1
print time.time() - start
My main question is, how to do improve my code? Few additional question:
Later, I need to do some operations such as column extraction, basic operation on some entries of data: would you recommend storing the extracted values directly in an np.array?
How to avoid extracting the same values multiple times (fields has some redundant rows such as ['name1', 'ppty3', 'ppty4'])?
I read that things such as i += 1 take a little bit of time, how can I avoid them?
This was tough to read, so I started by breaking bits out into functions. Then I could test to see if that worked using just a list comprehension. It's already faster, comparison over 10000 runs with timeit showed this code runs in about 64% of the original code's time.
In this case I kept everything in lists to force execution so it is directly comparable, but you could use generators or map, and that'd push the computation back to when the data is actually consumed.
def row_lookup(name, key1, key2):
return (d[name][key1], d[name][key2]) # Tuple is faster to construct than list
def field_lookup(field):
return [row_lookup(*row) for row in field]
start = time.time()
result = [field_lookup(field) for field in fields]
print(time.time() - start)
print(data == result)
# without dupes in fields
from itertools import groupby
result = [field_lookup(field) for field, _ in groupby(fields)]
Change just the result assignment line to:
result = map(field_lookup, fields)
And the runtime becomes negligible, because map is a generator, so it's not actually going to compute the data until you ask it for the result. This is not a fair comparison, but if you're not going to consume all the data, you'd save time. Change the list comprehensions in the functions to generators and you'd get the same benefit there too. Multiprocessing and asyncio didn't improve performance time in this case.
If you can change the structure you can preprocess your fields into a list of just the rows [['namex', 'pptyx', 'pptyX']..]. In this case, you can change it to just a single list comprehension, which lets you get this down to about 29% of the original runtime, ignoring the preprocessing to slim the fields.
from itertools import groupby, chain
slim_fields = [row for row, _ in groupby(chain.from_iterable(fields))]
results = [(d[name][key1], d[name][key2]) for name, key1, key2 in slim_fields]
In this case, results is just a list of tuples containing the values: [(value1, value2)..]
I'm a computing student who posted a question here the other day about helping me with my function to sort scores in order and I got some great help and it now works but I would also like it to sort the names according to the scores (so if James gets 10 then it prints "James 10". Right now what is happening is that the scores are sorting and printing to the screen properly but the names are just printing in the order that they are entered. I've tried this:
def sortlist():
global scorelist, namelist, hss
namelist = []
scorelist = []
hs = open("hstname.txt", "r")
namelist = hs.read().splitlines()
hss = open("hstscore.txt","r")
for line in hss:
scorelist.append(int(line))
switched = True
while switched:
switched = False
for i in range(len(scorelist)-1):
for j in range(len(namelist)-1):
if scorelist[i] < scorelist[i+1]:
scorelist[i],scorelist[i+1] = scorelist[i+1],scorelist[i]
namelist[j],namelist[j+1] = namelist[j+1],namelist[j]
switched = True
The score part works fine and it took me ages to get it and I'm not allowed to use a pre defined function like .sort(). Can anyone offer any help/advice? Or if you can see what I'm doing wrong then can you offer a solution? I can't work this out for the life of me
You don't need to use nested loops in order to go through both lists.
Since you should be manipulating the two lists in exactly the same way, you should only be using one for loop, and using the i variable to index into both lists.
If yoy turn this:
for i in range(len(scorelist)-1):
for j in range(len(namelist)-1):
if scorelist[i] < scorelist[i+1]:
scorelist[i],scorelist[i+1] = scorelist[i+1],scorelist[i]
namelist[j],namelist[j+1] = namelist[j+1],namelist[j]
switched = True
into this:
for i in range(len(scorelist)-1):
if scorelist[i] < scorelist[i+1]:
scorelist[i],scorelist[i+1] = scorelist[i+1],scorelist[i]
namelist[i],namelist[i+1] = namelist[i+1],namelist[i]
switched = True
then you should get both lists sorted as you want them.
The only time when this will cause an error is if your two lists are of different lengths. If namelist is somehow shorter than scorelist then this code will throw an exception. You could guard against that by checking before your sort routine that
len(scorelist) == len(namelist)
Imagine, you have a function that returns a single sorted list e.g. here's an implementation of Quicksort algorithm in Python:
def qsorted(L):
return L and (qsorted([x for x in L[1:] if x < L[0]]) + # lesser items
[L[0]] + # pivot
qsorted([x for x in L[1:] if x >= L[0]])) # greater or equal
Then you could use it to sort your scorelist:
qsorted_scorelist = qsorted(scorelist)
To sort namelist according to the order of scorelist; you could use Schwartzian transform:
qsorted_namelist = [name for score, name in qsorted(zip(scorelist, namelist))]
Note that the same function qsorted() is used in both case: to sort the scorelist along and to sort both lists together. You should try to extract common functionality into separate functions instead of modifying your sorting algorithm inplace for a slighty different task.
To test that the result is correct; you could use the builtin sorted() function:
sorted_namelist = [name for score, name in sorted(zip(scorelist, namelist))]
I'm a student in a Computing class and we have to write a program which contains file handling and a sort. I've got the file handling done and I wrote out my sort (it's a simple sort) but it doesn't sort the list. My code is this:
namelist = []
scorelist = []
hs = open("hst.txt", "r")
namelist = hs.read().splitlines()
hss = open("hstscore.txt","r")
for line in hss:
scorelist.append(int(line))
scorelength = len(scorelist)
for i in range(scorelength):
for j in range(scorelength + 1):
if scorelist[i] > scorelist[j]:
temp = scorelist[i]
scorelist[i] = scorelist[j]
scorelist[j] = temp
return scorelist
I've not been doing Python for very long so I know the code may not be efficient but I really don't want to use a completely different method for sorting it and we're not allowed to use .sort() or .sorted() since we have to write our own sort function. Is there something I'm doing wrong?
def super_simple_sort(my_list):
switched = True
while switched:
switched = False
for i in range(len(my_list)-1):
if my_list[i] > my_list[i+1]:
my_list[i],my_list[i+1] = my_list[i+1],my_list[i]
switched = True
super_simple_sort(some_list)
print some_list
is a very simple sorting implementation ... that is equivelent to yours but takes advantage of some things to speed it up (we only need one for loop, and we only need to repeat as long as the list is out of order, also python doesnt require a temp var for swapping values)
since its changing the actual array values you actually dont even need to return
Relatively new to programming hence why I've chosen to use python to learn.
At the moment I'm attempting to read a list of Usernames, passwords from an Excel Spreadsheet with XLRD and use them to login to something. Then back out and go to the next line. Log in etc and keep going.
Here is a snippit of the code:
import xlrd
wb = xlrd.open_workbook('test_spreadsheet.xls')
# Load XLRD Excel Reader
sheetname = wb.sheet_names() #Read for XCL Sheet names
sh1 = wb.sheet_by_index(0) #Login
def readRows():
for rownum in range(sh1.nrows):
rows = sh1.row_values(rownum)
userNm = rows[4]
Password = rows[5]
supID = rows[6]
print userNm, Password, supID
print readRows()
I've gotten the variables out and it reads all of them in one shot, here is where my lack of programming skills come in to play. I know I need to iterate through these and do something with them but Im kind of lost on what is the best practice. Any insight would be great.
Thank you again
couple of pointers:
i'd suggest you not print your function with no return value, instead just call it, or return something to print.
def readRows():
for rownum in range(sh1.nrows):
rows = sh1.row_values(rownum)
userNm = rows[4]
Password = rows[5]
supID = rows[6]
print userNm, Password, supID
readRows()
or looking at the docs you can take a slice from the row_values:
row_values(rowx, start_colx=0,
end_colx=None) [#]
Returns a slice of the values of the cells in the given row.
because you just want rows with index 4 - 6:
def readRows():
# using list comprehension
return [ sh1.row_values(idx, 4, 6) for idx in range(sh1.nrows) ]
print readRows()
using the second method you get a list return value from your function, you can use this function to set a variable with all of your data you read from the excel file. The list is actually a list of lists containing your row values.
L1 = readRows()
for row in L1:
print row[0], row[1], row[2]
After you have your data, you are able to manipulate it by iterating through the list, much like for the print example above.
def login(name, password, id):
# do stuff with name password and id passed into method
...
for row in L1:
login(row)
you may also want to look into different data structures for storing your data. If you need to find a user by name using a dictionary is probably your best bet:
def readRows():
rows = [ sh1.row_values(idx, 4, 6) for idx in range(sh1.nrows) ]
# using list comprehension
return dict([ [row[4], (row[5], row[6])] for row in rows ])
D1 = readRows()
print D['Bob']
('sdfadfadf',23)
import pprint
pprint.pprint(D1)
{'Bob': ('sdafdfadf',23),
'Cat': ('asdfa',24),
'Dog': ('fadfasdf',24)}
one thing to note is that dictionary values returned arbitrarily ordered in python.
I'm not sure if you are intent on using xlrd, but you may want to check out PyWorkbooks (note, I am the writter of PyWorkbooks :D)
from PyWorkbooks.ExWorkbook import ExWorkbook
B = ExWorkbook()
B.change_sheet(0)
# Note: it might be B[:1000, 3:6]. I can't remember if xlrd uses pythonic addressing (0 is first row)
data = B[:1000,4:7] # gets a generator, the '1000' is arbitrarily large.
def readRows()
while True:
try:
userNm, Password, supID = data.next() # you could also do data[0]
print userNm, Password, supID
if usrNm == None: break # when there is no more data it stops
except IndexError:
print 'list too long'
readRows()
You will find that this is significantly faster (and easier I hope) than anything you would have done. Your method will get an entire row, which could be a thousand elements long. I have written this to retrieve data as fast as possible (and included support for such things as numpy).
In your case, speed probably isn't as important. But in the future, it might be :D
Check it out. Documentation is available with the program for newbie users.
http://sourceforge.net/projects/pyworkbooks/
Seems to be good. With one remark: you should replace "rows" by "cells" because you actually read values from cells in every single row