Python dictionary too slow for crosscomparison, improvements? [closed] - python

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 6 years ago.
Improve this question
I am currently struggling with a performance issue when using Python dictionaries. I have a few huge dicts (up to 30k entries), and I want to do a cross-comparison on these entries. So, if one entry (identifier is a key) is given, how many other dicts contain this entry with this key, too? It currently takes up to 5h on my machine, but it should work in about a few minutes to make sense for my tool. I already tried to remove entries to make the search more efficient.
all_cached_data is a list with these lists of dicts. sources is a list with information about the lists in all_cached_data.
appearsin_list = []
# first, get all the cached data
sources = sp.get_sources()
all_cachedata = [0]*len(sources)
for source in sources:
iscached = source[8]
sourceid = int(source[0])
if iscached == "True":
cachedata, _ = get_local_storage_info(sourceid)
else:
cachedata = []
all_cachedata[sourceid-1] = cachedata
# second, compare cache entries
# iterate over all cached sources
for source in sources:
sourceid = int(source[0])
datatype = source[3]
iscached = source[8]
if verbose:
print("Started comparing entries from source " + str(sourceid) +
" with " + str(len(all_cachedata[sourceid-1])) + " entries.")
if iscached == "True":
# iterate over all other cache entries
for entry in all_cachedata[sourceid-1]:
# print("Comparing source " + str(sourceid) + " with source " + str(cmpsourceid) + ".")
appearsin = 0
for cmpsource in sources:
cmpsourceid = int(cmpsource[0])
cmpiscached = cmpsource[8]
# find entries for same potential threat
if cmpiscached == "True" and len(all_cachedata[cmpsourceid-1]) > 0 and cmpsourceid != sourceid:
for cmpentry in all_cachedata[cmpsourceid-1]:
if datatype in cmpentry:
if entry[datatype] == cmpentry[datatype]:
appearsin += 1
all_cachedata[cmpsourceid-1].remove(cmpentry)
break
appearsin_list.append(appearsin)
if appearsin > 0:
if verbose:
print(entry[datatype] + " appears also in " + str(appearsin) + " more source/s.")
all_cachedata[sourceid-1].remove(entry)
avg = float(sum(appearsin_list)) / float(len(appearsin_list))
print ("Average appearance: " + str(avg))
print ("Median: " + str(numpy.median(numpy.array(appearsin_list))))
print ("Minimum: " + str(min(appearsin_list)))
print ("Maximum: " + str(max(appearsin_list)))
I would be very thankful for some tips on speeding this up.

I think your algorithm can be improved; nested loops are not great in this case. I also think that Python is probably not the best for this particular purpouse: use SQL to do compare and search in a big amount of data. You can use something like sqlite_object to convert your data set in a SQLite db and query it.
If you want to go ahead with pure Python, you can try to compile your script with Cython; you can have some resonable improvements in speed.
http://docs.cython.org/src/tutorial/pure.html
Then you can improve your code with some static type hinting:
http://docs.cython.org/src/tutorial/pure.html#static-typing

Related

How can I convert a result into a list of variables that I can use as an input?

I was able to come up with these two parts, but I'm having trouble linking them.
Part 1 - This accepts a filter which is listed as 'project = status = blocked'. This will list all issue codes that match the filter and separate them line by line. Is it necessary to convert the results into a list? I'm also wondering if it converts the entire result into one massive string or if each line is a string.
issues_in_project = jira.search_issues(
'project = status = Blocked'
)
issueList = list(issues_in_project)
search_results = '\n'.join(map(str, issueList))
print(search_results)
Part 2 - Right now, the jira.issue will only accept an issue code one at a time. I would like to use the list generated from Part 1 to keep running the code below for each and every issue code in the result. I'm having trouble linking these two parts.
issue = jira.issue(##Issue Code goes here##)
print(issue.fields.project.name)
print(issue.fields.summary + " - " + issue.fields.status.statusCategory.name)
print("Description: " + issue.fields.description)
print("Reporter: " + issue.fields.reporter.displayName)
print("Created on: " + issue.fields.created)
Part 1
'project = status = Blocked' is not a valid JQL. So first of all, you will not get a valid result from calling jira.search_issues('project = status = Blocked').
The result of jira.search_issues() is basically a list of jira.resources.Issue objects and not a list of string or lines of string. To be correct, I should say the result of jira.search_issues() is of type jira.client.ResultList, which is a subclass of python's list.
Part 2
You already have all the required data in issues_in_project if your JQL is correct. Therefore, you can loop through the list and use the relevant information of each JIRA issue. For your information, jira.issue() returns exactly one jira.resources.Issue object (if the issue key exists).
Example
... # initialize jira
issues_in_project = jira.search_issues('status = Blocked')
for issue in issues_in_project:
print(issue.key)
print(issue.fields.summary)

Code working a week ago, now I am getting an error without changing anything in my code

I wrote a small web scraper that was working fine a couple of weeks ago, but now gives me an error without me having changed any part of my code. My code is listed below for reference:
address = driver.find_elements_by_xpath('//h3[#class = "street"]')
price = driver.find_elements_by_xpath('//div[#class = "price"]')
details = driver.find_elements_by_xpath('//div[#class = "details"]')
num_page_items = len(details)
with open('results.csv', 'a') as f:
for x in range(num_page_items):
f.write(address[x].text + " , " + price[x].text.replace(",", "") + "," + details[x].text + "\n")
I am using selenium (I omitted the import and setup since that part of the code works fine) and when I run my code I get the following error:
line 25, in <module>
f.write(address[x].text + " , " + price[x].text.replace(",", "") + "," + details[x].text + "\n")
IndexError: list index out of range
I did some researching but when I print len(details) I get 24, which indicates that there are values in the details variable. Since the range is defined, and I get a result for the length of the list, why would I get an out of range error?
Your code assumes that the length of each of the arrays is the same, but that's not guaranteed. Like others have said, reconsider your implementation if the design of the site has changed.
Alternatively, if you want to stop throwing errors, you could look into the built in zip library. https://docs.python.org/3.3/library/functions.html#zip
This will group together your arrays into an array of tuples, creating n tuples where n is the length of your smallest array. Consider though that if the site has changed its design, the meaningfulness of the newly created zip may not be valid.

How to print results from for loop in one list [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 4 years ago.
Improve this question
Below code is for loop to print many lists
for file in dir:
res = p.Probability(base + i + "/" + file)
print(i + ": " + ": " + str(res))
print(res)
#docum = []
#docum.append(res)
print(docum)
for loop result will be:
[['hello',123]['hi',456]]
[['hello',123]['hi',456]]
[['hello',123]['hi',456]]
[['hello',123]['hi',456]]
[['hello',123]['hi',456]]
but I want to print as a one list
[['hello',123]['hi',456],['hello',123]['hi',456],['hello',123]['hi',456]]
how can I do that. I tried many things but still not working. I am new to python. and one more help how to separate hi and hello.
like:
hi hello
456 123
456 123
456 123
I am doing school project for my class 11th. I struck in this and I am new to coding
docum = []
for file in dir:
res = p.Probability(base + i + "/" + file)
print(i + ": " + ": " + str(res))
print(res)
docum.append(res)
print(docum)
initialize the first list outside of the for loop, so it you're not constantly creating a new one and override the old entries.
About your second question about how to separate hi and hello:
You could run a for loop and check every first element of the array with a condition ( if res[i][0] = 'hi'), assuming you wanted the list the way you described

The word "the" causing syntax error in print function - Python [closed]

Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 6 years ago.
Improve this question
I'm using "The coder's apprentice: Learning Python with Python 3" (http://www.spronck.net/pythonbook/pythonbook.pdf).
I'm doing this exercise: "The cover price of a book is $24.95, but bookstores get a 40 percent discount.
Shipping costs $3 for the first copy and 75 cents for each additional copy. Calculate the total wholesale costs for 60 copies."
This is my code:
book_price = 24.95
book_discount = book_price / 10 * 4
bookstore_book_price = book_price - book_discount
shipping_first = 3
shipping_rest = 0.75
sixty_shipped = bookstore_book_price + shipping_first + (shipping_rest * 59)
print("A book is being sold regularly for " +str(book_price) + ".")
print("At bookstores, it's being sold with a 40% discount, amounting to " + str(book_discount) + ".")
print("This means it's being sold at bookstores for " + str(bookstore_book_price) + ".")
print("The first copy ships for " + "str(shipping_first) + ", but the rest ships for " + str(shipping_rest) ".")
print("Given 60 copies were shipped, it would cost " + str(sixty_shipped + ".")
For whatever reason, the word the in this line of code:
(print("The first copy ships for " + "str(shipping_first) + ", but the rest ships for " + str(shipping_rest) "."))`
Produces a syntax error. Given that I remove each word until I reach for I still get a syntax error. When only for and but are left, the error:
EOL while scanning string literal
is produced. I don't have a clue what to do.
Here's my code: Using IDLE editor (not prompt).
Because you got an extra ". Instead of
(print("The first copy ships for " + "str(shipping_first) + ", but the rest ships for " + str(shipping_rest) "."))
do
(print("The first copy ships for " + str(shipping_first) + ", but the rest ships for " + str(shipping_rest) + "."))
You can also omit calling str(), from print() docs:
All non-keyword arguments are converted to strings like str() does and written to the stream
UPD
Also you skipped + at the end of error line.
And as #tobias_k mentioned you forgot closing ) for str method print("Given 60 copies were shipped, it would cost " + str(sixty_shipped + ".")
So for your code to work without str() methods:
print("The first copy ships for ", shipping_first, ", but the rest ships for ", shipping_rest, ".")
Or even better with format()
print("The first copy ships for {}, but the rest ships for {}.".format(shipping_first, shipping_rest))
It's now more readable.

{Python} Generate unique integers using iteration

I'm making a program to collect information from the user, and to add it to a text file.
It's a program that will be used to get said information from a number of applicants.
For linearity in the results I collect, I want to randomly ask the questions.
What i'm asking is a way to pull a question from the list, ask for input, store the input in the text file, and then ask another question pulled from the list at random.
Here is my code so far:
def ques():
global quesnum
for i in questions:
num = int(random.randint(0,len(questions)-1))
j = int(numbers.count(str(num)))
while j >= 1:
num = int(random.randint(0,len(questions)-1))
##DEBUG ONLY##
print('true')
break
else:
num = str(num)
numbers.append(num)
##DEBUG ONLY##
print('false')
num = int(num)
answer = input(str(quesnum) + '. ' + questions[num] + ': ')
answers.write(str(quesnum) + '. ' + questions[num] + ': ')
answers.write(answer + '\n')
quesnum = int(quesnum + 1)
Errors:
Once the number has been used it is added to the list.
If a number has already been used, ideal situation is to generate a new number and use that instead.
I can't see any errors in my code, and as far as I can see it should work fine.
Can anyone point out a fix or suggest a better way of doing this? I have already found answers suggesting to use random.sample() but I have tried this already and can't get that working either.
Thanks in advance.
You can solve this by using random.shuffle:
import random
questions = ['Q1: ...', 'Q2: ...', 'Q3: ...']
random.shuffle(questions)
for q in questions:
answer = raw_input(q + ': ')
with open("answers.txt", "a") as myfile:
myfile.write("{}: {}\n\n".format(q, answer))
This will shuffle your questions, ask them in random order and save them to a text file. If you want to save more detailed information for each question, this will also work with a list of dicts. E.g.
questions = [
{'nr.': 1, 'text': 'Do you like horses?'},
{'nr.': 2, 'text': 'Where were you born?'}
]

Categories

Resources