I've a json file of type [{"score": 68},{"score": 78}]
I need to find the height of standard binary search tree that is made using the scores of all the objects. How can I do it?
This is what I'm doing. I'm first getting all the scores and storing inside the json file and then applying the formula.
import ijson
import math
f = open ('data_large')
content = ijson.items(f, 'item')
n = len(list(i['score'] for i in content))
height = math.ceil(math.log((n+1),2)-1)
print height
Well, this does gives me the correct answer, but wanted to know 2 things?
1) Whether this formula will also be valid in case when there are duplicates in the list, since I need to develop a BST which can have duplicates as well?
2) I think n = len(list(i['score'] for i in content)) is useless because since I dont need the node values to calculate the height of the BST, but only the length of the list. Is there any way I can calculate the number of entries so that I omit this line and calculate the total nuber of entries in the json file, which will serve the purpose of n?
The other thing is I also wanted to calculate is the unique scores as well from the file. So, this is how I'm doing print set(i['score'] for i in content) , but it takes 201secs to execute since the file is so large( 256MB, hence used ijson for fast processing ), hence there are multiple entries inside the content. Can I make this statement much more time-efficient. If yes, How?
1) Yes/no. If you've added a property to each node which counts the number of times that node has been inserted into the tree then you still have a BST and the answer is yes.
If you actually want duplicate nodes, then you'd need modify the BST property. The unmodified property says that items smaller than X go left and items greater than X go right. If you instead say items greater than or equal to X to right, then it's easy to see that you can make the tree arbitrarily high by adding many duplicate items and the answer is no.
2) Have you tried list(content)? Of course, you cannot build a BST without somehow removing duplicate nodes, this cannot be used to calculate the height of the BST. You need to remove duplicate items. This leads to your third question.
3) As regards the print set(i['score']... You shouldn't bundle separate questions together like this as it will lead you down the dark path to having different answers addressing different parts of your question. However, the code you've write certainly does have something Pythonic about it. So you've got to ask yourself if it's really worth your time (which is often the only time that really matters) to try to find a more convoluted, but quicker solution.
Related
I am relatively new to Python. However, my needs generally only involve simple string manipulation of rigidly formatted data files. I have a specific situation that I have scoured the web trying to solve and have come up blank.
This is the situation. I have a simple list of two-part entires, formatted like this:
name = ['PAUL;25', 'MARY;60', 'PAUL;40', 'NEIL;50', 'MARY;55', 'HELEN;25', ...]
And, I need to keep only one instance of any repeated name (ignoring the number to the right of the ' ; '), keeping only the entry with the highest number, along with that highest value still attached. So the answer would look like this:
ans = ['MARY;60', 'PAUL;40', 'HELEN;25', 'NEIL;50, ...]
The order of the elements in the list is irrelevant, but the format of the ans list entries must remain the same.
I can probably figure out a way to brute force it. I have looked at 2D lists, sets, tuples, etc. But, I can't seem to find the answer. The name list has about a million entries, so I need something that is efficient. I am sure it will be painfully easy for some of you.
Thanks for any input you can provide.
Cheers.
alkemyst
Probably the best data structure for this would be a dictionary, with the entries split up (and converted to integer) and later re-joined.
Something like this:
max_score = {}
for n in name:
person, score_str = n.split(';')
score = int(score_str)
if person not in max_score or max_score[person] < score:
max_score[person] = score
ans = [
'%s;%s' % (person, score)
for person, score in max_score.items()
]
This is a fairly common structure for many functions and programs: first convert the input to an internal representation (in this case, split and convert to integer), then do the logic or calculation (in this case, uniqueness and maximum), then convert to the required output representation (in this case, string separated with ;).
In terms of efficiency, this code looks at each input item once, then at each output item once; there's unlikely to be any approach that can do better than that (certainly not formally, and likely not in practice). All of the per-item operations are constant-time and fast. It accumulates the intermediate answer in memory (in max_score), but again that is unavoidable; if memory is an issue, the input and output could be changed to iterators/generators, but the whole intermediate answer has to be accumulated in max_score before any items can be output.
I have a CSV file with 100,000 rows.
Each row in column A is a sentence comprised of both chars and integers.
I want column B to contain only integers.
I want the new columns to be in the same CSV file.
How can I accomplish this?
If I'm understanding your question correctly, I would use .isdigit() to parse the data in column A. I'm frankly not sure what the format of column A is, so I don't know exactly what you would do with this (if you gave more information I could give a more specific answer). Your solution will likely come in a similar form to this:
def find(lines):
B = []
for line in lines:
numbers = [c for c in line if c.isdigit()]
current = int(''.join(numbers))
# current is the concatenation of all
# integers found in column A from left to right
B.append(current)
return B
Let me know if this makes sense or is even in the right track for your solution. Once again, without knowing what you're trying to do, and what A looks like, I'm not sure what your actual goals are.
EDIT
I'm not going to explain the csv stuff for you, mainly because there is a fantastic resource and library for it included in python here. If you have specific questions related to writing csv, definitely post them.
It sounds like you essentially want to pull int values out of column A then add them to a new column B. There are definitely many ways to solve this, but the general form of the problem is for each row you'll filter out the int, then you'll add the filtered int into the new column. I'll list a couple:
Regex: You could use a pattern such as [0-9]+ to pull the string out of A, then use int(whatever that output is) to cast to int, then store those values in B. I'm a sucker for a good regular expression and this one is fairly straight forward. Regexr is a great resource to learn about this and test your pattern.
Use an algorithm similar to above: The above algorithm worked before, but I've updated it slightly. Now that it's been updated it'll return an array of numbers correspondent to numbers in A from left to right. This is relatively sound, but it doesn't necessarily guarantee you have the right integer, given that if the title has an int in it, it'll mess some things up. It is likely one of the more clear ways of doing this, though.
This is my first post, please be gentle. I'm attempting to sort some
files into ascending and descending order. Once I have sorted a file, I am storing it in a list which is assigned to a variable. The user is then to choose a file and search for an item. I get an error message....
TypeError: unorderable types; int() < list()
.....when ever I try to search for an item using the variable of my sorted list, the error occurs on line 27 of my code. From research, I know that an int and list cannot be compared, but I cant for the life of me think how else to search a large (600) list for an item.
At the moment I'm just playing around with binary search to get used to it.
Any suggestions would be appreciated.
year = []
with open("Year_1.txt") as file:
for line in file:
line = line.strip()
year.append(line)
def selectionSort(alist):
for fillslot in range(len(alist)-1,0,-1):
positionOfMax=0
for location in range(1,fillslot+1):
if alist[location]>alist[positionOfMax]:
positionOfMax = location
temp = alist[fillslot]
alist[fillslot] = alist[positionOfMax]
alist[positionOfMax] = temp
def binarySearch(alist, item):
first = 0
last = len(alist)-1
found = False
while first<=last and not found:
midpoint = (first + last)//2
if alist[midpoint] == item:
found = True
else:
if item < alist[midpoint]:
last = midpoint-1
else:
first = midpoint+1
return found
selectionSort(year)
testlist = []
testlist.append(year)
print(binarySearch(testlist, 2014))
Year_1.txt file consists of 600 items, all years in the format of 2016.
They are listed in descending order and start at 2017, down to 2013. Hope that makes sense.
Is there some reason you're not using the Python: bisect module?
Something like:
import bisect
sorted_year = list()
for each in year:
bisect.insort(sorted_year, each)
... is sufficient to create the sorted list. Then you can search it using functions such as those in the documentation.
(Actually you could just use year.sort() to sort the list in-place ... bisect.insort() might be marginally more efficient for building the list from the input stream in lieu of your call to year.append() ... but my point about using the `bisect module remains).
Also note that 600 items is trivial for modern computing platforms. Even 6,000 won't take but a few milliseconds. On my laptop sorting 600,000 random integers takes about 180ms and similar sized strings still takes under 200ms.
So you're probably not gaining anything by sorting this list in this application at that data scale.
On the other hand Python also includes a number of modules in its standard libraries for managing structured data and data files. For example you could use Python: SQLite3.
Using this you'd use standard SQL DDL (data definition language) to describe your data structure and schema, SQL DML (data manipulation language: INSERT, UPDATE, and DELETE statements) to manage the contents of the data and SQL queries to fetch data from it. Your data can be returned sorted on any column and any mixture of ascending and descending on any number of columns with the standard SQL ORDER BY clauses and you can add indexes to your schema to ensure that the data is stored in a manner to enable efficient querying and traversal (table scans) in any order on any key(s) you choose.
Because Python includes SQLite in its standard libraries, and because SQLite provides SQL client/server semantics over simple local files ... there's almost no downside to using it for structured data. It's not like you have to install and maintain additional software, servers, handle network connections to a remote database server nor any of that.
I'm going to walk through some steps before getting to the answer.
You need to post a [mcve]. Instead of telling us to read from "Year1.txt", which we don't have, you need to put the list itself in the code. Do you NEED 600 entries to get the error in your code? No. This is sufficient:
year = ["2001", "2002", "2003"]
If you really need 600 entries, then provide them. Either post the actual data, or
year = [str(x) for x in range(2017-600, 2017)]
The code you post needs to be Cut, Paste, Boom - reproduces the error on my computer just like that.
selectionSort is completely irrelevant to the question, so delete it from the question entirely. In fact, since you say the input was already sorted, I'm not sure what selectionSort is actually supposed to do in your code, either. :)
Next you say testlist = [].append(year). USE YOUR DEBUGGER before you ask here. Simply looking at the value in your variable would have made a problem obvious.
How to append list to second list (concatenate lists)
Fixing that means you now have a list of things to search. Before you were searching a list to see if 2014 matched the one thing in there, which was a complete list of all the years.
Now we get into binarySearch. If you look at the variables, you see you are comparing the integer 2014 with some string, maybe "1716", and the answer to that is useless, if it even lets you do that (I have python 2.7 so I am not sure exactly what you get there). But the point is you can't find the integer 2014 in a list of strings, so it will always return False.
If you don't have a debugger, then you can place strategic print statements like
print ("debug info: binarySearch comparing ", item, alist[midpoint])
Now here, what VBB said in comments worked for me, after I fixed the other problems. If you are searching for something that isn't even in the list, and expecting True, that's wrong. Searching for "2014" returns True, if you provide the correct list to search. Alternatively, you could force it to string and then search for it. You could force all the years to int during the input phase. But the int 2014 is not the same as the string "2014".
I have a dict with 50,000,000 keys (strings) mapped to a count of that key (which is a subset of one with billions).
I also have a series of objects with a class set member containing a few thousand strings that may or may not be in the dict keys.
I need the fastest way to find the intersection of each of these sets.
Right now, I do it like this code snippet below:
for block in self.blocks:
#a block is a python object containing the set in the thousands range
#block.get_kmers() returns the set
count = sum([kmerCounts[x] for x in block.get_kmers().intersection(kmerCounts)])
#kmerCounts is the dict mapping millions of strings to ints
From my tests so far, this takes about 15 seconds per iteration. Since I have around 20,000 of these blocks, I am looking at half a week just to do this. And that is for the 50,000,000 items, not the billions I need to handle...
(And yes I should probably do this in another language, but I also need it done fast and I am not very good at non-python languages).
There's no need to do a full intersection, you just want the matching elements from the big dictionary if they exist. If an element doesn't exist you can substitute 0 and there will be no effect on the sum. There's also no need to convert the input of sum to a list.
count = sum(kmerCounts.get(x, 0) for x in block.get_kmers())
Remove the square brackets around your list comprehension to turn it into a generator expression:
sum(kmerCounts[x] for x in block.get_kmers().intersection(kmerCounts))
That will save you some time and some memory, which may in turn reduce swapping, if you're experiencing that.
There is a lower bound to how much you can optimize here. Switching to another language may ultimately be your only option.
I'm currently writing a program in Python to track statistics on video games. An example of the dictionary I'm using to track the scores :
ten = 1
sec = 9
fir = 10
thi5 = 6
sec5 = 8
games = {
'adom': [ten+fir+sec+sec5, "Ancient Domain of Mysteries"],
'nethack': [fir+fir+fir+sec+thi5, "Nethack"]
}
Right now, I'm going about this the hard way, and making a big long list of nested ifs, but I don't think that's the proper way to go about it. I was trying to figure out a way to sort the dictionary, via the arrays, and then, finding a way to display the first ten that pop up... instead of having to work deep in the if statements.
So... basically, my question is : Do you have any ideas that I could use to about making this easier, instead of wayyyy, way harder?
===== EDIT ====
the ten+fir produces numbers. I want to find a way to go about sorting the lists (I lack the knowledge of proper terminology) to go by the number (basically, whichever ones have the highest number in the first part of the array go first.
Here's an example of my current way of going about it (though, it's incomplete, due to it being very tiresome : Example Nests (paste2) (let's try this one?)
==== SECOND EDIT ====
In case someone doesn't see my comment below :
ten, fir, et cetera - these are just variables for scores. Basically, it goes from a top ten list into a variable number.
ten = 1, nin = 2, fir = 10, fir5 = 10, sec5 = 8, sec = 9...
so : 'adom': [ten+fir+sec+sec5, "Ancient Domain of Mysteries"] actually registers as : 'adom': [1+10+9+8, "Ancient Domain of Mysteries"] , which ends up looking like :
'adom': [28, "Ancient Domain of Mysteries"]
So, basically, if I ended up doing the "top two" out of my example, it'd be :
((1)) Nethack (48)
((2)) ADOM (28)
I'd write an actual number, but I'm thinking of changing a few things up, so the numbers might be a touch different, and I wouldn't want to rewrite it.
== THIRD (AND HOPEFULLY THE FINAL) EDIT ==
Fixed my original code example.
How about something like this:
scores = games.items()
scores.sort(key = lambda key, value: value[0])
return scores[:10]
This will return the first 10 items, sorted by the first item in the array.
I'm not sure if this is what you want though, please update the question (and fix the example link) if you need something else...
import heapq
return heapq.nlargest(10, games.iteritems(), key=lambda k, v: v[0])
is the most direct way to get the top ten key / value pairs, sorted by the first item of each "value" list. If you can define more precisely what output you want (just the names, the name / value pairs, or what else?) and the sorting criterion, this is easy to adjust, of course.
Wim's solution is good, but I'd say that you should probably go the extra mile and push this work off onto a database, rather than relying on Python. Python interfaces well with most types of databases, where much of what you're exploring is already a solved problem.
For example, instead of worrying about shifting your dictionaries to various other data types in order to properly sort them, you can simply get all the data for each pertinent entry pre-sorted based on the criteria of your query. There goes the need for convoluted sorting and resorting right there.
While dictionaries are tempting to use, because they give the illusion of database-like abilities to access data based on its attributes, I still think they stumble quite a bit with respect to implementation. I don't really have any numbers to throw at you, but just from personal experience, anything you do on Python when it comes to manipulating large amounts of data, you can do much faster and more efficient both in code and computation with something like MySQL.
I'm not sure what you have planned as far as the structure of your data goes, but along with adding data, changing its structure is a lot easier using a database, too.