Display the top 2 highest difference car records - python

How to display the top 2 rows of highest difference from a text file in python
For example here is a text file:
Mazda 64333 53333
Merce 74321 54322
BMW 52211 31432
The expected output would be
Merce 74321 54322
BMW 52211 31432
I tried multiple codes but only managed to display the actual difference and not the whole row.

would this work for you?
from operator import itemgetter
with open("x.txt", "r+") as data:
data = [i.split() for i in data.readlines()]
top = sorted([[row[0], int(row[1])-int(row[2])]for row in data],key=itemgetter(1), reverse=True)
print(top)
print(top[:2])
[['BMW', 20779], ['Merce', 19999], ['Mazda', 11000]]
[['BMW', 20779], ['Merce', 19999]]
So, at a glance, this might seem slightly complicated but it's really not!
let's break down each step of the following program
from operator import itemgetter
with open("x.txt", "r+") as data:
data = [i.split() for i in data.readlines()]
top = sorted([[row[0], int(row[1])-int(row[2])]for row in data],key=itemgetter(1), reverse=True)
now let's first note that operator is a built-in package, it's not an external import such as libraries like requests, and itemgetter is a pretty straightforward function.
with open("x.txt", "r+") as data should be pretty straight forward as well... all this does is open a text file with reading permissions and store that object in data.
we then use our first list comprehension which might look new to you...
data = [i.split() for i in data.readlines()]
all this is doing is going through each line for example car 123 122 and splitting it by spaces into a list like so ["car", "123", "122"].
Now if you look closely at the product of that, there's something wrong. The last 2 elements (which need to be integers to find the difference) are strings! hence, why we are going to have to use the next list comprehension to change that.
top = sorted([[row[0], int(row[1])-int(row[2])]for row in data],key=itemgetter(1), reverse=True)
This is a bit more complicated... but all it's really doing is sorting a simple list comprehension.
It goes through each value in data and gets the differences! Let's see how it does that.
As you know, our data looks something like [["car", "123", "122"], ["car1", "1234", "1223"]] right now. So, we would be able to access the integer values of ["car", "123", "122"] with [1] and [2], with this knowledge we can loop through the data, and get the difference of those when they are casted to integers. E.g int(row[1])-int(row[2]) of ["car", "123", "122"] would return 1 (the difference).
With this knowledge, we can create a new list with the comprehension that contains: the car's name row[0] and the difference int(row[1])-int(row[2]) represented by [row[0], int(row[1])-int(row[2])] in the list comp. while using row as each iterable in data we can easily form this! Heres that list comprehension by itself:
[[row[0], int(row[1])-int(row[2])] for row in data]
Finally, we have arrived at the last piece of this little program... the sorted function! sorted() will return a sorted list based on the key you give it (and you can use reverse=True to have the greatest values first). it's really not that hard to understand when it's abbreviated as follows:
sorted([the list comprehension],key=itemgetter(1), reverse=True)
So while you might know that yes, it's sorting that list comprehension we made and listing the biggest values first, you might not know how its sorting this! To know how it's being sorted we need to look at the key.
itemgetter() is a function of the operator class. All you really need to know about it is that it's getting the 1st index of the lists given and therefore sorting by that. If you can recall each element of our data looks like ["car", difference] (difference is just a placeholder for what actually is the integer difference). Since we want the greatest differences then it makes sense to sort by them right?
using itemgetter(1) it will sort by the 1st index; the difference! and that pretty much sums it up :)
we store all of that to the variable top and then print the first two elements with print(top[:2])
I hope this helped!

Create a dict that contains the distances of each row with the car brand as key.
Then you can sort the dict.items() using the values and return the top 2

Related

Dictionary created from zip() only contains two records instead of 1000+

I am currently doing the US Medical Insurance Cost Portfolio Project through Code Academy and am having trouble combining two lists into a single dictionary. I created two new lists (smoking_status and insurance_costs) in hope of investigating how insurance costs differ between smokers and non-smokers. When I try to zip these two lists together, however, the new dictionary only has two components. It should have well over a thousand. Where did I go wrong? Below is my code and output. It is worth nothing that the output seen is the last two data points in my original csv file.
import csv
insurance_db =
with open('insurance.csv',newline='') as insurance_csv:
insurance_reader = csv.DictReader(insurance_csv)
for row in insurance_reader:
insurance_db.append(row)
smoking_status = []
for person in insurance_db:
smoking_status.append(person.get('smoker'))
insurance_costs = []
for person in insurance_db:
insurance_costs.append(person.get('charges'))
smoker_dictionary = {key:value for key,value in zip(smoking_status,insurance_costs)}
print(smoker_dictionary)
Output:
{'yes': '29141.3603', 'no': '2007.945'}
A key can’t be presented twice in a dictionary.
If I understand correctly, there are 2 statuses, “yes” and “no”.
If so, then the “correct” dictionary structure would be:
{'yes':[...], 'no': [...]}
You can create an empty dictionary as:
{status: [] for status in set(smoking_status)}
And then run on your zipped list and append to the correct key.
Why are you looping three times over basically the input anyway?
entries = []
with open('insurance.csv',newline='') as insurance_csv:
for row in csv.DictReader(insurance_csv):
entries.append([row["smoker"], row["charges"]])
We can't guess what your expected output should be; zipping two lists should create one list with the rows in each list joined up, so that's what this code does. If you want a dictionary instead, or a list of dictionaries, you'll probably want to combine the data from each input row differently, but the code should otherwise be more or less this.
To spell this out, if your input is
id,smoker,charges
1,no,123
2,no,567
3,yes,987
then the output will be
[["no", 123],
["no", 567],
["yes", 987]]

Best way to store and manipulate data with Python

I want to find a way to write my data into a file, read it back from the file and sort it, read the sorted version of it.
Basically what I have is:
name: string
average: float
sum: float
coordinates: list of lists, contains floats. can be variable length for each name
I will sort the entries with respect to average or sum field. Then I will read the name and coordinates in the sorted order.
I tried writing a dictionary of dictionaries to a json; however, I couldn't really sort it after reading it back and couldn't manipulate it as I wanted to. My dictionary was like
big_dictionary = {"name1":{"avg":0.1, "sum":0.2, "coordinates":[[0,1,2,3],[4,5,6,7]]}, "name2":{....}}
I also tried csv ); but, I couldn't read the data back with its original data types (I couldn't read the list of lists to a list of lists for instance)
big_list = [[name1, avg1, sum1, [coordinates1, coordinates2,...]], [name2, ...]]
I know that one option is to use pandas. I haven't tried it yet because I am not familiar with it, and I am afraid losing even more time while struggling with its methods. If you recommend this way, I also need some more information
What should I do in this case?
UPDATE: Also, what about ordereddict?
you could use a list of dictionaries so you could sort easy:
data = [{"name": "name1", "avg":0.1, "sum":0.2, "coordinates":[[0,1,2,3],[4,5,6,7]]}, ..]
data.sort(key: lambda x: x["avg"]) # or sum
Sticking with JSON you could sort your data then write it as a list of dicts rather than a dict of dicts:
big_ordered_list_of_dicts = [
{"name":"name1", "avg":0.1, "sum":0.2, "coordinates":[[0,1,2,3],[4,5,6,7]]},
{"name":"name2", ... },
...,
{"name":"zzzzz", ... },
]
which will still be in the same order after writing to JSON and reading back in. It's also quite easy to re-order this list, for example
list_in_sum_order = sorted( big_ordered_list_of_dicts, key=lambda x: x['sum'] )
and relatively efficient since it just builds another list, it does not copy or move the actual data in the dicts

Fast way of slicing columns from tuples

I have a huge list of tuples from which I want to extract individual columns. I have tried two methods.
Assuming the name of the list name is List and I want to extract the jth column.
First one is
column=[item[j] for item in List]
Second one is
newList=zip(*List)
column=newList[j]
However both the methods are too slow since the length of the list is about 50000 and length of each tuple is about 100. Is there a faster way to extract the columns from the list?
this is something numpy does well
A = np.array(Lst) # this step may take a while now ... maybe you should have Lst as a np.array before you get to this point
sliced = A[:,[j]] # this should be really quite fast
that said
newList=zip(*List)
column=newList[j]
takes less than a second for me with a 50kx100 tuple ... so maybe profile your code and make sure the bottleneck is actually where you think it is...

Text search elements in a big python list

With a list that looks something like:
cell_lines = ["LN18_CENTRAL_NERVOUS_SYSTEM","769P_KIDNEY","786O_KIDNEY"]
With my dabbling in regular expressions, I can't figure out a compelling way to search individual strings in a list besides looping through each element and performing the search.
How can I retrieve indices containing "KIDNEY" in an efficient way (since I have a list of length thousands)?
Make a list comprehension:
[line for line in cell_lines if "KIDNEY" in line]
This is O(n) since we check every item in a list to contain KIDNEY.
If you would need to make similar queries like this often, you should probably think about reorganizing your data and have a dictionary grouped by categories like KIDNEY:
{
"KIDNEY": ["769P_KIDNEY","786O_KIDNEY"],
"NERVOUS_SYSTEM": ["LN18_CENTRAL_NERVOUS_SYSTEM"]
}
In this case, every "by category" lookup would take "constant" time.
You can use a set instead of a list since it performs lookups in constant time.
from bisect import bisect_left
def bi_contains(lst, item):
""" efficient `item in lst` for sorted lists """
# if item is larger than the last its not in the list, but the bisect would
# find `len(lst)` as the index to insert, so check that first. Else, if the
# item is in the list then it has to be at index bisect_left(lst, item)
return (item <= lst[-1]) and (lst[bisect_left(lst, item)] == item)
Slightly modifying the above code will give you pretty good efficiency.
Here's a list of the data structures available in Python along with the time complexities.
https://wiki.python.org/moin/TimeComplexity

Python - List not converting to Tuple inorder to Sort

def mkEntry(file1):
for line in file1:
lst = (line.rstrip().split(","))
print("Old", lst)
print(type(lst))
tuple(lst)
print(type(lst)) #still showing type='list'
sorted(lst, key=operator.itemgetter(1, 2))
def main():
openFile = 'yob' + input("Enter the year <Do NOT include 'yob' or .'txt' : ") + '.txt'
file1 = open(openFile)
mkEntry(file1)
main()
TextFile:
Emma,F,20791
Tom,M,1658
Anthony,M,985
Lisa,F,88976
Ben,M,6989
Shelly,F,8975
and I get this output:
IndexError: string index out of range
I am trying to convert the lst to Tuple from List. So I will able to order the F to M and Smallest Number to Largest Numbers. In around line 7, it's still printing type list instead of type tuple. I don't know why it's doing that.
print(type(lst))
tuple(lst)
print(type(lst)) #still showing type='list'
You're not changing what lst refers to. You create a new tuple with tuple(lst) and immediately throw it away because you don't assign it to anything. You can do:
lst = tuple(lst)
Note that this will not fix your program. Notice that your sort operation is happening once per line of your file, which is not what you want. Try collecting each line into one sequence of tuples and then doing the sort.
Firstly, you are not saving the tuple you created anywhere:
tup = tuple(lst)
Secondly, there is no point in making it a tuple before sorting it - in fact, a list could be sorted in place as it's mutable, while a tuple would need another copy (although that's fairly cheap, the items it contains aren't copied).
Thirdly, the IndexError has nothing to do with whether it's a list or tuple, nor whether it is sorted. It most likely comes from the itemgetter, because there's a list item that doesn't have three entries in turn - for instance, the strings "F" or "M".
Fourthly, the sort you're doing, but not saving anywhere, is done on each individual line, not the table of data. Considering this means you're comparing a name, a number, and a gender, I rather doubt it's what you intended.
It's completely unclear why you're trying to convert data types, and the code doesn't match the structure of the data. How about moving back to the overview plan and sorting out what you want done? It could well be something like Python's csv module could help considerably.

Categories

Resources