Disturbing odd behavior/bug in Python itertools groupby? - python

I am using itertools.groupby to parse a short tab-delimited textfile. the text file has several columns and all I want to do is group all the entries that have a particular value x in a particular column. The code below does this for a column called name2, looking for the value in variable x. I tried to do this using csv.DictReader and itertools.groupby. In the table, there are 8 rows that match this criteria so 8 entries should be returned. Instead groupby returns two sets of entries, one with a single entry and another with 7, which seems like the wrong behavior. I do the matching manually below on the same data and get the right result:
import itertools, operator, csv
col_name = "name2"
x = "ENSMUSG00000002459"
print "looking for entries with value %s in column %s" %(x, col_name)
print "groupby gets it wrong: "
data = csv.DictReader(open(f), delimiter="\t", fieldnames=fieldnames)
for name, entries in itertools.groupby(data, key=operator.itemgetter(col_name)):
if name == "ENSMUSG00000002459":
wrong_result = [e for e in entries]
print "wrong result has %d entries" %(len(wrong_result))
print "manually grouping entries is correct: "
data = csv.DictReader(open(f), delimiter="\t", fieldnames=fieldnames)
correct_result = []
for row in data:
if row[col_name] == "ENSMUSG00000002459":
correct_result.append(row)
print "correct result has %d entries" %(len(correct_result))
The output I get is:
looking for entries with value ENSMUSG00000002459 in column name2
groupby gets it wrong:
wrong result has 7 entries
wrong result has 1 entries
manually grouping entries is correct:
correct result has 8 entries
what is going on here? If groupby is really grouping, it seems like I should only get one set of entries per x, but instead it returns two. I cannot figure this out. EDIT: Ah got it it should be sorted.

You're going to want to change your code to force the data to be in key order...
data = csv.DictReader(open(f), delimiter="\t", fieldnames=fieldnames)
sorted_data = sorted(data, key=operator.itemgetter(col_name))
for name, entries in itertools.groupby(data, key=operator.itemgetter(col_name)):
pass # whatever
The main use though, is when the datasets are large, and the data is already in key order, so when you have to sort anyway, then using a defaultdict is more efficient
from collections import defaultdict
name_entries = defaultdict(list)
for row in data:
name_entries[row[col_name]].append(row)

According to the documentation, groupby() groups only consecutive occurrences of the same key.

I don't know what your data looks like but my guess is it's not sorted. groupby works on sorted data

Related

How to return the name of the someone with the largest value

I am new to python and have a question about this code (and am not allowed to use pandas). I have a CSV file (called personal_info.csv) with different columns labeled full_name, weight_kg, height_m. I am trying to create a function that will find the largest number in the weight_kg column and return the corresponding name.
The numbers are a float and I am trying to return a string (the name).
This is all I have so far. I am not sure how to go to a specific column in the csv file and how to find the corresponding name.
for number in heights:
if number > largest_number:
largest_number = number
print(largest_number)
I also saw this code but am having trouble applying it here
def max_employment(countries, employment):
max_value_index = np.argmax(employment)
max_value = employment[max_value_index]
max_country = countries[max_value_index]
return (max_country, max_value)
I would recommend to use enumerate (or range) in your largest_number finding algorithm. Enumerate (or range) will provide you an index of the item, which you also save once you find larger number. I.e., you need to store two values: the largest value and also the index of the largest value. Then, the name at the index is the name you are looking for.
This page nicely describes the reason behind enumerate, range and indices:
https://realpython.com/python-enumerate/
Once you master this, you can try more ways such as creating a dictionare with keys (names) and values, and using max function or use numpy library. However, all these lead into the same logic as described.
Suppose you have this csv file:
Name,Wt,Ht
Bob,50.5,60.1
Carol,49.2,55.2
Ted,51.1,59.9
Alice,48.2,54.33
You can read the file with csv module:
import csv
with open(fn) as f:
reader=csv.reader(f)
header=next(reader)
data=[[row[0]]+list(map(float, row[1:])) for row in reader]
Result:
>>> data
[['Bob', 50.5, 60.1], ['Carol', 49.2, 55.2], ['Ted', 51.1, 59.9], ['Alice', 48.2, 54.33]]
Then you can get the of some value using the max function with a key function:
>>> max(data, key=lambda row: row[1])
['Ted', 51.1, 59.9]
You can use the header value in header to get the index of the column in the csv data:
>>> max(data, key=lambda row: row[header.index('Wt')])
# same row
Then subscript if you just need the name:
>>> max(data, key=lambda row: row[1])[0]
'Ted'

How do you modify a dictionary from a text file when you only need to get specific values?

So say we have some sort of file with maybe like 6 columns, and 6 rows. If I wanted to get one specific column that reads one line, and modify a current dictionary I have, how would I approach that?
The output should be all the data with the the key being the second column, and the 2 values being the first column and 4th column?
Can somebody please help me start this off?
I've tried using:
for line in file:
(key, val) = line.split()
data[int(key)] = val
print (data)
But, obviously this'll fail, since this expects only 2 values. I need 1 key value, and 2 value values.
split returns a list. Use the list instead of expanding into named variables.
data = {}
for line in file:
row = line.strip().split()
data[int(row[1])] = row[0], row[3]
print (data)

Table/Data manipulation with Python Dictionary

I need help finishing up this python script. I'm an intern at a company, and this is my first week. I was asked to develop a python script that will take a .csv and put(append) any related columns into one column so that they have only the 15 or so necessary columns with the data in them. For example, if there are zip4, zip5, or postal code columns, they want those to all be underneath the zip code column.
I just started learning python this week as I was doing this project so please excuse my noobish question and vocabulary. I'm not looking for you guys to do this for me. I'm just looking for some guidance. In fact, I want to learn more about python, so anyone who could lead me in the right direction, please help.
I'm using dictionary key and values. The keys are every column in the first row. The values of each key are the remaining rows(second through 3000ish). Right now, I'm only getting one key:value pair. I'm only getting the final row as my array of values, and I'm only getting one key. Also, I'm getting a KeyError message, so my key's aren't being identified correctly. My code so far is underneath. I'm gonna keep working on this, and any help is immensely appreciated! Hopefully, I can by the person who helps me a beer and I can pick their brain a little:)
Thanks for your time
# To be able to read csv formated files, we will frist have to import the csv module
import csv
# cols = line.split(',')# each column is split by a comma
#read the file
CSVreader = csv.reader(open('N:/Individual Files/Jerry/2013 customer list qc, cr, db, gb 9-19-2013_JerrysMessingWithVersion.csv', 'rb'), delimiter=',', quotechar='"')
# define open dictionary
SLSDictionary={}# no empty dictionary. Need column names to compare to.
i=0
#top row are your keys. All other rows are your values
#adjust loop
for row in CSVreader:
# mulitple loops needed here
if i == 0:
key = row[i]
else:
[values] = [row[1:]]
SLSDictionary = dict({key: [values]}) # Dictionary is keys and array of values
i=i+1
#print Dictionary to check errors and make sure dictionary is filled with keys and values
print SLSDictionary
# SLSDictionary has key of zip/phone plus any characters
#SLSDictionary.has_key('zip.+')
SLSDictionary.has_key('phone.+')
#value of key are set equal to x. Values of that column set equal to x
#[x]=value
#IF SLSDictionary has the key of zip plus any characters, move values to zip key
#if true:
# SLSDictionary['zip'].append([x])
#SLSDictionary['phone_home'].append([value]) # I need to append the values of the specific column, not all columns
#move key's values to correct, corresponding key
SLSDictionary['phone_home'].append(SLSDictionary[has_key('phone.+')])#Append the values of the key/column 'phone plus characters' to phone_home key/column in SLSDictionary
#if false:
# print ''
# go to next key
SLSDictionary.has_value('')
if true:
print 'Error: No data in column'
# if there's no data in rows 1-?. Delete column
#if value <= 0:
# del column
print SLSDictionary
Found a couple of errors just quickly looking at it. One thing you need to watch out for is that you're assigning a new value to the existing dictionary every time:
SLSDictionary = dict({key: [values]})
You're re-assigning a new value to your SLSDictionary every time it enters that loop. Thus at the end you only have the bottom-most entry. To add a key to the dictionary you do the following:
SLSDictionary[key] = values
Also you shouldn't need the brackets in this line:
[values] = [row[1:]]
Which should instead just be:
values = row[1:]
But most importantly is that you will only ever have one key because you constantly increment your i value. So it will only ever have one key and everything will constantly be assigned to it. Without a sample of how the CSV looks I can't instruct you on how to restructure the loop so that it will catch all the keys.
Assuming your CSV is like this as you've described:
Col1, Col2, Col3, Col4
Val1, Val2, Val3, Val4
Val11, Val22, Val33, Val44
Val111, Val222, Val333, Val444
Then you probably want something like this:
dummy = [["col1", "col2", "col3", "col4"],
["val1", "val2", "val3", "val4"],
["val11", "val22", "val33", "val44"],
["val111", "val222", "val333", "val444"]]
column_index = []
SLSDictionary = {}
for each in dummy[0]:
column_index.append(each)
SLSDictionary[each] = []
for each in dummy[1:]:
for i, every in enumerate(each):
try:
if column_index[i] in SLSDictionary.keys():
SLSDictionary[column_index[i]].append(every)
except:
pass
print SLSDictionary
Which Yields...
{'col4': ['val4', 'val44', 'val444'], 'col2': ['val2', 'val22', 'val222'], 'col3': ['val3', 'val33', 'val333'], 'col1': ['val1', 'val11', 'val111']}
If you want them to stay in order then change the dictionary type to OrderedDict()

Counting how many unique identifiers there are by merging two columns of data?

I'm trying to make a really simple counting script I guess using defaultdict (I can't get my head around how to use DefaultDict so if someone could comment me a snippit of code I would greatly appreciate it)
My objective is to take element 0 and element 1, merge them into a single string and then to count how many unique strings there are...
For example, in the below data there are 15 lines consisting of 3 classes, 4 classids which when merged together we only have 3 unique classes. The merged data for the first line (ignoring the title row) is: Class01CD2
CSV Data:
uniq1,uniq2,three,four,five,six
Class01,CD2,data,data,data,data
Class01,CD2,data,data,data,data
Class01,CD2,data,data,data,data
Class01,CD2,data,data,data,data
Class02,CD3,data,data,data,data
Class02,CD3,data,data,data,data
Class02,CD3,data,data,data,data
Class02,CD3,data,data,data,data
Class02,CD3,data,data,data,data
Class02,CD3,data,data,data,data
Class02,CD3,data,data,data,data
DClass2,DE2,data,data,data,data
DClass2,DE2,data,data,data,data
Class02,CD1,data,data,data,data
Class02,CD1,data,data,data,data
The idea of it is to simply print out how many unique classes are available.
Anyone able to help me work this out?
Regards
- Hyflex
Since you are dealing with CSV data, you can use the CSV module along with dictionaries:
import csv
uniq = {} #Create an empty dictionary, which we will use as a hashmap as Python dictionaries support key-value pairs.
ifile = open('data.csv', 'r') #whatever your CSV file is named.
reader = csv.reader(ifile)
for row in reader:
joined = row[0] + row[1] #The joined string is simply the first and second columns in each row.
#Check to see that the key exists, if it does increment the occurrence by 1
if joined in uniq.keys():
uniq[joined] += 1
else:
uniq[joined] = 1 #This means the key doesn't exist, so add the key to the dictionary with an occurrence of 1
print uniq #Now output the results
This outputs:
{'Class02CD3': 7, 'Class02CD1': 2, 'Class01CD2': 3, 'DClass2DE2': 2}
NOTE: This is assuming that the CSV doesn't have the header row (uniq1,uniq2,three,four,five,six).
REFERENCES:
http://docs.python.org/2/library/stdtypes.html#dict

CSV find max in column and append new data

I asked a question about two hours ago regarding the reading and writing of data from a website. I've spent the last two hours since then trying to find a way to read the maximum date value from column 'A' of the output, comparing that value to the refreshed website data, and appending any new data to the csv file without overriding the old ones or creating duplicates.
The code that is currently 100% working is this:
import requests
symbol = "mtgoxUSD"
url = 'http://api.bitcoincharts.com/v1/trades.csv?symbol={}'.format(symbol)
data = requests.get(url)
with open("trades_{}.csv".format(symbol), "r+") as f:
f.write(data.text)
I've tried various ways of finding the maximum value of column 'A'. I've tried a bunch of different ways of using "Dict" and other methods of sorting/finding max, and even using pandas and numpy libs. None of which seem to work. Could someone point me in the direction of a decent way to find the maximum of a column from the .csv file? Thanks!
if you have it in a pandas DataFrame, you can get the max of any column like this:
>>> max(data['time'])
'2012-01-18 15:52:26'
where data is the variable name for the DataFrame and time is the name of the column
I'll give you two answers, one that just returns the max value, and one that returns the row from the CSV that includes the max value.
import csv
import operator as op
import requests
symbol = "mtgoxUSD"
url = 'http://api.bitcoincharts.com/v1/trades.csv?symbol={}'.format(symbol)
csv_file = "trades_{}.csv".format(symbol)
data = requests.get(url)
with open(csv_file, "w") as f:
f.write(data.text)
with open(csv_file) as f:
next(f) # discard first row from file -- see notes
max_value = max(row[0] for row in csv.reader(f))
with open(csv_file) as f:
next(f) # discard first row from file -- see notes
max_row = max(csv.reader(f), key=op.itemgetter(0))
Notes:
max() can directly consume an iterator, and csv.reader() gives us an iterator, so we can just pass that in. I'm assuming you might need to throw away a header line so I showed how to do that. If you had multiple header lines to discard, you might want to use islice() from the itertools module.
In the first one, we use a "generator expression" to select a single value from each row, and find the max. This is very similar to a "list comprehension" but it doesn't build a whole list, it just lets us iterate over the resulting values. Then max() consumes the iterable and we get the max value.
max() can use a key= argument where you specify a "key function". It will use the key function to get a value and use that value to figure the max... but the value returned by max() will be the unmodified original value (in this case, a row value from the CSV). In this case, the key function is manufactured for you by operator.itemgetter()... you pass in which column you want, and operator.itemgetter() builds a function for you that gets that column.
The resulting function is the equivalent of:
def get_col_0(row):
return row[0]
max_row = max(csv.reader(f), key=get_col_0)
Or, people will use lambda for this:
max_row = max(csv.reader(f), key=lambda row: row[0])
But I think operator.itemgetter() is convenient and nice to read. And it's fast.
I showed saving the data in a file, then pulling from the file again. If you want to go through the data without saving it anywhere, you just need to iterate over it by lines.
Perhaps something like:
text = data.text
rows = [line.split(',') for line in text.split("\n") if line]
rows.pop(0) # get rid of first row from data
max_value = max(row[0] for row in rows)
max_row = max(rows, key=op.itemgetter(0))
I don't know which column you want... column "A" might be column 0 so I used 0 in the above. Replace the column number as you like.
It seems like something like this should work:
import requests
import csv
symbol = "mtgoxUSD"
url = 'http://api.bitcoincharts.com/v1/trades.csv?symbol={}'.format(symbol)
data = requests.get(url)
with open("trades_{}.csv".format(symbol), "r+") as f:
all_values = list(csv.reader(f))
max_value = max([int(row[2]) for row in all_values[1:]])
(write-out-the-value?)
EDITS: I used "row[2]" because that was the sample column I was taking max of in my csv. Also, I had to strip off the column headers, which were all text, which was why I looked at "all_values[1:]" from the second row to the end of the file.

Categories

Resources