I've trying to develop a very simple initial model to predict the amount of fines a nursing home might expect to pay based on its location.
This is my class definition
#initial model to predict the amount of fines a nursing home might expect to pay based on its location
from sklearn.base import BaseEstimator, RegressorMixin, TransformerMixin
class GroupMeanEstimator(BaseEstimator, RegressorMixin):
#defines what a group is by using grouper
#initialises an empty dictionary for group averages
def __init__(self, grouper):
self.grouper = grouper
self.group_averages = {}
#Any calculation I require for my predict method goes here
#Specifically, I want to groupby the group grouper is set by
#I want to then find out what is the mean penalty by each group
#X is the data containing the groups
#Y is fine_totals
#map each state to its mean fine_tot
def fit(self, X, y):
#Use self.group_averages to store the average penalty by group
Xy = X.join(y) #Joining X&y together
state_mean_series = Xy.groupby(self.grouper)[y.name].mean() #Creating a series of state:mean penalties
#populating a dictionary with state:mean key:value pairs
for row in state_mean_series.iteritems():
self.group_averages[row[0]] = row[1]
return self
#The amount of fine an observation is likely to receive is based on his group mean
#Want to first populate the list with the number of observations
#For each observation in the list, what is his group and then set the likely fine to his group mean.
#Return the list
def predict(self, X):
dictionary = self.group_averages
group = self.grouper
list_of_predictions = [] #initialising a list to store our return values
for row in X.itertuples(): #iterating through each row in X
prediction = dictionary[row.STATE] #Getting the value from group_averages dict using key row.group
list_of_predictions.append(prediction)
return list_of_predictions
It works for this
state_model.predict(data.sample(5))
But breaks down when I try to do this:
state_model.predict(pd.DataFrame([{'STATE': 'AS'}]))
My model can't handle the possibility, and I would like to seek help in rectifying it.
The problem I am seeing is in your fit method, iteritems basically iterates over columns rather than rows. you should use itertuples which will give you row wise data. just change the loop in your fit method to
for row in pd.DataFrame(state_mean_series).itertuples(): #row format is [STATE, mean_value]
self.group_averages[row[0]] = row[1]
and then in your predict method, just do a fail safe check by doing
prediction = dictionary.get(row.STATE, None) # None is the default value here in case the 'AS' doesn't exist. you may replace it with what ever you want
I am writing a program for data analysis. To group my results I wrote those functions:
def prepare_data(sample_list, group_option):
if group_option == None:
group_parameter = None
elif group_option == 'sample':
group_parameter = sample_parameters.sample
elif group_option == 'parameter':
group_parameter = (sample_parameters.speed, sample_parameters.gap, sample_parameters.temperature, sample_parameters.volume)
This function determines how to group my different samples for later calculation. This should be the only place to make entries regarding grouping.
def group_data(sample_list, group_parameter):
query = session.query(sample_measurements_raw.data, group_parameter).filter(sample_parameters.idsample_parameters == sample_measurements_raw.sample_id).filter(sample_parameters.useless == 0).filter(sample_parameters.sample.in_(sample_list)).group_by(group_parameter)
data_table = pandas.DataFrame()
for row in query:
data_table = pandas.concat([data_table, (calculate_data(sample_list, row.keys))[-1:-2]], axis = 'columns', join = 'outer')
return data_table
This function get a list of samples and the group parameter (consisting of orm.attributes). It the searches in that list for samples with the same parameters and groups them. It then iterates through all unique parameter sets which are passed to another function:
def calculate_data(sample_list, parameter):
query = session.query(sample_measurements_raw.data).filter(sample_parameters.idsample_parameters == sample_measurements_raw.sample_id).filter(sample_parameters.useless == 0).filter(sample_parameters.sample.in_(sample_list)).filter(parameter)
data_table = pandas.DataFrame()
for row in query:
data_table = pandas.concat([data_table, pandas.read_json(row.data, orient = 'split').set_index('nm').rename(columns = {" %T": parameter})], axis = 'columns', join = 'outer')
data_table[parameter + '_mean'] = data_table.mean(axis = 1)
data_table[parameter + '_std'] = data_table[data_table.columns[0:-1]].std(axis = 1)
return data_table
Here the problem starts. That function should get the parameters of the unique groups and perform a filter to get all of the data matching the exact parameters of that group. Please keep in mind that the number of supplied parameters can change.
The question:
How to filter for those parameters I got from group_data?
I hope my question is understandable and thanks for the help!
EDIT:
A concrete example would be:
The list to start with looks something like that:
table
The table is still expanding, so I want to edit the group_parameter in the first function. Everything else should be derived from that.
Group data by group_parameter and create list. That list consists of row objects with member statements (eg. row.speed, row.gap). Number and name of the statements is dependent on group_parameters.
Iterate through list and fetch all data where the values match exactly with the values of the list entry. Therefor the expression should look something like that: .filter(sample_parameters.speed == row.speed, sample_parameters.gap == row.gap)
Question:
How to get a arbitray set of parameters (eg. sample_parameters.speed, sample_parameters.gap,...) from that list and filter by them in the next query?
if I understand your problem correctly, your provided functions are not the problem but the input format for the second functions needs to be pre-processed. So, to generalize your question: how to dynamically query for changing unique values from a returned data table?
I can -quickly- think of using df.Series.unique() to solve your problem.
#iterates through a list of unique df.Seriesvalues
for row in group_data.Parameters.unique:
# do something
However, there should be optimization potential performance-wise. For more information about, i.e., arguments for df.Series.unique(),
see documentation
I'm trying to interpolate (forward-fill) values of a table.
input: a BigQuery table with n+1 columns where n are a bunch of readings and +1 is the Time column (The time when the reading was made). Most of these columns are empty.
output: BigQuery table with the same n+1 columns, such that the empty values are replaced with the last known readings. (empty values at the beginning of time are ignored).
This is equivalent to pandas df.fillna(method='pad').
I would like to run This problem on huge tables using googles dataflow service through apache-beam.
It seems Beam Is great at handling rows but I can't seem to find a way to handle columns. Obviously once I've got a column I can easily iterate over it and interpolate the values as I go.
Although I'm not sure how memory works in dataflow. We need to make sure that it can handle the amount of data necessary.
beam.io.Read(beam.io.BigQuerySource(table_path))
When reading a Table from Big Query one gets a Pcollection of rows
how do I get a column?
Even a query returns the same....
If the forward fill your attempting is only at the end of each column, I would suggest using a combiner to find the last value in each column that was populated based upon the timestamp of the row.
ALL_MY_COLUMNS = ['foo', 'bar', ...]
class FindLastValue(core.CombineFn):
def create_accumulator(self, *args, **kwargs):
# first dict stores timestamps for columns while second dict stores last value seen
return ({}, {})
def add_input(self, mutable_accumulator, element, *args, **kwargs):
for column in ALL_MY_COLUMNS:
# if the column is populated and we haven't captured the value before or the timestamp of the element is greater then the value we have seen in the past then we will record this as the last known value.
if element[column] is not None and (mutable_accumulator[0][column] is None or mutable_accumulator[0][column] < element['timestamp']):
mutable_accumulator[0][column] = element['timestamp']
mutable_accumulator[1][column] = element[column]
def merge_accumulators(self, accumulators, *args, **kwargs):
# merge the accumulators based upon which has the smallest timestamp per column
merged = ({}, {})
for accum in accumulators:
if element[column] is not None:
if merged[0][column] is None or merged[0][column] > accum[0][column]:
merged[0][column] = accum[0][column]
merged[1][column] = accum[1][column]
return merged
def extract_output(self, accumulator, *args, **kwargs):
# return a dict of column to last known value
return accumulator[1]
def update_to_last_value(value, side_input):
for column in ALL_MY_COLUMNS:
if value[column] is None:
if side_input[column] is None:
# What do you want to do if the column is empty for all values?
else:
value[column] = side_input[column]
p = ... create pipeline ...
data = 'Read' >> p | beam.io.Read(beam.io.BigQuerySource(table_path))
side_input = 'Last Value' | CombineGlobally(sum).as_singleton_view()
# take the data that you computed as the 'last' value for each column and provide it to a function which updates any columns that are unset.
output = 'Output' >> data | Map(lambda main, s: update_to_last_value(main, side_input), side_input)
... any additional transforms that you want.
The above pipeline will parallelize well because you will compute the last value in parallel (this is the power of the combiner). Afterwards you'll be able to update all records in parallel since the last value has been computed.
Note that this won't solve arbitrary sparse sections in columns. Are these readings occurring at a regular frequency such that you can guarantee that every Y rows there is guaranteed to be a value?
I am afraid if you are using beam, you will have to write your own DoFn to handle it. Something like (pseudo code):
DoFn(input_element):
for all the field_to_fill repeat:
input_element.field_to_fill = NEW_VALUE;
emit input_element
And apply this to the whole data set (i.e. the one from beam.io.read()).
My answer is limited to beam. There might be feature in bigquery can do column access easily.
Say I have the following variables and its corresponding values which represents a record.
name = 'abc'
age = 23
weight = 60
height = 174
Please note that the value could be of different types (string, integer, float, reference-to-any-other-object, etc).
There will be many records (at least >100,000). Each and every record will be unique when all these four variables (actually its values) are put together. In other words, there exists no record with all 4 values are the same.
I am trying to find an efficient data structure in Python which will allow me to (store and) retrieve records based on any one of these variables in log(n) time complexity.
For example:
def retrieve(name=None,age=None,weight=None,height=None)
if name is not None and age is None and weight is None and height is None:
/* get all records with the given name */
if name is None and age is not None and weight is None and height is None:
/* get all records with the given age */
....
return records
The way the retrieve should be called is as follows:
retrieve(name='abc')
The above should return [{name:'abc', age:23, wight:50, height=175}, {name:'abc', age:28, wight:55, height=170}, etc]
retrieve(age=23)
The above should return [{name:'abc', age:23, wight:50, height=175}, {name:'def', age:23, wight:65, height=180}, etc]
And, I may need to add one or two more variables to this record in future. For example, say, sex = 'm'. So, the retrieve function must be scalable.
So in short: Is there a data structure in Python which will allow storing a record with n number of columns (name, age, sex, weigh, height, etc) and retrieving records based on any (one) of the column in logarithmic (or ideally constant - O(1) look-up time) complexity?
There isn't a single data structure built into Python that does everything you want, but it's fairly easy to use a combination of the ones it does have to achieve your goals and do so fairly efficiently.
For example, say your input was the following data in a comma-separated-value file called employees.csv with field names defined as shown by the first line:
name,age,weight,height
Bob Barker,25,175,6ft 2in
Ted Kingston,28,163,5ft 10in
Mary Manson,27,140,5ft 6in
Sue Sommers,27,132,5ft 8in
Alice Toklas,24,124,5ft 6in
The following is working code which illustrates how to read and store this data into a list of records, and automatically create separate look-up tables for finding records associated with the values of contained in the fields each of these record.
The records are instances of a class created by namedtuple which is a very memory efficient because each one lacks a __dict__ attribute that class instances normally contain. Using them will make it possible to access the fields of each by name using dot syntax, like record.fieldname.
The look-up tables are defaultdict(list) instances, which provide dictionary-like O(1) look-up times on average, and also allow multiple values to be associated with each one. So the look-up key is the value of the field value being sought, and the data associated with it will be a list of the integer indices of the Person records stored in the employees list with that value — so they'll all be relatively small.
Note that the code for the class is completely data-driven in that it doesn't contain any hardcoded field names which instead are all taken from the first row of csv data input file when it's read in. Of course when using an instance, all retrieve() method calls must provide valid field names.
Update
Modified to not create a lookup table for every unique value of every field when the data file is first read. Now the retrieve() method "lazily" creates them only when one is needed (and saves/caches the result for future use). Also modified to work in Python 2.7+ including 3.x.
from collections import defaultdict, namedtuple
import csv
class DataBase(object):
def __init__(self, csv_filename, recordname):
# Read data from csv format file into a list of namedtuples.
with open(csv_filename, 'r') as inputfile:
csv_reader = csv.reader(inputfile, delimiter=',')
self.fields = next(csv_reader) # Read header row.
self.Record = namedtuple(recordname, self.fields)
self.records = [self.Record(*row) for row in csv_reader]
self.valid_fieldnames = set(self.fields)
# Create an empty table of lookup tables for each field name that maps
# each unique field value to a list of record-list indices of the ones
# that contain it.
self.lookup_tables = {}
def retrieve(self, **kwargs):
""" Fetch a list of records with a field name with the value supplied
as a keyword arg (or return None if there aren't any).
"""
if len(kwargs) != 1: raise ValueError(
'Exactly one fieldname keyword argument required for retrieve function '
'(%s specified)' % ', '.join([repr(k) for k in kwargs.keys()]))
field, value = kwargs.popitem() # Keyword arg's name and value.
if field not in self.valid_fieldnames:
raise ValueError('keyword arg "%s" isn\'t a valid field name' % field)
if field not in self.lookup_tables: # Need to create a lookup table?
lookup_table = self.lookup_tables[field] = defaultdict(list)
for index, record in enumerate(self.records):
field_value = getattr(record, field)
lookup_table[field_value].append(index)
# Return (possibly empty) sequence of matching records.
return tuple(self.records[index]
for index in self.lookup_tables[field].get(value, []))
if __name__ == '__main__':
empdb = DataBase('employees.csv', 'Person')
print("retrieve(name='Ted Kingston'): {}".format(empdb.retrieve(name='Ted Kingston')))
print("retrieve(age='27'): {}".format(empdb.retrieve(age='27')))
print("retrieve(weight='150'): {}".format(empdb.retrieve(weight='150')))
try:
print("retrieve(hight='5ft 6in'):".format(empdb.retrieve(hight='5ft 6in')))
except ValueError as e:
print("ValueError('{}') raised as expected".format(e))
else:
raise type('NoExceptionError', (Exception,), {})(
'No exception raised from "retrieve(hight=\'5ft\')" call.')
Output:
retrieve(name='Ted Kingston'): [Person(name='Ted Kingston', age='28', weight='163', height='5ft 10in')]
retrieve(age='27'): [Person(name='Mary Manson', age='27', weight='140', height='5ft 6in'),
Person(name='Sue Sommers', age='27', weight='132', height='5ft 8in')]
retrieve(weight='150'): None
retrieve(hight='5ft 6in'): ValueError('keyword arg "hight" is an invalid fieldname')
raised as expected
Is there a data structure in Python which will allow storing a record with n number of columns (name, age, sex, weigh, height, etc) and retrieving records based on any (one) of the column in logarithmic (or ideally constant - O(1) look-up time) complexity?
No, there is none. But you could try to implement one on the basis of one dictionary per value dimension. As long as your values are hashable of course. If you implement a custom class for your records, each dictionary will contain references to the same objects. This will save you some memory.
You could achieve logarithmic time complexity in a relational database using indexes (O(log(n)**k) with single column indexes). Then to retrieve data just construct appropriate SQL:
names = {'name', 'age', 'weight', 'height'}
def retrieve(c, **params):
if not (params and names.issuperset(params)):
raise ValueError(params)
where = ' and '.join(map('{0}=:{0}'.format, params))
return c.execute('select * from records where ' + where, params)
Example:
import sqlite3
c = sqlite3.connect(':memory:')
c.row_factory = sqlite3.Row # to provide key access
# create table
c.execute("""create table records
(name text, age integer, weight real, height real)""")
# insert data
records = (('abc', 23, 60, 174+i) for i in range(2))
c.executemany('insert into records VALUES (?,?,?,?)', records)
# create indexes
for name in names:
c.execute("create index idx_{0} on records ({0})".format(name))
try:
retrieve(c, naame='abc')
except ValueError:
pass
else:
assert 0
for record in retrieve(c, name='abc', weight=60):
print(record['height'])
Output:
174.0
175.0
Given http://wiki.python.org/moin/TimeComplexity how about this:
Have a dictionary for every column you're interested in - AGE, NAME, etc.
Have the keys of that dictionaries (AGE, NAME) be possible values for given column (35 or "m").
Have a list of lists representing values for one "collection", e.g. VALUES = [ [35, "m"], ...]
Have the value of column dictionaries (AGE, NAME) be lists of indices from the VALUES list.
Have a dictionary which maps column name to index within lists in VALUES so that you know that first column is age and second is sex (you could avoid that and use dictionaries, but they introduce large memory footrpint and with over 100K objects this may or not be a problem).
Then the retrieve function could look like this:
def retrieve(column_name, column_value):
if column_name == "age":
return [VALUES[index] for index in AGE[column_value]]
elif ...: # repeat for other "columns"
Then, this is what you get
VALUES = [[35, "m"], [20, "f"]]
AGE = {35:[0], 20:[1]}
SEX = {"m":[0], "f":[1]}
KEYS = ["age", "sex"]
retrieve("age", 35)
# [[35, 'm']]
If you want a dictionary, you can do the following:
[dict(zip(KEYS, values)) for values in retrieve("age", 35)]
# [{'age': 35, 'sex': 'm'}]
but again, dictionaries are a little heavy on the memory side, so if you can go with lists of values it might be better.
Both dictionary and list retrieval are O(1) on average - worst case for dictionary is O(n) - so this should be pretty fast. Maintaining that will be a little bit of pain, but not so much. To "write", you'd just have to append to the VALUES list and then append the index in VALUES to each of the dictionaries.
Of course, then best would be to benchmark your actual implementation and look for potential improvements, but hopefully this make sense and will get you going :)
EDIT:
Please note that as #moooeeeep said, this will only work if your values are hashable and therefore can be used as dictionary keys.
The long (winded) version:
I'm gathering research data using Python. My initial parsing is ugly (but functional) code which gives me some basic information and turns my raw data into a format suitable for heavy duty statistical analysis using SPSS. However, every time I modify the experiment, I have to dive into the analysis code.
For a typical experiment, I'll have 30 files, each for a unique user. Field count is fixed for each experiment (but can vary from one to another 10-20). Files are typically 700-1000 records long with a header row. Record format is tab separated (see sample which is 4 integers, 3 strings, and 10 floats).
I need to sort my list into categories. In a 1000 line file, I could have 4-256 categories. Rather than trying to pre-determine how many categories each file has, I'm using the code below to count them. The integers at the beginning of each line dictate what category the float values in the row correspond to. Integer combinations can be modified by the string values to produce wildly different results, and multiple combinations can sometimes be lumped together.
Once they're in categories, number crunching begins. I get statistical info (mean, sd, etc. for each category for each file).
The essentials:
I need to parse data like the sample below into categories. Categories are combos of the non-floats in each record. I'm also trying to come up with a dynamic (graphical) way to associate column combinations with categories. Will make a new post fot this.
I'm looking for suggestions on how to do both.
# data is a list of tab separated records
# fields is a list of my field names
# get a list of fieldtypes via gettype on our first row
# gettype is a function to get type from string without changing data
fieldtype = [gettype(n) for n in data[1].split('\t')]
# get the indexes for fields that aren't floats
mask = [i for i, field in enumerate(fieldtype) if field!="float"]
# for each row of data[skipping first and last empty lists] we split(on tabs)
# and take the ith element of that split where i is taken from the list mask
# which tells us which fields are not floats
records = [[row.split('\t')[i] for i in mask] for row in data[1:-1]]
# we now get a unique set of combos
# since set doesn't happily take a list of lists, we join each row of values
# together in a comma seperated string. So we end up with a list of strings.
uniquerecs = set([",".join(row) for row in records])
print len(uniquerecs)
quit()
def gettype(s):
try:
int(s)
return "int"
except ValueError:
pass
try:
float(s)
return "float"
except ValueError:
return "string"
Sample Data:
field0 field1 field2 field3 field4 field5 field6 field7 field8 field9 field10 field11 field12 field13 field14 field15
10 0 2 1 Right Right Right 5.76765674196 0.0310912272139 0.0573603238282 0.0582901376612 0.0648936500524 0.0655294305058 0.0720571099855 0.0748289246137 0.446033755751
3 1 3 0 Left Left Right 8.00982745764 0.0313840132052 0.0576521406854 0.0585844966069 0.0644905497442 0.0653386429438 0.0712603578765 0.0740345755708 0.2641076191
5 19 1 0 Right Left Left 4.69440026591 0.0313852052224 0.0583165354345 0.0592403274967 0.0659404609478 0.0666070804916 0.0715314027001 0.0743022054775 0.465994962101
3 1 4 2 Left Right Left 9.58648184552 0.0303649003017 0.0571579895338 0.0580911765412 0.0634304670863 0.0640132919609 0.0702920967445 0.0730697946335 0.556525293
9 0 0 7 Left Left Left 7.65374257547 0.030318719717 0.0568551744109 0.0577785415066 0.0640577002605 0.0647226582655 0.0711459854908 0.0739256050784 1.23421547397
Not sure if I understand your question, but here are a few thoughts:
For parsing the data files, you usually use the Python csv module.
For categorizing the data you could use a defaultdict with the non-float fields joined as a key for the dict. Example:
from collections import defaultdict
import csv
reader = csv.reader(open('data.file', 'rb'), delimiter='\t')
data_of_category = defaultdict(list)
lines = [line for line in reader]
mask = [i for i, n in enumerate(lines[1]) if gettype(n)!="float"]
for line in lines[1:]:
category = ','.join([line[i] for i in mask])
data_of_category[category].append(line)
This way you don't have to calculate the categories in the first place an can process the data in one pass.
And I didn't understand the part about "a dynamic (graphical) way to associate column combinations with categories".
For at least part of your question, have a look at Named Tuples
Step 1: Use something like csv.DictReader to turn the text file into an iterable of rows.
Step 2: Turn that into a dict of first entry: rest of entries.
with open("...", "rb") as data_file:
lines = csv.Reader(data_file, some_custom_dialect)
categories = {line[0]: line[1:] for line in lines}
Step 3: Iterate over the items() of the data and do something with each line.
for category, line in categories.items():
do_stats_to_line(line)
Some useful answers already but I'll throw mine in as well. Key points:
Use the csv module
Use collections.namedtuple for each row
Group the rows using a tuple of int field values as the key
If your source rows are sorted by the keys (the integer column values), you could use itertools.groupby. This would likely reduce memory consumption. Given your example data, and the fact that your files contain >= 1000 rows, this is probably not an issue to worry about.
def coerce_to_type(value):
_types = (int, float)
for _type in _types:
try:
return _type(value)
except ValueError:
continue
return value
def parse_row(row):
return [coerce_to_type(field) for field in row]
with open(datafile) as srcfile:
data = csv.reader(srcfile, delimiter='\t')
## Read headers, create namedtuple
headers = srcfile.next().strip().split('\t')
datarow = namedtuple('datarow', headers)
## Wrap with parser and namedtuple
data = (parse_row(row) for row in data)
data = (datarow(*row) for row in data)
## Group by the leading integer columns
grouped_rows = defaultdict(list)
for row in data:
integer_fields = [field for field in row if isinstance(field, int)]
grouped_rows[tuple(integer_fields)].append(row)
## DO SOMETHING INTERESTING WITH THE GROUPS
import pprint
pprint.pprint(dict(grouped_rows))
EDIT You may find the code at https://gist.github.com/985882 useful.