processing columns in apache beam? mainly forward fill

processing columns in apache beam? mainly forward fill - python

I'm trying to interpolate (forward-fill) values of a table.
input: a BigQuery table with n+1 columns where n are a bunch of readings and +1 is the Time column (The time when the reading was made). Most of these columns are empty.
output: BigQuery table with the same n+1 columns, such that the empty values are replaced with the last known readings. (empty values at the beginning of time are ignored).
This is equivalent to pandas df.fillna(method='pad').
I would like to run This problem on huge tables using googles dataflow service through apache-beam.
It seems Beam Is great at handling rows but I can't seem to find a way to handle columns. Obviously once I've got a column I can easily iterate over it and interpolate the values as I go.
Although I'm not sure how memory works in dataflow. We need to make sure that it can handle the amount of data necessary.
beam.io.Read(beam.io.BigQuerySource(table_path))
When reading a Table from Big Query one gets a Pcollection of rows
how do I get a column?
Even a query returns the same....

If the forward fill your attempting is only at the end of each column, I would suggest using a combiner to find the last value in each column that was populated based upon the timestamp of the row.
ALL_MY_COLUMNS = ['foo', 'bar', ...]
class FindLastValue(core.CombineFn):
def create_accumulator(self, *args, **kwargs):
# first dict stores timestamps for columns while second dict stores last value seen
return ({}, {})
def add_input(self, mutable_accumulator, element, *args, **kwargs):
for column in ALL_MY_COLUMNS:
# if the column is populated and we haven't captured the value before or the timestamp of the element is greater then the value we have seen in the past then we will record this as the last known value.
if element[column] is not None and (mutable_accumulator[0][column] is None or mutable_accumulator[0][column] < element['timestamp']):
mutable_accumulator[0][column] = element['timestamp']
mutable_accumulator[1][column] = element[column]
def merge_accumulators(self, accumulators, *args, **kwargs):
# merge the accumulators based upon which has the smallest timestamp per column
merged = ({}, {})
for accum in accumulators:
if element[column] is not None:
if merged[0][column] is None or merged[0][column] > accum[0][column]:
merged[0][column] = accum[0][column]
merged[1][column] = accum[1][column]
return merged
def extract_output(self, accumulator, *args, **kwargs):
# return a dict of column to last known value
return accumulator[1]
def update_to_last_value(value, side_input):
for column in ALL_MY_COLUMNS:
if value[column] is None:
if side_input[column] is None:
# What do you want to do if the column is empty for all values?
else:
value[column] = side_input[column]
p = ... create pipeline ...
data = 'Read' >> p | beam.io.Read(beam.io.BigQuerySource(table_path))
side_input = 'Last Value' | CombineGlobally(sum).as_singleton_view()
# take the data that you computed as the 'last' value for each column and provide it to a function which updates any columns that are unset.
output = 'Output' >> data | Map(lambda main, s: update_to_last_value(main, side_input), side_input)
... any additional transforms that you want.
The above pipeline will parallelize well because you will compute the last value in parallel (this is the power of the combiner). Afterwards you'll be able to update all records in parallel since the last value has been computed.
Note that this won't solve arbitrary sparse sections in columns. Are these readings occurring at a regular frequency such that you can guarantee that every Y rows there is guaranteed to be a value?

I am afraid if you are using beam, you will have to write your own DoFn to handle it. Something like (pseudo code):
DoFn(input_element):
for all the field_to_fill repeat:
input_element.field_to_fill = NEW_VALUE;
emit input_element
And apply this to the whole data set (i.e. the one from beam.io.read()).
My answer is limited to beam. There might be feature in bigquery can do column access easily.

Related

Defining a function with args to be used in df.transform

For a current project, I am planning to winsorize a Pandas DataFrame that consists of two columns/objects df['Policies'] and df['ProCon']. This means that the outliers at the high and the low end of the set shall be cut out.
The winsorising shall be conducted at 0.05 and 0.95 based on the values shown in the df['ProCon'] section, while both columns shall be cut out in case an outlier is identified.
The code below is however not accepting the direct reference to the 'ProCon' column in line def winsorize_series(df['ProCon']):, yielding an error about an invalid syntax.
Is there any smart way to indicate that ProCon shall be the determining value for the winsorizing?
import pandas as pd
from scipy.stats import mstats
# Loading the file
df = pd.read_csv("3d201602.csv")
# Winsorizing
def winsorize_series(df['ProCon']):
return mstats.winsorize(df['ProCon'], limits=[0.05,0.95])
# Defining the winsorized DataFrame
df = df.transform(winsorize_series)

Have you tried separating the column name from the table?
def winsorize_series(df, column):
return mstats.winsorize(df[column], limits=[0.05,0.95])
Can't test it though if there's no sample data.

As per comments, .transform is not the right choice to modify only one or selected columns from df. Whatever the function definition and arguments passed, transform will iterate and pass EVERY column to func and try to broadcast the joined result to the original shape of df.
What you need is much simpler
limits = [0.05,0.95] # keep limits static for any calls you make
colname = 'ProCon' # you could even have a list of columns and loop... for colname in cols
df[colname] = mstats.winsorize(df[colname], limits=limits)
df.transform(func) can be passed *args and **kwargs which will be passed to func, as in
df = df.transform(mstats.winsorize, axis=0, a=df['ProCon'], limits=[0.05,0.95])
So there is no need for
def winsorize_series...

Python - How to optimize code to run faster? (lots of for loops in DataFrame)

I have a code that works with an excel file (SAP Download) quite extensively (data transformation and calculation steps).
I need to loop through all the lines (couple thousand rows) a few times. I have written a code prior that adds DataFrame columns separately, so I could do everything in one for loop that was of course quite quick, however, I had to change data source that meant change in raw data structure.
The raw data structure has 1st 3 rows empty, then a Title row comes with column names, then 2 rows empty, and the 1st column is also empty. I decided to wipe these, and assign column names and make them headers (steps below), however, since then, separately adding column names and later calculating everything in one for statement does not fill data to any of these specific columns.
How could i optimize this code?
I have deleted some calculation steps since they are quite long and make code part even less readable
#This function adds new column to the dataframe
def NewColdfConverter(*args):
for i in args:
dfConverter[i] = '' #previously used dfConverter[i] = NaN
#This function creates dataframe from excel file
def DataFrameCreator(path,sheetname):
excelFile = pd.ExcelFile(path)
global readExcel
readExcel = pd.read_excel(excelFile,sheet_name=sheetname)
#calling my function to create dataframe
DataFrameCreator(filePath,sheetName)
dfConverter = pd.DataFrame(readExcel)
#dropping NA values from Orders column (right now called Unnamed)
dfConverter.dropna(subset=['Unnamed: 1'], inplace=True)
#dropping rows and deleting other unnecessary columns
dfConverter.drop(dfConverter.head(1).index, inplace=True)
dfConverter.drop(dfConverter.columns[[0,11,12,13,17,22,23,48]], axis = 1,inplace = True)
#renaming columns from Unnamed 1: etc to proper names
dfConverter = dfConverter.rename(columns={Unnamed 1:propername1 Unnamed 2:propername2 etc.})
#calling new column function -> this Day column appears in the 1st for loop
NewColdfConverter("Day")
#example for loop that worked prior, but not working since new dataset and new header/column steps added:
for i in range(len(dfConverter)):
#Day column-> floor Entry Date -1, if time is less than 5:00:00
if(dfConverter['Time'][i] <= time(hour=5,minute=0,second=0)):
dfConverter['Day'][i] = pd.to_datetime(dfConverter['Entry Date'][i])-timedelta(days=1)
else:
dfConverter['Day'][i] = pd.to_datetime(dfConverter['Entry Date'][i])
Problem is, there are many columns that build on one another, so I cannot get them in one for loop, for instance in below example I need to calculate reqsWoSetUpValue, so I can calculate requirementsValue, so I can calculate otherReqsValue, but I'm not able to do this within 1 for loop by assigning the values to the dataframecolumn[i] row, because the value will just be missing, like nothing happened.
(dfsorted is the same as dfConverter, but a sorted version of it)
#example code of getting reqsWoSetUpValue
for i in range(len(dfSorted)):
reqsWoSetUpValue[i] = #calculationsteps...
#inserting column with value
dfSorted.insert(49,'Reqs wo SetUp',reqsWoSetUpValue)
#getting requirements value with previously calculated Reqs wo SetUp column
for i in range(len(dfSorted)):
requirementsValue[i] = #calc
dfSorted.insert(50,'Requirements',requirementsValue)
#Calculating Other Reqs value with previously calculated Requirements column.
for i in range(len(dfSorted)):
otherReqsValue[i] = #calc
dfSorted.insert(51,'Other Reqs',otherReqsValue)
Anyone have a clue, why I cannot do this in 1 for loop anymore by 1st adding all columns by the function, like:
NewColdfConverter('Reqs wo setup','Requirements','Other reqs')
#then in 1 for loop:
for i in range(len(dfsorted)):
dfSorted['Reqs wo setup'] = #calculationsteps
dfSorted['Requirements'] = #calculationsteps
dfSorted['Other reqs'] = #calculationsteps
Thank you

General comment: How to identify bottlenecks
To get started, you should try to identify which parts of the code are slow.
Method 1: time code sections using the time package
Wrap blocks of code in statements like this:
import time
t = time.time()
# do something
print("time elapsed: {:.1f} seconds".format(time.time() - t))
Method 2: use a profiler
E.g. Spyder has a built-in profiler. This allows you to check which operations are most time consuming.
Vectorize your operations
Your code will be orders of magnitude faster if you vectorize your operations. It looks like your loops are all avoidable.
For example, rather than calling pd.to_datetime on every row separately, you should call it on the entire column at once
# slow (don't do this):
for i in range(len(dfConverter)):
dfConverter['Day'][i] = pd.to_datetime(dfConverter['Entry Date'][i])
# fast (do this instead):
dfConverter['Day'] = pd.to_datetime(dfConverter['Entry Date'])
If you want to perform an operation on a subset of rows, you can also do this in a vectorized operation by using loc:
mask = dfConverter['Time'] <= time(hour=5,minute=0,second=0)
dfConverter.loc[mask,'Day'] = pd.to_datetime(dfConverter.loc[mask,'Entry Date']) - timedelta(days=1)

Not sure this would improve performance, but you could calculate the dependent columns at the same time row by row with DataFrame.iterrows()
for index, data in dfSorted.iterrows():
dfSorted['Reqs wo setup'][index] = #calculationsteps
dfSorted['Requirements'][index] = #calculationsteps
dfSorted['Other reqs'][index] = #calculationsteps

Selecting Values from One Multi Index Dataframe to Add to a New Column in Another Existing Multi Index Dataframe

Alright I could use a little help on this one. I've created a function that I can feed two multi index dataframes into, as well as a list of kwargs, and the function will then take values from one dataframe and add them to the other into a new column.
Just to try to make sure that I'm explaining it well enough, the two dataframes are both stock info, where one dataframe is my "universe" or stocks that I'm analyzing, and the other is a dataframe of market and sector ETFs.
So what my function does is takes kwargs in the form of:
new_stock_column_name = "existing_sector_column_name"
Here is my actual function:
def map_columns(hist_multi_to, hist_multi_from, **kwargs):
''' Map columns from the sector multi index dataframe to a new column
in the existing universe multi index dataframe.
**kwargs should be in the format newcolumn="existing_sector_column"
'''
df_to = hist_multi_to.copy()
df_from = hist_multi_from.copy()
for key, value in kwargs.items():
df_to[key] = np.nan
for index, val in df_to.iterrows():
try:
df_to.loc[index, key] = df_from.loc[(index[0],val.xl_sect),value]
except KeyError:
pass
return df_to
I believe my function works exactly as I intend, except it takes a ridiculously long time to loop through the data. There has got to be a better way to do this, so any help you could provide would be greatly appreciated.
I apologize in advance, but I'm having trouble coming up with two simple example dataframes, but the only real requirement is that the stock dataframe has a column that lists its sector ETF in it, and that column value directly coincides to the level 1 index of the ETF dataframe.
The exception handler is in place simply because the ETFs sometimes do not exist for all the dates of the analysis, in which case I don't mind that the values stay as NaN.
Update:
Here is a revised code snippet that will allow you to run the function to see what I'm talking about. Forgive me, my coding skills are limited.
import pandas as pd
import numpy as np
stock_arrays = [np.array(['1/1/2020','1/1/2020','1/2/2020','1/2/2020']),
np.array(['aapl', 'amzn', 'aapl', 'amzn'])]
stock_tuples = list(zip(*stock_arrays))
stock_index = pd.MultiIndex.from_tuples(stock_tuples, names=['date', 'stock'])
etf_arrays = [np.array(['1/1/2020','1/1/2020','1/2/2020','1/2/2020']),
np.array(['xly', 'xlk','xly', 'xlk'])]
etf_tuples = list(zip(*etf_arrays))
etf_index = pd.MultiIndex.from_tuples(etf_tuples, names=['date', 'stock'])
stock_df = pd.DataFrame(np.random.randn(4), index=stock_index, columns=['close_price'])
etf_df = pd.DataFrame(np.random.randn(4), index=etf_index, columns=['close_price'])
stock_df['xl_sect'] = np.array(['xlk', 'xly','xlk', 'xly'])
def map_columns(hist_multi_to, hist_multi_from, **kwargs):
''' Map columns from the sector multi index dataframe to a new column
in the existing universe multi index dataframe.
**kwargs should be in the format newcolumn="existing_sector_column"
'''
df_to = hist_multi_to.copy()
df_from = hist_multi_from.copy()
for key, value in kwargs.items():
df_to[key] = np.nan
for index, val in df_to.iterrows():
try:
df_to.loc[index, key] = df_from.loc[(index[0],val.xl_sect),value]
except KeyError:
pass
return df_to
Now after running the above in a cell, you can access the function by calling it like this:
new_stock_df = map_columns(stock_df, etf_df, sect_etf_close='close_price')
new_stock_df
I hope this is more helpful. And as you can see, the function works, but with really large datasets it's extremely slow and inefficient.

Pandas, perf issues when filtering a dataframe, calculating on the filtered dataframe, and updating the main dataframe

I have some performance issues with pandas. I have a quite big dataframe which contains about 3000 products, identified with a unique_id (and several entries per product). I need to filter the dataframe for each product, perform some calculations, and update the base dataframe. For now I'm doing something like that:
for unique_id in self.df.unique_id.unique():
# prod_df = self.df[(self.df["unique_id"] == unique_id)]
prod_df = self.df.query(f"unique_id == {unique_id}")
some_function(prod_df)
And
def some_function(prod_df):
... some code ...
values = some_values
for idx, val in zip(prod_df.index, some_values):
self.df.loc[idx, "foo_column"] = val
This code is however awfully slow (I'm talking several hours here...). I did some quick profiling and it seems my script spends most of its runtime in pandas indexing.py script. Not so much surprise here.
I was wondering if there is a pandas solution to this kind of problem. I think the filtering, and maybe also the writing values per index, are killing the perf of my script. Any idea? At this point I'm thinking putting my dataframe in a dict or in a sqlite database.
EDIT:
Here is a typical function that I could use in-lieu of some_function:
def comp_gradient_for_column(
self, prod_df: pd.DataFrame
) -> None:
"""
Compute the gradient for a given column and insert it in the dataframe
Arguments:
prod_df (pd.DataFrame): sub-dataframe to work on
Returns:
None:
"""
values = prod_df[column_name].values
gradients = np.gradient(values)
for idx, val in zip(prod_df.index, gradients):
self.df.loc[idx, "foo_column"] = val

What is the best data structure for storing a set of four (or more) values?

Say I have the following variables and its corresponding values which represents a record.
name = 'abc'
age = 23
weight = 60
height = 174
Please note that the value could be of different types (string, integer, float, reference-to-any-other-object, etc).
There will be many records (at least >100,000). Each and every record will be unique when all these four variables (actually its values) are put together. In other words, there exists no record with all 4 values are the same.
I am trying to find an efficient data structure in Python which will allow me to (store and) retrieve records based on any one of these variables in log(n) time complexity.
For example:
def retrieve(name=None,age=None,weight=None,height=None)
if name is not None and age is None and weight is None and height is None:
/* get all records with the given name */
if name is None and age is not None and weight is None and height is None:
/* get all records with the given age */
....
return records
The way the retrieve should be called is as follows:
retrieve(name='abc')
The above should return [{name:'abc', age:23, wight:50, height=175}, {name:'abc', age:28, wight:55, height=170}, etc]
retrieve(age=23)
The above should return [{name:'abc', age:23, wight:50, height=175}, {name:'def', age:23, wight:65, height=180}, etc]
And, I may need to add one or two more variables to this record in future. For example, say, sex = 'm'. So, the retrieve function must be scalable.
So in short: Is there a data structure in Python which will allow storing a record with n number of columns (name, age, sex, weigh, height, etc) and retrieving records based on any (one) of the column in logarithmic (or ideally constant - O(1) look-up time) complexity?

There isn't a single data structure built into Python that does everything you want, but it's fairly easy to use a combination of the ones it does have to achieve your goals and do so fairly efficiently.
For example, say your input was the following data in a comma-separated-value file called employees.csv with field names defined as shown by the first line:
name,age,weight,height
Bob Barker,25,175,6ft 2in
Ted Kingston,28,163,5ft 10in
Mary Manson,27,140,5ft 6in
Sue Sommers,27,132,5ft 8in
Alice Toklas,24,124,5ft 6in
The following is working code which illustrates how to read and store this data into a list of records, and automatically create separate look-up tables for finding records associated with the values of contained in the fields each of these record.
The records are instances of a class created by namedtuple which is a very memory efficient because each one lacks a __dict__ attribute that class instances normally contain. Using them will make it possible to access the fields of each by name using dot syntax, like record.fieldname.
The look-up tables are defaultdict(list) instances, which provide dictionary-like O(1) look-up times on average, and also allow multiple values to be associated with each one. So the look-up key is the value of the field value being sought, and the data associated with it will be a list of the integer indices of the Person records stored in the employees list with that value — so they'll all be relatively small.
Note that the code for the class is completely data-driven in that it doesn't contain any hardcoded field names which instead are all taken from the first row of csv data input file when it's read in. Of course when using an instance, all retrieve() method calls must provide valid field names.
Update
Modified to not create a lookup table for every unique value of every field when the data file is first read. Now the retrieve() method "lazily" creates them only when one is needed (and saves/caches the result for future use). Also modified to work in Python 2.7+ including 3.x.
from collections import defaultdict, namedtuple
import csv
class DataBase(object):
def __init__(self, csv_filename, recordname):
# Read data from csv format file into a list of namedtuples.
with open(csv_filename, 'r') as inputfile:
csv_reader = csv.reader(inputfile, delimiter=',')
self.fields = next(csv_reader) # Read header row.
self.Record = namedtuple(recordname, self.fields)
self.records = [self.Record(*row) for row in csv_reader]
self.valid_fieldnames = set(self.fields)
# Create an empty table of lookup tables for each field name that maps
# each unique field value to a list of record-list indices of the ones
# that contain it.
self.lookup_tables = {}
def retrieve(self, **kwargs):
""" Fetch a list of records with a field name with the value supplied
as a keyword arg (or return None if there aren't any).
"""
if len(kwargs) != 1: raise ValueError(
'Exactly one fieldname keyword argument required for retrieve function '
'(%s specified)' % ', '.join([repr(k) for k in kwargs.keys()]))
field, value = kwargs.popitem() # Keyword arg's name and value.
if field not in self.valid_fieldnames:
raise ValueError('keyword arg "%s" isn\'t a valid field name' % field)
if field not in self.lookup_tables: # Need to create a lookup table?
lookup_table = self.lookup_tables[field] = defaultdict(list)
for index, record in enumerate(self.records):
field_value = getattr(record, field)
lookup_table[field_value].append(index)
# Return (possibly empty) sequence of matching records.
return tuple(self.records[index]
for index in self.lookup_tables[field].get(value, []))
if __name__ == '__main__':
empdb = DataBase('employees.csv', 'Person')
print("retrieve(name='Ted Kingston'): {}".format(empdb.retrieve(name='Ted Kingston')))
print("retrieve(age='27'): {}".format(empdb.retrieve(age='27')))
print("retrieve(weight='150'): {}".format(empdb.retrieve(weight='150')))
try:
print("retrieve(hight='5ft 6in'):".format(empdb.retrieve(hight='5ft 6in')))
except ValueError as e:
print("ValueError('{}') raised as expected".format(e))
else:
raise type('NoExceptionError', (Exception,), {})(
'No exception raised from "retrieve(hight=\'5ft\')" call.')
Output:
retrieve(name='Ted Kingston'): [Person(name='Ted Kingston', age='28', weight='163', height='5ft 10in')]
retrieve(age='27'): [Person(name='Mary Manson', age='27', weight='140', height='5ft 6in'),
Person(name='Sue Sommers', age='27', weight='132', height='5ft 8in')]
retrieve(weight='150'): None
retrieve(hight='5ft 6in'): ValueError('keyword arg "hight" is an invalid fieldname')
raised as expected

Is there a data structure in Python which will allow storing a record with n number of columns (name, age, sex, weigh, height, etc) and retrieving records based on any (one) of the column in logarithmic (or ideally constant - O(1) look-up time) complexity?
No, there is none. But you could try to implement one on the basis of one dictionary per value dimension. As long as your values are hashable of course. If you implement a custom class for your records, each dictionary will contain references to the same objects. This will save you some memory.

You could achieve logarithmic time complexity in a relational database using indexes (O(log(n)**k) with single column indexes). Then to retrieve data just construct appropriate SQL:
names = {'name', 'age', 'weight', 'height'}
def retrieve(c, **params):
if not (params and names.issuperset(params)):
raise ValueError(params)
where = ' and '.join(map('{0}=:{0}'.format, params))
return c.execute('select * from records where ' + where, params)
Example:
import sqlite3
c = sqlite3.connect(':memory:')
c.row_factory = sqlite3.Row # to provide key access
# create table
c.execute("""create table records
(name text, age integer, weight real, height real)""")
# insert data
records = (('abc', 23, 60, 174+i) for i in range(2))
c.executemany('insert into records VALUES (?,?,?,?)', records)
# create indexes
for name in names:
c.execute("create index idx_{0} on records ({0})".format(name))
try:
retrieve(c, naame='abc')
except ValueError:
pass
else:
assert 0
for record in retrieve(c, name='abc', weight=60):
print(record['height'])
Output:
174.0
175.0

Given http://wiki.python.org/moin/TimeComplexity how about this:
Have a dictionary for every column you're interested in - AGE, NAME, etc.
Have the keys of that dictionaries (AGE, NAME) be possible values for given column (35 or "m").
Have a list of lists representing values for one "collection", e.g. VALUES = [ [35, "m"], ...]
Have the value of column dictionaries (AGE, NAME) be lists of indices from the VALUES list.
Have a dictionary which maps column name to index within lists in VALUES so that you know that first column is age and second is sex (you could avoid that and use dictionaries, but they introduce large memory footrpint and with over 100K objects this may or not be a problem).
Then the retrieve function could look like this:
def retrieve(column_name, column_value):
if column_name == "age":
return [VALUES[index] for index in AGE[column_value]]
elif ...: # repeat for other "columns"
Then, this is what you get
VALUES = [[35, "m"], [20, "f"]]
AGE = {35:[0], 20:[1]}
SEX = {"m":[0], "f":[1]}
KEYS = ["age", "sex"]
retrieve("age", 35)
# [[35, 'm']]
If you want a dictionary, you can do the following:
[dict(zip(KEYS, values)) for values in retrieve("age", 35)]
# [{'age': 35, 'sex': 'm'}]
but again, dictionaries are a little heavy on the memory side, so if you can go with lists of values it might be better.
Both dictionary and list retrieval are O(1) on average - worst case for dictionary is O(n) - so this should be pretty fast. Maintaining that will be a little bit of pain, but not so much. To "write", you'd just have to append to the VALUES list and then append the index in VALUES to each of the dictionaries.
Of course, then best would be to benchmark your actual implementation and look for potential improvements, but hopefully this make sense and will get you going :)
EDIT:
Please note that as #moooeeeep said, this will only work if your values are hashable and therefore can be used as dictionary keys.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

processing columns in apache beam? mainly forward fill - python

Related

Defining a function with args to be used in df.transform

Python - How to optimize code to run faster? (lots of for loops in DataFrame)

Selecting Values from One Multi Index Dataframe to Add to a New Column in Another Existing Multi Index Dataframe

Pandas, perf issues when filtering a dataframe, calculating on the filtered dataframe, and updating the main dataframe

What is the best data structure for storing a set of four (or more) values?

Categories

Resources