Aligning and adding columns in multiple Pandas dataframes based on Date column - python

I have a number of dataframes all which contain columns labeled 'Date' and 'Cost' along with additional columns. I'd like to add the numerical data in the 'Cost' columns across the different frames based on lining up the dates in the 'Date' columns to provide a timeseries of total costs for each of the dates.
There are different numbers of rows in each of the dataframes.
This seems like something that Pandas should be well suited to doing, but I can't find a clean solution.
Any help appreciated!
Here are two of the dataframes:
df1:
Date Total Cost Funded Costs
0 2015-09-30 724824 940451
1 2015-10-31 757605 940451
2 2015-11-15 788051 940451
3 2015-11-30 809368 940451
df2:
Date Total Cost Funded Costs
0 2015-11-30 3022 60000
1 2016-01-15 3051 60000
I want to have the resulting dataframe have five rows (there are five different dates) and a single column with the total of the 'Total Cost' column from each of the dataframes. Initially I used the following:
totalFunding = df1['Total Cost'].values + df2['Total Cost'].values
This worked fine until there were different dates in each of the dataframes.
Thanks!
The solution posted below works great, except that I need to do this recursively as I have a number of data frames. I created the following function:
def addDataFrames(f_arg, *argv):
dfTotal = f_arg
for arg in argv:
dfTotal = dfTotal.set_index('Date').add(arg.set_index('Date'), fill_value = 0)
return dfTotal
Which works fine when adding the first two dataframes. However, the addition method appears to convert my Date column into an index in the resulting sum and therefore subsequent passes through the function fail. Here is what dfTotal looks like after the first two data frames are added together:
Total Cost Funded Costs Remaining Cost Total Employee Hours
Date
2015-09-30 1449648 1880902 431254 7410.6
2015-10-31 1515210 1880902 365692 7874.4
2015-11-15 1576102 1880902 304800 8367.2
2015-11-30 1618736 1880902 262166 8578.0
2015-12-15 1671462 1880902 209440 8945.2
2015-12-31 1721840 1880902 159062 9161.2
2016-01-15 1764894 1880902 116008 9495.0
Note that what was originally a column in the dataframe called 'Date' is now listed as the index causing df.set_index('Date') to generate an error on subsequent passes through my function.

DataFrame.add does exactly what you're looking for; it matches the DataFrames based on index, so:
df1.set_index('Date').add(df2.set_index('Date'), fill_value=0)
should do the trick. If you just want the Total Cost column and you want it as a DataFrame:
df1.set_index('Date').add(df2.set_index('Date'), fill_value=0)[['Total Cost']]
See also the documentation for DataFrame.add at:
http://pandas.pydata.org/pandas-docs/version/0.17.1/generated/pandas.DataFrame.add.html

Solution found. As mentioned, the add method converted the 'Date' column into the dataframe index. This was resolved using:
dfTotal['Date'] = dfTotal.index
The complete function is then:
def addDataFrames(f_arg, *argv):
dfTotal = f_arg
for arg in argv:
dfTotal = dfTotal.set_index('Date').add(arg.set_index('Date'), fill_value = 0)
dfTotal['Date'] = dfTotal.index
return dfTotal

Related

Automatically Map columns from one dataframe to another using pandas

I am trying to merge multiple dataframes to a master dataframe based on the columns in the master dataframes. For Example:
MASTER DF:
PO ID
Sales year
Name
Acc year
10
1934
xyz
1834
11
1942
abc
1842
SLAVE DF:
PO ID
Yr
Amount
Year
12
1935
365.2
1839
13
1966
253.9
1855
RESULTANT DF:
PO ID
Sales Year
Acc Year
10
1934
1834
11
1942
1842
12
1935
1839
13
1966
1855
Notice how I have manually mapped columns (Sales Year-->Yr and Acc Year-->Year) since I know they are the same quantity, only the column names are different.
I am trying to write some logic which can map them automatically based on some criteria (be it column names or the data type of that column) so that user does not need to map them manually.
If I map them by column name, both the columns have different names (Sales Year, Yr) and (Acc Year, Year). So to which column should the fourth column (Year) in the SLAVE DF be mapped in the MASTER DF?
Another way would be to map them based on their column values but again they are the same so cannot do that.
The logic should be able to map Yr to Sales Year and map Year to Acc Year automatically.
Any idea/logic would be helpful.
Thanks in advance!
I think safest is manually rename columns names.
df = df.rename(columns={'Yr':'Sales year','Sales year':'Sales Year',
'Year':'Acc Year','Acc Year':'Acc year'})
One idea is filter columns names for integers and if all values are between thresholds, here between 1800 and 2000, last set columns names:
df = df.set_index('PO ID')
df1 = df.select_dtypes('integer')
mask = (df1.gt(1800) & df1.lt(2000)).all().reindex(df.columns, fill_value=False)
df = df.loc[:, mask].set_axis(['Sales Year','Acc Year'], axis=1)
Generally this is impossible as there is no solid/consistent factor by which we can map the columns.
That being said what one can do is use cosine similarity to calculate how similar one string (in this case the column name) is to other strings in another dataframe.
So in your case, we'll get 4 vectors for the first dataframe and 4 for the other one. Now calculate the cosine similarity between the first vector(PO ID) from the first dataframe and first vector from second dataframe (PO ID). This will return 100% as both the strings are same.
For each and every column, you'll get 4 confidence scores. Just pick the highest and map them.
That way you can get a makeshift logic through which you can map the column although there are loopholes in this logic too. But it is better than nothing as that way the number of columns to be mapped by the user will be less as compared to mapping them all manually.
Cheers!

Using get_loc to get index of multiple values by iterating over a dataframe in Pandas

Events is the DataFrame with date as index. It looks like this:
co_code
co_stkdate
2009-03-17 11
2010-02-03 11
2011-02-14 363
2015-01-09 363
2010-10-15 365
residual is the other dataframe with date as index and contains the elements in co_code of events dataframe as the column names. residual looks like this (has more than 700 columns but i've posted 3 for reference):
11 363 365
co_stkdate
1997-07-02 NaN -12.134525 NaN
1997-07-04 NaN -3.663248 -15.703843
1997-07-07 NaN -30.649876 3.400623
1997-07-08 NaN 17.924305 -6.188777
1997-07-10 NaN -25.828099 -0.615380
I want to compare the two dataframes to find the common dates for each column of residual dataframe individually and extract the specific row and its adjacent rows for each column which has a matching date in events dataframe. Since the dataset is very large, I want to iterate through each column of residual to compare the date in accordance with the column name (that matches with the events dataframe). Hence, I tried the following code:
carvalues = {}
for code in residual.columns:
for c in events['co_code']:
if (code == c):
for elem in events['co_stkdate']:
for dates in residual.index:
if (elem == dates):
if pd.notnull(residual.loc[dates, code]):
idx=residual.index.get_loc(dates, code. method=None)
carvalues = residual.iloc[idx - 10 : idx +10]
But I keep getting the following error:
TypeError: get_loc() got multiple values for argument 'method'
The expected output: For example, extract 10 rows (from the residual dataframe) above and below the date 2009-03-17 corresponding to 'co_code'=11 (given in events dataframe). And expect the output for date 2009-03-17, corresponding to 'co_code'=11 to be:
co_stkdate 11
2009-02-25 4.467442
2009-02-26 4.921655
2009-02-27 -4.875917
2009-03-02 1.895546
2009-03-03 -3.162370
2009-03-06 85.396542
2009-03-09 43.233098
2009-03-12 11.389193
2009-03-13 -68.633160
2009-03-16 0.329175
2009-03-17 -0.049623
2009-03-18 3.584602
2009-03-19 -3.602577
2009-03-20 -1.532591
2009-03-23 2.766331
2009-03-24 0.487590
2009-03-25 -3.541044
2009-03-26 -5.055355
2009-03-27 0.887624
2009-03-30 2.530087
Similarly, next I want the output for co_stkdate=2010-02-03 & co_code=11 and then for co_stkdate=2011-02-14 & co_code=363 and so on (as given in events dataframe). How can I remove the error? Any guidance on the best way to do this would be much appreciated.
Reshape the second data frame from wide format to long format and join both the data frames you will get the desired result

Pivot across multiple columns with repeating values in each column

I am trying to pivot a pandas dataframe, but the data is following a strange format that I cannot seem to pivot. The data is structured as below:
Date, Location, Action1, Quantity1, Action2, Quantity2, ... ActionN, QuantityN
<date> 1 Lights 10 CFloor 1 ... Null Null
<date2> 2 CFloor 2 CWalls 4 ... CBasement 15
<date3> 2 CWalls 7 CBasement 4 ... NUll Null
Essentially, each action will always have a quantity attached to it (which may be 0), but null actions will never have a quantity (the quantity will just be null). The format I am trying to achieve is the following:
Lights CFloor CBasement CWalls
1 10 1 0 0
2 0 2 19 11
The index of the rows becomes the location while the columns become any unique action found across the multiple activity columns. When pulling the data together, the value of each row/column is the sum of each quantity associated with the action (i.e Action1 corresponds to Quantity1). Is there a way to do this with the native pandas pivot funciton?
My current code performs a ravel across all the activity columns to get a list of all unique activities. It will also grab all the unique locations from the Location column. Once I have the unique columns, I create an empty dataframe and fill it with zeros:
Lights CFloor CBasement CWalls
1 0 0 0 0
2 0 0 0 0
I then iterate back over the old data frame with the itertuples() method (I was told it was significantly faster than iterrows()) and populate the new dataframe. This empty dataframe acts as a template that is stored in memory and filled later.
#Creates a template from the dataframe
def create_template(df):
act_cols = ['Activity01', 'Activity02', 'Activity03', 'Activity04']
activities = df[act_cols]
flat_acts = activities.values.ravel('K')
unique_locations = pd.unique(df['Location'])
unique_acts = pd.unique(flat_acts)
pivot_template = pd.DataFrame(index=unique_locations, columns=unique_acts).fillna(0)
return pivot_template
#Fills the template from the dataframe
def create_pivot(df, pivot_frmt):
act_cols = ['Activity01', 'Activity02', 'Activity03', 'Activity04']
quant_cols = ['Quantity01', 'Quantity02', 'Quantity03', 'Quantity04']
for row in df.itertuples():
for act, quantity in zip(act_cols, quant_cols):
act_val = getattr(row, act)
if pd.notna(act_val):
quantity_val = getattr(row, quantity)
location = getattr(row, 'Location')
pivot_frmt.loc[location, act_val] += quantity_val
return pivot_frmt
While my solution works, it is incredibly slow when dealing with a large dataset and has taken 10 seconds or more to complete this type of operation. Any help would be greatly appreciated!
After experimenting with various pandas functions, such as melt and pivot on multiple columns simulatenously, I found a solution that worked for me:
For every quantity-activity pair, I build a partial frame of the final dataset and store it in a list. Once every pair has been addressed I will end up with multiple dataframes that all have the same row counts, but potentially different column counts. I solved this issue by simply concatenating the columns and if any columns are repeated, I then sum them to get the final result.
def test_pivot(df):
act_cols = ['Activity01', 'Activity02', 'Activity03', 'Activity04']
quant_cols = ['Quantity01', 'Quantity02', 'Quantity03', 'Quantity04']
dfs = []
for act, quant in zip(act_cols, quant_cols):
partial = pd.crosstab(index=df['Location'], columns=df[act], values=df[quant], aggfunc=np.sum).fillna(0)
dfs.append(partial)
finalDf = pd.concat(dfs, axis=1)
finalDf = test.groupby(finalDf.columns, axis=1).sum()
return finalDf
There are two assumptions that I make during this approach:
The indexes maintain their order across all partial dataframes
There are an equivalent number of indexes across all partial dataframes
While this is probably not the most elegant solution, it achieves the desired result and reduced the time it took to process the data by a very significant margin (from 10s ~4k rows to 0.2s ~4k rows). If anybody has a better way to deal with this type of scenario and do the process outlined above in one shot, then I would love to see your response!

Python Matching multiple criteria

I need to match multiple criteria between two dataframes and then assign an ID.
This is complicated by the fact that one criteria needs to be 'like or similar' and not exact as it involves a time reference that is slightly different.
I need the timestamp match second +/- a 1/2 second. I then would like to add a column that print's the ID in a new column in DF2:
DF1
TimeStamp ID Size
2018-07-12T03:34:54.228000Z 46236499 0.0013
2018-07-12T03:34:54.301000Z 46236500 0.01119422
DF2
TimeStamp Size ID #new column
2018-07-12T03:34:54.292Z 0.00 blank #no match/no data
2018-07-12T03:34:54.300Z 0.01119422 46236500 #size and
#timestamp match within tolerances
In the example above the script would look at the time stamp column and look for any timestamp in DF2 that had the following information "2018-07-12T03:34:54" +/- a 1/2 second + had the exact same 'Size' element.
This needs to be done like this as there could be multiple 'Size' elements that are the same throughout the dataset.
It would then stamp the corresponding ID in the newly created 'ID' column within DF2 or if DF2 was copied to a new DF, I would just add the new 'ID' column within DF3.
Depending on which rows you need in the final dataframe you may choose different join operators. One solution uses the combined dataframes joined by the column Size and then filters the remaining columns based on the absoulte time difference between the merged datetime columns.
df3 = df1.merge(df2, left_on='Size', right_on='Size', how='right')
df3['deltaTime'] = numpy.abs(df3['TimeStamp_x'] - df3['TimeStamp_y'])
df3 = df3[(df3['deltaTime'] < timedelta(milliseconds=500)) | pandas.isnull(df3['deltaTime'])]
Output:
TimeStamp_x ID_x Size TimeStamp_y ID_y deltaTime
0 2018-07-12 03:34:54.301 46236500.0 0.011194 2018-07-12 03:34:54.300 46236500 00:00:00.001000
1 2018-07-12 03:34:54.301 46236500.0 0.011194 2018-07-12 03:34:54.800 46236501 00:00:00.499000
3 NaT NaN 0.000000 2018-07-12 03:34:54.292 blank NaT
If you don't want any none merged rows then just remove | pandas.isnull(df3['deltaTime']) and use an inner join.

A Multi-Index Construction for Intraday TimeSeries (10 min price data)

I have a file with intraday prices every ten minutes. [0:41] times in a day. Each date is repeated 42 times. The multi-index below should "collapse" the repeated dates into one for all times.
There are 62,035 rows x 3 columns: [date, time, price].
I would like write a function to get the difference of the ten minute prices, restricting differences to each unique date.
In other words, 09:30 is the first time of each day and 16:20 is the last: I cannot overlap differences between days of price from 16:20 - 09:30. The differences should start as 09:40 - 09:30 and end as 16:20 - 16:10 for each unique date in the dataframe.
Here is my attempt. Any suggestions would be greatly appreciated.
def diffSeries(rounded,data):
'''This function accepts a column called rounded from 'data'
The 2nd input 'data' is a dataframe
'''
df=rounded.shift(1)
idf=data.set_index(['date', 'time'])
data['diff']=['000']
for i in range(0,length(rounded)):
for day in idf.index.levels[0]:
for time in idf.index.levels[1]:
if idf.index.levels[1]!=1620:
data['diff']=rounded[i]-df[i]
else:
day+=1
time+=2
data[['date','time','price','II','diff']].to_csv('final.csv')
return data['diff']
Then I call:
data=read_csv('file.csv')
rounded=roundSeries(data['price'],5)
diffSeries(rounded,data)
On the traceback - I get an Assertion Error.
You can use groupby and then apply to achieve what you want:
diffs = data.groupby(lambda idx: idx[0]).apply(lambda row: row - row.shift(1))
For a full example, suppose you create a test data set for 14 Nov to 16 Nov:
import pandas as pd
from numpy.random import randn
from datetime import time
# Create date range with 10 minute intervals, and filter out irrelevant times
times = pd.bdate_range(start=pd.datetime(2012,11,14,0,0,0),end=pd.datetime(2012,11,17,0,0,0), freq='10T')
filtered_times = [x for x in times if x.time() >= time(9,30) and x.time() <= time(16,20)]
prices = randn(len(filtered_times))
# Create MultiIndex and data frame matching the format of your CSV
arrays = [[x.date() for x in filtered_times]
,[x.time() for x in filtered_times]]
tuples = zip(*arrays)
m_index = pd.MultiIndex.from_tuples(tuples, names=['date', 'time'])
data = pd.DataFrame({'prices': prices}, index=m_index)
You should get a DataFrame a bit like this:
prices
date time
2012-11-14 09:30:00 0.696054
09:40:00 -1.263852
09:50:00 0.196662
10:00:00 -0.942375
10:10:00 1.915207
As mentioned above, you can then get the differences by grouping by the first index and then subtracting the previous row for each row:
diffs = data.groupby(lambda idx: idx[0]).apply(lambda row: row - row.shift(1))
Which gives you something like:
prices
date time
2012-11-14 09:30:00 NaN
09:40:00 -1.959906
09:50:00 1.460514
10:00:00 -1.139036
10:10:00 2.857582
Since you are grouping by the date, the function is not applied for 16:20 - 09:30.
You might want to consider using a TimeSeries instead of a DataFrame, because it will give you far greater flexibility with this kind of data. Supposing you have already loaded your DataFrame from the CSV file, you can easily convert it into a TimeSeries and perform a similar function to get the differences:
dt_index = pd.DatetimeIndex([datetime.combine(i[0],i[1]) for i in data.index])
# or dt_index = pd.DatetimeIndex([datetime.combine(i.date,i.time) for i in data.index])
# if you don't have an multi-level index on data yet
ts = pd.Series(data.prices.values, dt_index)
diffs = ts.groupby(lambda idx: idx.date()).apply(lambda row: row - row.shift(1))
However, you would now have access to the built-in time series functions such as resampling. See here for more about time series in pandas.
#MattiJohn's construction gives a filtered list of length 86,772--when run over 1/3/2007-8/30/2012 for 42 times (10 minute intervals). Observe the data cleaning issues.
Here the data of prices coming from the csv is length: 62,034.
Hence, simply importing from the .csv, as follows, is problematic:
filtered_times = [x for x in times if x.time() >= time(9,30) and x.time() <= time(16,20)]
DF=pd.read_csv('MR10min.csv')
prices = DF.price
# I.E. rather than the generic: prices = randn(len(filtered_times)) above.
The fact that the real data falls short of the length it "should be" means there are data cleaning issues. Often we do not have the full times as bdate_time will generate (half days in the market, etc, holidays).
Your solution is elegant. But I am not sure how to overcome the mismatch between the actual data and the a priori, prescribed dataframe.
Your second TimesSeries suggestion seems to still require construction of a datetime index similar to the first one. For example, if I were use the following two lines to get the actual data of interest:
DF=pd.read_csv('MR10min.csv')
data=pd.DF.set_index(['date','time'])
dt_index = pd.DatetimeIndex([datetime.combine(i[0],i[1]) for i in data.index])
It will generate a:
TypeError: combine() argument 1 must be datetime.date, not str
How does one make a bdate_time array completely informed by the actual data available?
Thank you to (#MattiJohn) and to anyone with interest in continuing this discussion.

Categories

Resources