Create a sliding window of data index positions - python

I am trying to write a function that returns the index positions of a sliding window over a Pandas DataFrame as a list of (train, test) tuples.
Example:
df.head(10)
col_a col_b
0 20.1 6.0
1 19.1 7.1
2 19.1 8.9
3 16.5 11.0
4 16.0 11.1
5 17.4 8.7
6 19.3 9.7
7 22.8 12.6
8 21.4 11.9
9 23.0 12.8
def split_function(df, train_length, test_length):
some_logic_to_split_dataframe
split_indices = [(train_idx, test_idx) for index_tuples in split_dataframe_logic]
return split_indices
Desired outcome:
train_length = 2
test_length = 1
split_indices = split_function(df, train_length, test_length)
split_indices
output:
[((0,1), (2)), ((1,2),(3)),...,((7,8), (9)) etc]
The function loop/generator expression would need to terminate when the test_index == last observation too.
All help very much appreciated

I would suggest using the rolling method offered by pandas.
split_indices = []
def split(x):
split_indices.append((x.index[:train_length], x.index[-test_length:]))
return np.nan
df['col1'].rolling(train_length + test_length).apply(split)
This code will create the following split_indices
>>> split_indices
[(Int64Index([0, 1], dtype='int64'), Int64Index([2], dtype='int64')),
(Int64Index([1, 2], dtype='int64'), Int64Index([3], dtype='int64')),
(Int64Index([2, 3], dtype='int64'), Int64Index([4], dtype='int64')),
(Int64Index([3, 4], dtype='int64'), Int64Index([5], dtype='int64')),
(Int64Index([4, 5], dtype='int64'), Int64Index([6], dtype='int64')),
(Int64Index([5, 6], dtype='int64'), Int64Index([7], dtype='int64')),
(Int64Index([6, 7], dtype='int64'), Int64Index([8], dtype='int64')),
(Int64Index([7, 8], dtype='int64'), Int64Index([9], dtype='int64'))]
After you can easily get the data of your dataframe of a given index
>>> df.loc[split_indices[3][0]]
col1 col2
3 16.5 11.0
4 16.0 11.1

Related

How to calculate the ratio per columns in python?

I'm trying to calculate the ratio by columns in python.
import pandas as pd
import numpy as np
data={
'category': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I'],
'value 1': [1, 1, 2, 5, 3, 4, 4, 8, 7],
'value 2': [4, 2, 8, 5, 7, 9, 3, 4, 2]
}
data=pd.DataFrame(data)
data.set_index('category')
# value 1 value 2
#category
# A 1 4
# B 1 2
# C 2 8
# D 5 5
# E 3 7
# F 4 9
# G 4 3
# H 8 4
# I 7 2
The expected results is as below:
#The sum of value 1: 35, value 2: 44
#The values in the first columns were diveded by 35, and the second columns were divded by 44
# value 1 value 2
#category
# A 0.028 0.090
# B 0.028 0.045
# C 0.057 0.181
# D 0.142 0.113
# E 0.085 0.159
# F 0.114 0.204
# G 0.114 0.068
# H 0.228 0.090
# I 0.2 0.045
I tried to run the below code, but it returned NaN values:
data=data.apply(lambda x:x/data.sum())
data
I think that there are simpler methods for this job, but I cannot search the proper keywords..
How can I calculate the ratio in each column?
The issue is that you did not make set_index permanent.
What I usually do to ensure I do correct things is using pipelines
data=pd.DataFrame(data)
dataf =(
data
.set_index('category')
.transform(lambda d: d/d.sum())
)
print(dataf)
By piping commands, you get what you want. Note: I used transform instead of apply for speed.
They are easy to read, and less prune to mistake. Using inplace=True is discouraged in Pandas as the effects could be unpredictable.

Searching in numpy array

I have a 2D numpy array, say A sorted with respect to Column 0. e.g.
Col.0
Col.1
Col.2
10
2.45
3.25
11
2.95
4
12
3.45
4.25
15
3.95
5
18
4.45
5.25
21
4.95
6
23
5.45
6.25
27
5.95
7
29
6.45
7.25
32
6.95
8
35
7.45
8.25
The entries in each row is unique i.e. Col. 0 is the identification number of a co-ordinate in xy plane, Columns 1 and 2 are x and y co-ordinates of these points.
I have another array B (rows can contain duplicate data). Column 0 and Column 1 store x and y co-ordinates.
Col.0
Col.1
2.45
3.25
4.45
5.25
6.45
7.25
2.45
3.25
My aim is to find the row index number in array A corresponding to data in array B without using for loop. So, in this case, my output should be [0,4,8,0].
Now, I know that with numpy searchsorted lookup for multiple data can be done in one shot. But, it can be used to compare with a single column of A and not multiple columns. Is there a way to do this?
Pure numpy solution:
My intuition is that I take the difference c between a[:,1:] and b by broadcasting, such that c is of shape (11, 4, 2). The rows that match will be all zeros. Then I do c == False to obtain a mask. I do c.all(2) which results in a boolean array of shape (11, 4), where all True elements represents matches between a and b. Then I simply use np.nonzero to obtain the indices of said elements.
import numpy as np
a = np.array([
[10, 2.45, 3.25],
[11, 2.95, 4],
[12, 3.45, 4.25],
[15, 3.95, 5],
[18, 4.45, 5.25],
[21, 4.95, 6],
[23, 5.45, 6.25],
[27, 5.95, 7],
[29, 6.45, 7.25],
[32, 6.95, 8],
[35, 7.45, 8.25],
])
b = np.array([
[2.45, 3.25],
[4.45, 5.25],
[6.45, 7.25],
[2.45, 3.25],
])
c = (a[:,np.newaxis,1:]-b) == False
rows, cols = c.all(2).nonzero()
print(rows[cols.argsort()])
# [0 4 8 0]
You can use merge in pandas:
df2.merge(df1.reset_index(),how='left',left_on=['Col.0','Col.1'],right_on=['Col.1','Col.2'])['index']
output:
0 0
1 4
2 8
3 0
Name: index, dtype: int64
and if you like it as array:
df2.merge(df1.reset_index(),how='left',left_on=['Col.0','Col.1'],right_on=['Col.1','Col.2'])['index'].to_numpy()
#array([0, 4, 8, 0])

Replace specific values in multiindex dataframe

I have a multindex dataframe with 3 index levels and 2 numerical columns.
A 1 2017-04-01 14.0 87.346878
2017-06-01 4.0 87.347504
2 2014-08-01 1.0 123.110001
2015-01-01 4.0 209.612503
B 3 2014-07-01 1.0 68.540001
2014-12-01 1.0 64.370003
4 2015-01-01 3.0 75.000000
I want to replace the values in first row of 3rd index level wherever a new second level index begins.
For ex: every first row
(A,1,2017-04-01)->0.0 0.0
(A,2,2014-08-01)->0.0 0.0
(B,3,2014-07-01)->0.0 0.0
(B,4,2015-01-01)->0.0 0.0
The dataframe is too big and doing it datframe by dataframe like df.xs('A,1')...df.xs(A,2) gets time consuming. Is there some way where i can get a mask and replace with new values in these positions ?
Use DataFrame.reset_index on level=2, then use DataFrame.groupby on level=[0, 1] and aggregate level_2 using first, then using pd.MultiIndex.from_arrays create a multilevel index, finally use this multilevel index to change the values in dataframe:
idx = df.reset_index(level=2).groupby(level=[0, 1])['level_2'].first()
idx = pd.MultiIndex.from_arrays(idx.reset_index().to_numpy().T)
df.loc[idx, :] = 0
Result:
# print(df)
col1 col2
A 1 2017-04-01 0.0 0.000000
2017-06-01 4.0 87.347504
2 2014-08-01 0.0 0.000000
2015-01-01 4.0 209.612503
B 3 2014-07-01 0.0 0.000000
2014-12-01 1.0 64.370003
4 2015-01-01 0.0 0.000000
We can extract a series of the second-level index with:
df.index.get_level_values(1)
# output: Int64Index([1, 1, 2, 2, 3, 3, 4], dtype='int64')
And check where it changes with:
idx = df.index.get_level_values(1)
np.where(idx != np.roll(idx, 1))[0]
# output: array([0, 2, 4, 6])
So we can simply use the returned value of the second statement with iloc to get the first row of every second-level index and modify their values like this:
idx = df.index.get_level_values(1)
df.iloc[np.where(idx != np.roll(idx, 1))[0]] = 0
output:
value1 value2
A 1 2017-04-01 0.0 0.000000
2017-06-01 4.0 87.347504
2 2014-08-01 0.0 0.000000
2015-01-01 4.0 209.612503
B 3 2014-07-01 0.0 0.000000
2014-12-01 1.0 64.370003
4 2015-01-01 0.0 0.000000
You can use the grouper indices in a simple iloc:
df.iloc[[a[0] for a in df.groupby(level=[0, 1]).indices.values()]] = 0
Example:
df = pd.DataFrame({'col1': [14., 4., 1., 4., 1., 1., 3.],
'col2': [ 87.346878, 87.347504, 123.110001, 209.612503, 68.540001, 64.370003, 75.]},
index = pd.MultiIndex.from_tuples(([('A', 1, '2017-04-01'), ('A', 1, '2017-06-01'),
('A', 2, '2014-08-01'), ('A', 2, '2015-01-01'),
('B', 3, '2014-07-01'), ('B', 3, '2014-12-01'),
('B', 4, '2015-01-01')])))
Result:
col1 col2
A 1 2017-04-01 0.0 0.000000
2017-06-01 4.0 87.347504
2 2014-08-01 0.0 0.000000
2015-01-01 4.0 209.612503
B 3 2014-07-01 0.0 0.000000
2014-12-01 1.0 64.370003
4 2015-01-01 0.0 0.000000
Timings:
%%timeit
idx = df.reset_index(level=2).groupby(level=[0, 1])['level_2'].first()
idx = pd.MultiIndex.from_arrays(idx.reset_index().to_numpy().T)
df.loc[idx, :] = 0
#6.7 ms ± 40 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
df.iloc[[a[0] for a in df.groupby(level=[0, 1]).indices.values()]] = 0
#897 µs ± 6.99 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
So this is about 7 times faster than the accepted answer
I think you can use something like this:
import pandas as pd
import numpy as np
arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
tuples = list(zip(*arrays))
df = pd.DataFrame([['A', 'B'], ['bar', 'two'],
['foo', 'one'], ['foo', 'two']],
columns=['first', 'second'])
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
df = pd.DataFrame(np.random.randn(8, 4), index=arrays)
df
You can create a list of unique values from your index. Then get the index position, to replace on your column the row value coincidence with the row value.
lst = ['bar','foo', 'qux']
ls = []
for i in lst:
base = df.index.get_loc(i)
a = base.indices(len(df))
a = a[0]
ls.append(a)
for ii in ls:
#print(ii)
df[0][ii] = 0
df
Fortunately, this can help you.
Cheers!

check start and end available in data using python pandas

df1
id start end data
1 2001 2004 [[2004,1],[2003,2],[2002,6],[2001,0.9]]
2 2001 2004 [[2005,1],[2003,2],[2002,6],[2001,0.9]]
3 2001 2004 [[2004,1],[2003,2],[2002,6]]
output
id missed_one
2 2004
3 2001
That is the output.
I have to check from the start to end and the available in the data. If any data is missed it should print the output.
You can use set differencing
df[['start', 'end']].agg(set,1) - df.data.transform(lambda k: set([item for z in k for item in z]))
1 {}
2 {2004}
3 {2001}
dtype: object
Using a list comprehension and zip:
out = df.assign(missing=[
[i for i in range(start, end+1) if i not in {d for d, _ in datum}] or np.nan
for datum, start, end in zip(df.data, df.start, df.end)
])
id start end data missing
0 1 2001 2004 [[2004, 1], [2003, 2], [2002, 6], [2001, 0.9]] NaN
1 2 2001 2004 [[2005, 1], [2003, 2], [2002, 6], [2001, 0.9]] [2004]
2 3 2001 2004 [[2004, 1], [2003, 2], [2002, 6]] [2001]
So if you want only rows with a year missing:
out.loc[out.missing.notnull()]
id start end data missing
1 2 2001 2004 [[2005, 1], [2003, 2], [2002, 6], [2001, 0.9]] [2004]
2 3 2001 2004 [[2004, 1], [2003, 2], [2002, 6]] [2001]
If you only want to show a single missing value, and not a list of missing values, you can use next:
df.assign(missing=[
next((i for i in range(start, end+1) if i not in {d for d, _ in datum}), np.nan)
for datum, start, end in zip(df.data, df.start, df.end)
])
id start end data missing
0 1 2001 2004 [[2004, 1], [2003, 2], [2002, 6], [2001, 0.9]] NaN
1 2 2001 2004 [[2005, 1], [2003, 2], [2002, 6], [2001, 0.9]] 2004.0
2 3 2001 2004 [[2004, 1], [2003, 2], [2002, 6]] 2001.0
Some timings:
df = pd.concat([df]*10000)
In [145]: %%timeit
...: out = df.assign(missing=[^M
...: [i for i in range(start, end+1) if i not in {d for d, _ in datum}] or np.nan^M
...: for datum, start, end in zip(df.data, df.start, df.end)^M
...: ])
...:
72.3 ms ± 1.31 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [146]: %%timeit
...: df[['start', 'end']].agg(set,1) - df.data.transform(lambda k: set([item for z in k for item in z]))
...:
503 ms ± 13.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
You can substract sets:
#if necessary convert to nested lists
import ast
df['data'] = df['data'].apply(ast.literal_eval)
df = df.set_index('id')
ranges = df[['start', 'end']].apply(lambda x: set(range(x['start'], x['end'] + 1)), axis=1)
data = df['data'].apply(lambda k: set([z[0] for z in k]))
out = (ranges - data).to_dict()
print (out)
{1: set(), 2: {2004}, 3: {2001}}
df1 = pd.DataFrame([(k, v1) for k, v in out.items() for v1 in v], columns=['id','missed_one'])
print (df1)
id missed_one
0 2 2004
1 3 2001
Details:
print (ranges)
id
1 {2001, 2002, 2003, 2004}
2 {2001, 2002, 2003, 2004}
3 {2001, 2002, 2003, 2004}
print (data)
id
1 {2001, 2002, 2003, 2004}
2 {2001, 2002, 2003, 2005}
3 {2002, 2003, 2004}
Name: data, dtype: object

Manipulate pandas.DataFrame with multiple criterias

For example I have a dataframe:
df = pd.DataFrame({'Value_Bucket': [5, 5, 5, 10, 10, 10],
'DayofWeek': [1, 1, 3, 2, 4, 2],
'Hour_Bucket': [1, 5, 7, 4, 3, 12],
'Values': [1, 1.5, 2, 3, 5, 3]})
The actual data set is rather large (5000 rows+). I'm looking to perform functions on 'Values' if the "Value_Bucket" = 5, and for each possible combination of "DayofWeek" and "Hour_Bucket".
Essentially the data will be grouped to a table of 24 rows (Hour_Bucket) and 7 columns (DayofWeek), and each cell is filled with the result of a function (say average for example). I can use a groupby function for 1 criteria, can someone explain how I can group two criteria and tabulate the result in a table?
query to subset
groupby
unstack
df.query('Value_Bucket == 5').groupby(
['Hour_Bucket', 'DayofWeek']).Values.mean().unstack()
DayofWeek 1 3
Hour_Bucket
1 1.0 NaN
5 1.5 NaN
7 NaN 2.0
If you want to have zeros instead of NaN
df.query('Value_Bucket == 5').groupby(
['Hour_Bucket', 'DayofWeek']).Values.mean().unstack(fill_value=0)
DayofWeek 1 3
Hour_Bucket
1 1.0 0.0
5 1.5 0.0
7 0.0 2.0
Pivot tables seem more natural to me than groupby paired with unstack though they do the exact same thing.
pd.pivot_table(data=df.query('Value_Bucket == 5'),
index='Hour_Bucket',
columns='DayofWeek',
values='Values',
aggfunc='mean',
fill_value=0)
Output
DayofWeek 1 3
Hour_Bucket
1 1.0 0
5 1.5 0
7 0.0 2

Categories

Resources