Create a sliding window of data index positions

Create a sliding window of data index positions - python

I am trying to write a function that returns the index positions of a sliding window over a Pandas DataFrame as a list of (train, test) tuples.
Example:
df.head(10)
col_a col_b
0 20.1 6.0
1 19.1 7.1
2 19.1 8.9
3 16.5 11.0
4 16.0 11.1
5 17.4 8.7
6 19.3 9.7
7 22.8 12.6
8 21.4 11.9
9 23.0 12.8
def split_function(df, train_length, test_length):
some_logic_to_split_dataframe
split_indices = [(train_idx, test_idx) for index_tuples in split_dataframe_logic]
return split_indices
Desired outcome:
train_length = 2
test_length = 1
split_indices = split_function(df, train_length, test_length)
split_indices
output:
[((0,1), (2)), ((1,2),(3)),...,((7,8), (9)) etc]
The function loop/generator expression would need to terminate when the test_index == last observation too.
All help very much appreciated

I would suggest using the rolling method offered by pandas.
split_indices = []
def split(x):
split_indices.append((x.index[:train_length], x.index[-test_length:]))
return np.nan
df['col1'].rolling(train_length + test_length).apply(split)
This code will create the following split_indices
>>> split_indices
[(Int64Index([0, 1], dtype='int64'), Int64Index([2], dtype='int64')),
(Int64Index([1, 2], dtype='int64'), Int64Index([3], dtype='int64')),
(Int64Index([2, 3], dtype='int64'), Int64Index([4], dtype='int64')),
(Int64Index([3, 4], dtype='int64'), Int64Index([5], dtype='int64')),
(Int64Index([4, 5], dtype='int64'), Int64Index([6], dtype='int64')),
(Int64Index([5, 6], dtype='int64'), Int64Index([7], dtype='int64')),
(Int64Index([6, 7], dtype='int64'), Int64Index([8], dtype='int64')),
(Int64Index([7, 8], dtype='int64'), Int64Index([9], dtype='int64'))]
After you can easily get the data of your dataframe of a given index
>>> df.loc[split_indices[3][0]]
col1 col2
3 16.5 11.0
4 16.0 11.1

Related

How to calculate the ratio per columns in python?

I'm trying to calculate the ratio by columns in python.
import pandas as pd
import numpy as np
data={
'category': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I'],
'value 1': [1, 1, 2, 5, 3, 4, 4, 8, 7],
'value 2': [4, 2, 8, 5, 7, 9, 3, 4, 2]
}
data=pd.DataFrame(data)
data.set_index('category')
# value 1 value 2
#category
# A 1 4
# B 1 2
# C 2 8
# D 5 5
# E 3 7
# F 4 9
# G 4 3
# H 8 4
# I 7 2
The expected results is as below:
#The sum of value 1: 35, value 2: 44
#The values in the first columns were diveded by 35, and the second columns were divded by 44
# value 1 value 2
#category
# A 0.028 0.090
# B 0.028 0.045
# C 0.057 0.181
# D 0.142 0.113
# E 0.085 0.159
# F 0.114 0.204
# G 0.114 0.068
# H 0.228 0.090
# I 0.2 0.045
I tried to run the below code, but it returned NaN values:
data=data.apply(lambda x:x/data.sum())
data
I think that there are simpler methods for this job, but I cannot search the proper keywords..
How can I calculate the ratio in each column?

The issue is that you did not make set_index permanent.
What I usually do to ensure I do correct things is using pipelines
data=pd.DataFrame(data)
dataf =(
data
.set_index('category')
.transform(lambda d: d/d.sum())
)
print(dataf)
By piping commands, you get what you want. Note: I used transform instead of apply for speed.
They are easy to read, and less prune to mistake. Using inplace=True is discouraged in Pandas as the effects could be unpredictable.

Searching in numpy array

I have a 2D numpy array, say A sorted with respect to Column 0. e.g.
Col.0
Col.1
Col.2
10
2.45
3.25
11
2.95
4
12
3.45
4.25
15
3.95
5
18
4.45
5.25
21
4.95
6
23
5.45
6.25
27
5.95
7
29
6.45
7.25
32
6.95
8
35
7.45
8.25
The entries in each row is unique i.e. Col. 0 is the identification number of a co-ordinate in xy plane, Columns 1 and 2 are x and y co-ordinates of these points.
I have another array B (rows can contain duplicate data). Column 0 and Column 1 store x and y co-ordinates.
Col.0
Col.1
2.45
3.25
4.45
5.25
6.45
7.25
2.45
3.25
My aim is to find the row index number in array A corresponding to data in array B without using for loop. So, in this case, my output should be [0,4,8,0].
Now, I know that with numpy searchsorted lookup for multiple data can be done in one shot. But, it can be used to compare with a single column of A and not multiple columns. Is there a way to do this?

Pure numpy solution:
My intuition is that I take the difference c between a[:,1:] and b by broadcasting, such that c is of shape (11, 4, 2). The rows that match will be all zeros. Then I do c == False to obtain a mask. I do c.all(2) which results in a boolean array of shape (11, 4), where all True elements represents matches between a and b. Then I simply use np.nonzero to obtain the indices of said elements.
import numpy as np
a = np.array([
[10, 2.45, 3.25],
[11, 2.95, 4],
[12, 3.45, 4.25],
[15, 3.95, 5],
[18, 4.45, 5.25],
[21, 4.95, 6],
[23, 5.45, 6.25],
[27, 5.95, 7],
[29, 6.45, 7.25],
[32, 6.95, 8],
[35, 7.45, 8.25],
])
b = np.array([
[2.45, 3.25],
[4.45, 5.25],
[6.45, 7.25],
[2.45, 3.25],
])
c = (a[:,np.newaxis,1:]-b) == False
rows, cols = c.all(2).nonzero()
print(rows[cols.argsort()])
# [0 4 8 0]

You can use merge in pandas:
df2.merge(df1.reset_index(),how='left',left_on=['Col.0','Col.1'],right_on=['Col.1','Col.2'])['index']
output:
0 0
1 4
2 8
3 0
Name: index, dtype: int64
and if you like it as array:
df2.merge(df1.reset_index(),how='left',left_on=['Col.0','Col.1'],right_on=['Col.1','Col.2'])['index'].to_numpy()
#array([0, 4, 8, 0])

Replace specific values in multiindex dataframe

I have a multindex dataframe with 3 index levels and 2 numerical columns.
A 1 2017-04-01 14.0 87.346878
2017-06-01 4.0 87.347504
2 2014-08-01 1.0 123.110001
2015-01-01 4.0 209.612503
B 3 2014-07-01 1.0 68.540001
2014-12-01 1.0 64.370003
4 2015-01-01 3.0 75.000000
I want to replace the values in first row of 3rd index level wherever a new second level index begins.
For ex: every first row
(A,1,2017-04-01)->0.0 0.0
(A,2,2014-08-01)->0.0 0.0
(B,3,2014-07-01)->0.0 0.0
(B,4,2015-01-01)->0.0 0.0
The dataframe is too big and doing it datframe by dataframe like df.xs('A,1')...df.xs(A,2) gets time consuming. Is there some way where i can get a mask and replace with new values in these positions ?

Use DataFrame.reset_index on level=2, then use DataFrame.groupby on level=[0, 1] and aggregate level_2 using first, then using pd.MultiIndex.from_arrays create a multilevel index, finally use this multilevel index to change the values in dataframe:
idx = df.reset_index(level=2).groupby(level=[0, 1])['level_2'].first()
idx = pd.MultiIndex.from_arrays(idx.reset_index().to_numpy().T)
df.loc[idx, :] = 0
Result:
# print(df)
col1 col2
A 1 2017-04-01 0.0 0.000000
2017-06-01 4.0 87.347504
2 2014-08-01 0.0 0.000000
2015-01-01 4.0 209.612503
B 3 2014-07-01 0.0 0.000000
2014-12-01 1.0 64.370003
4 2015-01-01 0.0 0.000000

We can extract a series of the second-level index with:
df.index.get_level_values(1)
# output: Int64Index([1, 1, 2, 2, 3, 3, 4], dtype='int64')
And check where it changes with:
idx = df.index.get_level_values(1)
np.where(idx != np.roll(idx, 1))[0]
# output: array([0, 2, 4, 6])
So we can simply use the returned value of the second statement with iloc to get the first row of every second-level index and modify their values like this:
idx = df.index.get_level_values(1)
df.iloc[np.where(idx != np.roll(idx, 1))[0]] = 0
output:
value1 value2
A 1 2017-04-01 0.0 0.000000
2017-06-01 4.0 87.347504
2 2014-08-01 0.0 0.000000
2015-01-01 4.0 209.612503
B 3 2014-07-01 0.0 0.000000
2014-12-01 1.0 64.370003
4 2015-01-01 0.0 0.000000

You can use the grouper indices in a simple iloc:
df.iloc[[a[0] for a in df.groupby(level=[0, 1]).indices.values()]] = 0
Example:
df = pd.DataFrame({'col1': [14., 4., 1., 4., 1., 1., 3.],
'col2': [ 87.346878, 87.347504, 123.110001, 209.612503, 68.540001, 64.370003, 75.]},
index = pd.MultiIndex.from_tuples(([('A', 1, '2017-04-01'), ('A', 1, '2017-06-01'),
('A', 2, '2014-08-01'), ('A', 2, '2015-01-01'),
('B', 3, '2014-07-01'), ('B', 3, '2014-12-01'),
('B', 4, '2015-01-01')])))
Result:
col1 col2
A 1 2017-04-01 0.0 0.000000
2017-06-01 4.0 87.347504
2 2014-08-01 0.0 0.000000
2015-01-01 4.0 209.612503
B 3 2014-07-01 0.0 0.000000
2014-12-01 1.0 64.370003
4 2015-01-01 0.0 0.000000
Timings:
%%timeit
idx = df.reset_index(level=2).groupby(level=[0, 1])['level_2'].first()
idx = pd.MultiIndex.from_arrays(idx.reset_index().to_numpy().T)
df.loc[idx, :] = 0
#6.7 ms ± 40 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
df.iloc[[a[0] for a in df.groupby(level=[0, 1]).indices.values()]] = 0
#897 µs ± 6.99 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
So this is about 7 times faster than the accepted answer

I think you can use something like this:
import pandas as pd
import numpy as np
arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
tuples = list(zip(*arrays))
df = pd.DataFrame([['A', 'B'], ['bar', 'two'],
['foo', 'one'], ['foo', 'two']],
columns=['first', 'second'])
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
df = pd.DataFrame(np.random.randn(8, 4), index=arrays)
df
You can create a list of unique values from your index. Then get the index position, to replace on your column the row value coincidence with the row value.
lst = ['bar','foo', 'qux']
ls = []
for i in lst:
base = df.index.get_loc(i)
a = base.indices(len(df))
a = a[0]
ls.append(a)
for ii in ls:
#print(ii)
df[0][ii] = 0
df
Fortunately, this can help you.
Cheers!

check start and end available in data using python pandas

df1
id start end data
1 2001 2004 [[2004,1],[2003,2],[2002,6],[2001,0.9]]
2 2001 2004 [[2005,1],[2003,2],[2002,6],[2001,0.9]]
3 2001 2004 [[2004,1],[2003,2],[2002,6]]
output
id missed_one
2 2004
3 2001
That is the output.
I have to check from the start to end and the available in the data. If any data is missed it should print the output.

You can use set differencing
df[['start', 'end']].agg(set,1) - df.data.transform(lambda k: set([item for z in k for item in z]))
1 {}
2 {2004}
3 {2001}
dtype: object

Using a list comprehension and zip:
out = df.assign(missing=[
[i for i in range(start, end+1) if i not in {d for d, _ in datum}] or np.nan
for datum, start, end in zip(df.data, df.start, df.end)
])
id start end data missing
0 1 2001 2004 [[2004, 1], [2003, 2], [2002, 6], [2001, 0.9]] NaN
1 2 2001 2004 [[2005, 1], [2003, 2], [2002, 6], [2001, 0.9]] [2004]
2 3 2001 2004 [[2004, 1], [2003, 2], [2002, 6]] [2001]
So if you want only rows with a year missing:
out.loc[out.missing.notnull()]
id start end data missing
1 2 2001 2004 [[2005, 1], [2003, 2], [2002, 6], [2001, 0.9]] [2004]
2 3 2001 2004 [[2004, 1], [2003, 2], [2002, 6]] [2001]
If you only want to show a single missing value, and not a list of missing values, you can use next:
df.assign(missing=[
next((i for i in range(start, end+1) if i not in {d for d, _ in datum}), np.nan)
for datum, start, end in zip(df.data, df.start, df.end)
])
id start end data missing
0 1 2001 2004 [[2004, 1], [2003, 2], [2002, 6], [2001, 0.9]] NaN
1 2 2001 2004 [[2005, 1], [2003, 2], [2002, 6], [2001, 0.9]] 2004.0
2 3 2001 2004 [[2004, 1], [2003, 2], [2002, 6]] 2001.0
Some timings:
df = pd.concat([df]*10000)
In [145]: %%timeit
...: out = df.assign(missing=[^M
...: [i for i in range(start, end+1) if i not in {d for d, _ in datum}] or np.nan^M
...: for datum, start, end in zip(df.data, df.start, df.end)^M
...: ])
...:
72.3 ms ± 1.31 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [146]: %%timeit
...: df[['start', 'end']].agg(set,1) - df.data.transform(lambda k: set([item for z in k for item in z]))
...:
503 ms ± 13.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

You can substract sets:
#if necessary convert to nested lists
import ast
df['data'] = df['data'].apply(ast.literal_eval)
df = df.set_index('id')
ranges = df[['start', 'end']].apply(lambda x: set(range(x['start'], x['end'] + 1)), axis=1)
data = df['data'].apply(lambda k: set([z[0] for z in k]))
out = (ranges - data).to_dict()
print (out)
{1: set(), 2: {2004}, 3: {2001}}
df1 = pd.DataFrame([(k, v1) for k, v in out.items() for v1 in v], columns=['id','missed_one'])
print (df1)
id missed_one
0 2 2004
1 3 2001
Details:
print (ranges)
id
1 {2001, 2002, 2003, 2004}
2 {2001, 2002, 2003, 2004}
3 {2001, 2002, 2003, 2004}
print (data)
id
1 {2001, 2002, 2003, 2004}
2 {2001, 2002, 2003, 2005}
3 {2002, 2003, 2004}
Name: data, dtype: object

Manipulate pandas.DataFrame with multiple criterias

For example I have a dataframe:
df = pd.DataFrame({'Value_Bucket': [5, 5, 5, 10, 10, 10],
'DayofWeek': [1, 1, 3, 2, 4, 2],
'Hour_Bucket': [1, 5, 7, 4, 3, 12],
'Values': [1, 1.5, 2, 3, 5, 3]})
The actual data set is rather large (5000 rows+). I'm looking to perform functions on 'Values' if the "Value_Bucket" = 5, and for each possible combination of "DayofWeek" and "Hour_Bucket".
Essentially the data will be grouped to a table of 24 rows (Hour_Bucket) and 7 columns (DayofWeek), and each cell is filled with the result of a function (say average for example). I can use a groupby function for 1 criteria, can someone explain how I can group two criteria and tabulate the result in a table?

query to subset
groupby
unstack
df.query('Value_Bucket == 5').groupby(
['Hour_Bucket', 'DayofWeek']).Values.mean().unstack()
DayofWeek 1 3
Hour_Bucket
1 1.0 NaN
5 1.5 NaN
7 NaN 2.0
If you want to have zeros instead of NaN
df.query('Value_Bucket == 5').groupby(
['Hour_Bucket', 'DayofWeek']).Values.mean().unstack(fill_value=0)
DayofWeek 1 3
Hour_Bucket
1 1.0 0.0
5 1.5 0.0
7 0.0 2.0

Pivot tables seem more natural to me than groupby paired with unstack though they do the exact same thing.
pd.pivot_table(data=df.query('Value_Bucket == 5'),
index='Hour_Bucket',
columns='DayofWeek',
values='Values',
aggfunc='mean',
fill_value=0)
Output
DayofWeek 1 3
Hour_Bucket
1 1.0 0
5 1.5 0
7 0.0 2

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Create a sliding window of data index positions - python

Related

How to calculate the ratio per columns in python?

Searching in numpy array

Replace specific values in multiindex dataframe

check start and end available in data using python pandas

Manipulate pandas.DataFrame with multiple criterias

Categories

Resources