How to rolling non-overlapping window in pandas

How to rolling non-overlapping window in pandas - python

My dataframe looks like:
c1
0 10
1 11
2 12
3 13
4 14
5 15
6 16
7 17
I want to find the minimum for every 3 rows. which looks like:
c1 min
0 10 10
1 11 10
2 12 10
3 13 13
4 14 13
5 15 13
6 16 16
7 17 16
and the number of rows might not be divisible by 3. I can't achieve it with rolling function.

If there is default index values use integer division by 3 and pass to GroupBy.transform with min:
df['min'] = df['c1'].groupby(df.index // 3).transform('min')
Or if any index generate helper np.arange:
df['min'] = df['c1'].groupby(np.arange(len(df)) // 3).transform('min')
print (df)
c1 min
0 10 10
1 11 10
2 12 10
3 13 13
4 14 13
5 15 13
6 16 16
7 17 16

You can also do this:
>>> df['min'] = df['c1'][::3]
>>> df.ffill().astype(int)
c1 min
0 10 10
1 11 10
2 12 10
3 13 13
4 14 13
5 15 13
6 16 16
7 17 16

Related

how to Replace column values with several conditions

I have a column as follows:
A B
0 0 20.00
1 1 35.00
2 2 75.00
3 3 29.00
4 4 125.00
5 5 16.00
6 6 52.50
7 7 NaN
8 8 NaN
9 9 NaN
10 10 NaN
11 11 NaN
12 12 NaN
13 13 239.91
14 14 22.87
15 15 52.74
16 16 37.20
17 17 27.44
18 18 57.01
19 19 29.88
I want to change the values of the column as follows
if 0<B<10.0, then Replace the cell value of B by "0 to 10"
if 10.1<B<20.0, then Replace the cell value of B by "10 to 20"
continue like this until the maximum range achieved.
I have tried
ds['B'] = np.where(ds['B'].between(10.0,20.0), "10 to 20", ds['B'])
But once I perform this operation, the DataFrame is occupied by the string "10 to 20" so I cannot perform this operation again for the remaining values of the DataFrame. After this step, the DataFrame looks like this:
A B
0 0 10 to 20
1 1 35.0
2 2 75.0
3 3 29.0
4 4 125.0
5 5 10 to 20
6 6 52.5
7 7 nan
8 8 nan
9 9 nan
10 10 nan
11 11 nan
12 12 nan
13 13 239.91
14 14 22.87
15 15 52.74
16 16 37.2
17 17 27.44
18 18 57.01
19 19 29.88
And the following line: ds['B'] = np.where(ds['B'].between(20.0,30.0), "20 to 30", ds['B']) will throw TypeError: '>=' not supported between instances of 'str' and 'float'
How can i code this to change all of the values in the DataFrame to these strings of ranges all at once?

Build your bins and labels and use pd.cut:
bins = np.arange(0, df["B"].max() // 10 * 10 + 10, 10).astype(int)
labels = [' to '.join(t) for t in zip(bins[:-1].astype(str), bins[1:].astype(str))]
df["B"] = pd.cut(df["B"], bins=bins, labels=labels)
>>> df
A B
0 0 10 to 20
1 1 30 to 40
2 2 70 to 80
3 3 20 to 30
4 4 120 to 130
5 5 10 to 20
6 6 50 to 60
7 7 NaN
8 8 NaN
9 9 NaN
10 10 NaN
11 11 NaN
12 12 NaN
13 13 NaN
14 14 20 to 30
15 15 50 to 60
16 16 30 to 40
17 17 20 to 30
18 18 50 to 60
19 19 20 to 30

This can be done with much less code as this is actually just a matter of string formatting.
ds['B'] = ds['B'].apply(lambda x: f'{int(x/10) if x>=10 else ""}0 to {int(x/10)+1}0' if pd.notnull(x) else x)

You can create a custom function that maps each range to a string. For example, 19.0 will be mapped to "10 to 20", and then apply this function to each row.
I've written the code so that the minimum and maximum of the range is generalizable to the DataFrame, and takes on values that are multiples of 10.
import numpy as np
import pandas as pd
## copy and paste your DataFrame
ds = pd.read_clipboard()
# floor to nearest multiple of 10
ds_min = ds['B'].min()//10*10
# ceiling to the nearest multiple of 10
ds_max = round(ds['B'].max(),-1)
ranges = np.linspace(ds_min, ds_max, ((ds_max-ds_min)/10)+1)
def map_value_to_string(value):
for idx in range(1,len(ranges)):
low_value, high_value = ranges[idx-1], ranges[idx]
if low_value < value <= high_value:
return f"{int(low_value)} to {int(high_value)}"
else:
continue
ds['B'] = ds['B'].apply(lambda x: map_value_to_string(x))
Output:
>>> ds
A B
0 0 10 to 20
1 1 30 to 40
2 2 70 to 80
3 3 20 to 30
4 4 120 to 130
5 5 10 to 20
6 6 50 to 60
7 7 None
8 8 None
9 9 None
10 10 None
11 11 None
12 12 None
13 13 230 to 240
14 14 20 to 30
15 15 50 to 60
16 16 30 to 40
17 17 20 to 30
18 18 50 to 60
19 19 20 to 30

How to populate rows of pandas dataframe column based with previous row based on a multiple conditions?

Disclaimer: This might be possible duplicate but I cannot find the exact solution. Please feel free to mark this question as duplicate and provide link to duplicate question in comments.
I am still learning python dataframe operations and this possibly has a very simple solution which I am not able to figure out.
I have a python dataframe with a single columns. Now I want to change value of each row to value of previous row if certain conditions are satisfied. I have created a loop solution to implement this but I was hoping for a more efficient solution.
Creation of initial data:
import numpy as np
import pandas as pd
data = np.random.randint(5,30,size=20)
df = pd.DataFrame(data, columns=['random_numbers'])
print(df)
random_numbers
0 6
1 24
2 29
3 18
4 22
5 17
6 12
7 7
8 6
9 27
10 29
11 13
12 23
13 6
14 25
15 24
16 16
17 15
18 25
19 19
Now lets assume two condition are 1) value less than 10 and 2) value more than 20. In any of these cases, set row value to previous row value. This has been implement in loop format as follows:
for index,row in df.iterrows():
if index == 0:
continue;
if(row.random_numbers<10):
df.loc[index,'random_numbers']=df.loc[index-1,'random_numbers']
if(row.random_numbers>20):
df.loc[index,'random_numbers']=df.loc[index-1,'random_numbers']
random_numbers
0 6
1 6
2 6
3 18
4 18
5 17
6 12
7 12
8 12
9 12
10 12
11 13
12 13
13 13
14 13
15 13
16 16
17 15
18 15
19 19
Please suggest a more efficient way to implement this logic as I am using large number of rows.

You can replace the values less than 10 and values more than 20 with NaN then use pandas.DataFrame.ffill() to fill nan with previous row value.
mask = (df['random_numbers'] < 10) | (df['random_numbers'] > 20)
# Since you escape with `if index == 0:`
mask[df.index[0]] = False
df.loc[mask, 'random_numbers'] = np.nan
df['random_numbers'].ffill(inplace=True)
# Original
random_numbers
0 7
1 28
2 8
3 14
4 12
5 20
6 21
7 11
8 16
9 27
10 19
11 23
12 18
13 5
14 6
15 11
16 6
17 8
18 17
19 8
# After replaced
random_numbers
0 7.0
1 7.0
2 7.0
3 14.0
4 12.0
5 20.0
6 20.0
7 11.0
8 16.0
9 16.0
10 19.0
11 19.0
12 18.0
13 18.0
14 18.0
15 11.0
16 11.0
17 11.0
18 17.0
19 17.0

We can also do it in a simpler way by using .mask() together with .ffill() and slicing on [1:] as follows:
df['random_numbers'][1:] = df['random_numbers'][1:].mask((df['random_numbers'] < 10) | (df['random_numbers'] > 20))
df['random_numbers'] = df['random_numbers'].ffill(downcast='infer')
.mask() tests for the condition and replace with NaN when the condition is true (default to replace with NaN if the parameter other= is not supplied). Retains the original values for rows with condition not met.
Note that the resulting numbers are maintained as integer instead of transformed unexpectedly to float type by supplying the downcast='infer' in the call to .ffill().
We use [1:] on the first line to ensure the data on row 0 is untouched without transformation.
# Original data: (reusing your sample data)
random_numbers
0 6
1 24
2 29
3 18
4 22
5 17
6 12
7 7
8 6
9 27
10 29
11 13
12 23
13 6
14 25
15 24
16 16
17 15
18 25
19 19
# After transposition:
random_numbers
0 6
1 6
2 6
3 18
4 18
5 17
6 12
7 12
8 12
9 12
10 12
11 13
12 13
13 13
14 13
15 13
16 16
17 15
18 15
19 19

Fastest way to replace current value in dataframe based on last LARGEST value

say i have a dataframe that looks like this
A
0 17
1 21
2 18
3 11
4 4
5 27
6 21
7 11
8 7
9 4
10 7
11 4
12 3
13 27
14 27
15 11
16 11
17 25
I'd like to replace the next row's value with the last LARGEST value in that row. the desired output is this:
A B
0 17 17
1 21 21
2 18 0
3 11 0
4 4 0
5 27 27
6 21 0
7 11 0
8 7 0
9 4 0
10 7 0
11 4 0
12 3 0
13 27 27
14 27 27
15 11 0
16 11 0
17 25 0
Currently I run a iterrows function that looks like this:
df['B'] = df['A']
lastrow = -1
for i, row in df.iterrows():
if lastrow > row['B']:
row['B'] = 0
else:
lastrow = row['B']
But it's quite slow. Is there a way to improve the speed of this loop?
i timed it and for 100,000 rows this is the output:
CPU times: user 10.3 s, sys: 4.5 ms, total: 10.3 s
Wall time: 10.4 s

Check with cummax
df['B']=df.A.where(df.A.eq(df.A.cummax()),0)
df
Out[75]:
A B
0 17 17
1 21 21
2 18 0
3 11 0
4 4 0
5 27 27
6 21 0
7 11 0
8 7 0
9 4 0
10 7 0
11 4 0
12 3 0
13 27 27
14 27 27
15 11 0
16 11 0
17 25 0

Finding all simple cycles in undirected graphs

I am trying to implement a task of finding all simple cycles in undirected graph. Originally, the task was to find all cycles of fixed length (= 3), and I've managed to do it using the properties of adjacency matrices. But before using that approach I was also trying to use DFS and it worked correctly for really small input sizes, but for bigger inputs it was going crazy, ending with (nearly) infinite loops. I tried to fix the code, but then it just could not find all the cycles.
My code is attached below.
1. Please, do not pay attention to several global variables used. The working code using another approach was already submitted. This one is just for me to see if how to make DFS work properly.
2. Yes, I've searched for this problem before posting this question, but either the option I've managed to find used different approach, or it was just about detecting if there are cycles at all. Besides, I want to know if it is possible to fix my code.
Big thanks to anyone who could help.
num_res = 0
adj_list = []
cycles_list = []
def dfs(v, path):
global num_res
for node in adj_list[v]:
if node not in path:
dfs(node, path + [node])
elif len(path) >= 3 and (node == path[-3]):
if sorted(path[-3:]) not in cycles_list:
cycles_list.append(sorted(path[-3:]))
num_res += 1
if __name__ == "__main__":
num_towns, num_pairs = [int(x) for x in input().split()]
adj_list = [[] for x in range(num_towns)]
adj_matrix = [[0 for x in range(num_towns)] for x in range(num_towns)]
# EDGE LIST TO ADJACENCY LIST
for i in range(num_pairs):
cur_start, cur_end = [int(x) for x in input().split()]
adj_list[cur_start].append(cur_end)
adj_list[cur_end].append(cur_start)
dfs(0, [0])
print(num_res)
UPD: Works ok for following inputs:
5 8
4 0
0 2
0 1
3 2
4 3
4 2
1 3
3 0
(output: 5)
6 15
5 4
2 0
3 1
5 1
4 1
5 3
1 0
4 0
4 3
5 2
2 1
3 0
3 2
5 0
4 2
(output: 20)
9 12
0 1
0 2
1 3
1 4
2 4
2 5
3 6
4 6
4 7
5 7
6 8
7 8
(output: 0)
Does NOT give any output and just continues through the loop.
22 141
5 0
12 9
18 16
7 6
7 0
4 1
16 1
8 1
6 1
14 0
16 0
11 9
20 14
12 3
18 3
1 0
17 0
17 15
14 5
17 13
6 5
18 12
21 1
13 4
18 11
18 13
8 0
15 9
21 18
13 6
12 8
16 13
20 18
21 3
11 6
15 14
13 5
17 5
10 8
9 5
16 14
19 9
7 5
14 10
16 4
18 7
12 1
16 3
19 18
19 17
20 2
12 11
15 3
15 11
13 2
10 7
15 13
10 9
7 3
14 3
10 1
21 19
9 2
21 4
19 0
18 1
10 6
15 0
20 7
14 11
19 6
18 10
7 4
16 10
9 4
13 3
12 2
4 3
17 7
15 8
13 7
21 14
4 2
21 0
20 16
18 8
20 12
14 2
13 1
16 15
17 11
17 16
20 10
15 7
14 1
13 0
17 12
18 5
12 4
15 1
16 9
9 1
17 14
16 2
12 5
20 8
19 2
18 4
19 4
19 11
15 12
14 12
11 8
17 10
18 14
12 7
16 8
20 11
8 7
18 9
6 4
11 5
17 6
5 3
15 10
20 19
15 6
19 10
20 13
9 3
13 9
13 10
21 7
19 13
19 12
19 14
6 3
21 15
21 6
17 3
10 5
(output should be 343)

Pandas dataframe sub-selection

I am new to programming and have taken up learning python in an attempt to make some tasks I run in my research more efficient. I am running a PCA in the pandas module (I found a tutorial online) and have the script for this, but need to subselect part of a dataframe prior to the pca.
so far I have (just for example in reality I am reading a .csv file with a larger matrix)
x = np.random.randint(30, size=(8,8))
df = pd.DataFrame(x)
0 1 2 3 4 5 6 7
0 9 0 23 13 2 5 14 6
1 20 17 11 10 25 23 20 23
2 15 14 22 25 11 15 5 15
3 9 27 15 27 7 15 17 23
4 12 6 11 13 27 11 26 20
5 27 13 5 16 5 5 2 18
6 3 18 22 0 7 10 11 11
7 25 18 10 11 29 29 1 25
What I want to do is sub-select columns that satisfy a certain criteria in any of the rows, specifically I want every column that has at least one number =>27 (just for example) to produce a new dataframe
0 1 3 4 5
0 9 0 13 2 5
1 20 17 10 25 23
2 15 14 25 11 15
3 9 27 27 7 15
4 12 6 13 27 11
5 27 13 16 5 5
6 3 18 0 7 10
7 25 18 11 29 29
I have looked into the various slicing methods in pandas but none seem to do what I want (.loc and .iloc etc.).
The actual script I am using to read in thus far is
filename = 'Data.csv'
data = pd.read_csv(filename,sep = ',')
x = data.ix[:,1:] # variables - species
y = data.ix[:,0] # cases - age
so a sub dataframme of x is what I am after (as above).
Any advice is greatly appreciated.

Indexers like loc, iloc, and ix accept boolean arrays. For example if you have three columns, df.loc[:, [True, False, True]] will return all the rows and the columns 0 and 2 (when corresponding value is True). You can check whether any of the elements in a column is greater than or equal to 27 by (df>=27).any(). This will return True for the columns that has at least one value >=27. So you can slice the dataframe with:
df.loc[:, (df>=27).any()]
Out[34]:
0 1 3 4 5 7
0 8 2 28 9 14 21
1 24 26 23 17 0 0
2 3 24 7 15 4 28
3 29 17 12 7 7 6
4 5 3 10 24 29 14
5 23 21 0 16 23 13
6 22 10 27 1 7 24
7 9 27 2 27 17 12
And this is the initial dataframe:
df
Out[35]:
0 1 2 3 4 5 6 7
0 8 2 7 28 9 14 26 21
1 24 26 15 23 17 0 21 0
2 3 24 26 7 15 4 7 28
3 29 17 9 12 7 7 0 6
4 5 3 13 10 24 29 22 14
5 23 21 26 0 16 23 17 13
6 22 10 19 27 1 7 9 24
7 9 27 26 2 27 17 8 12

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to rolling non-overlapping window in pandas - python

My dataframe looks like: c1 0 10 1 11 2 12 3 13 4 14 5 15 6 16 7 17 I want to find the minimum for every 3 rows. which looks like: c1 min 0 10 10 1 11 10 2 12 10 3 13 13 4 14 13 5 15 13 6 16 16 7 17 16 and the number of rows might not be divisible by 3. I can't achieve it with rolling function.

You can also do this: >>> df['min'] = df['c1'][::3] >>> df.ffill().astype(int) c1 min 0 10 10 1 11 10 2 12 10 3 13 13 4 14 13 5 15 13 6 16 16 7 17 16

Related

how to Replace column values with several conditions

How to populate rows of pandas dataframe column based with previous row based on a multiple conditions?

Fastest way to replace current value in dataframe based on last LARGEST value

Finding all simple cycles in undirected graphs

Pandas dataframe sub-selection

Categories

Resources