Duplicating pandas dataframe vertically - python

I have the foll. dataframe:
Month Day season
0 4 15 current
1 4 16 current
2 4 17 current
3 4 18 current
4 4 19 current
5 4 20 current
I would like to duplicate it like so:
Month Day season
0 4 15 current
1 4 16 current
2 4 17 current
3 4 18 current
4 4 19 current
5 4 20 current
6 4 15 past
7 4 16 past
8 4 17 past
9 4 18 past
10 4 19 past
11 4 20 past
I can duplicate it using:
df.append([df]*2,ignore_index=True)
However, how do I duplicate so that the season column has past as the duplicated values instead of current

I think this would be a good case for assign since it allows you to keep your functional programming style (i approve!)
In [144]: df.append([df.assign(season='past')]*2,ignore_index=True)
Out[144]:
Month Day season
0 4 15 current
1 4 16 current
2 4 17 current
3 4 18 current
4 4 19 current
5 4 20 current
6 4 15 past
7 4 16 past
8 4 17 past
9 4 18 past
10 4 19 past
11 4 20 past
12 4 15 past
13 4 16 past
14 4 17 past
15 4 18 past
16 4 19 past
17 4 20 past

Related

How to do rolling sum with conditional window criteria on different index levels in Python

I want to do a rolling sum based on different levels of the index but am struggling to make it a reality. Instead of explaining the problem am giving below the demo input and desired output along with the kind of insights am looking for.
So I have multiple brands and each of their sales of various item categories in different year month day grouped by as below. What I want is a dynamic rolling sum at each day level, rolled over a window on Year as asked.
for eg, if someone asks
Demo question 1) Till a certain day(not including that day) what were their last 2 years' sales of that particular category for that particular brand.
I need to be able to answer this for every single day i.e every single row should have a number as shown in Table 2.0.
I want to be able to code in such a way that if the question changes from 2 years to 3 years I just need to change a number. I also need to do the same thing at the month's level.
demo question 2) Till a certain day(not including that day) what was their last 3 months' sale of that particular category for that particular year for that particular brand.
Below is demo input
The tables are grouped by brand,category,year,month,day and sum of sales from a master table which had all the info and sales at hour level each day
Table 1.0
Brand
Category
Year
Month
Day
Sales
ABC
Big Appliances
2021
9
3
0
Clothing
2021
9
2
0
Electronics
2020
10
18
2
Utensils
2020
10
18
0
2021
9
2
4
3
0
XYZ
Big Appliances
2012
4
29
7
2013
4
7
6
Clothing
2012
4
29
3
Electronics
2013
4
9
1
27
2
5
4
5
2015
4
27
7
5
2
2
Fans
2013
4
14
4
5
4
0
2015
4
18
1
5
17
11
2016
4
12
18
Furniture
2012
5
4
1
8
6
20
4
2013
4
5
1
7
8
9
2
2015
4
18
12
27
15
5
2
4
17
3
Musical-inst
2012
5
18
10
2013
4
5
6
2015
4
16
10
18
0
2016
4
12
1
16
13
Utencils
2012
5
8
2
2016
4
16
3
18
2
2017
4
12
13
Below is desired output for demo question 1 based on the demo table(last 2 years cumsum not including that day)
Table 2.0
Brand
Category
Year
Month
Day
Sales
Conditional Cumsum(till last 2 years)
ABC
Big Appliances
2021
9
3
0
0
Clothing
2021
9
2
0
0
Electronics
2020
10
18
2
0
Utensils
2020
10
18
0
0
2021
9
2
4
0
3
0
4
XYZ
Big Appliances
2012
4
29
7
0
2013
4
7
6
7
Clothing
2012
4
29
3
0
Electronics
2013
4
9
1
0
27
2
1
5
4
5
3
2015
4
27
7
8
5
2
2
15
Fans
2013
4
14
4
0
5
4
0
4
2015
4
18
1
4
5
17
11
5
2016
4
12
18
12
Furniture
2012
5
4
1
0
8
6
1
20
4
7
2013
4
5
1
11
7
8
12
9
2
20
2015
4
18
12
11
27
15
23
5
2
4
38
17
3
42
Musical-inst
2012
5
18
10
0
2013
4
5
6
10
2015
4
16
10
6
18
0
16
2016
4
12
1
10
16
13
11
Utencils
2012
5
8
2
0
2016
4
16
3
0
18
2
3
2017
4
12
13
5
End thoughts:
The idea is to basically do a rolling window over year column maintaining the 2 years span criteria and keep on summing the sales figures.
P.S I really need a fast solution due to the huge data size and therefore created a .apply function row-wise which I didn't find feasible. A better solution by using some kind of group rolling sum or supporting columns will be really helpful.
Here I'm giving a sample solution for the above problem.
I have concidered just onr product so that the solution would be simple
Code:
from datetime import date,timedelta
Input={"Utencils": [[2012,5,8,2],[2016,4,16,3],[2017,4,12,13]]}
Input1=Input["Utencils"]
Limit=timedelta(365*2)
cumsum=0
lis=[]
Tot=[]
for i in range(len(Input1)):
if(lis):
while(lis):
idx=lis[0]
Y,M,D=Input1[i][:3]
reqDate=date(Y,M,D)-Limit
Y,M,D=Input1[idx][:3]
if(date(Y,M,D)<=reqDate):
lis.pop(0)
cumsum-=Input1[idx][3]
else:
break
Tot.append(cumsum)
lis.append(i)
cumsum+=Input1[i][3]
print(Tot)
Here Tot would output the required cumsum column for the given data.
Output:
[0, 0, 3]
Here you can specify the Time span using Number of days in Limit variable.
Hope this solves the problem you are looking for.

Calculate moving average of residuals iteratively in pandas

I have a dataframe of this form
Residual = Actual - Pred
Actual Pred Residual
0 11 10 1
1 12 10 2
2 13 10 3
3 14 10 4
4 15 10 5
5 16 10 6
6 17 10 7
7 18 10 8
8 19 10 9
I want to calculate the 3 day moving average of the residuals and add it back to Pred column; then recalculate the residuals and repeat the process for next day iteratively as shown in the df below
For eg:
For index=3; MA of the previous 3 day residuals will be (1+2+3)/3 = 2. We add this value to the today's prediction, which will be 12 and the new residual will be 14-12 = 2.
Now, for index=4; we take the last 3 days MA of Residual_New i.e. (2+3+2)/3 ~ 2.33. So Pred_new = 12.33 and Residual_New = 15-12.33 = 2.67 .. and so on
Actual Pred Residual Pred_New Residual_New
0 11 10 1 10 1
1 12 10 2 10 2
2 13 10 3 10 3
3 14 10 4 10+2 2
4 15 10 5 10+2.33 2.67
5 16 10 6 ........
6 17 10 7 .......
7 18 10 8
8 19 10 9
How can I achieve this effectively in pandas.
Thanks

Pandas Business Day Offset: Request for Simple Example

I have a dataframe, "df", with a datetime index. Here is a rough snapshot of its dimensions:
V1 V2 V3 V4 V5
1/12/2008 4 15 11 7 1
1/13/2008 5 2 8 7 1
1/14/2008 13 13 9 6 4
1/15/2008 14 15 12 9 3
1/16/2008 1 10 2 12 15
1/17/2008 10 5 9 9 1
1/18/2008 13 11 5 7 2
1/19/2008 2 6 7 9 6
1/20/2008 5 4 14 3 7
1/21/2008 11 11 4 7 15
1/22/2008 9 4 15 10 3
1/23/2008 2 13 13 10 3
1/24/2008 12 15 14 12 8
1/25/2008 1 4 2 6 15
Some of the days in the index are weekends and holidays.
I would like to move all dates, in the datetime index of "df", to their respective closest (US) business day (i.e. Mon-Friday, excluding holidays).
How would you recommend for me to do this? I am aware that Pandas has a "timeseries offset" facility for this. But, I haven't been able to find an example that walks a novice reader through this.
Can you help?
I am not familiar with this class but after looking at the source code it seems fairly straightforward to achieve this. Keep in mind that it picks the next closest business day meaning Saturday turns into Monday as opposed to Friday. Also making your index be non-unique will decrease performance on your DataFrame, so I suggest assigning these values to a new column.
The one prerequisite is you have to make sure your index is any of these three types, datetime, timedelta, pd.tseries.offsets.Tick.
offset = pd.tseries.offsets.CustomBusinessDay(n=0)
df.assign(
closest_business_day=df.index.to_series().apply(offset)
)
V1 V2 V3 V4 V5 closest_business_day
2008-01-12 4 15 11 7 1 2008-01-14
2008-01-13 5 2 8 7 1 2008-01-14
2008-01-14 13 13 9 6 4 2008-01-14
2008-01-15 14 15 12 9 3 2008-01-15
2008-01-16 1 10 2 12 15 2008-01-16
2008-01-17 10 5 9 9 1 2008-01-17
2008-01-18 13 11 5 7 2 2008-01-18
2008-01-19 2 6 7 9 6 2008-01-21
2008-01-20 5 4 14 3 7 2008-01-21
2008-01-21 11 11 4 7 15 2008-01-21
2008-01-22 9 4 15 10 3 2008-01-22
2008-01-23 2 13 13 10 3 2008-01-23
2008-01-24 12 15 14 12 8 2008-01-24
2008-01-25 1 4 2 6 15 2008-01-25

Finding all simple cycles in undirected graphs

I am trying to implement a task of finding all simple cycles in undirected graph. Originally, the task was to find all cycles of fixed length (= 3), and I've managed to do it using the properties of adjacency matrices. But before using that approach I was also trying to use DFS and it worked correctly for really small input sizes, but for bigger inputs it was going crazy, ending with (nearly) infinite loops. I tried to fix the code, but then it just could not find all the cycles.
My code is attached below.
1. Please, do not pay attention to several global variables used. The working code using another approach was already submitted. This one is just for me to see if how to make DFS work properly.
2. Yes, I've searched for this problem before posting this question, but either the option I've managed to find used different approach, or it was just about detecting if there are cycles at all. Besides, I want to know if it is possible to fix my code.
Big thanks to anyone who could help.
num_res = 0
adj_list = []
cycles_list = []
def dfs(v, path):
global num_res
for node in adj_list[v]:
if node not in path:
dfs(node, path + [node])
elif len(path) >= 3 and (node == path[-3]):
if sorted(path[-3:]) not in cycles_list:
cycles_list.append(sorted(path[-3:]))
num_res += 1
if __name__ == "__main__":
num_towns, num_pairs = [int(x) for x in input().split()]
adj_list = [[] for x in range(num_towns)]
adj_matrix = [[0 for x in range(num_towns)] for x in range(num_towns)]
# EDGE LIST TO ADJACENCY LIST
for i in range(num_pairs):
cur_start, cur_end = [int(x) for x in input().split()]
adj_list[cur_start].append(cur_end)
adj_list[cur_end].append(cur_start)
dfs(0, [0])
print(num_res)
UPD: Works ok for following inputs:
5 8
4 0
0 2
0 1
3 2
4 3
4 2
1 3
3 0
(output: 5)
6 15
5 4
2 0
3 1
5 1
4 1
5 3
1 0
4 0
4 3
5 2
2 1
3 0
3 2
5 0
4 2
(output: 20)
9 12
0 1
0 2
1 3
1 4
2 4
2 5
3 6
4 6
4 7
5 7
6 8
7 8
(output: 0)
Does NOT give any output and just continues through the loop.
22 141
5 0
12 9
18 16
7 6
7 0
4 1
16 1
8 1
6 1
14 0
16 0
11 9
20 14
12 3
18 3
1 0
17 0
17 15
14 5
17 13
6 5
18 12
21 1
13 4
18 11
18 13
8 0
15 9
21 18
13 6
12 8
16 13
20 18
21 3
11 6
15 14
13 5
17 5
10 8
9 5
16 14
19 9
7 5
14 10
16 4
18 7
12 1
16 3
19 18
19 17
20 2
12 11
15 3
15 11
13 2
10 7
15 13
10 9
7 3
14 3
10 1
21 19
9 2
21 4
19 0
18 1
10 6
15 0
20 7
14 11
19 6
18 10
7 4
16 10
9 4
13 3
12 2
4 3
17 7
15 8
13 7
21 14
4 2
21 0
20 16
18 8
20 12
14 2
13 1
16 15
17 11
17 16
20 10
15 7
14 1
13 0
17 12
18 5
12 4
15 1
16 9
9 1
17 14
16 2
12 5
20 8
19 2
18 4
19 4
19 11
15 12
14 12
11 8
17 10
18 14
12 7
16 8
20 11
8 7
18 9
6 4
11 5
17 6
5 3
15 10
20 19
15 6
19 10
20 13
9 3
13 9
13 10
21 7
19 13
19 12
19 14
6 3
21 15
21 6
17 3
10 5
(output should be 343)

Pandas dataframe sub-selection

I am new to programming and have taken up learning python in an attempt to make some tasks I run in my research more efficient. I am running a PCA in the pandas module (I found a tutorial online) and have the script for this, but need to subselect part of a dataframe prior to the pca.
so far I have (just for example in reality I am reading a .csv file with a larger matrix)
x = np.random.randint(30, size=(8,8))
df = pd.DataFrame(x)
0 1 2 3 4 5 6 7
0 9 0 23 13 2 5 14 6
1 20 17 11 10 25 23 20 23
2 15 14 22 25 11 15 5 15
3 9 27 15 27 7 15 17 23
4 12 6 11 13 27 11 26 20
5 27 13 5 16 5 5 2 18
6 3 18 22 0 7 10 11 11
7 25 18 10 11 29 29 1 25
What I want to do is sub-select columns that satisfy a certain criteria in any of the rows, specifically I want every column that has at least one number =>27 (just for example) to produce a new dataframe
0 1 3 4 5
0 9 0 13 2 5
1 20 17 10 25 23
2 15 14 25 11 15
3 9 27 27 7 15
4 12 6 13 27 11
5 27 13 16 5 5
6 3 18 0 7 10
7 25 18 11 29 29
I have looked into the various slicing methods in pandas but none seem to do what I want (.loc and .iloc etc.).
The actual script I am using to read in thus far is
filename = 'Data.csv'
data = pd.read_csv(filename,sep = ',')
x = data.ix[:,1:] # variables - species
y = data.ix[:,0] # cases - age
so a sub dataframme of x is what I am after (as above).
Any advice is greatly appreciated.
Indexers like loc, iloc, and ix accept boolean arrays. For example if you have three columns, df.loc[:, [True, False, True]] will return all the rows and the columns 0 and 2 (when corresponding value is True). You can check whether any of the elements in a column is greater than or equal to 27 by (df>=27).any(). This will return True for the columns that has at least one value >=27. So you can slice the dataframe with:
df.loc[:, (df>=27).any()]
Out[34]:
0 1 3 4 5 7
0 8 2 28 9 14 21
1 24 26 23 17 0 0
2 3 24 7 15 4 28
3 29 17 12 7 7 6
4 5 3 10 24 29 14
5 23 21 0 16 23 13
6 22 10 27 1 7 24
7 9 27 2 27 17 12
And this is the initial dataframe:
df
Out[35]:
0 1 2 3 4 5 6 7
0 8 2 7 28 9 14 26 21
1 24 26 15 23 17 0 21 0
2 3 24 26 7 15 4 7 28
3 29 17 9 12 7 7 0 6
4 5 3 13 10 24 29 22 14
5 23 21 26 0 16 23 17 13
6 22 10 19 27 1 7 9 24
7 9 27 26 2 27 17 8 12

Categories

Resources