Adding simple moving average as an additional column to python DataFrame

Adding simple moving average as an additional column to python DataFrame - python

I have sales data in sales_training.csv that looks like this -
time_period sales
1 127
2 253
3 123
4 253
5 157
6 105
7 244
8 157
9 130
10 221
11 132
12 265
I want to add 3rd column that contains the moving average. My code -
import pandas as pd
df = pd.read_csv("./Sales_training.csv", index_col="time_period")
periods = df.index.tolist()
period = int(input("Enter a period for the moving average :"))
sum1 = 0
for i in periods:
if i < period:
df['forecast'][i] = i
else:
for j in range(period):
sum1 += df['sales'][i-j]
df['forecast'][i] = sum1/period
sum1 = 0
print(df)
df.to_csv("./forecast_mannual.csv")
This is giving KeyError: 'forecast' at the line df['forecast'][i] = i. What is the issue?

one simple solution for it, just df['forecast'] = df['sales']
import pandas as pd
df = pd.read_csv("./Sales_training.csv", index_col="time_period")
periods = df.index.tolist()
period = int(input("Enter a period for the moving average :"))
sum1 = 0
df['forecast'] = df['sales'] # add one line
for i in periods:
if i < period:
df['forecast'][i] = i
else:
for j in range(period):
sum1 += df['sales'][i-j]
df['forecast'][i] = sum1/period
sum1 = 0
print(df)
df.to_csv("./forecast_mannual.csv")

Your code is giving 'keyerror' because of incorrect way of referencing column value for 'forecast'.Because the first time your code runs,'forecast' column is not yet created and when it tries to reference df'forecast' for first iteration then it gives key error.
Here,our task is to update values in dynamically created new column called 'forecast'. Therefore, instead of df['forecast'][i] you can write df.at[i,'forecast'].
There is another issue in the code.When value of i is less than period you are assigning 'i' to forecast which to my understanding is not correct.It should not display any thing in such case.
Here is my version of corrected code:
import pandas as pd
df = pd.read_csv("./sales.csv", index_col="time_period")
periods = df.index.tolist()
period = int(input("Enter a period for the moving average :"))
sum1 = 0
for i in periods:
print(i)
if i < period:
df.at[i,'forecast'] = ''
else:
for j in range(period):
sum1 += df['sales'][i-j]
df['forecast'][i] = sum1/period
sum1 = 0
print(df)
df.to_csv("./forecast_mannual.csv")
Output when I entered period=2 to calculate moving average:
Hope this helps.

Related

Climatology frequencies and duration

I have a 10 years climatological dateset as follows.
dt T P
01-01-2010 3 0
02-01-2010 5 11
03-01-2010 10 50
....
31-12-2020 -1 0
I want to estimate the total number of days in each month where T and P continuously stayed greater than 0 for three days or more
I would want these columns as an output:
month Number of days/DurationT&P>0 T P
I have never used loops in python, I seem to be able to write a simple loop and nothing beyond this when the data has to be first grouped by month and year and then apply the condition. would really appreciate any hints on the construction of the loop.
A= dataset
A['dt'] = pd.to_datetime(A['dt'], format='%Y-%m-%d')
for column in A [['P', 'T']]:
for i in range (len('P')):
if i > 0:
P.value_counts()
print(i)
for j in range (len ('T')):
if i > 0:
T.value_counts()
print(j)

Here is a really naive way you could set it up by simply iterating over the rows:
df['valid'] = (df['T'] > 0) & (df['P'] > 0)
def count_total_days(df):
i = 0
total = 0
for idx, row in df.iterrows():
if row.valid:
i += 1
elif not row.valid:
if i >= 3:
total += i
i = 0
return total
Since you want it per month, you would first have to create new month and year columns to group by:
df['month'] = df['dt'].dt.month
df['year'] = df['dt'].dt.year
for date, df_subset in df.groupby(['month', 'year']):
count_total_days(df_subset)

You can use resample and sum to get the sum of day for each where the condition is true.
import pandas as pd
dt = ["01-01-2010", "01-02-2010","01-03-2010","01-04-2010", "03-01-2010",'12-31-2020']
t=[3,66,100,5,10,-1]
P=[0,77,200,11,50,0]
A=pd.DataFrame(list(zip(dt, t,P)),
columns =['dtx', 'T','P'])
A['dtx'] = pd.to_datetime(A['dtx'], format='%m-%d-%Y')
A['Mask']=A.dtx.diff().dt.days.ne(1).cumsum()
dict_freq=A['Mask'].value_counts().to_dict()
newdict = dict((k, v) for k, v in dict_freq.items() if v >= 3)
A=A[A['Mask'].isin(list(newdict.keys()))]
A['Mask']=(A['T'] >= 1) & (A['P'] >= 1)
df_summary=A.query('Mask').resample(rule='M',on='dtx')['Mask'].sum()
Which produce
2010-01-31 3

Code Not Working Properly - Trying To Create A Simple Graph

I'm trying to create a simple program where my inputs about data relating to daily COVID-19 cases will then be tabulated and created into a small graph. For example, I'll first input (primary input) will be: 7 20200401 20200403, which represents the # of inputs after my primary input, and from what dates the cases are from. Then I'll go onto input which the hospital, the # of cases from that hospital, and the day of the report. The number of cases per day will be represented by a * . When I go to run my program, it just shows me what the last # of cases inputted was for all 7 days. Is there any way to fix it, and have the program properly display the correct amount of cases per day?
Just to help you understand, here is what a sample input and output should be for this program:
Input:
7 20200401 20200403
HP1 20200401 1
HP2 20200401 1
HP3 20200401 1
HP4 20200402 1
HP5 20200402 1
HP6 20200403 1
HP7 20200403 1
Output:
20200401:***
20200402:**
20200403:**
But instead, I get this:
20200401:*
20200402:*
20200403:*
Here is my code:
CoronaCaseNumber = input("")
CoronaList = CoronaCaseNumber.split(" ")
LuckyNumber = CoronaList[0]
Date = CoronaList[1]
Date2 = CoronaList[2]
LuckyNumero = int(LuckyNumber)
DateList = []
CaseNumberList = []
for case in range(LuckyNumero):
CoronaCaseData = input()
CoronaList2 = CoronaCaseData.split(" ")
InfoDate = CoronaList2[1]
DateList.append(InfoDate)
CaseNumber = CoronaList2[2]
CaseNumberList.append(CaseNumber)
EmptySet = []
for i in DateList:
if i >= Date and i <= Date2:
if i not in EmptySet:
EmptySet.append(i)
for i in range(0, len(CaseNumberList)):
CaseNumberList[i] = int(CaseNumberList[i])
EmptySet.sort()
for i in range(len(EmptySet)):
print("{}{}{}".format(EmptySet[i], ":", "*" * CaseNumberList[i]))

I'm way too lazy to type in all that data everytime I run your script, so I automated that part to make development and testing of it easier. Likewise, I think the easiest thing to do would be to use the collections module's defaultdict class to keep track of what dates have been seen and the total number of cases seen on each of them. Here's what I mean:
from collections import defaultdict
#CoronaCaseNumber = input("")
#CoronaList = CoronaCaseNumber.split(" ")
#LuckyNumber = CoronaList[0]
#Date = CoronaList[1]
#Date2 = CoronaList[2]
LuckyNumber, Date, Date2 = "8 20200401 20200404".split(" ")
data = """\
HP4 20200402 1
HP5 20200402 1
HP1 20200401 1
HP2 20200401 1
HP3 20200401 1
HP6 20200403 0
HP6 20200404 1
HP7 20200404 1
""".splitlines()
LuckyNumero = int(LuckyNumber)
DateList = []
CaseNumberList = []
for case in range(LuckyNumero):
CoronaCaseData = data[case]
CoronaList2 = CoronaCaseData.split(" ")
InfoDate = CoronaList2[1]
DateList.append(InfoDate)
CaseNumber = CoronaList2[2]
CaseNumberList.append(CaseNumber)
DailyCases = defaultdict(int)
for i, d in enumerate(DateList):
if Date <= d <= Date2: # Valid date?
DailyCases[d] += int(CaseNumberList[i])
# Print daily cases sorted by date (i.e. the dictionary's keys).
for date in sorted(DailyCases, key=lambda d: int(d)):
print("{}:{}".format(date, '*' * DailyCases[date]))
Output:
20200401:***
20200402:**
20200403:
20200404:**

I want to add elements in a particular column of Excel one by one using Python

I am able to write the function but I get a memory error.
I want to get the data of i in a column name (bolded below in code)
Suppose there is a column of date time with format '%Y-%M-%D %H:%m:%s' and I have another column with price changing every second , then I need to make two new columns where 1st will carry the average of price for last 1 min, and while 2nd will carry the average of price for last ten minutes, where these two column will fill perfectly according to time, like if suppose we start at 10 :10:01 and ends at 10:11:00 then we add same average value of last one min in all rows between this one minute period , same with 2nd one where "x" is a list of 12000 elements
m = list(set(x))
f1 = [0,0,]
f10 = [0,0,]
index = []
index10 = []
for i in m:
index.append(x.index(i))
for i in index[0:len(index):9]:
index10.append(i)
For differnce if equal to 1 in x
total = 0
for i in index[2:len(index)-1]:
j = i+1
list = df['Price'][i:j]
for i in list:
total = (total+i)/j
f1.append(total)
total = 0
for i in f1[0:len(m)]:
j = i+1
for l in range(0,j):
**df['Average last 1 min'][l] = i**
for differnce if equal to 10 in x
total1 = 0
for i in index10[2:len(index10)-1]:
j = i+1
list = df['Price'][i:j]
for i in list:
total1 = (total1+i)/j
f1.append(total1)
total1 = 0
for i in f10[0:len(m)]:
j = i+1
for l in range(0,j):
**df['Average last 10 min'][l] = i**
df.to_excel('A:\\Test\\time.xlsx')

Removing value from a DataFrame column which repeats over 15 times

I'm working on forex data like this:
0 1 2 3
1 AUD/JPY 20040101 00:01:00.000 80.598 80.598
2 AUD/JPY 20040101 00:02:00.000 80.595 80.595
3 AUD/JPY 20040101 00:03:00.000 80.562 80.562
4 AUD/JPY 20040101 00:04:00.000 80.585 80.585
5 AUD/JPY 20040101 00:05:00.000 80.585 80.585
I want to go through column 2 and 3 and remove the rows in which the value is repeated for more than 15 times in a row. So far I managed to produce this piece of code:
price = 0
drop_start = 0
counter = 0
df_new = df
for i, r in df.iterrows():
if r.iloc[2] != price:
if counter >= 15:
df_new = df_new.drop(df_new.index[drop_start:i])
price = r.iloc[2]
counter = 1
drop_start = i
if r.iloc[2] == price:
counter = counter + 1
price = 0
drop_start = 0
counter = 0
df = df_new
for i, r in df.iterrows():
if r.iloc[3] != price:
if counter >= 15:
df_new = df_new.drop(df_new.index[drop_start:i])
price = r.iloc[3]
counter = 1
drop_start = i
if r.iloc[3] == price:
counter = counter + 1
print(df_new.info())
df_new.to_csv('df_new.csv', index=False, header=None)
Unfortunately when I check the output file there are some mistakes, there are some weekends which have not been removed by the program. How should I build my algorithm, so it removes the duplicated values correctly?
First 250k rows of my initial dataset is available here: https://ufile.io/omg5h
The output of this program for that sample data is available here:
https://ufile.io/2gc3d
You can see that in the output file the rows 6931+ were not succesfully removed:

The problem with your algorithm is that, you are not holding specific counter values for the row values, but rather increment the counter through the loop. This causes the result to be false I believe. Also, the comparison r.iloc[2] != price also does not make sense because you are changing the value of price every iteration, so if there are elements between the duplicates, this check do not serve a proper function. I wrote a small code to copy the behavior you asked for.
df = pd.DataFrame([[0,0.5, 2.5],[0,1, 2],[0,1.5,2.5 ],[0,2, 3],[0,2, 3],[0,3, 4],
[0,4, 5]],columns = ['A','B','C'])
df_new = df
dict = {}
print('Initial DF')
print(df)
print()
for i, r in df.iterrows():
counter = dict.get(r.iloc[1])
if counter == None:
counter = 0
dict[r.iloc[1]] = counter + 1
if dict[r.iloc[1]] >= 2:
df_new = df_new[df_new.B != r.iloc[1]]
print('2nd col. deleted DF')
print(df_new)
print()
df_fin = df_new
dict2 = {}
for i, r in df_new.iterrows():
counter = dict2.get(r.iloc[2])
if counter == None:
counter = 0
dict2[r.iloc[2]] = counter + 1
if dict2[r.iloc[2]] >= 2:
df_fin = df_fin[df_fin.C != r.iloc[2]]
print('3rd col. deleted DF')
print(df_fin)
Here, I hold the counter value for each unique value in the rows of column 2 and 3. Then, according to the threshold(which is 2 in this case) I remove the rows which are exceeding the threshold. I first eliminate values according to the 2nd column, then forward this modified array to the next loop and eliminate values according to the 3rd column and finish the process.

Can dataframe row loop be avoided when indexing previous or next row?

I have a data set which I want to individually assign a unique value to every time it reaches zero.
The code I've come up with seems slow and I suspect there must be a faster way of doing it.
import time
import pandas as pd
import numpy as np
#--------------------------------
# DEBUG TEST DATASET
#--------------------------------
#Create random test data
series_random = np.random.randint(low=1, high=10, size=(10000,1))
#Insert zeros at known points (this should result in six motion IDs)
series_random[[5,6,7,15,100,2000,5000]] = 0
#Create data frame from test series
df = pd.DataFrame(series_random, columns=['Speed'])
#--------------------------------
#Elaped time counter
Elapsed_ms = time.time()
#Set Motion ID variable
Motion_ID = 0
#Create series with Motion IDs
df.loc[:,'Motion ID'] = 0
#Iterate through each row of df
for i in range(df.index.min()+1, df.index.max()+1):
#Set Motion ID to latest value
df.loc[i, 'Motion ID'] = Motion_ID
#If previous speed was zero and current speed is >0, then new motion detected
if df.loc[i-1, 'Speed'] == 0 and df.loc[i, 'Speed'] > 0:
Motion_ID += 1
df.loc[i, 'Motion ID'] = Motion_ID
#Include first zero value in new Motion ID (for plotting purposes)
df.loc[i-1, 'Motion ID'] = Motion_ID
Elapsed_ms = int((time.time() - Elapsed_ms) * 1000)
print('Result: {} records checked, {} unique trips identified in {} ms'.format(len(df.index),df['Motion ID'].nunique(),Elapsed_ms))
The output from the above code, was:
Result: 10000 records checked, 6 unique trips identified in 6879 ms
My actual data set will be much larger, so even in this small example I'm surprised it took so long for what seems like a simple operation.

You can express the logic using boolean arrays and expressions in numpy without any loops:
def get_motion_id(speed):
mask = np.zeros(speed.size, dtype=bool)
# mask[i] == True if Speed[i - 1] == 0 and Speed[i] > 0
mask[1:] = speed[:-1] == 0
mask &= speed > 0
# Taking the cumsum increases the motion_id by one where mask is True
motion_id = mask.astype(int).cumsum()
# Carry over beginning of a motion to the preceding step with Speed == 0
motion_id[:-1] = motion_id[1:]
return motion_id
# small demo example
df = pd.DataFrame({'Speed': [3, 0, 1, 2, 0, 1]})
df['Motion_ID'] = get_motion_id(df['Speed'])
print(df)
Speed Motion_ID
0 3 0
1 0 1
2 1 1
3 2 1
4 0 2
5 1 2
For your 10,000 row example I see a speed up of around 800:
%time df['Motion_ID'] = get_motion_id(df['Speed'])
CPU times: user 5.26 ms, sys: 3.18 ms, total: 8.43 ms
Wall time: 8.01 ms

Another way of doing it is by extracting the index value of 0 from df and then iterating over those index values to check and assigning the value of Motion Id. Check below code:
Motion_ID = 0
#Create series with Motion IDs
df.loc[:,'Motion ID'] = 0
i=0
for index_val in sorted(df[df['Speed'] == 0].index):
df.loc[i:index_val,'Motion ID'] = Motion_ID
i = index_val
if df.loc[index_val+1, 'Speed'] > 0:
Motion_ID += 1
df.loc[i:df.index.max(),'Motion ID'] = Motion_ID+1
#Iterate through each row of df
Output :
Result: 10000 records checked, 6 unique trips identified in 49 ms

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Adding simple moving average as an additional column to python DataFrame - python

Related

Climatology frequencies and duration

Code Not Working Properly - Trying To Create A Simple Graph

I want to add elements in a particular column of Excel one by one using Python

Removing value from a DataFrame column which repeats over 15 times

Can dataframe row loop be avoided when indexing previous or next row?

Categories

Resources