How to unpivot multiple Columns and name months dynamically in Python? - python

I need some help in converting multiple columns into individual observations. Last time with your help, I tried to convert Demand columns, now I have to add more columns like Jobs and PO 12 columns of each and want to convert as three individual observations and then later calculate Future free column (Future Free = Max(Job,PO)-Demand)
from sqlalchemy import create_engine
import pandas as pd
import matplotlib.pyplot as plt
import datetime as dt
import calendar
from pandas.tseries.offsets import MonthEnd
engine = create_engine('mssql+pyodbc://server/driver=SQL+Server')
con=engine.connect()
rs=con.execute("""Select StockCode, Demand00, Demand01, Demand02,
Demand03, Demand04, Demand05, Demand06, Demand07, Demand08, Demand09,
Demand10, Demand11 from ForecastData""")
df= pd.DataFrame(rs.fetchall())
df.columns = ["StockCode", "Demand01","Demand02", "Demand03", "Demand04",
"Demand05", "Demand06","Demand07", "Demand08", "Demand09", "Demand10",
"Demand11", "Demand12"]
df.set_index('StockCode')
demand_columns=[i for i in df.columns if i.startswith('Demand')]
today=pd.Timestamp.now()
month_list=[(today+pd.DateOffset(months=i)) for i in
range(len(demand_columns))]
dic_month={col:month for col,month in zip(demand_columns,month_list)}
df.rename(columns=dic_month)
df2=pd.DataFrame(df.rename(columns=dict(zip(demand_columns,month_list))).set_
index('StockCode').stack()).reset_index()
df2.columns = ['StockCode', 'Month', 'Value']
df2['Month'] = pd.to_datetime(df2['Month'], format = '%Y%m').dt.date
Previous Output
StockCode Month Value
ABC 2019-01-01 100
ABC 2019-02-01 80
BXY 2019-01-01 50
Desired Output
StockCode Month Demand Job PO FutureFree
ABC January 100 120 0 20
ABC February 120 80 0 0
BXY January 50 00 60 10

Related

How can I group my CSV's list of dates into their Months?

I have a CSV file which contains two columns, the first is a date column in the format 01/01/2020 and the second is a number for each month representing the months sales volume. The dates range from 2004 to 2019 and my task is to create a 12 bar chart, with each bar representing the average sales volume for that month across every years data. I attempted to use a groupby function but got an error relating to not having numeric types to aggregate. I am very new to python so apologies for the beginner questions. I Have posted my code so far below. Thanks in advance for any help with this :)
# -*- coding: utf-8 -*-
import csv
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
file = "GlasgowSalesVolume.csv"
data = pd.read_csv(file)
typemean = (data.groupby(['Date', 'SalesVolume'], as_index=False).mean().groupby('Date')
['SalesVolume'].mean())
Output:
DataError: No numeric types to aggregate
enter code here
I prepared a DataFrame limited to just 2 rows and 3 months:
Date Sales
0 01/01/2019 3
1 01/02/2019 4
2 01/03/2019 8
3 01/01/2020 10
4 01/02/2020 20
5 01/03/2020 30
For now Date column is of string type, so the first step is to
convert it to datetime64:
df.Date = pd.to_datetime(df.Date, dayfirst=True)
Now to compute your result, run:
result = df.groupby(df.Date.dt.month).Sales.mean()
The result is a Series containing:
Date
1 6.5
2 12.0
3 19.0
Name: Sales, dtype: float64
The index is the month number (1 thru 12) and the value is the mean from
respective month, from all years.

Elegant pandas pre-fill using date_range with various possible freq settings

I am trying to prefill a dataframe akin to:
In the sample I am randomly removing some rows to highlight the challenge. I am trying to *elegantly calculate the dti value. The dti value in the first row would be 0 (even if first row is deleted as per script) but as gaps appear in the dti sequence needs to skip the missing rows. A logical approach would be to divide dt/delta to create a unique integer representing the bucket but nothing I tried felt or seemed elegant.
A bit of code to help simulate the problem:
from datetime import datetime, timedelta
import pandas as pd
import numpy as np
start = datetime.now()
nin = 24
delta='4H'
df = pd.date_range( start, periods=nin, freq=deltadf, name ='dt')
# remove some random data points
frac_points = 8/24 # Fraction of points to retain
r = np.random.rand(nin)
df = df[r <= frac_points] # reduce the number of points
df = df.to_frame(index=False) # reindex
df['dti'] = ...
Thank you in advance,
One solution is to divide the time differences between each row by the timedelta:
from datetime import datetime, timedelta
import pandas as pd
import numpy as np
start = datetime.now()
nin = 24
delta='4H'
df = pd.date_range(start, periods=nin, freq=delta, name='dt')
# Round to nearest ten minutes for better readability
df = df.round('10min')
# Ensure reproducibility
np.random.seed(1)
# remove some random data points
frac_points = 8/24 # Fraction of points to retain
r = np.random.rand(nin)
df = df[r <= frac_points] # reduce the number of points
df = df.to_frame(index=False) # reindex
df['dti'] = df['dt'].diff() / pd.to_timedelta(delta)
df['dti'] = df['dti'].fillna(0).cumsum().astype(int)
df
dt dti
0 2019-03-17 18:10:00 0
1 2019-03-17 22:10:00 1
2 2019-03-18 02:10:00 2
3 2019-03-18 06:10:00 3
4 2019-03-18 10:10:00 4
5 2019-03-19 10:10:00 10
6 2019-03-19 18:10:00 12
7 2019-03-20 10:10:00 16
8 2019-03-20 14:10:00 17
9 2019-03-21 02:10:00 20

Pandas timeseries replace weekend with one value generated from the weekend mean

I have a multicolumn pandas dataframe, with rows for each day.
Now I would like to replace each weekend with it's mean values in one row.
I.e. (Fr,Sa,Su).resample().mean() --> (Weekend)
Not sure where to start even.
Thank you in advance.
import pandas as pd
from datetime import timedelta
# make some data
df = pd.DataFrame({'dt': pd.date_range("2018-11-27", "2018-12-12"), "val": range(0,16)})
# adjust the weekend dates to fall on the friday
df['shifted'] = [d - timedelta(days = max(d.weekday() - 4, 0)) for d in df['dt']]
# calc the mean
df2 = df.groupby(df['shifted']).val.mean()
df2
#Out[105]:
#shifted
#2018-11-27 0
#2018-11-28 1
#2018-11-29 2
#2018-11-30 4
#2018-12-03 6
#2018-12-04 7
#2018-12-05 8
#2018-12-06 9
#2018-12-07 11
#2018-12-10 13
#2018-12-11 14
#2018-12-12 15

Slice, combine, and map fiscal year dates to calendar year dates to new column

I have the following pandas data frame:
Shortcut_Dimension_4_Code Stage_Code
10225003 2
8225003 1
8225004 3
8225005 4
It is part of a much larger dataset that I need to be able to filter by month and year. I need to pull the fiscal year from the first two digits for values larger than 9999999 in the Shortcut_Dimension_4_Code column, and the first digit for values less than or equal to 9999999. That value needs to be added to "20" to produce a year i.e. "20" + "8" = 2008 | "20" + "10" = 2010.
That year "2008, 2010" needs to be combined with the stage code value (1-12) to produce a month/year, i.e. 02/2010.
The date 02/2010 then needs to converted from fiscal year date to calendar year date, i.e. Fiscal Year Date : 02/2010 = Calendar Year date: 08/2009. The resulting date needs to be presented in a new column. The resulting df would end up looking like this:
Shortcut_Dimension_4_Code Stage_Code Date
10225003 2 08/2009
8225003 1 07/2007
8225004 3 09/2007
8225005 4 10/2007
I am new to pandas and python and could use some help. I am beginning with this:
Shortcut_Dimension_4_Code Stage_Code CY_Month Fiscal_Year
0 10225003 2 8.0 10
1 8225003 1 7.0 82
2 8225003 1 7.0 82
3 8225003 1 7.0 82
4 8225003 1 7.0 82
I used .map and .str methods to produce this df, but have not been able to figure out how to get the FY's right, for fy 2008-2009.
In below code, I'll assume Shortcut_Dimension_4_Code is an integer. If it's a string you can convert it or slice it like this: df['Shortcut_Dimension_4_Code'].str[:-6]. More explanations in comments alongside the code.
That should work as long as you don't have to deal with empty values.
import pandas as pd
import numpy as np
from datetime import date
from dateutil.relativedelta import relativedelta
fiscal_month_offset = 6
input_df = pd.DataFrame(
[[10225003, 2],
[8225003, 1],
[8225004, 3],
[8225005, 4]],
columns=['Shortcut_Dimension_4_Code', 'Stage_Code'])
# make a copy of input dataframe to avoid modifying it
df = input_df.copy()
# numpy will help us with numeric operations on large collections
df['fiscal_year'] = 2000 + np.floor_divide(df['Shortcut_Dimension_4_Code'], 1000000)
# loop with `apply` to create `date` objects from available columns
# day is a required field in date, so we'll just use 1
df['fiscal_date'] = df.apply(lambda row: date(row['fiscal_year'], row['Stage_Code'], 1), axis=1)
df['calendar_date'] = df['fiscal_date'] - relativedelta(months=fiscal_month_offset)
# by default python dates will be saved as Object type in pandas. You can verify with `df.info()`
# to use clever things pandas can do with dates we need co convert it
df['calendar_date'] = pd.to_datetime(df['calendar_date'])
# I would just keep date as datetime type so I could access year and month
# but to create same representation as in question, let's format it as string
df['Date'] = df['calendar_date'].dt.strftime('%m/%Y')
# copy important columns into output dataframe
output_df = df[['Shortcut_Dimension_4_Code', 'Stage_Code', 'Date']].copy()
print(output_df)

Select a time range in DataFrame without date

I'm using/learning Pandas to load a csv style dataset where I have a time column that can be used as index. The data is sampled roughly at 100Hz. Here is a simplified snippet of the data:
Time (sec) Col_A Col_B Col_C
0.0100 14.175 -29.97 -22.68
0.0200 13.905 -29.835 -22.68
0.0300 12.257 -29.32 -22.67
... ...
1259.98 -0.405 2.205 3.825
1259.99 -0.495 2.115 3.735
There are 20 min of data, resulting in about 120,000 rows at 100 Hz. My goal is to select those rows within a certain time range, say 100-200 sec.
Here is what I've figured out
import panda as pd
df = pd.DataFrame(my_data) # my_data is a numpy array
df.set_index(0, inplace=True)
df.columns = ['Col_A', 'Col_B', 'Col_C']
df.index = pd.to_datetime(df.index, unit='s', origin='1900-1-1') # the date in origin is just a space-holder
My dataset doesn't include the date. How to avoid setting a fake date like I did above? It feels wrong, and also is quite annoying when I plot the data against time.
I know there are ways to remove date from the datatime object like here.
But my goal is to select some rows that are in a certain time range, which means I need to use pd.date_range(). This function does not seem to work without date.
It's not the end of the world if I just use a fake date throughout my project. But I'd like to know if there are more elegant ways around it.
I don't see why you need to use datetime64 objects for this. Your time column is an number, so you can very easily select time intervals with inequalities. You can also plot the columns without issue.
Sample Data
import pandas as pd
import numpy as np
np.random.seed(123)
df = pd.DataFrame({'Time': np.arange(0,1200,0.01),
'Col_A': np.random.randint(1,100,120000),
'Col_B': np.random.randint(1,10,120000)})
Select Data between 100 and 200 seconds.
df[df.Time.between(100,200)]
Outputs:
Time Col_A Col_B
10000 100.00 75 9
10001 100.01 23 7
...
19999 199.99 39 7
20000 200.00 25 2
Plotting against time
#First 100 rows just for illustration
df[0:100].plot(x='Time')
Convert to timedelta64
If you really wanted to, you could convert the column to a timedelta64[ns]
df['Time'] = pd.to_datetime(df.Time, unit='s') - pd.to_datetime('1970-01-01')
print(df.head())
# Time Col_A Col_B
#0 00:00:00 67 6
#1 00:00:00.010000 93 1
#2 00:00:00.020000 99 3
#3 00:00:00.030000 18 2
#4 00:00:00.040000 84 3
df.dtypes
#Time timedelta64[ns]
#Col_A int32
#Col_B int32
#dtype: object

Categories

Resources