Data Cleaning in Python/Pandas to iterate through months combinations - python

I am doing some data cleaning to do some machine learning on a data set.
Basically I would like to predict next 12 months values based on last 12 months.
I have a data set with values per month (example below).
I would like to train my model by iterating into each possible combination of 12 months.
For example I want to train him on 2014-01 to 2014-12 to populate 2015-01 to 2015-12 but also to train him on 2014-02 to 2015-01 to populate 2015-02 to 2016-01 etc.
But I struggle to populate all these possibilities.
I show below where I am currently in my code and an example below of what I would like to have (with just 6 months instead of 12).
import pandas as pd
import numpy as np
data = [[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24]]
Months=['201401','201402','201403','201404','201405','201406','201407','201408','201409','201410','201411','201412','201501','201502','201503','201504','201505','201506','201507','201508','201509','201510','201511','201512']
df = pd.DataFrame(data,columns=Months)
The part that I can't get to work.
X = np.array([])
Y = np.array([])
for month in Months:
loc = df.columns.get_loc(month)
print(month,loc)
if loc + 11 <= df.shape[1]:
X = np.append(X,df.iloc[:,loc:loc+5].values,axis=0)
Y = np.append(Y,df.iloc[:,loc+6:loc+1].values,axis=0)
This is what I am expecting (for the first 3 iteratios)
### RESULTS EXPECTED ####
X = [[1,2,3,4,5,6],[2,3,4,5,6,7],[3,4,5,6,7,8]]
Y = [[7,8,9,10,11,12],[8,9,10,11,12,13],[9,10,11,12,13,14]]

To generate date ranges like the ones you describe in your explanation (rather than the ones shown in your sample output), you could use Pandas functionality like so:
import pandas as pd
months = pd.Series([
'201401','201402','201403','201404','201405','201406',
'201407','201408','201409','201410','201411','201412',
'201501','201502','201503','201504','201505','201506',
'201507','201508','201509','201510','201511','201512'
])
# this function converts strings like "201401"
# to datetime objects, and then uses DateOffset
# and date_range to generate a sequence of months
def date_range(month):
date = pd.to_datetime(month, format="%Y%m")
return pd.date_range(date, date + pd.DateOffset(months=11), freq='MS')
# apply function to original Series
# and then apply pd.Series to expand resulting arrays
# into DataFrame columns
month_ranges = months.apply(date_range).apply(pd.Series)
# sample of output:
# 0 1 2 3 4 5 \
# 0 2014-01-01 2014-02-01 2014-03-01 2014-04-01 2014-05-01 2014-06-01
# 1 2014-02-01 2014-03-01 2014-04-01 2014-05-01 2014-06-01 2014-07-01
# 2 2014-03-01 2014-04-01 2014-05-01 2014-06-01 2014-07-01 2014-08-01
# 3 2014-04-01 2014-05-01 2014-06-01 2014-07-01 2014-08-01 2014-09-01

Related

How to summarize missing values in time series data in a Pandas Dataframe?

I'm having a timeseries dataset like the following:
As seen, there are three columns for channel values paired against the same set of timestamps.
Each channel has sets of NaN values.
My objective is to create a summary of these NaN values as follows:
My approach (inefficient): Create a for loop to go across each channel column first, and then another nested for loop to go across each row of the channel. Then when it stumbles across NaN value sets, it can register the start timestamp, end timestamp and duration in the form of individual rows (or lists), which I can eventually stack together as the final output.
But my logic seems pretty inefficient and slow especially considering that my original dataset has 200 channel columns and 10k rows. I'm sure there should be a better approach than this in Python.
Can anyone please help me out with an appropriate way to deal with this - using Pandas in Python?
Use DataFrame.melt for reshape DataFrame, then filter consecutive groups by misisng values and next value after missing and create new DataFrame by aggregation min with max values:
df['date_time'] = pd.to_datetime(df['date_time'])
df1 = df.melt('date_time', var_name='Channel No.')
m = df1['value'].shift(fill_value=False).notna() #
mask = df1['value'].isna() | ~m
df1 = (df1.groupby([m.cumsum()[mask], 'Channel No.'])
.agg(Starting_Timestamp = ('date_time','min'),
Ending_Timestamp = ('date_time','max'))
.assign(Duration = lambda x: x['Ending_Timestamp'].sub(x['Starting_Timestamp']))
.droplevel(0)
.reset_index()
)
print (df1)
Channel No. Starting_Timestamp Ending_Timestamp Duration
0 Channel_1 2019-09-19 10:59:00 2019-09-19 14:44:00 0 days 03:45:00
1 Channel_1 2019-09-19 22:14:00 2019-09-19 23:29:00 0 days 01:15:00
2 Channel_2 2019-09-19 13:59:00 2019-09-19 19:44:00 0 days 05:45:00
3 Channel_3 2019-09-19 10:59:00 2019-09-19 12:44:00 0 days 01:45:00
4 Channel_3 2019-09-19 15:14:00 2019-09-19 16:44:00 0 days 01:30:00
Use:
inds = df[df['g'].isna()].index.to_list()
gs = []
s = 0
for i, x in enumerate(inds):
if i<len(inds)-1:
if x+1!=inds[i+1]:
gs.append(inds[s:i+1])
s = i+1
else:
gs.append(inds[s:i+1])
ses = []
for g in gs:
ses.append([df.iloc[g[0]]['date'], df.iloc[g[-1]+1]['date']])
res = pd.DataFrame(ses, columns = ['st', 'et'])
res['d'] = res['et']-res['st']
And a more efficient solution:
import pandas as pd
import numpy as np
df = pd.DataFrame({'date':pd.date_range('2021-01-01', '2021-12-01', 12), 'g':range(12)})
df['g'].loc[0:3]=np.nan
df['g'].loc[5:7]=np.nan
inds = df[df['g'].isna().astype(int).diff()==-1].index+1
pd.DataFrame([(x.iloc[0]['date'], x.iloc[-1]['date']) for x in np.array_split(df, inds) if np.isnan(x['g'].iloc[0])])

Adding repeating date column to pandas DataFrame

I am new to pandas and I am struggling adding dates to my pandas dataFrame df that comes from .csv file. I have a dataFrame with several unique ids, and each id has 120 months, I need to add a column date. Each id should have exactly the same dates for 120 periods. I am struggling to add them as after first id there is another id and the dates should start over again. my data in csv file looks like this:
month id
1 1593
2 1593
...
120 1593
1 8964
2 8964
...
120 8964
1 58944
...
Here is my code and I am not really sure how should I use groupby method to add dates for my dataframe based on id:
group=df.groupby('id')
group['date']=pd.date_range(start='2020/6/1', periods=120, freq='MS').shift(14,freq='D')
Please help me!!!
If you know how many sets of 120 you have, you can use this. Just change the 2 at the end. This example creates a repeating 120 dates twice. You may have to adapt for your specific use.
new_dates = list(pd.date_range(start='2020/6/1', periods=120, freq='MS').shift(14,freq='D'))*2
df = pd.DataFrame({'date': new_dates})
These are the same except ones using lambda
def repeatingDates(numIds): return [d.strftime(
'%Y/%m/%d') for d in pandas.date_range(start='2020/6/1', periods=120, freq='MS')] * numIds
repeatingDates = lambda numIds: [d.strftime(
'%Y/%m/%d') for d in pandas.date_range(start='2020/6/1', periods=120, freq='MS')] * numIds
You can use Pandas transform. This is how I solved it:
dataf['dates'] = \
(dataf
.groupby("id")
.transform(lambda d: pd.date_range(start='2020/6/1', periods=d.max(), freq='MS').shift(14,freq='D')
)
Results:
month id dates
0 1 1593 2020-06-15
1 2 1593 2020-07-15
2 3 1593 2020-08-15
3 1 8964 2020-06-15
4 2 8964 2020-07-15
5 1 58944 2020-06-15
6 2 58944 2020-07-15
7 3 58944 2020-08-15
8 4 58944 2020-09-15
Test data:
import io
import pandas as pd
dataf = pd.read_csv(io.StringIO("""
month,id
1,1593
2,1593
3,1593
1,8964
2,8964
1,58944
2,58944
3,58944
4,58944""")).astype(int)

Calculating moving median within group

I want to perform rolling median on price column over 4 days back, data will be groupped by date. So basically I want to take prices for a given day and all prices for 4 days back and calculate median out of these values.
Here are the sample data:
id date price
1637027 2020-01-21 7045204.0
280955 2020-01-11 3590000.0
782078 2020-01-28 2600000.0
1921717 2020-02-17 5500000.0
1280579 2020-01-23 869000.0
2113506 2020-01-23 628869.0
580638 2020-01-25 650000.0
1843598 2020-02-29 969000.0
2300960 2020-01-24 5401530.0
1921380 2020-02-19 1220000.0
853202 2020-02-02 2990000.0
1024595 2020-01-27 3300000.0
565202 2020-01-25 3540000.0
703824 2020-01-18 3990000.0
426016 2020-01-26 830000.0
I got close with combining rolling and groupby:
df.groupby('date').rolling(window = 4, on = 'date')['price'].median()
But this seems to add one row per each index value and by median definition, I am not able to somehow merge these rows to produce one result per row.
Result now looks like this:
date date
2020-01-10 2020-01-10 NaN
2020-01-10 NaN
2020-01-10 NaN
2020-01-10 3070000.0
2020-01-10 4890000.0
...
2020-03-11 2020-03-11 4290000.0
2020-03-11 3745000.0
2020-03-11 3149500.0
2020-03-11 3149500.0
2020-03-11 3149500.0
Name: price, Length: 389716, dtype: float64
It seems it just deleted 3 first values and then just printed price value.
Is it possible to get one lagged / moving median value per one date?
You can use rolling with a frequency window of 5 days to get today and last 4 days, then drop_duplicates to keep the last row per day. First create a copy (if you want to keep the original one), sort_values per date and ensure the date column is datetime
#sort and change to datetime
df_f = df[['date','price']].copy().sort_values('date')
df_f['date'] = pd.to_datetime(df_f['date'])
#create the column rolling
df_f['price'] = df_f.rolling('5D', on='date')['price'].median()
#drop_duplicates and keep the last row per day
df_f = df_f.drop_duplicates(['date'], keep='last').reset_index(drop=True)
print (df_f)
date price
0 2020-01-11 3590000.0
1 2020-01-18 3990000.0
2 2020-01-21 5517602.0
3 2020-01-23 869000.0
4 2020-01-24 3135265.0
5 2020-01-25 2204500.0
6 2020-01-26 849500.0
7 2020-01-27 869000.0
8 2020-01-28 2950000.0
9 2020-02-02 2990000.0
10 2020-02-17 5500000.0
11 2020-02-19 3360000.0
12 2020-02-29 969000.0
This is a step by step process. There are probably more efficient methods of getting what you want. Note, if you have time information for your dates, you would need to drop that information before grouping by date.
import pandas as pd
import statistics as stat
import numpy as np
# Replace with you data import
df = pd.read_csv('random_dates_prices.csv')
# Convert your date to a datetime
df['date'] = pd.to_datetime(df['date'])
# Sort your data by date
df = df.sort_values(by = ['date'])
# Create group by object
dates = df.groupby('date')
# Reformat dataframe for one row per day, with prices in a nested list
df = pd.DataFrame(dates['price'].apply(lambda s: s.tolist()))
# Extract price lists to a separate list
prices = df['price'].tolist()
# Initialize list to store past four days of prices for current day
four_days = []
# Loop over the prices list to combine the last four days to a single list
for i in range(3, len(prices), 1):
x = i - 1
y = i - 2
z = i - 3
four_days.append(prices[i] + prices[x] + prices[y] + prices[z])
# Initialize a list to store median values
medians = []
# Loop through four_days list and calculate the median of the last for days for the current date
for i in range(len(four_days)):
medians.append(stat.median(four_days[i]))
# Create dummy zero values to add lists create to dataframe
four_days.insert(0, 0)
four_days.insert(0, 0)
four_days.insert(0, 0)
medians.insert(0, 0)
medians.insert(0, 0)
medians.insert(0, 0)
# Add both new lists to data frames
df['last_four_day_prices'] = four_days
df['last_four_days_median'] = medians
# Replace dummy zeros with np.nan
df[['last_four_day_prices', 'last_four_days_median']] = df[['last_four_day_prices', 'last_four_days_median']].replace(0, np.nan)
# Clean data frame so you only have a single date a median value for past four days
df_clean = df.drop(['price', 'last_four_day_prices'], axis=1)

PANDAS Time Series Window Labels

I currently have a process for windowing time series data, but I am wondering if there is a vectorized, in-place approach for performance/resource reasons.
I have two lists that have the start and end dates of 30 day windows:
start_dts = [2014-01-01,...]
end_dts = [2014-01-30,...]
I have a dataframe with a field called 'transaction_dt'.
What I am trying accomplish is method to add two new columns ('start_dt' and 'end_dt') to each row when the transaction_dt is between a pair of 'start_dt' and 'end_dt' values. Ideally, this would be vectorized and in-place if possible.
EDIT:
As requested here is some sample data of my format:
'customer_id','transaction_dt','product','price','units'
1,2004-01-02,thing1,25,47
1,2004-01-17,thing2,150,8
2,2004-01-29,thing2,150,25
IIUC
By suing IntervalIndex
df2.index=pd.IntervalIndex.from_arrays(df2['Start'],df2['End'],closed='both')
df[['End','Start']]=df2.loc[df['transaction_dt']].values
df
Out[457]:
transaction_dt End Start
0 2017-01-02 2017-01-31 2017-01-01
1 2017-03-02 2017-03-31 2017-03-01
2 2017-04-02 2017-04-30 2017-04-01
3 2017-05-02 2017-05-31 2017-05-01
Data Input :
df=pd.DataFrame({'transaction_dt':['2017-01-02','2017-03-02','2017-04-02','2017-05-02']})
df['transaction_dt']=pd.to_datetime(df['transaction_dt'])
list1=['2017-01-01','2017-02-01','2017-03-01','2017-04-01','2017-05-01']
list2=['2017-01-31','2017-02-28','2017-03-31','2017-04-30','2017-05-31']
df2=pd.DataFrame({'Start':list1,'End':list2})
df2.Start=pd.to_datetime(df2.Start)
df2.End=pd.to_datetime(df2.End)
If you want start and end we can use this, Extracting the first day of month of a datetime type column in pandas:
import io
import pandas as pd
import datetime
string = """customer_id,transaction_dt,product,price,units
1,2004-01-02,thing1,25,47
1,2004-01-17,thing2,150,8
2,2004-01-29,thing2,150,25"""
df = pd.read_csv(io.StringIO(string))
df["transaction_dt"] = pd.to_datetime(df["transaction_dt"])
df["start"] = df['transaction_dt'].dt.floor('d') - pd.offsets.MonthBegin(1)
df["end"] = df['transaction_dt'].dt.floor('d') + pd.offsets.MonthEnd(1)
df
Returns
customer_id transaction_dt product price units start end
0 1 2004-01-02 thing1 25 47 2004-01-01 2004-01-31
1 1 2004-01-17 thing2 150 8 2004-01-01 2004-01-31
2 2 2004-01-29 thing2 150 25 2004-01-01 2004-01-31
new approach:
import io
import pandas as pd
import datetime
string = """customer_id,transaction_dt,product,price,units
1,2004-01-02,thing1,25,47
1,2004-01-17,thing2,150,8
2,2004-06-29,thing2,150,25"""
df = pd.read_csv(io.StringIO(string))
df["transaction_dt"] = pd.to_datetime(df["transaction_dt"])
# Get all timestamps that are necessary
# This assumes dates are sorted
# if not we should change [0] -> min_dt and [-1] --> max_dt
timestamps = [df.iloc[0]["transaction_dt"].floor('d') - pd.offsets.MonthBegin(1)]
while df.iloc[-1]["transaction_dt"].floor('d') > timestamps[-1]:
timestamps.append(timestamps[-1]+datetime.timedelta(days=30))
# We store all ranges here
ranges = list(zip(timestamps,timestamps[1:]))
# Loop through all values and add to column start and end
for ind,value in enumerate(df["transaction_dt"]):
for i,(start,end) in enumerate(ranges):
if (value >= start and value <= end):
df.loc[ind, "start"] = start
df.loc[ind, "end"] = end
# When match is found let's also
# remove all ranges that aren't met
# This can be removed if dates are not sorted
# But this should speed things up for large datasets
for _ in range(i):
ranges.pop(0)

How to slice a pandas dataframe at a regular interval

I am new to python and I have a list of five climate data replicates that I would like to separate into individual replicates. Each replicate has a length of 42734, and the total length of the data frame (df) is 213,674.
Each replicate is separated by a line where the first entry is “replicate”. I have shown the titles of each column of data above the separating line.
Index year Month Day Rain Evap Max_Temp
42734 Replicate # 2 nan nan nan
I have tried the following code, which is extremely clunky and as I have to generate 100 climate replicates, is not practical. I know there is an easier way to do this, but I do not have enough experience with python yet to figure it out.
Here is the code I wrote:
# Import replicate .txt file into a dataframe
df=pd.read_table('5_replicates.txt',sep=r"\s*"
,skiprows=12,engine='python',header=None,
names =['year', 'Month', 'Day', 'Rain', 'Evap', 'Max_T'])
len(df)
i = 42734
num_replicates = 5
## Replicate 1
replicate_1 = df[0:i]
print "length of replicate_1:", len(replicate_1)
# Replicate 2
replicate_2 = df[i+1 : 2*i+1]
print "length of replicate_2:", len(replicate_2)
# Replicate 3
replicate_3 = df[2*i+2 : 3*i+2]
print "length of replicate_3:", len(replicate_3)
# Replicate 4
replicate_4 = df[3*i+3 : 4*i+3]
print "length of replicate_4:", len(replicate_4)
# Replicate 5
replicate_5 = df[4*i+4 : 5*i+4]
print "length of replicate_5:", len(replicate_5)
Any help would be much appreciated!
## create the example data frame
df = pd.DataFrame({'year':pd.date_range(start = '2016-01-01', end='2017-01-01', freq='H'), 'rain':np.random.randn(8785), 'max_temp':np.random.randn(8785)})
df.year = df.year.astype(str) #make the year column of str type
## add index at which we enter replicate.
df.ix[np.floor(np.linspace(0,df.shape[0]-1, 5)), 'year'] = "Replicate"
In [7]: df.head()
Out[7]:
max_temp rain year
0 -1.068354 0.959108 Replicate
1 -0.219425 0.777235 2016-01-01 01:00:00
2 -0.262994 0.472665 2016-01-01 02:00:00
3 -1.761527 -0.515135 2016-01-01 03:00:00
4 -2.038738 -1.452385 2016-01-01 04:00:00
Here, I just to the following. 1), I find the indexes at which the word "Replicate" is featured and record those indexes into dictionary idx_dict. 2) create a python range for each block that essentially indexes which blocks rows are in which replicate. 3) finally, I assign the number of a replicate to each block though once you have the range object, you don't really need to do this.
#1) find where the word "replicate" is featured
indexes = df[df.year == 'Replicate'].index
#2) create the range objects
idx_dict = {}
for i in range(0,indexes.shape[0]-1):
idx_dict[i] = range(indexes[i],indexes[i+1]-1)
#3) set the replicate number in some column
df.loc[:,'rep_num'] = np.nan #preset a value for the 'rep_num' column
for i in range(0, 4):
print(i)
df.loc[idx_dict[i],'rep_num'] = i
#fill in the NAs because my indexing algorithm isn't splendid
df.rep_num.fillna(method='ffill', inplace=True)
Now, you can just subset the df as you please by the replicate number or store portions elsewhere.
#get the number of rows in each replicate:
In [26]: df.groupby("rep_num").count()
Out[26]:
max_temp rain year
rep_num
0.0 2196 2196 2196
1.0 2196 2196 2196
2.0 2196 2196 2196
3.0 2197 2197 2197
#get the portion with the first replicate
In [27]: df.loc[df.rep_num==0,:].head()
Out[27]:
max_temp rain year rep_num
0 0.976052 0.896358 Replicate 0.0
1 -0.875221 -1.110111 2016-01-01 01:00:00 0.0
2 -0.305727 0.495230 2016-01-01 02:00:00 0.0
3 0.694737 -0.356541 2016-01-01 03:00:00 0.0
4 0.325071 0.669536 2016-01-01 04:00:00 0.0

Categories

Resources