How to slice a pandas dataframe at a regular interval - python

I am new to python and I have a list of five climate data replicates that I would like to separate into individual replicates. Each replicate has a length of 42734, and the total length of the data frame (df) is 213,674.
Each replicate is separated by a line where the first entry is “replicate”. I have shown the titles of each column of data above the separating line.
Index year Month Day Rain Evap Max_Temp
42734 Replicate # 2 nan nan nan
I have tried the following code, which is extremely clunky and as I have to generate 100 climate replicates, is not practical. I know there is an easier way to do this, but I do not have enough experience with python yet to figure it out.
Here is the code I wrote:
# Import replicate .txt file into a dataframe
df=pd.read_table('5_replicates.txt',sep=r"\s*"
,skiprows=12,engine='python',header=None,
names =['year', 'Month', 'Day', 'Rain', 'Evap', 'Max_T'])
len(df)
i = 42734
num_replicates = 5
## Replicate 1
replicate_1 = df[0:i]
print "length of replicate_1:", len(replicate_1)
# Replicate 2
replicate_2 = df[i+1 : 2*i+1]
print "length of replicate_2:", len(replicate_2)
# Replicate 3
replicate_3 = df[2*i+2 : 3*i+2]
print "length of replicate_3:", len(replicate_3)
# Replicate 4
replicate_4 = df[3*i+3 : 4*i+3]
print "length of replicate_4:", len(replicate_4)
# Replicate 5
replicate_5 = df[4*i+4 : 5*i+4]
print "length of replicate_5:", len(replicate_5)
Any help would be much appreciated!

## create the example data frame
df = pd.DataFrame({'year':pd.date_range(start = '2016-01-01', end='2017-01-01', freq='H'), 'rain':np.random.randn(8785), 'max_temp':np.random.randn(8785)})
df.year = df.year.astype(str) #make the year column of str type
## add index at which we enter replicate.
df.ix[np.floor(np.linspace(0,df.shape[0]-1, 5)), 'year'] = "Replicate"
In [7]: df.head()
Out[7]:
max_temp rain year
0 -1.068354 0.959108 Replicate
1 -0.219425 0.777235 2016-01-01 01:00:00
2 -0.262994 0.472665 2016-01-01 02:00:00
3 -1.761527 -0.515135 2016-01-01 03:00:00
4 -2.038738 -1.452385 2016-01-01 04:00:00
Here, I just to the following. 1), I find the indexes at which the word "Replicate" is featured and record those indexes into dictionary idx_dict. 2) create a python range for each block that essentially indexes which blocks rows are in which replicate. 3) finally, I assign the number of a replicate to each block though once you have the range object, you don't really need to do this.
#1) find where the word "replicate" is featured
indexes = df[df.year == 'Replicate'].index
#2) create the range objects
idx_dict = {}
for i in range(0,indexes.shape[0]-1):
idx_dict[i] = range(indexes[i],indexes[i+1]-1)
#3) set the replicate number in some column
df.loc[:,'rep_num'] = np.nan #preset a value for the 'rep_num' column
for i in range(0, 4):
print(i)
df.loc[idx_dict[i],'rep_num'] = i
#fill in the NAs because my indexing algorithm isn't splendid
df.rep_num.fillna(method='ffill', inplace=True)
Now, you can just subset the df as you please by the replicate number or store portions elsewhere.
#get the number of rows in each replicate:
In [26]: df.groupby("rep_num").count()
Out[26]:
max_temp rain year
rep_num
0.0 2196 2196 2196
1.0 2196 2196 2196
2.0 2196 2196 2196
3.0 2197 2197 2197
#get the portion with the first replicate
In [27]: df.loc[df.rep_num==0,:].head()
Out[27]:
max_temp rain year rep_num
0 0.976052 0.896358 Replicate 0.0
1 -0.875221 -1.110111 2016-01-01 01:00:00 0.0
2 -0.305727 0.495230 2016-01-01 02:00:00 0.0
3 0.694737 -0.356541 2016-01-01 03:00:00 0.0
4 0.325071 0.669536 2016-01-01 04:00:00 0.0

Related

How to summarize missing values in time series data in a Pandas Dataframe?

I'm having a timeseries dataset like the following:
As seen, there are three columns for channel values paired against the same set of timestamps.
Each channel has sets of NaN values.
My objective is to create a summary of these NaN values as follows:
My approach (inefficient): Create a for loop to go across each channel column first, and then another nested for loop to go across each row of the channel. Then when it stumbles across NaN value sets, it can register the start timestamp, end timestamp and duration in the form of individual rows (or lists), which I can eventually stack together as the final output.
But my logic seems pretty inefficient and slow especially considering that my original dataset has 200 channel columns and 10k rows. I'm sure there should be a better approach than this in Python.
Can anyone please help me out with an appropriate way to deal with this - using Pandas in Python?
Use DataFrame.melt for reshape DataFrame, then filter consecutive groups by misisng values and next value after missing and create new DataFrame by aggregation min with max values:
df['date_time'] = pd.to_datetime(df['date_time'])
df1 = df.melt('date_time', var_name='Channel No.')
m = df1['value'].shift(fill_value=False).notna() #
mask = df1['value'].isna() | ~m
df1 = (df1.groupby([m.cumsum()[mask], 'Channel No.'])
.agg(Starting_Timestamp = ('date_time','min'),
Ending_Timestamp = ('date_time','max'))
.assign(Duration = lambda x: x['Ending_Timestamp'].sub(x['Starting_Timestamp']))
.droplevel(0)
.reset_index()
)
print (df1)
Channel No. Starting_Timestamp Ending_Timestamp Duration
0 Channel_1 2019-09-19 10:59:00 2019-09-19 14:44:00 0 days 03:45:00
1 Channel_1 2019-09-19 22:14:00 2019-09-19 23:29:00 0 days 01:15:00
2 Channel_2 2019-09-19 13:59:00 2019-09-19 19:44:00 0 days 05:45:00
3 Channel_3 2019-09-19 10:59:00 2019-09-19 12:44:00 0 days 01:45:00
4 Channel_3 2019-09-19 15:14:00 2019-09-19 16:44:00 0 days 01:30:00
Use:
inds = df[df['g'].isna()].index.to_list()
gs = []
s = 0
for i, x in enumerate(inds):
if i<len(inds)-1:
if x+1!=inds[i+1]:
gs.append(inds[s:i+1])
s = i+1
else:
gs.append(inds[s:i+1])
ses = []
for g in gs:
ses.append([df.iloc[g[0]]['date'], df.iloc[g[-1]+1]['date']])
res = pd.DataFrame(ses, columns = ['st', 'et'])
res['d'] = res['et']-res['st']
And a more efficient solution:
import pandas as pd
import numpy as np
df = pd.DataFrame({'date':pd.date_range('2021-01-01', '2021-12-01', 12), 'g':range(12)})
df['g'].loc[0:3]=np.nan
df['g'].loc[5:7]=np.nan
inds = df[df['g'].isna().astype(int).diff()==-1].index+1
pd.DataFrame([(x.iloc[0]['date'], x.iloc[-1]['date']) for x in np.array_split(df, inds) if np.isnan(x['g'].iloc[0])])

Boxplot of Multiindex df

I want to do 2 things:
I want to create one boxplot per date/day with all the values for MeanTravelTimeSeconds in that date. The number of MeanTravelTimeSeconds elements varies from date to date (e.g. one day might have a count of 300 values while another, 400).
Also, I want to transform the rows in my multiindex series into columns because I don't want the rows to repeat every time. If it stays like this I'd have tens of millions of unnecessary rows.
Here is the resulting series after using df.stack() on a df indexed by date (date is a datetime object index):
Date
2016-01-02 NumericIndex 1611664
OriginMovementID 4744
DestinationMovementID 5084
MeanTravelTimeSeconds 1233
RangeLowerBoundTravelTimeSeconds 756
...
2020-03-31 DestinationMovementID 3594
MeanTravelTimeSeconds 1778
RangeLowerBoundTravelTimeSeconds 1601
RangeUpperBoundTravelTimeSeconds 1973
DayOfWeek Tuesday
Length: 11281655, dtype: object
When I use seaborn to plot the boxplot I guet a bucnh of errors after playing with different selections.
If I try to do df.stack().unstack() or df.stack().T I get then following error:
Index contains duplicate entries, cannot reshape
How do I plot the boxplot and how do I turn the rows into columns?
You really do need to make your index unique to make the functions you want to work. I suggest a sequential number that resets at every change in the other two key columns.
import datetime as dt
import random
import numpy as np
cat = ["NumericIndex","OriginMovementID","DestinationMovementID","MeanTravelTimeSeconds",
"RangeLowerBoundTravelTimeSeconds"]
df = pd.DataFrame(
[{"Date":d, "Observation":cat[random.randint(0,len(cat)-1)],
"Value":random.randint(1000,10000)}
for i in range(random.randint(5,20))
for d in pd.date_range(dt.datetime(2016,1,2), dt.datetime(2016,3,31), freq="14D")])
# starting point....
df = df.sort_values(["Date","Observation"]).set_index(["Date","Observation"])
# generate an array that is sequential within change of key
seq = np.full(df.index.shape, 0)
s=0
p=""
for i, v in enumerate(df.index):
if i==0 or p!=v: s=0
else: s+=1
seq[i] = s
p=v
df["SeqNo"] = seq
# add to index - now unstack works as required
dfdd = df.set_index(["SeqNo"], append=True)
dfdd.unstack(0).loc["MeanTravelTimeSeconds"].boxplot()
print(dfdd.unstack(1).head().to_string())
output
Value
Observation DestinationMovementID MeanTravelTimeSeconds NumericIndex OriginMovementID RangeLowerBoundTravelTimeSeconds
Date SeqNo
2016-01-02 0 NaN NaN 2560.0 5324.0 5085.0
1 NaN NaN 1066.0 7372.0 NaN
2016-01-16 0 NaN 6226.0 NaN 7832.0 NaN
1 NaN 1384.0 NaN 8839.0 NaN
2 NaN 7892.0 NaN NaN NaN

Calculating moving median within group

I want to perform rolling median on price column over 4 days back, data will be groupped by date. So basically I want to take prices for a given day and all prices for 4 days back and calculate median out of these values.
Here are the sample data:
id date price
1637027 2020-01-21 7045204.0
280955 2020-01-11 3590000.0
782078 2020-01-28 2600000.0
1921717 2020-02-17 5500000.0
1280579 2020-01-23 869000.0
2113506 2020-01-23 628869.0
580638 2020-01-25 650000.0
1843598 2020-02-29 969000.0
2300960 2020-01-24 5401530.0
1921380 2020-02-19 1220000.0
853202 2020-02-02 2990000.0
1024595 2020-01-27 3300000.0
565202 2020-01-25 3540000.0
703824 2020-01-18 3990000.0
426016 2020-01-26 830000.0
I got close with combining rolling and groupby:
df.groupby('date').rolling(window = 4, on = 'date')['price'].median()
But this seems to add one row per each index value and by median definition, I am not able to somehow merge these rows to produce one result per row.
Result now looks like this:
date date
2020-01-10 2020-01-10 NaN
2020-01-10 NaN
2020-01-10 NaN
2020-01-10 3070000.0
2020-01-10 4890000.0
...
2020-03-11 2020-03-11 4290000.0
2020-03-11 3745000.0
2020-03-11 3149500.0
2020-03-11 3149500.0
2020-03-11 3149500.0
Name: price, Length: 389716, dtype: float64
It seems it just deleted 3 first values and then just printed price value.
Is it possible to get one lagged / moving median value per one date?
You can use rolling with a frequency window of 5 days to get today and last 4 days, then drop_duplicates to keep the last row per day. First create a copy (if you want to keep the original one), sort_values per date and ensure the date column is datetime
#sort and change to datetime
df_f = df[['date','price']].copy().sort_values('date')
df_f['date'] = pd.to_datetime(df_f['date'])
#create the column rolling
df_f['price'] = df_f.rolling('5D', on='date')['price'].median()
#drop_duplicates and keep the last row per day
df_f = df_f.drop_duplicates(['date'], keep='last').reset_index(drop=True)
print (df_f)
date price
0 2020-01-11 3590000.0
1 2020-01-18 3990000.0
2 2020-01-21 5517602.0
3 2020-01-23 869000.0
4 2020-01-24 3135265.0
5 2020-01-25 2204500.0
6 2020-01-26 849500.0
7 2020-01-27 869000.0
8 2020-01-28 2950000.0
9 2020-02-02 2990000.0
10 2020-02-17 5500000.0
11 2020-02-19 3360000.0
12 2020-02-29 969000.0
This is a step by step process. There are probably more efficient methods of getting what you want. Note, if you have time information for your dates, you would need to drop that information before grouping by date.
import pandas as pd
import statistics as stat
import numpy as np
# Replace with you data import
df = pd.read_csv('random_dates_prices.csv')
# Convert your date to a datetime
df['date'] = pd.to_datetime(df['date'])
# Sort your data by date
df = df.sort_values(by = ['date'])
# Create group by object
dates = df.groupby('date')
# Reformat dataframe for one row per day, with prices in a nested list
df = pd.DataFrame(dates['price'].apply(lambda s: s.tolist()))
# Extract price lists to a separate list
prices = df['price'].tolist()
# Initialize list to store past four days of prices for current day
four_days = []
# Loop over the prices list to combine the last four days to a single list
for i in range(3, len(prices), 1):
x = i - 1
y = i - 2
z = i - 3
four_days.append(prices[i] + prices[x] + prices[y] + prices[z])
# Initialize a list to store median values
medians = []
# Loop through four_days list and calculate the median of the last for days for the current date
for i in range(len(four_days)):
medians.append(stat.median(four_days[i]))
# Create dummy zero values to add lists create to dataframe
four_days.insert(0, 0)
four_days.insert(0, 0)
four_days.insert(0, 0)
medians.insert(0, 0)
medians.insert(0, 0)
medians.insert(0, 0)
# Add both new lists to data frames
df['last_four_day_prices'] = four_days
df['last_four_days_median'] = medians
# Replace dummy zeros with np.nan
df[['last_four_day_prices', 'last_four_days_median']] = df[['last_four_day_prices', 'last_four_days_median']].replace(0, np.nan)
# Clean data frame so you only have a single date a median value for past four days
df_clean = df.drop(['price', 'last_four_day_prices'], axis=1)

Create conditional column for Date Difference based on matching values in two columns

I have a dataframe, I am struggling to create a column based out of other columns, I will share the problem for a sample data.
Date Target1 Close
0 2019-04-17 209.2440 203.130005
1 2019-04-17 212.2155 203.130005
2 2019-04-17 213.6330 203.130005
3 2019-04-17 213.0555 203.130005
4 2019-04-17 212.6250 203.130005
5 2019-04-17 212.9820 203.130005
6 2019-04-17 213.1395 203.130005
7 2019-04-16 209.2860 199.250000
8 2019-04-16 209.9055 199.250000
9 2019-04-16 210.3045 199.250000
I want to create another column (for each observation) (called days_to_hit_target for example) which is the difference of days such that close hits (or comes very close to target of specific day), when it does closes that very closely, then it counts the difference of days and put them in the column days_to_hit_target.
This should work:
daysAboveTarget = []
for i in range(len(df.Date)):
try:
dayAboveTarget = df.iloc[i:].loc[(df.Close > df.Target1[i])]['Date'].iloc[0]
except IndexError:
dayAboveTarget = None
daysAboveTarget.append(dayAboveTarget)
daysAboveTarget = pd.Series(daysAboveTarget)
df['days_to_hit_target'] = daysAboveTarget - df.Date
I sort of overused iloc and loc here, so let me explain.
The variable dayAboveTarget gets the date when the price closes above the target. The first iloc subsets the dataframe to only future dates, the first loc finds the actual results, the second iloc gets only the first result. We need the exception for days where the price never goes above target.
NOTE I use python 3.7.1 and pandas 0.23.4. I came up with something very dirty; I am sure there is a neater and more efficient way of doing this.
### Create sample data
date_range = pd.date_range(start="1/1/2018", end="20/1/2018", freq="6H", closed="right")
target1 = np.random.uniform(10, 30, len(date_range))
close = [[i]*4 for i in np.random.uniform(10,30, len(date_range)//4)]
close_flat = np.array([item for sublist in close for item in sublist])
df = pd.DataFrame(np.array([np.array(date_range.date), target1,
close_flat]).transpose(), columns=["date", "target", "close"])
### Create the column you need
# iterating over the days and finding days when the difference between
# "close" of current day and all "target" is lower than 0.25 OR the "target"
# value is greater than "close" value.
thresh = 0.25
date_diff_arr = np.zeros(len(df))
for i in range(0,len(df),4):
diff_lt_thresh = df[(abs(df.target-df.close.iloc[i]) < thresh) | (df.target > df.close.iloc[i])]
# only keep the findings from the next day onwards
diff_lt_thresh = diff_lt_thresh.loc[i+4:]
if not diff_lt_thresh.empty:
# find day difference only if something under thresh is found
days_diff = (diff_lt_thresh.iloc[0].date - df.iloc[i].date).days
else:
# otherwise write it as nan
days_diff = np.nan
# fill in the np.array which will be used to write to the df
date_diff_arr[i:i+4] = days_diff
df["date_diff"] = date_diff_arr
Sample output:
0 2018-01-01 21.64 26.7319 2.0
1 2018-01-01 22.9047 26.7319 2.0
2 2018-01-01 26.0945 26.7319 2.0
3 2018-01-02 10.2155 26.7319 2.0
4 2018-01-02 17.5602 11.0507 1.0
5 2018-01-02 12.0368 11.0507 1.0
6 2018-01-02 19.5923 11.0507 1.0
7 2018-01-03 21.8168 11.0507 1.0
8 2018-01-03 11.5433 16.8862 1.0
9 2018-01-03 27.3739 16.8862 1.0
10 2018-01-03 26.9073 16.8862 1.0
11 2018-01-04 19.6677 16.8862 1.0
12 2018-01-04 25.3599 27.3373 1.0
13 2018-01-04 22.7479 27.3373 1.0
14 2018-01-04 18.7246 27.3373 1.0
15 2018-01-05 25.4122 27.3373 1.0
16 2018-01-05 28.3294 23.8469 1.0
maybe a little faster solution:
import pandas as pd
# df is your DataFrame
df["Date"] = pd.to_datetime(df["Date"])
df = df.sort_values("Date")
def days_to_hit(x, no_hit_default=None):
return next(
((df["Date"].iloc[j+x.name] - x["Date"]).days
for j in range(len(df)-x.name)
if df["Close"].iloc[j+x.name] >= x["Target1"]), no_hit_default)
df["days_to_hit_target"] = df.apply(days_to_hit, axis=1)

Data Cleaning in Python/Pandas to iterate through months combinations

I am doing some data cleaning to do some machine learning on a data set.
Basically I would like to predict next 12 months values based on last 12 months.
I have a data set with values per month (example below).
I would like to train my model by iterating into each possible combination of 12 months.
For example I want to train him on 2014-01 to 2014-12 to populate 2015-01 to 2015-12 but also to train him on 2014-02 to 2015-01 to populate 2015-02 to 2016-01 etc.
But I struggle to populate all these possibilities.
I show below where I am currently in my code and an example below of what I would like to have (with just 6 months instead of 12).
import pandas as pd
import numpy as np
data = [[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24]]
Months=['201401','201402','201403','201404','201405','201406','201407','201408','201409','201410','201411','201412','201501','201502','201503','201504','201505','201506','201507','201508','201509','201510','201511','201512']
df = pd.DataFrame(data,columns=Months)
The part that I can't get to work.
X = np.array([])
Y = np.array([])
for month in Months:
loc = df.columns.get_loc(month)
print(month,loc)
if loc + 11 <= df.shape[1]:
X = np.append(X,df.iloc[:,loc:loc+5].values,axis=0)
Y = np.append(Y,df.iloc[:,loc+6:loc+1].values,axis=0)
This is what I am expecting (for the first 3 iteratios)
### RESULTS EXPECTED ####
X = [[1,2,3,4,5,6],[2,3,4,5,6,7],[3,4,5,6,7,8]]
Y = [[7,8,9,10,11,12],[8,9,10,11,12,13],[9,10,11,12,13,14]]
To generate date ranges like the ones you describe in your explanation (rather than the ones shown in your sample output), you could use Pandas functionality like so:
import pandas as pd
months = pd.Series([
'201401','201402','201403','201404','201405','201406',
'201407','201408','201409','201410','201411','201412',
'201501','201502','201503','201504','201505','201506',
'201507','201508','201509','201510','201511','201512'
])
# this function converts strings like "201401"
# to datetime objects, and then uses DateOffset
# and date_range to generate a sequence of months
def date_range(month):
date = pd.to_datetime(month, format="%Y%m")
return pd.date_range(date, date + pd.DateOffset(months=11), freq='MS')
# apply function to original Series
# and then apply pd.Series to expand resulting arrays
# into DataFrame columns
month_ranges = months.apply(date_range).apply(pd.Series)
# sample of output:
# 0 1 2 3 4 5 \
# 0 2014-01-01 2014-02-01 2014-03-01 2014-04-01 2014-05-01 2014-06-01
# 1 2014-02-01 2014-03-01 2014-04-01 2014-05-01 2014-06-01 2014-07-01
# 2 2014-03-01 2014-04-01 2014-05-01 2014-06-01 2014-07-01 2014-08-01
# 3 2014-04-01 2014-05-01 2014-06-01 2014-07-01 2014-08-01 2014-09-01

Categories

Resources