I am working on processing a very large data set using pandas into more manageable data frames. I have a loop that goes through and splits the data frame into smaller data frames based on a leading ID number, I then sort by the date column. However, I notice that after everything runs there are still some issues with dates not being sorted correctly. I want to create a manual filter that basically loops through the date column and checks to see if next date is greater or equal to the previous date. This ideally would eliminate issues where the date column may go something like (obviously in more of a data frame format):
[2017,2017,2018,2018,2018,2017,2018,2018]
I am writing some code to take care of this however, I keep getting errors and was hoping someone could point me in the right direction to go.
for i in range(len(Rcols)):
dfs[i] = data.filter(regex=f'{Rcols[i]}-')
dfs[i]['Engine'] = data['Operation_ID:-PARAMETER_NAME:']
dfs[i].set_index('Engine',inplace=True)
dfs[i][f'{Rcols[i]}-DATE_TIME_START']=pd.to_datetime(dfs[i][f'{Rcols[i]}-DATE_TIME_START'],errors = 'ignore')
dfs[i].sort_values(by=f'{Rcols[i]}-DATE_TIME_START',ascending = True ,inplace=True)
for index, item in enumerate(dfs[i][f'{Rcols[i]}-DATE_TIME_START']):
if dfs[i][f'{Rcols[i]}-DATE_TIME_START'][index + 1] >= dfs[i][f'{Rcols[i]}-DATE_TIME_START'][index]:
continue
else:
dfs[i].drop(dfs[i][index])
Where Rcols is just a list of the column header leading IDs. dfs is a large list of names that call pandas data frames.
Thanks
This isn't particularly "manual", but you can use pd.Series.shift. Here's a minimal example, but the principle works equally well with a series of dates:
df = pd.DataFrame({'Years': [2017,2017,2018,2018,2018,2017,2018,2018]})
mask = df['Years'].shift() > df['Years']
df = df[~mask]
print(df)
Years
0 2017
1 2017
2 2018
3 2018
4 2018
6 2018
7 2018
Notice how the row with index 5 has been dropped since 2017 < 2018 (the row before). You can extend this for multiple columns via a for loop.
You should under no circumstances modify rows while you are iterating over them. This is spelt out in the docs for pd.DataFrame.iterrows:
You should never modify something you are iterating over. This is not
guaranteed to work in all cases. Depending on the data types, the
iterator returns a copy and not a view, and writing to it will have no
effect.
However, this becomes irrelevant when there is a vectorised solution available, as described above.
Related
I have a dataframe that has Date as its index. The dataframe has stock market related data so the dates are not continuous. If I want to move lets say 120 rows up in the dataframe, how do I do that. For example:
If I want to get the data starting from 120 trading days before the start of yr 2018, how do I do that below:
df['2018-01-01':'2019-12-31']
Thanks
Try this:
df[df.columns[df.columns.get_loc('2018-01-01'):df.columns.get_loc('2019-12-31')]]
Get location of both Columns in the column array and index them to get the desired.
UPDATE :
Based on your requirement, make some little modifications of above.
Yearly Indexing
>>> df[df.columns[(df.columns.get_loc('2018')).start:(df.columns.get_loc('2019')).stop]]
Above df.columns.get_loc('2018') will give out numpy slice object and will give us index of first element of 2018 (which we index using .start attribute of slice) and similarly we do for index of last element of 2019.
Monthly Indexing
Now consider you want data for First 6 months of 2018 (without knowing what is the first day), the same can be done using:
>>> df[df.columns[(df.columns.get_loc('2018-01')).start:(df.columns.get_loc('2018-06')).stop]]
As you can see above we have indexed the first 6 months of 2018 using the same logic.
Assuming you are using pandas and the dataframe is sorted by dates, a very simple way would be:
initial_date = '2018-01-01'
initial_date_index = df.loc[df['dates']==initial_date].index[0]
offset=120
start_index = initial_date_index-offset
new_df = df.loc[start_index:]
I am trying to pivot a pandas dataframe, but the data is following a strange format that I cannot seem to pivot. The data is structured as below:
Date, Location, Action1, Quantity1, Action2, Quantity2, ... ActionN, QuantityN
<date> 1 Lights 10 CFloor 1 ... Null Null
<date2> 2 CFloor 2 CWalls 4 ... CBasement 15
<date3> 2 CWalls 7 CBasement 4 ... NUll Null
Essentially, each action will always have a quantity attached to it (which may be 0), but null actions will never have a quantity (the quantity will just be null). The format I am trying to achieve is the following:
Lights CFloor CBasement CWalls
1 10 1 0 0
2 0 2 19 11
The index of the rows becomes the location while the columns become any unique action found across the multiple activity columns. When pulling the data together, the value of each row/column is the sum of each quantity associated with the action (i.e Action1 corresponds to Quantity1). Is there a way to do this with the native pandas pivot funciton?
My current code performs a ravel across all the activity columns to get a list of all unique activities. It will also grab all the unique locations from the Location column. Once I have the unique columns, I create an empty dataframe and fill it with zeros:
Lights CFloor CBasement CWalls
1 0 0 0 0
2 0 0 0 0
I then iterate back over the old data frame with the itertuples() method (I was told it was significantly faster than iterrows()) and populate the new dataframe. This empty dataframe acts as a template that is stored in memory and filled later.
#Creates a template from the dataframe
def create_template(df):
act_cols = ['Activity01', 'Activity02', 'Activity03', 'Activity04']
activities = df[act_cols]
flat_acts = activities.values.ravel('K')
unique_locations = pd.unique(df['Location'])
unique_acts = pd.unique(flat_acts)
pivot_template = pd.DataFrame(index=unique_locations, columns=unique_acts).fillna(0)
return pivot_template
#Fills the template from the dataframe
def create_pivot(df, pivot_frmt):
act_cols = ['Activity01', 'Activity02', 'Activity03', 'Activity04']
quant_cols = ['Quantity01', 'Quantity02', 'Quantity03', 'Quantity04']
for row in df.itertuples():
for act, quantity in zip(act_cols, quant_cols):
act_val = getattr(row, act)
if pd.notna(act_val):
quantity_val = getattr(row, quantity)
location = getattr(row, 'Location')
pivot_frmt.loc[location, act_val] += quantity_val
return pivot_frmt
While my solution works, it is incredibly slow when dealing with a large dataset and has taken 10 seconds or more to complete this type of operation. Any help would be greatly appreciated!
After experimenting with various pandas functions, such as melt and pivot on multiple columns simulatenously, I found a solution that worked for me:
For every quantity-activity pair, I build a partial frame of the final dataset and store it in a list. Once every pair has been addressed I will end up with multiple dataframes that all have the same row counts, but potentially different column counts. I solved this issue by simply concatenating the columns and if any columns are repeated, I then sum them to get the final result.
def test_pivot(df):
act_cols = ['Activity01', 'Activity02', 'Activity03', 'Activity04']
quant_cols = ['Quantity01', 'Quantity02', 'Quantity03', 'Quantity04']
dfs = []
for act, quant in zip(act_cols, quant_cols):
partial = pd.crosstab(index=df['Location'], columns=df[act], values=df[quant], aggfunc=np.sum).fillna(0)
dfs.append(partial)
finalDf = pd.concat(dfs, axis=1)
finalDf = test.groupby(finalDf.columns, axis=1).sum()
return finalDf
There are two assumptions that I make during this approach:
The indexes maintain their order across all partial dataframes
There are an equivalent number of indexes across all partial dataframes
While this is probably not the most elegant solution, it achieves the desired result and reduced the time it took to process the data by a very significant margin (from 10s ~4k rows to 0.2s ~4k rows). If anybody has a better way to deal with this type of scenario and do the process outlined above in one shot, then I would love to see your response!
Rather than a question, this is more of my workaround for a problem I was having when reading tmy3 files. I hope that it can be of use to some of you. I am still a novice when it comes to coding and python, so there may be simpler ways.
PROBLEM
Upon using the iotools.read_tmy3() function, I followed examples and code outlined by others. This included:
1. Reading the tmy3 datafile;
2. Coercing the year to 2015 (or any other year you like); and,
3. Shifting my index by 30 minutes to match the solar radiation simulation.
The code I used was:
tmy_data, metadata = pvlib.iotools.read_tmy3(datapath, coerce_year=2015)
tmy_data = tmy_data.shift(freq='-30Min') ['2015']
tmy_data.index.name = 'Time
By using this code, you lose the final row of your DataFrame. This is resolved by removing the ['2015'] in line two resolves this, but now the final date is set in 2014. For the purpose of my work, I needed the final index to remain consistent.
WHAT I TRIED
I attempted to shift index of the final row, unsuccessfully, using the shift method.
I attempted to reindex by setting the index of the final row equal to the DatetimeIndex I wanted.
I attempted to remove the timezone data from the index, modify the index, then reapply the timezone data.
All of these were overly complicated, and did not help resolve my issue.
WHAT WORKED FOR ME
The code for what I did is shown below. These were my steps:
1. Identify the index from my final row, and copy its data;
2. Drop the final row of my tmy_data DataFrame;
3. Create a new dataframe with the shifted date and copied data; and,
4. Append the row to my tmy_data DataFrame
It is a bit tedious, but it an easy fix when reading multiple tmy3 files with multiple timezones.
#Remove the specific year from line 2 in the code above
tmy_data = tmy_data.shift(freq='-30Min')
#Identify the last row, and copy the data
last_date = tmy_data.iloc[[-1]].index
copied_data = tmy_data.iloc[[-1]]
copied_data = copied_data.to_dict(orient='list')
#Drop the row with the incorrect index
tmy_data.drop([last_date][0], inplace = True)
#Shift the date of the last date
last_date = last_date.shift(n = 1, freq = 'A')
#Create a new DataFrame with the shifted date and copied data
last_row = pd.DataFrame(data = copied_data,
index=[last_date][0])
tmy_data = tmy_data.append(last_row)
After doing this my final indeces are:
2015-01-01 00:30:00-05:00
....
2015-12-31 22:30:00-05:00
2015-12-31 23:30:00-05:00
This way, the DataFrame contains 8760 rows, as opposed to 8759 through the previous method.
I hope this helps others!
I have a pandas.DataFrame object containing 2 time series. One series is much shorter than the other.
I want to determine the farther date for which a data is available in the shortest series, and remove data in the 2 columns before that date.
What is the most pythonic way to do that?
(I apologize that I don't really follow the SO guideline for submitting questions)
Here is a fragment of my dataframe:
osr go
Date
1990-08-17 NaN 239.75
1990-08-20 NaN 251.50
1990-08-21 352.00 265.00
1990-08-22 353.25 274.25
1990-08-23 351.75 290.25
In this case, I want to get rid of all rows before 1990-08-21 (I add there may be NAs in one of the columns for more recent dates)
You can use idxmax in inverted s by df['osr'][::-1] and then use subset of df:
print df
# osr go
#Date
#1990-08-17 NaN 239.75
#1990-08-20 NaN 251.50
#1990-08-21 352.00 265.00
#1990-08-22 353.25 274.25
#1990-08-23 351.75 290.25
s = df['osr'][::-1]
print s
#Date
#1990-08-23 351.75
#1990-08-22 353.25
#1990-08-21 352.00
#1990-08-20 NaN
#1990-08-17 NaN
#Name: osr, dtype: float64
maxnull = s.isnull().idxmax()
print maxnull
#1990-08-20 00:00:00
print df[df.index > maxnull]
# osr go
#Date
#1990-08-21 352.00 265.00
#1990-08-22 353.25 274.25
#1990-08-23 351.75 290.25
EDIT: New answer based upon comments/edits
It sounds like the data is sequential and once you have lines that don't have data you want to throw them out. This can be done easily with dropna.
df = df.dropna()
This answer assumes that once you are passed the bad rows, they stay good. Or if you don't care about dropping rows in the middle...depends on how sequential you need to be. If the data needs to be sequential and your input is well formed jezrael answer is good
Original answer
You haven't given much here by way of structure in your dataframe so I am going to make assumptions here. I'm going to assume you have many columns, two of which: time_series_1 and time_series_2 are the ones you referred to in your question and this is all stored in df
First we can find the shorter series by just using
shorter_col = df['time_series_1'] if len(df['time_series_1']) > len(df['time_series_2']) else df['time_series_2']
Now we want the last date in that
remove_date = max(shorter_col)
Now we want to remove data before that date
mask = (df['time_series_1'] > remove_date) | (df['time_series_2'] > remove_date)
df = df[mask]
Potentially a slightly misleading title but the problem is this:
I have a large dataframe with multiple columns. This looks a bit like
df =
id date value
A 01-01-2015 1.0
A 03-01-2015 1.2
...
B 01-01-2015 0.8
B 02-01-2015 0.8
...
What I want to do is within each of the IDs I identify the date one week earlier and place the value on this date into e.g. a 'lagvalue' column. The problem comes with not all dates existing for all ids so a simple .shift(7) won't pull the correct value [in this instance I guess I should put a NaN in].
I can do this with a lot of horrible iterating over the dates and ids to find the value, for example some rough idea
[
df[
df['date'] == df['date'].iloc[i] - datetime.timedelta(weeks=1)
][
df['id'] == df['id'].iloc[i]
]['value']
for i in range(len(df.index))
]
but I'm certain there is a 'better' way to do it that cuts down on time and processing that I just can't think of right now.
I could write a function using a groupby on the id and then look within that and I'm certain that would reduce the time it would take to perform the operation - is there a much quicker, simpler way [aka am I having a dim day]?
Basic strategy is, for each id, to:
Use date index
Use reindex to expand the data to include all dates
Use shift to shift 7 spots
Use ffill to do last value interpolation. I'm not sure if you want this, or possibly bfill which will use the last value less than a week in the past. But simple to change. Alternatively, if you want NaN when not available 7 days in the past, you can just remove the *fill completely.
Drop unneeded data
This algorithm gives NaN when the lag is too far in the past.
There are a few assumptions here. In particular that the dates are unique inside each id and they are sorted. If not sorted, then use sort_values to sort by id and date. If there are duplicate dates, then some rules will be needed to resolve which values to use.
import pandas as pd
import numpy as np
dates = pd.date_range('2001-01-01',periods=100)
dates = dates[::3]
A = pd.DataFrame({'date':dates,
'id':['A']*len(dates),
'value':np.random.randn(len(dates))})
dates = pd.date_range('2001-01-01',periods=100)
dates = dates[::5]
B = pd.DataFrame({'date':dates,
'id':['B']*len(dates),
'value':np.random.randn(len(dates))})
df = pd.concat([A,B])
with_lags = []
for id, group in df.groupby('id'):
group = group.set_index(group.date)
index = group.index
group = group.reindex(pd.date_range(group.index[0],group.index[-1]))
group = group.ffill()
group['lag_value'] = group.value.shift(7)
group = group.loc[index]
with_lags.append(group)
with_lags = pd.concat(with_lags, 0)
with_lags.index = np.arange(with_lags.shape[0])