i have a data frame which is of format(date range from start="2018-09-09",end="2020-02-02") with values from 1 to 513
I have another data frame with the format(only 3 dates)
based on second data frame i want 2 dates before and 1 date after what I mean is this
Edited: Corrected answer as per the question
If you do this:
keep = []
for val in df2['value']:
keep += [val-3, val-2, val-1, val]
df_final = df1.take(keep)
Assumption: Your value columns always starts from 1 and is sequential. Also, its datatype is integer not string.
What it does:
The row numbers (indices) of every date = value of that row - 1, since indices start from 0.
So this keeps only the value-3 (2 days before), value-2 (1 day before), value-1 (that day present in df2) and value (1 day after) indices in the keep list.
Then DataFrame.take(indices) does the work for us, by only taking from the mentioned DataFrame df1 the rows with indices mentioned in the argument indices, a list.
Related
I have a dataframe from yahoo finance
import pandas as pd
import yfinance
ticker = yfinance.Ticker("INFY.NS")
df = ticker.history(period = '1y')
print(df)
This gives me df as,
If I specify,
date = "2021-04-23"
I need a subset of df with row having indexes label "2021-04-23"
rows of 2 days before the date
row of 1 day after of date
The important thing here is, we cannot calculate before & after using date strings as df may not have some dates but rows to be printed based on indexes. (i.e. 2 rows of previous indexes and one row of next index)
For example, in df, there is no "2021-04-21" but "2021-04-20"
How can we implement this?
You can go for integer-based indexing. First find the integer location of the desired date and then take the desired subset with iloc:
def get_subset(df, date):
# get the integer index of the matching date(s)
matching_dates_inds, = np.nonzero(df.index == date)
# and take the first one (works in case of duplicates)
first_matching_date_ind = matching_dates_inds[0]
# take the 4-element subset
desired_subset = df.iloc[first_matching_date_ind - 2: first_matching_date_ind + 2]
return desired_subset
If need before and after values by positions (if always exist date in DatetimeIndex) use DataFrame.iloc with position by Index.get_loc with min and max for select rows if not exist values before 2 or after 1 like in sample data:
df = pd.DataFrame({'a':[1,2,3]},
index=pd.to_datetime(['2021-04-21','2021-04-23','2021-04-25']))
date = "2021-04-23"
pos = df.index.get_loc(date)
df = df.iloc[max(0, pos-2):min(len(df), pos+2)]
print (df)
a
2021-04-21 1
2021-04-23 2
2021-04-25 3
Notice:
min and max are added for not failed selecting if date is first (not exist 2 values before, or second - not exist second value before) or last (not exist value after)
I have a pandas dataframe with a datetime index and some column, 'value'. I would like to compare the 'value' value at a given time of day to the value at a different time of the same day. E.g. compare the 10am value to the 10pm value.
Right now I can get the value at either side using:
mask = df[(df.index.hour == hour)]
the problem is that this returns a dataframe indexed at hour. So doing mask1.value - mask2.value returns Nan's since the indexes are different.
I can get around this in a convoluted way:
out = mask.value.loc["2020-07-15"].reset_index() - mask2.value.loc["2020-07-15"].reset_index() #assuming mask2 is the same as the mask call but at a different hour
but this is tiresome to loop over for a dataset that spans years. (Obviously I could timedelta +=1 in the loop to avoid the hard calls).
I don't actually care if some nan's get into the end result if some, e.g. 10am, values are missing.
Edit:
Initial dataframe:
index values
2020-05-10T10:00:00 23
2020-05-10T11:00:00 20
2020-05-10T12:00:00 5
.....
2020-05-30T22:00:00 8
2020-05-30T23:00:00 8
2020-05-30T24:00:00 9
Expected dataframe:
index date newval
0 2020-05-10 18
.....
x 2020-05-30 1
where newval is some subtraction of the two different times I described above (eg. the 10am measurement - the 12pm measurement so 23-5 = 18), second entry is made up
it doesn't matter to me if date is a separate column or the index.
A workaround:
mask1 = df[(df.index.hour == hour1)]
mask2 = df[(df.index.hour == hour2)]
out = mask1.values - mask2.values # df.values returns an np array without indices
result_df = pd.DataFrame(index=pd.daterange(start,end), data=out)
It should save you the effort of looping over the dates
I have data of electricity usage. During the power outrages the data is '0'.
I want to replace those 0's with the data of same time during the past week. Which is 168 indexes ahead or behind in the dataset.
In the below code, I am saving the index of all the zeros. Running a loop which will place the value that lies 168 indexes ahead in the dataset at the current index.
Index_Zero = data[data["Total"]==0].index.to_list() #Output = list of indexes where all the zeros lie
print(Index_Zero[0]) #Output = 2
for i in Index_Zero:
data.loc[(Index_Zero[i]), 'Total']=data.loc[(Index_Zero[i+168]), 'Total']
Also, if I print
data.loc[(Index_Zero[0]), 'Total']=data.loc[(Index_Zero[2]), 'Total']
print(data.loc[(Index_Zero[0]), 'Total'])
Output: 0.0
DataSet:
Date Time Total
0 23-Jan-2019 12:00:00 AM 18343.00
1 23-Jan-2019 01:00:00 AM 18188.00
2 23-Jan-2019 02:00:00 AM 0.00
3 23-Jan-2019 03:00:00 AM 23394.00
4 23-Jan-2019 04:00:00 AM 20037.00
I think, a more natural solution is to:
Set the index to "true" datetime, derived from Date and Time columns.
Run a loop over indices of rows with Total == 0.
Retrieve the value from a row with index 1 week back.
Save this value as Total in the row with current index.
Finally reset the index to what it was before.
To perform this, run:
df.set_index(pd.to_datetime(df.Date + ' ' + df.Time), inplace=True)
for ind in df[df.Total.eq(0)].index:
df.loc[ind, 'Total'] = df.loc[ind - pd.Timedelta('1W'), 'Total']
df.reset_index(drop=True, inplace=True)
Note that the loop must be based only on indices, not on full rows.
The reason is that power outage could occur at particular weekday and hour
e.g. in 2 (or more) consecutive weeks.
So a loop based on full rows (for ... in df[df.Total.eq(0)].iterrows():)
would retrieve always original Total values (it would not see the
update from one week, while processing a row for the next week (assuming
that both these rows contained initially 0)).
Another remark
Assuming that your rows are ordered by Date / Time, your original code
should:
Refer to the current index minus 168 (one week before,
not after).
The mentioned subtraction od 168 should be done from the current
index (Index_Zero[i]).
So this fragment of code should actually be data.loc[(Index_Zero[i] - 168), 'Total'].
But my solution is resistant to any missing rows in the DataFrame,
so I advise to take my solution.
Here's what I think is the problem. You are replacing the value of data.loc[(Index_Zero[i]), 'Total'] as the value at index(i+168) in the Index_zero list which is always 0. (You maintained the list exactly for that). I think this is an innocent mistake.
Change your code to this
Index_Zero = data[data["Total"]==0].index.to_list() #Output = list of indexes where all the zeros lie
print(Index_Zero[0]) #Output = 2
for i in Index_Zero:
data.loc[(Index_Zero[i]), 'Total']=data.loc[(i+168), 'Total']
The problem was in the range of for loop. It was iterating beyond the list.
Index_Zero = data[data["Total"]==0].index.to_list()
for items in range(0, len(Index_Zero)-1):
data.loc[(Index_Zero[items]), 'Total'] = data.loc[(items+168), 'Total']
I want to update the mergeAllGB.Intensity columns NaN values with values from another dataframe where ID, weekday and hour are matching. I'm trying:
mergeAllGB.Intensity[mergeAllGB.Intensity.isnull()] = precip_hourly[precip_hourly.SId == mergeAllGB.SId & precip_hourly.Hour == mergeAllGB.Hour & precip_hourly.Weekday == mergeAllGB.Weekday].Intensity
However, this returns ValueError: Series lengths must match to compare. How could I do this?
Minimal example:
Inputs:
_______
mergeAllGB
SId Hour Weekday Intensity
1 12 5 NaN
2 5 6 3
precip_hourly
SId Hour Weekday Intensity
1 12 5 2
Desired output:
________
mergeAllGB
SId Hour Weekday Intensity
1 12 5 2
2 5 6 3
TL;DR this will (hopefully) work:
# Set the index to compare by
df = mergeAllGB.set_index(["SId", "Hour", "Weekday"])
fill_df = precip_hourly.set_index(["SId", "Hour", "Weekday"])
# Fill the nulls with the relevant values of intensity
df["Intensity"] = df.Intensity.fillna(fill_df.Intensity)
# Cancel the special indexes
mergeAllGB = df.reset_index()
Alternatively, the line before the last could be
df.loc[df.Intensity.isnull(), "Intensity"] = fill_df.Intensity
Assignment and comparison in pandas are done by index (which isn't shown in your example).
In the example, running precip_hourly.SId == mergeAllGB.SId results in ValueError: Can only compare identically-labeled Series objects. This is because we try to compare the two columns by value, but precip_hourly doesn't have a row with index 1 (default indexing starts at 0), so the comparison fails.
Even if we assume the comparison succeeded, the assignment stage is problematic.
Pandas tries to assign according to the index - but this doesn't have the intended meaning.
Luckily, we can use it for our own benefit - by setting the index to be ["SId", "Hour", "Weekday"], any comparison and assignments will be done with relation to this index, so running df.Intensity= fill_df.Intensity will assign to df.Intensity the values in fill_df.Intensity wherever the index match, that is, wherever they have the same ["SId", "Hour", "Weekday"].
In order to assign only to the places where the Intensity is NA, we need to filter first (or use fillna). Note that filter by df.Intensity[df.Intensity.isnull()] will work, but assignment to it will probably fail if you have several values with the same (SId, Hour, Weekday) values.
My dataframe1 contains the day column which has numeric data from 1 to 7 for each day of the week. 1 - Monday, 2 - Tuesday...etc.
This day column is the day of Departure of a flight.
I need to create a new column dayOfBooking in a second dataframe2 which finds day of the week based on the number of days before a person books a flight and the day of departure of the flight.
For that I've written this function:
def findDay(dayOfDeparture, beforeDay):
beforeDay = int(beforeDay)
beforeDay = beforeDay % 7
if((dayOfDeparture - beforeDay) > 0):
dayAns = currDay - beforeDay;
else:
dayAns = 7 - abs(dayOfDeparture - beforeDay)
return(dayAns)
I want something like:
dataframe2["dayOfBooking"] = findDay(dataframe1["day"], i)
where i is the scalar value.
I can see that findDay takes the entire column day of dataframe1 instead of taking a single value for each row.
Is there an easy way to accomplish this like when we want a third column to be the sum of two other columns for each row, we can just write this:
dataframe["sum"] = dataframe2["val1"] + dataframe2["val2"]
EDIT: Figured it out. Answer and explanation below.
df2["colname"] = df.apply(lambda row: findDay(row['col'], i), axis = 1)
We have to use the apply function if we want to extract each row value of a particular column and pass it to a user defined function.
axis = 1 denotes that every row value is being taken for that column.