I have a dataframe, called PORResult, of daily temperatures where rows are years and each column is a day (121 rows x 365 columns). I also have an array, called Percentile_90, of a threshold temperature for each day (length=365). For every day for every year in the PORResult dataframe I want to find out if the value for that day is higher than the value for that day in the Percentile_90 array. The results of which I want to store in a new dataframe, called Count (121rows x 365 columns). To start, the Count dataframe is full of zeros, but if the daily value in PORResult is greater than the daily value in Percentile_90. I want to change the daily value in Count to 1.
This is what I'm starting with:
for i in range(len(PORResult)):
if PORResult.loc[i] > Percentile_90[i]:
CountResult[i]+=1
But when I try this I get KeyError:0. What else can I try?
(Edited:)
Depending on your data structure, I think
CountResult = PORResult.gt(Percentile_90,axis=0).astype(int)
should do the trick. Generally, the toolset provided in pandas is sufficient that for-looping over a dataframe is unnecessary (as well as remarkably inefficient).
Related
I have one data frame with start_date and end_date (01-02-2020), based on these two dates it can be daily (if start and end are one day apart), similarly for yearly or quarterly.
Then there is a column Value (3.5) in values column.
Now if there exists one record for monthly with 2.5 value and one record with quarterly with 4.5 and multiple for daily like 1.5 and one for yearly like 0.5.
enter image description here
Then I need to get one row for one date like (01-01-2020) with summing values and giving aggregate value (2.5+4.5+1.5+0.5 = 9), hence 9 is total_value on 01-01-2020. Something like below:
enter image description here
There are years of data like this with multiple records existing for same time period. And I need to get aggregated value for one by one dates for all distinct 'names'
I have been trying to do this in Python with no success till now. Any help is appreciated.
I have a pandas dataframe and the index column is time with hourly precision. I want to create a new column that compares the value of the column "Sales number" at each hour with the same exact time one week ago.
I know that it can be written in using shift function:
df['compare'] = df['Sales'] - df['Sales'].shift(7*24)
But I wonder how can I take advantage of the date_time format of the index. I mean, is there any alternatives to using shift(7*24) when the index is in date_time format?
Try something with
df['Sales'].shift(7,freq='D')
I have a dataframe that conains gps locations of vehicles recieved at various times in a day. For each vehicle, I want to resample hourly data such that I have the median report (according to the time stamp) for each hour of the day. For hours where there are no corresponding rows, I want a blank row.
I am using the following code:
for i,j in enumerate(list(df.id.unique())):
data=df.loc[df.id==j]
data['hour']=data['timestamp'].hour
data_grouped=data.groupby(['imo','hour']).median().reset_index()
data = data_grouped.set_index('hour').reindex(idx).reset_index() #idx is a list of integers from 0 to 23.
Since my dataframe has millions of id's it takes me a lot of time to iterate though all of them. Is there an efficient way of doing this?
Unlike Pandas reindex dates in Groupby, I have multiple rows for each hour, in addition to some hours having no rows at all.
Tested in last version of pandas, convert hour column to categoricals with all possible categories and then aggregate without loop:
df['hour'] = pd.Categorical(df['timestamp'].dt.hour, categories=range(24))
df1 = df.groupby(['id','imo','hour']).median().reset_index()
Ok, here is my situation (leaving out uninteresting things):
Dataframe from a csv file, weher I get infos about the infentory of stores, like
Date,StoreID,…,InventoryCount
The rows are sorted by Date, but not sorted by StoreID, and the amount of stores can very in this time series.
What I want:
I want add a column to the Dataframe with the change in InventoryCount from one day to the previous one.
For that I was trying:
for name, group in df.groupby(["StoreID"]):
for i in range(1, len(group)):
group.loc[i, 'InventoryChange'] = group.loc[i, 'InventoryCount'] - group.loc[i-1, 'InventoryCount']
Your code explicitly iterates through the rows, which is a terrible idea in pandas, both aesthetically and performance wise. Instead, replace the last two lines by:
group['InventoryChange'] = group[ 'InventoryCount'].diff(n)
Where n is the number of days you are interested in - 1 in your example, 8 in your comment.
I have a sheet like this. I need to calculate absolute of "CURRENT HIGH" - "PREVIOUS DAY CLOSE PRICE" of particular "INSTRUMENT" and "SYMBOL".
So I used .shift(1) function of pandas dataframe to create a lagged close column and then I am subtracting current HIGH and lagged close column but that also subtracts between 2 different "INSTRUMENTS" and "SYMBOL". But if a new SYMBOL or INSTRUMENTS appears I want First row to be NULL instead of subtracting current HIGH and lagged close column.
What should I do?
I believe you need if all days are consecutive per groups:
df['new'] = df['HIGH'].sub(df.groupby(['INSTRUMENT','SYMBOL'])['CLOSE'].shift())