How to select weekday / hour in pandas when doing iteration - python

I'm filtering a dataframe by hour and weekday:
if type == 'daily':
hour = data.index.hour
day = data.index.weekday
selector = ((hour != 17)) | ((day!=5) & (day!=6))
data = data[selector]
if type == 'weekly':
day = data.index.weekday
selector = ((day!=5) & (day!=6))
data = data[selector]
Then I'm using a for where I need to write some conditional according to the weekday/hour and the row.index doesn't have any information. What can I do in this case ?
I need to do something like (this it won't work since row.index doesn't have weekday or hour info):
for index, row in data.iterrows():
if type == 'weekly' & row.index.weekday == 1 & row.index.hour == 0 & row.index.min == 0 | \
type == 'daily' & row.index.hour == 18 & row.index.min == 0:
Thx in advance

I know this is not the most elegant way, but you could create your variables in columns:
df['Hour'] = df.index.hour
If you need a min or a max based on those variables, you could create another column and use rolling_min or rolling type formulas.
Once you have your columns, you can iterate as you please with iteration you suggested.
There's info about the index properties here

Related

Matched 3 different column element of 2 different dataframe

I am trying to solve a problem where I have two dataframe which are df1 and df2. Both dataframe has the same column. I wanted to check if df1['column1'] == df2["column1"] and df1['column2'] == df2['column2] and df1['column3'] == df2['column3'] if this true wanted to get index of the both dataframe where condition is matched. I tried this but it takes a long time because I have around 250 000 row dataframe. Does anyone suggest some efficient way to find out this?
Tried solution :
from datetime import datetime
MS_counter = 0
matched_ws_index = []
start = datetime.now()
for MS_id in Mastersheet_df["Index"]:
WS_counter = 0
for WS_id in Weekly_sheet_df["Index"]:
if (Weekly_sheet_df.loc[WS_counter,"Trial ID"] == Mastersheet_df.loc[MS_counter,"Trial ID"]) and (Mastersheet_df.loc[MS_counter,"Biomarker Type"] == Weekly_sheet_df.loc[WS_counter,"Biomarker Type"]) and (WS_id == MS_id): # match trial id
print("Trial id, index and biomarker type are matched")
print(WS_counter)
print(MS_counter)
matched_ws_index.append(WS_counter)
WS_counter +=1
MS_counter +=1
end = datetime.now()
print("The time of execution of above program is :",
str(end-start)[5:])
Expected output is :
If above three condition is true it should gives the dataframe index postion like this
Matched
df1 index is = 170
Matched df2 index is = 658

Pandas for Loop Optimization(Vectorization) when looking at previous row value

I'm looking to optimize the time taken for a function with a for loop. The code below is ok for smaller dataframes, but for larger dataframes, it takes too long. The function effectively creates a new column based on calculations using other column values and parameters. The calculation also considers the value of a previous row value for one of the columns. I read that the most efficient way is to use Pandas vectorization, but i'm struggling to understand how to implement this when my for loop is considering the previous row value of 1 column to populate a new column on the current row. I'm a complete novice, but have looked around and cant find anything that suits this specific problem, though I'm searching from a position of relative ignorance, so may have missed something.
The function is below and I've created a test dataframe and random parameters too. it would be great if someone could point me in the right direction to get the processing time down. Thanks in advance.
def MODE_Gain (Data, rated, MODELim1, MODEin, Normalin,NormalLim600,NormalLim1):
print('Calculating Gains')
df = Data
df.fillna(0, inplace=True)
df['MODE'] = ""
df['Nominal'] = ""
df.iloc[0, df.columns.get_loc('MODE')] = 0
for i in range(1, (len(df.index))):
print('Computing Status{i}/{r}'.format(i=i, r=len(df.index)))
if ((df['MODE'].loc[i-1] == 1) & (df['A'].loc[i] > Normalin)) :
df['MODE'].loc[i] = 1
elif (((df['MODE'].loc[i-1] == 0) & (df['A'].loc[i] > NormalLim600))|((df['B'].loc[i] > NormalLim1) & (df['B'].loc[i] < MODELim1 ))):
df['MODE'].loc[i] = 1
else:
df['MODE'].loc[i] = 0
df[''] = (df['C']/6)
for i in range(len(df.index)):
print('Computing MODE Gains {i}/{r}'.format(i=i, r=len(df.index)))
if ((df['A'].loc[i] > MODEin) & (df['A'].loc[i] < NormalLim600)&(df['B'].loc[i] < NormalLim1)) :
df['Nominal'].loc[i] = rated/6
else:
df['Nominal'].loc[i] = 0
df["Upgrade"] = df[""] - df["Nominal"]
return df
A = np.random.randint(0,28,size=(8000))
B = np.random.randint(0,45,size=(8000))
C = np.random.randint(0,2300,size=(8000))
df = pd.DataFrame()
df['A'] = pd.Series(A)
df['B'] = pd.Series(B)
df['C'] = pd.Series(C)
MODELim600 = 32
MODELim30 = 28
MODELim1 = 39
MODEin = 23
Normalin = 20
NormalLim600 = 25
NormalLim1 = 32
rated = 2150
finaldf = MODE_Gain(df, rated, MODELim1, MODEin, Normalin,NormalLim600,NormalLim1)
Your second loop doesn't evaluate the prior row, so you should be able to use this instead
df['Nominal'] = 0
df.loc[(df['A'] > MODEin) & (df['A'] < NormalLim600) & (df['B'] < NormalLim1), 'Nominal'] = rated/6
For your first loop, the elif statements looks to evaluate this
((df['B'].loc[i] > NormalLim1) & (df['B'].loc[i] < MODELim1 )) and sets it to 1 regardless of the other condition, so you can remove that and vectorize that operation. didn't try, but this should do it
df.loc[(df['B'].loc[i] > NormalLim1) & (df['B'].loc[i] < MODELim1 ), 'MODE'] = 1
then you may be able to collapse the other conditions into one statement use |
Not sure how much all that will save you, but you should cut the time in half getting rid of the 2nd loop.
For vectorizing it I suggest you first shift your column in another one :
df['MODE_1'] = df['MODE'].shift(1)
and then use :
(df['MODE_1'].loc[i] == 1)
After that you should be able to vectorize

Conditional operations in dataframe (if else)

I have a data frame called Install_Date. I want to assign values to another data frame called age under two conditions- if value in Install_Date is null then age = current year - plant construct date, if value is not null then age = current year - INPUT_Asset["Install_Date"],
This is the code I have. First condition works fine but the second condition still gives 0 as values. :
Plant_Construct_Year = 1975
this_year= 2020
for i in INPUT_Asset["Install_Date"]:
if i != 0.0:
INPUT_Asset["Asset_Age"] = this_year- INPUT_Asset["Install_Date"]
else
INPUT_Asset["Asset_Age"] = this_year- Plant_Construct_Year
INPUT_Asset["Install_Date"] = pd.to_numeric(INPUT_Asset["Install_Date"], errors='coerce').fillna(0)
INPUT_Asset["Asset_Age"] = np.where(INPUT_Asset["Install_Date"] ==0.0, this_year- Plant_Construct_Year,INPUT_Asset["Asset_Age"])
INPUT_Asset["Asset_Age"] = np.where(INPUT_Asset["Install_Date"] !=0.0, this_year- INPUT_Asset["Install_Date"],INPUT_Asset["Asset_Age"])
print(INPUT_Asset["Asset_Age"])

Python / Pandas / Data Frame / Calculate date difference

I have a data Frame, and I am doing the following:
def calculate_planungsphase(audit, phase1, phase2):
datum_first_milestone = data_audit[(data_audit.Audit == audit) & (data_audit.Meilenstein == phase1)]
datum_second_milestone = data_audit[(data_audit.Audit == audit) & (data_audit.Meilenstein == phase2)]
print(datum_first_milestone['GeplantesErledigungsdatum'])
print(datum_second_milestone['GeplantesErledigungsdatum'])
print(datum_first_milestone['GeplantesErledigungsdatum'] - datum_second_milestone['GeplantesErledigungsdatum'])
The result of print(datum_first_milestone['GeplantesErledigungsdatum']) =
2018-01-01
Name: GeplantesErledigungsdatum, dtype: datetime64[ns]
The result of print(datum_second_milestone['GeplantesErledigungsdatum']) =
2018-01-02 Name: GeplantesErledigungsdatum, dtype: datetime64[ns]
The result of the difference calculation is:
0 NaT
1 NaT
Name: GeplantesErledigungsdatum, dtype: timedelta64[ns
Why is the result of the calculation NaT? And why do i have two results, when I am doing only one calculation? (Index 0 and Index 1 = NaT)
Thank you for your help!
There is problem different index values, so in subtraction Series are not aligned.
Possible solution, if same size of both filtered Series, is create same index values:
datum_first_milestone.index = datum_second_milestone.index
Also solution should be simplify if need filter only column by loc + column name:
datum_first_milestone = data_audit.loc[(data_audit.Audit == audit) & (data_audit.Meilenstein == phase1), 'GeplantesErledigungsdatum']
datum_second_milestone = data_audit.loc[(data_audit.Audit == audit) & (data_audit.Meilenstein == phase2), 'GeplantesErledigungsdatum']
print(datum_first_milestone)
print(datum_second_milestone)
and if always is returned one value Series.item return scalars:
print (datum_first_milestone.item() - datum_second_milestone.item())
More general if there is one or more values is possible select first value for scalars:
print (datum_first_milestone.iat[0] - datum_second_milestone.iat[0])

subsetting a Python DataFrame

I am transitioning from R to Python. I just began using Pandas. I have an R code that subsets nicely:
k1 <- subset(data, Product = p.id & Month < mn & Year == yr, select = c(Time, Product))
Now, I want to do similar stuff in Python. this is what I have got so far:
import pandas as pd
data = pd.read_csv("../data/monthly_prod_sales.csv")
#first, index the dataset by Product. And, get all that matches a given 'p.id' and time.
data.set_index('Product')
k = data.ix[[p.id, 'Time']]
# then, index this subset with Time and do more subsetting..
I am beginning to feel that I am doing this the wrong way. perhaps, there is an elegant solution. Can anyone help? I need to extract month and year from the timestamp I have and do subsetting. Perhaps there is a one-liner that will accomplish all this:
k1 <- subset(data, Product = p.id & Time >= start_time & Time < end_time, select = c(Time, Product))
thanks.
I'll assume that Time and Product are columns in a DataFrame, df is an instance of DataFrame, and that other variables are scalar values:
For now, you'll have to reference the DataFrame instance:
k1 = df.loc[(df.Product == p_id) & (df.Time >= start_time) & (df.Time < end_time), ['Time', 'Product']]
The parentheses are also necessary, because of the precedence of the & operator vs. the comparison operators. The & operator is actually an overloaded bitwise operator which has the same precedence as arithmetic operators which in turn have a higher precedence than comparison operators.
In pandas 0.13 a new experimental DataFrame.query() method will be available. It's extremely similar to subset modulo the select argument:
With query() you'd do it like this:
df[['Time', 'Product']].query('Product == p_id and Month < mn and Year == yr')
Here's a simple example:
In [9]: df = DataFrame({'gender': np.random.choice(['m', 'f'], size=10), 'price': poisson(100, size=10)})
In [10]: df
Out[10]:
gender price
0 m 89
1 f 123
2 f 100
3 m 104
4 m 98
5 m 103
6 f 100
7 f 109
8 f 95
9 m 87
In [11]: df.query('gender == "m" and price < 100')
Out[11]:
gender price
0 m 89
4 m 98
9 m 87
The final query that you're interested will even be able to take advantage of chained comparisons, like this:
k1 = df[['Time', 'Product']].query('Product == p_id and start_time <= Time < end_time')
Just for someone looking for a solution more similar to R:
df[(df.Product == p_id) & (df.Time> start_time) & (df.Time < end_time)][['Time','Product']]
No need for data.loc or query, but I do think it is a bit long.
I've found that you can use any subset condition for a given column by wrapping it in []. For instance, you have a df with columns ['Product','Time', 'Year', 'Color']
And let's say you want to include products made before 2014. You could write,
df[df['Year'] < 2014]
To return all the rows where this is the case. You can add different conditions.
df[df['Year'] < 2014][df['Color' == 'Red']
Then just choose the columns you want as directed above. For instance, the product color and key for the df above,
df[df['Year'] < 2014][df['Color'] == 'Red'][['Product','Color']]
Regarding some points mentioned in previous answers, and to improve readability:
No need for data.loc or query, but I do think it is a bit long.
The parentheses are also necessary, because of the precedence of the & operator vs. the comparison operators.
I like to write such expressions as follows - less brackets, faster to type, easier to read. Closer to R, too.
q_product = df.Product == p_id
q_start = df.Time > start_time
q_end = df.Time < end_time
df.loc[q_product & q_start & q_end, c('Time,Product')]
# c is just a convenience
c = lambda v: v.split(',')
Creating an Empty Dataframe with known Column Name:
Names = ['Col1','ActivityID','TransactionID']
df = pd.DataFrame(columns = Names)
Creating a dataframe from csv:
df = pd.DataFrame('...../file_name.csv')
Creating a dynamic filter to subset a dtaframe:
i = 12
df[df['ActivitiID'] <= i]
Creating a dynamic filter to subset required columns of dtaframe
df[df['ActivityID'] == i][['TransactionID','ActivityID']]

Categories

Resources