Python / Pandas / Data Frame / Calculate date difference - python

I have a data Frame, and I am doing the following:
def calculate_planungsphase(audit, phase1, phase2):
datum_first_milestone = data_audit[(data_audit.Audit == audit) & (data_audit.Meilenstein == phase1)]
datum_second_milestone = data_audit[(data_audit.Audit == audit) & (data_audit.Meilenstein == phase2)]
print(datum_first_milestone['GeplantesErledigungsdatum'])
print(datum_second_milestone['GeplantesErledigungsdatum'])
print(datum_first_milestone['GeplantesErledigungsdatum'] - datum_second_milestone['GeplantesErledigungsdatum'])
The result of print(datum_first_milestone['GeplantesErledigungsdatum']) =
2018-01-01
Name: GeplantesErledigungsdatum, dtype: datetime64[ns]
The result of print(datum_second_milestone['GeplantesErledigungsdatum']) =
2018-01-02 Name: GeplantesErledigungsdatum, dtype: datetime64[ns]
The result of the difference calculation is:
0 NaT
1 NaT
Name: GeplantesErledigungsdatum, dtype: timedelta64[ns
Why is the result of the calculation NaT? And why do i have two results, when I am doing only one calculation? (Index 0 and Index 1 = NaT)
Thank you for your help!

There is problem different index values, so in subtraction Series are not aligned.
Possible solution, if same size of both filtered Series, is create same index values:
datum_first_milestone.index = datum_second_milestone.index
Also solution should be simplify if need filter only column by loc + column name:
datum_first_milestone = data_audit.loc[(data_audit.Audit == audit) & (data_audit.Meilenstein == phase1), 'GeplantesErledigungsdatum']
datum_second_milestone = data_audit.loc[(data_audit.Audit == audit) & (data_audit.Meilenstein == phase2), 'GeplantesErledigungsdatum']
print(datum_first_milestone)
print(datum_second_milestone)
and if always is returned one value Series.item return scalars:
print (datum_first_milestone.item() - datum_second_milestone.item())
More general if there is one or more values is possible select first value for scalars:
print (datum_first_milestone.iat[0] - datum_second_milestone.iat[0])

Related

Matched 3 different column element of 2 different dataframe

I am trying to solve a problem where I have two dataframe which are df1 and df2. Both dataframe has the same column. I wanted to check if df1['column1'] == df2["column1"] and df1['column2'] == df2['column2] and df1['column3'] == df2['column3'] if this true wanted to get index of the both dataframe where condition is matched. I tried this but it takes a long time because I have around 250 000 row dataframe. Does anyone suggest some efficient way to find out this?
Tried solution :
from datetime import datetime
MS_counter = 0
matched_ws_index = []
start = datetime.now()
for MS_id in Mastersheet_df["Index"]:
WS_counter = 0
for WS_id in Weekly_sheet_df["Index"]:
if (Weekly_sheet_df.loc[WS_counter,"Trial ID"] == Mastersheet_df.loc[MS_counter,"Trial ID"]) and (Mastersheet_df.loc[MS_counter,"Biomarker Type"] == Weekly_sheet_df.loc[WS_counter,"Biomarker Type"]) and (WS_id == MS_id): # match trial id
print("Trial id, index and biomarker type are matched")
print(WS_counter)
print(MS_counter)
matched_ws_index.append(WS_counter)
WS_counter +=1
MS_counter +=1
end = datetime.now()
print("The time of execution of above program is :",
str(end-start)[5:])
Expected output is :
If above three condition is true it should gives the dataframe index postion like this
Matched
df1 index is = 170
Matched df2 index is = 658

Selecting Rows in pandas Dataframe based on condition doesnt work

I have a dataframe with a "Timestamp" column like this:
df[Timestamp]
0 1.341709
1 1.343688
2 1.344503
3 1.344593
4 1.344700
...
1263453 413.056745
1263454 413.056836
1263455 413.056945
1263456 413.057046
1263457 413.057153
Name: Timestamp, Length: 1263458, dtype: float64
Now i have two variables to define a start and end of an interval like so:
start = 10
end = 15
To select all rows in the Dataframe where the Timestamp lies between "start" and "end" I use a query approach:
df_want = df.query("#start <= Timestamp < #end")
This gives me a Typeerror though
TypeError: '<=' not supported between instances of 'int' and 'type'
Why does this not work, shouldnt Timestamp be of type 'float64'? Why is it just 'type'?
You need to do the following:
df_want = df[df['Timestamp'].between(start,end)]
With the variables
df[(df['Timestamp'] >= start) & (df['Timestamp'] <= end)]
With the values harcoded:
df[(df['Timestamp'] >= 15) & (df['Timestamp'] <= 30)]

Converting Data Frame entry to float in Python/Pandas

I'm trying to save the values from populatioEst column in float variables using Python3 & Pandas,
I have the following table:
Name
populationEst
Amsterdam
872757
Netherlands
17407585
I have tried to separate both values as following,
populationAM = pops['populationEst'][pops.Name == 'Amsterdam']
populationNL = pops['populationEst'][pops.Name == 'Netherlands']
However when I try to print out the value,print(populationAM), I get this output
0 872757
Name: PopulationEstimate2020-01-01, dtype: int64
and I think that populationAM & populationNL are not int values, because Whenever I try to run some arithmetic operation on them I do not get the desired value.
For example, I have tried to calculate the fraction of the populationAM against populationNL using this formula
frac = populationAM.astype(float) * 100 / populationNL.astype(float)
and I did not get the desired output that should be 5,013659276 but I have got this one:
0 Nan
1 Nan
Name: PopulationEst, dtype: float64
Can Anybody tell where am I going wrong here or how can I save these values in simple float variables.
Is this what are you looking for?
populationAM = pops.loc[pops.Name == 'Amsterdam', 'populationEst'].iloc[0]
populationNL = pops.loc[pops.Name == 'Netherlands', 'populationEst'].iloc[0]
frac = populationAM * 100 / populationNL
The value of frac here is 5.013659275539944, while populationAM and populationNL are the integers corresponding to the respective populations (as you can see, the type of these variables is not a problem to compute the correct value of frac). In your code, the issue is that populationAM and populationNL are pandas Series, instead of integers; iloc[0] retrieves the value in the first position of the series.
Is this what you are trying to do?:
populationAM = pops[pops['pops.Name'] == 'Amsterdam']['populationEst']
populationNL = pops[pops['pops.Name'] == 'Netherlands']['populationEst']
Maybe you try this:
populationAM = pops['populationEst'][pops['Name'] == 'Amsterdam']
populationNL = pops['populationEst'][pops['Name'] == 'Netherlands']
It will be dtype: Int. But you could easily trnasform it to float.

How to select weekday / hour in pandas when doing iteration

I'm filtering a dataframe by hour and weekday:
if type == 'daily':
hour = data.index.hour
day = data.index.weekday
selector = ((hour != 17)) | ((day!=5) & (day!=6))
data = data[selector]
if type == 'weekly':
day = data.index.weekday
selector = ((day!=5) & (day!=6))
data = data[selector]
Then I'm using a for where I need to write some conditional according to the weekday/hour and the row.index doesn't have any information. What can I do in this case ?
I need to do something like (this it won't work since row.index doesn't have weekday or hour info):
for index, row in data.iterrows():
if type == 'weekly' & row.index.weekday == 1 & row.index.hour == 0 & row.index.min == 0 | \
type == 'daily' & row.index.hour == 18 & row.index.min == 0:
Thx in advance
I know this is not the most elegant way, but you could create your variables in columns:
df['Hour'] = df.index.hour
If you need a min or a max based on those variables, you could create another column and use rolling_min or rolling type formulas.
Once you have your columns, you can iterate as you please with iteration you suggested.
There's info about the index properties here

subsetting a Python DataFrame

I am transitioning from R to Python. I just began using Pandas. I have an R code that subsets nicely:
k1 <- subset(data, Product = p.id & Month < mn & Year == yr, select = c(Time, Product))
Now, I want to do similar stuff in Python. this is what I have got so far:
import pandas as pd
data = pd.read_csv("../data/monthly_prod_sales.csv")
#first, index the dataset by Product. And, get all that matches a given 'p.id' and time.
data.set_index('Product')
k = data.ix[[p.id, 'Time']]
# then, index this subset with Time and do more subsetting..
I am beginning to feel that I am doing this the wrong way. perhaps, there is an elegant solution. Can anyone help? I need to extract month and year from the timestamp I have and do subsetting. Perhaps there is a one-liner that will accomplish all this:
k1 <- subset(data, Product = p.id & Time >= start_time & Time < end_time, select = c(Time, Product))
thanks.
I'll assume that Time and Product are columns in a DataFrame, df is an instance of DataFrame, and that other variables are scalar values:
For now, you'll have to reference the DataFrame instance:
k1 = df.loc[(df.Product == p_id) & (df.Time >= start_time) & (df.Time < end_time), ['Time', 'Product']]
The parentheses are also necessary, because of the precedence of the & operator vs. the comparison operators. The & operator is actually an overloaded bitwise operator which has the same precedence as arithmetic operators which in turn have a higher precedence than comparison operators.
In pandas 0.13 a new experimental DataFrame.query() method will be available. It's extremely similar to subset modulo the select argument:
With query() you'd do it like this:
df[['Time', 'Product']].query('Product == p_id and Month < mn and Year == yr')
Here's a simple example:
In [9]: df = DataFrame({'gender': np.random.choice(['m', 'f'], size=10), 'price': poisson(100, size=10)})
In [10]: df
Out[10]:
gender price
0 m 89
1 f 123
2 f 100
3 m 104
4 m 98
5 m 103
6 f 100
7 f 109
8 f 95
9 m 87
In [11]: df.query('gender == "m" and price < 100')
Out[11]:
gender price
0 m 89
4 m 98
9 m 87
The final query that you're interested will even be able to take advantage of chained comparisons, like this:
k1 = df[['Time', 'Product']].query('Product == p_id and start_time <= Time < end_time')
Just for someone looking for a solution more similar to R:
df[(df.Product == p_id) & (df.Time> start_time) & (df.Time < end_time)][['Time','Product']]
No need for data.loc or query, but I do think it is a bit long.
I've found that you can use any subset condition for a given column by wrapping it in []. For instance, you have a df with columns ['Product','Time', 'Year', 'Color']
And let's say you want to include products made before 2014. You could write,
df[df['Year'] < 2014]
To return all the rows where this is the case. You can add different conditions.
df[df['Year'] < 2014][df['Color' == 'Red']
Then just choose the columns you want as directed above. For instance, the product color and key for the df above,
df[df['Year'] < 2014][df['Color'] == 'Red'][['Product','Color']]
Regarding some points mentioned in previous answers, and to improve readability:
No need for data.loc or query, but I do think it is a bit long.
The parentheses are also necessary, because of the precedence of the & operator vs. the comparison operators.
I like to write such expressions as follows - less brackets, faster to type, easier to read. Closer to R, too.
q_product = df.Product == p_id
q_start = df.Time > start_time
q_end = df.Time < end_time
df.loc[q_product & q_start & q_end, c('Time,Product')]
# c is just a convenience
c = lambda v: v.split(',')
Creating an Empty Dataframe with known Column Name:
Names = ['Col1','ActivityID','TransactionID']
df = pd.DataFrame(columns = Names)
Creating a dataframe from csv:
df = pd.DataFrame('...../file_name.csv')
Creating a dynamic filter to subset a dtaframe:
i = 12
df[df['ActivitiID'] <= i]
Creating a dynamic filter to subset required columns of dtaframe
df[df['ActivityID'] == i][['TransactionID','ActivityID']]

Categories

Resources