Python pandas pyhaystack - python

I am using a module called pyhaystack to retrieve data (rest API) from a building automation system based on 'tags.' Python will return a dictionary of the data. Im trying to use pandas with an If Else statement further below that I am having trouble with. The pyhaystack is working just fine to get the data...
This connects me to the automation system: (works just fine)
from pyhaystack.client.niagara import NiagaraHaystackSession
import pandas as pd
session = NiagaraHaystackSession(uri='http://0.0.0.0', username='Z', password='z', pint=True)
This code finds my tags called 'znt', converts dictionary to Pandas, and filters for time: (works just fine for the two points)
znt = session.find_entity(filter_expr='znt').result
znt = session.his_read_frame(znt, rng= '2018-01-01,2018-02-12').result
znt = pd.DataFrame.from_dict(znt)
znt.index.names=['Date']
znt = znt.fillna(method = 'ffill').fillna(method = 'bfill').between_time('08:00','17:00')
What I am most interested in is the column name, where ultimately I want Python to return the column named based on conditions:
print(znt.columns)
print(znt.values)
Returns:
Index(['C.Drivers.NiagaraNetwork.Adams_Friendship.points.A-Section.AV1.AV1ZN~2dT', 'C.Drivers.NiagaraNetwork.points.A-Section.AV2.AV2ZN~2dT'], dtype='object')
[[ 65.9087 66.1592]
[ 65.9079 66.1592]
[ 65.9079 66.1742]
...,
[ 69.6563 70.0198]
[ 69.6563 70.2873]
[ 69.5673 70.2873]]
I am most interested in this name of the Pandas dataframe. C.Drivers.NiagaraNetwork.Adams_Friendship.points.A-Section.AV1.AV1ZN~2dT
For my two arrays, I am subtracting the value of 70 for the data in the data frames. (works just fine)
znt_sp = 70
deviation = znt - znt_sp
deviation = deviation.abs()
deviation
And this is where I am getting tripped up in Pandas. I want Python to print the name of the column if the deviation is greater than four else print this zone is Normal. Any tips would be greatly appreciated..
if (deviation > 4).any():
print('Zone %f does not make setpoint' % deviation)
else:
print('Zone %f is Normal' % deviation)
The columns names in Pandas are the:
C.Drivers.NiagaraNetwork.Adams_Friendship.points.A-Section.AV1.AV1ZN~2dT

I think DataFrame would be a good way to handle what you want.
Starting with znt you can make all the calculation there :
deviation = znt - 70
deviation = deviation.abs()
# and the cool part is filtering in the df
problem_zones =
deviation[deviation['C.Drivers.NiagaraNetwork.Adams_Friendship.points.A-
Section.AV1.AV1ZN~2dT']>4]
You can play with this and figure out a way to iterate through columns, like :
for each in df.columns:
# if in this column, more than 10 occurences of deviation GT 4...
if len(df[df[each]>4]) > 10:
print('This zone have a lot of troubles : ', each)
edit
I like adding columns to a DataFrame instead of just building an external Series.
df[‘error_for_a’] = df[a] - 70
This open possibilities and keep everything together. One could use
df[df[‘error_for_a’]>4]
Again, all() or any() can be useful but in a real life scenario, we would probably need to trig the “fault detection” when a certain number of errors are present.
If the schedule has been set ‘occupied’ at 8hAM.... maybe the first entries won’t be correct.... (any would trig an error even if the situation gets better 30minutes later). Another scenario would be a conference room where error is tiny....but as soon as there are people in it...things go bad (all() would not see that).

Solution:
You can iterate over columns
for col in df.columns:
if (df[col] > 4).any(): # or .all if needed
print('Zone %s does not make setpoint' % col)
else:
print('Zone %s is Normal' % col)
Or by defining a function and using apply
def _print(x):
if (x > 4).any():
print('Zone %s does not make setpoint' % x.name)
else:
print('Zone %s is Normal' % x.name)
df.apply(lambda x: _print(x))
# you can even do
[_print(df[col]) for col in df.columns]
Advice:
maybe you would keep the result in another structure, change the function to return a boolean series that "is normal":
def is_normal(x):
return not (x > 4).any()
s = df.apply(lambda x: is_normal(x))
# or directly
s = df.apply(lambda x: not (x > 4).any())
it will return a series s where index is column names of your df and values a boolean corresponding to your condition.
You can then use it to get all the Normal columns names s[s].index or the non-normal s[~s].index
Ex : I want only the normal columns of my df: df[s[s].index]
A complete example
For the example I will use a sample df with a different condition from yours (I check if no element is lower than 4 - Normal else Does not make the setpoint )
df = pd.DataFrame(dict(a=[1,2,3],b=[2,3,4],c=[3,4,5])) # A sample
print(df)
a b c
0 1 2 3
1 2 3 4
2 3 4 5
Your use case: Print if normal or not - Solution
for col in df.columns:
if (df[col] < 4).any():
print('Zone %s does not make setpoint' % col)
else:
print('Zone %s is Normal' % col)
Result
Zone a is Normal
Zone b is does not make setpoint
Zone c is does not make setpoint
To illustrate my Advice : Keep the is_normal columns in a series
s = df.apply(lambda x: not (x < 4).any()) # Build the series
print(s)
a True
b False
c False
dtype: bool
print(df[s[~s].index]) #Falsecolumns df
b c
0 2 3
1 3 4
2 4 5
print(df[s[s].index]) #Truecolumns df
a
0 1
1 2
2 3

Related

Efficient way to loop through GroupBy DataFrame

Since my last post did lack in information:
example of my df (the important col):
deviceID: unique ID for the vehicle. Vehicles send data all Xminutes.
mileage: the distance moved since the last message (in km)
positon_timestamp_measure: unixTimestamp of the time the dataset was created.
deviceID mileage positon_timestamp_measure
54672 10 1600696079
43423 20 1600696079
42342 3 1600701501
54672 3 1600702102
43423 2 1600702701
My Goal is to validate the milage by comparing it to the max speed of the vehicle (which is 80km/h) by calculating the speed of the vehicle using the timestamp and the milage. The result should then be written in the orginal dataset.
What I've done so far is the following:
df_ori['dataIndex'] = df_ori.index
df = df_ori.groupby('device_id')
#create new col and set all values to false
df_ori['valid'] = 0
for group_name, group in df:
#sort group by time
group = group.sort_values(by='position_timestamp_measure')
group = group.reset_index()
#since I can't validate the first point in the group, I set it to valid
df_ori.loc[df_ori.index == group.dataIndex.values[0], 'validPosition'] = 1
#iterate through each data in the group
for i in range(1, len(group)):
timeGoneSec = abs(group.position_timestamp_measure.values[i]-group.position_timestamp_measure.values[i-1])
timeHours = (timeGoneSec/60)/60
#calculate speed
if((group.mileage.values[i]/timeHours)<maxSpeedKMH):
df_ori.loc[dataset.index == group.dataIndex.values[i], 'validPosition'] = 1
dataset.validPosition.value_counts()
It definitely works the way I want it to, however it lacks in performance a lot. The df contains nearly 700k in data (already cleaned). I am still a beginner and can't figure out a better solution. Would really appreciate any of your help.
If I got it right, no for-loops are needed here. Here is what I've transformed your code into:
df_ori['dataIndex'] = df_ori.index
df = df_ori.groupby('device_id')
#create new col and set all values to false
df_ori['valid'] = 0
df_ori = df_ori.sort_values(['position_timestamp_measure'])
# Subtract preceding values from currnet value
df_ori['timeGoneSec'] = \
df_ori.groupby('device_id')['position_timestamp_measure'].transform('diff')
# The operation above will produce NaN values for the first values in each group
# fill the 'valid' with 1 according the original code
df_ori[df_ori['timeGoneSec'].isna(), 'valid'] = 1
df_ori['timeHours'] = df_ori['timeGoneSec']/3600 # 60*60 = 3600
df_ori['flag'] = (df_ori['mileage'] / df_ori['timeHours']) <= maxSpeedKMH
df_ori.loc[df_ori['flag'], 'valid'] = 1
# Remove helper columns
df_ori = df.drop(columns=['flag', 'timeHours', 'timeGoneSec'])
The basic idea is try to use vectorized operation as much as possible and to avoid for loops, typically iteration row by row, which can be insanly slow.
Since I can't get the context of your code, please double check the logic and make sure it works as desired.

How can I count the element in specific intervals in a dataframe?

I've got a dataframe like below where columns in c01 represent the start time and c04 the end for time intervals:
c01 c04
1742 8.444991 14.022029
3786 29.91143 31.422439
3951 29.91143 31.145099
5402 37.81136 42.689595
8230 63.12394 65.34602
also a list like this (it's actually way longer):
8.522494
8.54471
8.578426
8.611193
8.644996
8.678053
8.710918
8.744901
8.777851
8.811053
8.844867
8.878389
8.912099
8.944729
8.977601
9.011232
9.04492
9.078157
9.111946
9.144788
9.177663
9.211054
9.245265
9.27805
9.311766
9.344647
9.377612
9.411709
I'd like to count how many elements in the list falls in the intervals shown by the dataframe, where I coded like this:
count = 0
for index, row in speech.iterrows():
count += gtls.count(lambda i : i in [row['c01'], row['c04']])
the file works as a whole but all 'count' turns out to be 0, would you please tell me where did I mess up?
I took the liberty of converting your list into a numpy array() (I called it arr). Then you can use the apply function to create your count column. Let's assume your dataframe is called df.
def get_count(row): #the logic for your summation is here
return np.sum([(row['c01'] < arr) & (row['c04'] >= arr)])
df['C_sum'] = df.apply(get_count, axis = 1)
print(df)
Output:
c01 c04 C_sum
0 8.444991 14.022029 28
1 29.911430 31.422439 0
2 29.911430 31.145099 0
3 37.811360 42.689595 0
4 63.123940 65.346020 0
You can also do the whole thing in one line using lambda:
df['C_sum'] = df.apply(lambda row: np.sum([(row['c01'] < arr) & (row['c04'] >= arr)]), axis = 1)
Welcome to Stack Overflow! The i in [row['c01'], row['c04']] doesn't do what you seem to think; it stands for checking whether element i can be found from the two-element list instead of checking the range between row['c01'] and row['c04']. For checking if a floating point number is within a range, use row['c01'] < i < row['c04'].

Rolling sum and mean in a dataframe in python

I have this input df
import pandas as pd
df = pd.DataFrame([[0,'B','A',1,0], [1,'B','C',0,0], [2,'A','B',3,2],[3,'A','B',5,2],[4,'A','C',2,1],[5,'B','A',0,1],[6,'C','B',5,5]], columns=['events','Runner 1','Runner 2','dist_R1','dist_R2'])
print(df)
and i'd like to add 4 more rolling calculated columns as below:
import pandas as pd
df = pd.DataFrame([[0,'B','A',1,0,0,0,0,0], [1,'B','C',0,0,1,0,1,0], [2,'A','B',3,2,0,1,0,0.5],[3,'A','B',5,2,3,3,2,1],[4,'A','C',2,1,8,0,2.67,0],[5,'B','A',0,1,5,10,1.25,2.5],[6,'C','B',5,5,1,5,0.5,1]], columns=['events','Runner 1','Runner 2','dist_R1','dist_R2','sum_dist_last_2_by_R1','sum dist last 2 by R2','mean dist last 2 by R1','mean dist last 2 by R2'])
print(df)
(sorry, but i'm learning how to format a df in StackOverflow)
I want to calculate last 4 columns.
In details i need to now at the star of the event "n" the sum and the mean km that Runner 1 and Runner 2 completed during the last two events they joined between thost from event 0 to n-1.
I think is quite challenging.
Any help is welcome.
Thanks in advance,
M
You wrote "rolling", but as a matter of fact it is a "very special type"
of rolling calculation (including only rows for runners from the current
row), so you can not use "pandasonic" rolling functions.
Instead you should compute the result other way.
Start from preparatory computation:
Generate 2 auxiliary DataFrames - results for runner 1 and runner 2:
wrk1 = df[['events', 'Runner 1', 'dist_R1']]
wrk1.columns = ['events', 'Runner', 'dist']
wrk2 = df[['events', 'Runner 2', 'dist_R2']]
wrk2.columns = ['events', 'Runner', 'dist']
Concatenate them, getting wrk DataFrame and delete 2 previous DataFrames:
wrk = pd.concat([wrk1, wrk2]).sort_values('events')
del wrk1, wrk2
Then define 2 following functions:
Get statistisc (sum and mean) for the given runner (rnr),
from 2 last events before the given event (ev):
def getStat(rnr, ev):
res = wrk.query('Runner == #rnr and events < #ev').dist.iloc[-2:]
return res.sum(), res.mean()
Get additional columns for the current row:
def getAddCols(row):
td_r1, md_r1 = getStat(row['Runner 1'], row.events)
td_r2, md_r2 = getStat(row['Runner 2'], row.events)
return pd.Series([td_r1, td_r2, md_r1, md_r2],
index=['tot dist_R1', 'tot dist_R2', 'mean dist_R1', 'mean dist_R2'])
And to get the result, run:
df.join(df.apply(getAddCols, axis=1).fillna(0))\
.astype({'tot dist_R1': int, 'tot dist_R2': int})
Note that a Series returned by getAddCols contains some float values,
so all 4 new columns are coerced to float.
To convert both total columns back to int, the last step (astype)
is needed.
The detailed results are a bit different from what you wrote in your post,
but I assume that you failed in your computation (in some cases).

Performance issues with pandas iterrows

I am having performance issues with iterrows in on my dataframe as I start to scale up my data analysis.
Here is the current loop that I am using.
for ii, i in a.iterrows():
for ij, j in a.iterrows():
if ii != ij:
if i['DOCNO'][-5:] == j['DOCNO'][4:9]:
if i['RSLTN1'] > j['RSLTN1']:
dl.append(ij)
else:
dl.append(ii)
elif i['DOCNO'][-5:] == j['DOCNO'][-5:]:
if i['RSLTN1'] > j['RSLTN1']:
dl.append(ij)
else:
dl.append(ii)
c = a.drop(a.index[dl])
The point of the loop is to find 'DOCNO' values that are different in the dataframe but are known to be equivalent denoted by the 5 characters that are equivalent but spaced differently in the string. When found I want to drop the smaller number from the associated 'RSLTN1' column. Additionally, my data set may have multiple entries for a unique 'DOCNO' that I want to drop the lower number 'RSLTN1' result.
I was successful running this will small quantities of data (~1000 rows) but as I scale up 10x I am running into performance issues. Any suggestions?
Sample from dataset
In [107]:a[['DOCNO','RSLTN1']].sample(n=5)
Out[107]:
DOCNO RSLTN1
6815 MP00064958 72386.0
218 MP0059189A 65492.0
8262 MP00066187 96497.0
2999 MP00061663 43677.0
4913 MP00063387 42465.0
How does this fit you needs?
import pandas as pd
s = '''\
DOCNO RSLTN1
MP00059189 72386.0
MP0059189A 65492.0
MP00066187 96497.0
MP00061663 43677.0
MP00063387 42465.0'''
# Recreate dataframe
df = pd.read_csv(pd.compat.StringIO(s), sep='\s+')
# Create mask
# We sort to make sure we keep only highest value
# Remove all non-digit according to: https://stackoverflow.com/questions/44117326/
m = (df.sort_values(by='RSLTN1',ascending=False)['DOCNO']
.str.extract('(\d+)', expand=False)
.astype(int).duplicated())
# Apply inverted `~` mask
df = df.loc[~m]
Resulting df:
DOCNO RSLTN1
0 MP00059189 72386.0
2 MP00066187 96497.0
3 MP00061663 43677.0
4 MP00063387 42465.0
In this example the following row was removed:
MP0059189A 65492.0

subsetting a Python DataFrame

I am transitioning from R to Python. I just began using Pandas. I have an R code that subsets nicely:
k1 <- subset(data, Product = p.id & Month < mn & Year == yr, select = c(Time, Product))
Now, I want to do similar stuff in Python. this is what I have got so far:
import pandas as pd
data = pd.read_csv("../data/monthly_prod_sales.csv")
#first, index the dataset by Product. And, get all that matches a given 'p.id' and time.
data.set_index('Product')
k = data.ix[[p.id, 'Time']]
# then, index this subset with Time and do more subsetting..
I am beginning to feel that I am doing this the wrong way. perhaps, there is an elegant solution. Can anyone help? I need to extract month and year from the timestamp I have and do subsetting. Perhaps there is a one-liner that will accomplish all this:
k1 <- subset(data, Product = p.id & Time >= start_time & Time < end_time, select = c(Time, Product))
thanks.
I'll assume that Time and Product are columns in a DataFrame, df is an instance of DataFrame, and that other variables are scalar values:
For now, you'll have to reference the DataFrame instance:
k1 = df.loc[(df.Product == p_id) & (df.Time >= start_time) & (df.Time < end_time), ['Time', 'Product']]
The parentheses are also necessary, because of the precedence of the & operator vs. the comparison operators. The & operator is actually an overloaded bitwise operator which has the same precedence as arithmetic operators which in turn have a higher precedence than comparison operators.
In pandas 0.13 a new experimental DataFrame.query() method will be available. It's extremely similar to subset modulo the select argument:
With query() you'd do it like this:
df[['Time', 'Product']].query('Product == p_id and Month < mn and Year == yr')
Here's a simple example:
In [9]: df = DataFrame({'gender': np.random.choice(['m', 'f'], size=10), 'price': poisson(100, size=10)})
In [10]: df
Out[10]:
gender price
0 m 89
1 f 123
2 f 100
3 m 104
4 m 98
5 m 103
6 f 100
7 f 109
8 f 95
9 m 87
In [11]: df.query('gender == "m" and price < 100')
Out[11]:
gender price
0 m 89
4 m 98
9 m 87
The final query that you're interested will even be able to take advantage of chained comparisons, like this:
k1 = df[['Time', 'Product']].query('Product == p_id and start_time <= Time < end_time')
Just for someone looking for a solution more similar to R:
df[(df.Product == p_id) & (df.Time> start_time) & (df.Time < end_time)][['Time','Product']]
No need for data.loc or query, but I do think it is a bit long.
I've found that you can use any subset condition for a given column by wrapping it in []. For instance, you have a df with columns ['Product','Time', 'Year', 'Color']
And let's say you want to include products made before 2014. You could write,
df[df['Year'] < 2014]
To return all the rows where this is the case. You can add different conditions.
df[df['Year'] < 2014][df['Color' == 'Red']
Then just choose the columns you want as directed above. For instance, the product color and key for the df above,
df[df['Year'] < 2014][df['Color'] == 'Red'][['Product','Color']]
Regarding some points mentioned in previous answers, and to improve readability:
No need for data.loc or query, but I do think it is a bit long.
The parentheses are also necessary, because of the precedence of the & operator vs. the comparison operators.
I like to write such expressions as follows - less brackets, faster to type, easier to read. Closer to R, too.
q_product = df.Product == p_id
q_start = df.Time > start_time
q_end = df.Time < end_time
df.loc[q_product & q_start & q_end, c('Time,Product')]
# c is just a convenience
c = lambda v: v.split(',')
Creating an Empty Dataframe with known Column Name:
Names = ['Col1','ActivityID','TransactionID']
df = pd.DataFrame(columns = Names)
Creating a dataframe from csv:
df = pd.DataFrame('...../file_name.csv')
Creating a dynamic filter to subset a dtaframe:
i = 12
df[df['ActivitiID'] <= i]
Creating a dynamic filter to subset required columns of dtaframe
df[df['ActivityID'] == i][['TransactionID','ActivityID']]

Categories

Resources