Taking single value from a grouped data frame in Pandas - python

I am a new Python convert (from Matlab). I am using the pandas groupby function, and I am getting tripped up by a seemingly easy problem. I have written a custom function that I apply to the grouped df that returns 4 different values. Three of the values are working great, but the other value is giving me an error. Here is the original df:
Index,SN,Date,City,State,ID,County,Age,A,B,C
0,32,9/1/16,X,AL,360,BB County,29.0,negative,positive,positive
1,32,9/1/16,X,AL,360,BB County,1.0,negative,negative,negative
2,32,9/1/16,X,AL,360,BB County,10.0,negative,negative,negative
3,32,9/1/16,X,AL,360,BB County,11.0,negative,negative,negative
4,35,9/1/16,X,AR,718,LL County,67.0,negative,negative,negative
5,38,9/1/16,X,AR,728-13,JJ County,3.0,negative,negative,negative
6,38,9/1/16,X,AR,728-13,JJ County,8.0,negative,negative,negative
7,30,9/1/16,X,AR,728-13,JJ County,8.0,negative,negative,negative
8,30,9/1/16,X,AR,728-13,JJ County,14.0,negative,negative,negative
9,30,9/1/16,X,AR,728-13,JJ County,5.0,negative,negative,negative
...
This is the function that transforms the data. Basically, it counts the number of 'positive' values and the total number of observations in the group. I also want it to return the ID value, and this is where the problem is:
def _ct_id_pos(grp):
return grp['ID'][0], grp[grp.A == 'positive'].shape[0], grp[grp.B == 'positive'].shape[0], grp.shape[0]
I apply the _ct_id_pos function to the data grouped by Date and SN:
FullMx_prime = FullMx.groupby(['Date', 'SN']).apply(_ct_id_pos).reset_index()
So, the method should return something like this:
Date SN ID 0
0 9/1/16 32 360 (360,2,1,4)
1 9/1/16 35 718 (718,0,0,1)
2 9/2/16 38 728 (728,1,0,2)
3 9/3/16 30 728 (728,2,0,3)
But, I keep getting the following error:
...
KeyError: 0
Obviously, it does not like this part of the function: grp['ID'][0] . I just want to take the first value of grp['ID'] because--if there are multiple values--they should all be the same (i.e., I could take the last, it does not matter). I have tried other ways to index, but to no avail.

Change grp['ID'][0] to grp.iloc[0]['ID']
The problem you are having is due to grp['ID'] which selects a column and returns a pandas.Series. Which is straight forward enough, and you could reasonably expect that [0] would select the first element. But the [0] actually selects based on the index for the Series, and in this case the index is from the dataframe that was grouped. So, 0 is not always going to be a valid index.
Code:
def _ct_id_pos(grp):
id = grp.iloc[0]['ID']
a = grp[grp.A == 'positive'].shape[0]
b = grp[grp.B == 'positive'].shape[0]
sz = grp.shape[0]
return id, a, b, sz
Test Code:
df = pd.read_csv(StringIO(u"""
Index,SN,Date,City,State,ID,County,Age,A,B,C
0,32,9/1/16,X,AL,360,BB County,29.0,negative,positive,positive
1,32,9/1/16,X,AL,360,BB County,1.0,negative,negative,negative
2,32,9/1/16,X,AL,360,BB County,10.0,negative,negative,negative
3,32,9/1/16,X,AL,360,BB County,11.0,negative,negative,negative
4,35,9/1/16,X,AR,718,LL County,67.0,negative,negative,negative
5,38,9/1/16,X,AR,728-13,JJ County,3.0,negative,negative,negative
6,38,9/1/16,X,AR,728-13,JJ County,8.0,negative,negative,negative
7,30,9/1/16,X,AR,728-13,JJ County,8.0,negative,negative,negative
8,30,9/1/16,X,AR,728-13,JJ County,14.0,negative,negative,negative
9,30,9/1/16,X,AR,728-13,JJ County,5.0,negative,negative,negative
"""), header=0, index_col=0)
print(df.groupby(['Date', 'SN']).apply(_ct_id_pos).reset_index())
Results:
Date SN 0
0 9/1/16 30 (728-13, 0, 0, 3)
1 9/1/16 32 (360, 0, 1, 4)
2 9/1/16 35 (718, 0, 0, 1)
3 9/1/16 38 (728-13, 0, 0, 2)

Related

Downsample time series whenever abs(difference) since the previous sample exceeds threshold

I have a timeseries of intraday tick-by-tick stock prices that change gradually over time. Whenever there is a small change (e.g. the price increases by $0.01), a new row of data is created. This leads to a very large data series which is slow to plot. I want to downsample so that small changes (e.g. the price goes up/down/up/down/up/down and is unchanged after 50 rows of data) are ignored, which improves plotting speed without sacrificing the qualitative accuracy of the graph. I only want to sample if the price goes up/up/up/up so that I am only displaying obvious changes.
import pandas as pd
import numpy as np
prices = pd.DataFrame(np.random.randint(0,1000, size=(100, 1))/100+1000, columns=list('A'))
I wish to sample whenever the difference with the previous sample exceeds some threshold. So, I will sample row 0 by default. If row 1, 2, 3 and 4 are too close to row 0, I want to throw them away. Then, if row 5 is sufficiently far away from row 0, I will sample that. Then, row 5 becomes my new anchor point, and I will repeat the same process described immediately above.
Is there a way to do this, ideally without a loop?
You could apply a down-sampling masking function that checks if the distance has been exceeded. Then use that to select to select the applicable rows.
Here is the down-sampling masking function:
def down_mask(x, max_dist=3):
global cum_diff
# if NaN return True
if x!=x:
return True
cum_diff += x
if abs(cum_diff) > max_dist:
cum_diff = 0
return True
return False
Then apply it and use it as a mask to get the entries that you want:
cum_diff = 0
df[df['prices'].diff().apply(down_mask, max_dist=5)]
prices
0 1002.07
1 1007.37
2 1000.09
6 1008.08
10 1001.57
14 1006.74
18 1000.42
19 1006.98
21 1001.30
26 1008.89
28 1003.77
38 1009.04
40 1000.52
44 1007.06
47 1001.21
48 1009.38
49 1001.81
51 1008.64
52 1002.72
55 1008.84
56 1000.86
57 1007.17
67 1001.31
68 1006.33
79 1001.14
98 1009.74
99 1000.53
Not exactly what was asked for. I offer two options with a threshold and a threshold and a sliding period.
import pandas as pd
import numpy as np
prices = pd.DataFrame(np.random.randint(0,1000, size=(100, 1))/100+1000, columns=list('A'))
threshold_ = 3
index = np.abs(prices['A'].values[1:] - prices['A'].values[:-1]) > threshold_
index = np.insert(index, 0, True)
print(prices[index == True], len(prices[index == True]))
period = 5
hist = len(prices)
index = np.abs(prices['A'].values[period:] - prices['A'].values[:hist-period]) > threshold_
index = np.insert(index, 0, np.empty((1,period), dtype=bool)[0])
print(prices[index == True], len(prices[index == True]))

How can I count the element in specific intervals in a dataframe?

I've got a dataframe like below where columns in c01 represent the start time and c04 the end for time intervals:
c01 c04
1742 8.444991 14.022029
3786 29.91143 31.422439
3951 29.91143 31.145099
5402 37.81136 42.689595
8230 63.12394 65.34602
also a list like this (it's actually way longer):
8.522494
8.54471
8.578426
8.611193
8.644996
8.678053
8.710918
8.744901
8.777851
8.811053
8.844867
8.878389
8.912099
8.944729
8.977601
9.011232
9.04492
9.078157
9.111946
9.144788
9.177663
9.211054
9.245265
9.27805
9.311766
9.344647
9.377612
9.411709
I'd like to count how many elements in the list falls in the intervals shown by the dataframe, where I coded like this:
count = 0
for index, row in speech.iterrows():
count += gtls.count(lambda i : i in [row['c01'], row['c04']])
the file works as a whole but all 'count' turns out to be 0, would you please tell me where did I mess up?
I took the liberty of converting your list into a numpy array() (I called it arr). Then you can use the apply function to create your count column. Let's assume your dataframe is called df.
def get_count(row): #the logic for your summation is here
return np.sum([(row['c01'] < arr) & (row['c04'] >= arr)])
df['C_sum'] = df.apply(get_count, axis = 1)
print(df)
Output:
c01 c04 C_sum
0 8.444991 14.022029 28
1 29.911430 31.422439 0
2 29.911430 31.145099 0
3 37.811360 42.689595 0
4 63.123940 65.346020 0
You can also do the whole thing in one line using lambda:
df['C_sum'] = df.apply(lambda row: np.sum([(row['c01'] < arr) & (row['c04'] >= arr)]), axis = 1)
Welcome to Stack Overflow! The i in [row['c01'], row['c04']] doesn't do what you seem to think; it stands for checking whether element i can be found from the two-element list instead of checking the range between row['c01'] and row['c04']. For checking if a floating point number is within a range, use row['c01'] < i < row['c04'].

Reading values from datafram.iloc is too slow and problem in dataframe.values

I use python and I have data of 35 000 rows I need to change values by loop but it takes too much time
ps: I have columns named by succes_1, succes_2, succes_5, succes_7....suces_120 so I get the name of the column by the other loop the values depend on the other column
exemple:
SK_1 Sk_2 Sk_5 .... SK_120 Succes_1 Succes_2 ... Succes_120
1 0 1 0 1 0 0
1 1 0 1 2 1 1
for i in range(len(data_jeux)):
for d in range (len(succ_len)):
ids = succ_len[d]
if data_jeux['SK_%s' % ids][i] == 1:
data_jeux.iloc[i]['Succes_%s' % ids]= 1+i
I ask if there is a way for executing this problem with the faster way I try :
data_jeux.values[i, ('Succes_%s' % ids)] = 1+i
but it returns me the following error maybe it doesn't accept string index
You can define columns and then use loc to increment. It's not clear whether your columns are naturally ordered; if they aren't you can use sorted with a custom function. String-based sorting will cause '20' to come before '100'.
def splitter(x):
return int(x.rsplit('_', maxsplit=1)[-1])
cols = df.columns
sk_cols = sorted(cols[cols.str.startswith('SK')], key=splitter)
succ_cols = sorted(cols[cols.str.startswith('Succes')], key=splitter)
df.loc[df[sk_cols] == 1, succ_cols] += 1

Find first time a value occurs in the dataframe

I have a dataframe with year-quarter (e.g. 2015-Q4), the customer_ID, and amount booked, and many other columns irrelevant for now. I want to create a column that has the first time each customer made a booking. I tried this:
alldata.sort_values(by=['Total_Apps_Reseller_Bookings_USD', 'Year_Quarter'],
ascending=[1, 1],
inplace=True)
first_q = alldata[['Customer_ID', 'Year_Quarter']].groupby(by='Customer_ID').first()
but I am not sure it worked.
Also, I then want to have another column that tells me how many quarters after the first booking that booking was made. I failed using replace and dictionary, so I used a merge. I create an numeric id for each quarter of booking, and first quarter from above, and then subtract the two:
q_booking_num = pd.DataFrame({'Year_Quarter': x, 'First_Quarter_id': np.arange(28)})
alldata = pd.merge(alldata, q_booking_num, on='Year_Quarter', how='outer')
q_first_num = pd.DataFrame({'First_Quarter': x, 'First_Quarter_id': np.arange(28)})
alldata = pd.merge(alldata, q_first_num, on='First_Quarter', how='outer')
this doesn't seem to have worked at all as I see 'first quarters' that are after some bookings that were already made.
You need to specify which column to use for taking the first value:
first_q = (alldata[['Customer_ID','Year_Quarter']]
.groupby(by='Customer_ID')
.Year_Quarter
.first()
)
Here is some sample data for three customers:
df = pd.DataFrame({'customer_ID': [1,
2, 2,
3, 3, 3],
'Year_Quarter': ['2010-Q1',
'2010-Q1', '2011-Q1',
'2010-Q1', '2011-Q1', '2012-Q1'],
'Total_Apps_Reseller_Bookings_USD': [1,
2, 3,
4, 5, 6]})
Below, I convert text quarters (e.g. '2010-Q1') to a numeric equivalent by taking the int value of the first for characters (df.Year_Quarter.str[:4].astype(int)). I then multiply it by four and add the value of the quarter. This value is only used for differencing to determine the total number of quarters since the first order.
Next, I use transform on the groupby to take the min value of these quarters we just calculated. Using transform keeps this value in the same shape as the original dataframe.
I then calcualte the quarters_since_first_order as the difference between the quarter and the first quarter.
df['quarters'] = df.Year_Quarter.str[:4].astype(int) * 4 + df.Year_Quarter.str[-1].astype(int)
first_order_quarter_no = df.groupby('customer_ID').quarters.transform(min)
df['quarters_since_first_order'] = quarters - first_order_quarter_no
del df['quarters'] # Clean-up.
>>> df
Total_Apps_Reseller_Bookings_USD Year_Quarter customer_ID quarters_since_first_order
0 1 2010-Q1 1 0
1 2 2010-Q1 2 0
2 3 2011-Q1 2 4
3 4 2010-Q1 3 0
4 5 2011-Q1 3 4
5 6 2012-Q1 3 8
For part 1:
I think you need to sort a little differently to get your desired outcome:
alldata.sort_values(by=['Customer_ID', 'Year_Quarter',
'Total_Apps_Reseller_Bookings_USD'],
ascending=[1, 1],inplace=True)
first_q = alldata[['Customer_ID','Year_Quarter']].groupby(by='Customer_ID').head(1)
For part 2:
Continuing off of part 1, you can merge the values back on to the original dataframe. At that point, you can write a custom function to subtract your date strings and then apply it to each row.
Something like:
def qt_sub(val, first):
year_dif = val[0:4] - first[0:4]
qt_dif = val[6] - first[6]
return 4 * int(year_dif) + int(qt_dif)
alldata['diff_from_first'] = alldata.apply(lambda x: qt_sub(x['Year_Quarter'],
x['First_Sale']),
axis=1)

pandas groupby - return min() along with time where min() occurs

My data is organized in multi-index dataframes. I am trying to groupby the "Sweep" index and return both the min (or max) in a specific time range, along with the time at which that time occurs.
Data looks like:
Time Primary Secondary BL LED
Sweep
Sweep1 0 0.00000 -28173.828125 -0.416565 -0.000305
1 0.00005 -27050.781250 -0.416260 0.000305
2 0.00010 -27490.234375 -0.415955 -0.002441
3 0.00015 -28222.656250 -0.416260 0.000305
4 0.00020 -28759.765625 -0.414429 -0.002136
Getting the min or max is very straightforward.
def find_groupby_peak(voltage_df, start_time, end_time, peak="min"):
boolean_vr = (voltage_df.Time >= start_time) & (voltage_df.Time <=end_time)
df_subset = voltage_df[boolean_vr]
grouped = df_subset.groupby(level="Sweep")
if peak == "min":
peak = grouped.Primary.min()
elif peak == "max":
peak = grouped.max()
return peak
Which gives (partial output):
Sweep
Sweep1 -92333.984375
Sweep10 -86523.437500
Sweep11 -85205.078125
Sweep12 -87109.375000
Sweep13 -77929.687500
But I need to time where those peaks occur as well. I know I could iterate over the output and find where in the original dataset those values occur, but that seems like a rather brute-force way to do it. I also could write a different function to apply to the grouped object that returns both the max and the time where that max occurs (at least in theory - haven't tried to do this, but I assume it's pretty straightforward).
Other than those two options, is there a simpler way to pass the outputs from grouped.Primary.min() (i.e. the peak values) to return where in Time those values occur?
You could consider using the transform function with groupby. If you had data that look a bit like this:
import pandas as pd
sweep = ["sweep1", "sweep1", "sweep1", "sweep1",
"sweep2", "sweep2", "sweep2", "sweep2",
"sweep3", "sweep3", "sweep3", "sweep3",
"sweep4", "sweep4", "sweep4", "sweep4"]
Time = [0.009845, 0.002186, 0.006001, 0.00265,
0.003832, 0.005627, 0.002625, 0.004159,
0.00388, 0.008107, 0.00813, 0.004813,
0.003205, 0.003225, 0.00413, 0.001202]
Primary = [-2832.013203, -2478.839133, -2100.671551, -2057.188346,
-2605.402055, -2030.195497, -2300.209967, -2504.817095,
-2865.320903, -2456.0049, -2542.132906, -2405.657053,
-2780.140743, -2351.743053, -2232.340363, -2820.27356]
s_count = [ 0, 1, 2, 3,
0, 1, 2, 3,
0, 1, 2, 3,
0, 1, 2, 3]
df = pd.DataFrame({ 'Time' : Time,
'Primary' : Primary}, index = [sweep, s_count])
Then you could write a very simple transform function that will return for each group of data (grouped by the sweep index), the row at which the minimum value of 'Primary' is located. This you would do with simple boolean slicing. That would look like this:
def trans_function(df):
return df[df.Primary == min(df.Primary)]
Then to use this function simply call it inside the transform method:
df.groupby(level = 0).transform(trans_function)
And that gives me the following output:
Primary Time
sweep1 0 -2832.013203 0.009845
sweep2 0 -2605.402055 0.003832
sweep3 0 -2865.320903 0.003880
sweep4 3 -2820.273560 0.001202
Obviously you could incorporate that into you function that is acting on some subset of the data if that is what you require.
As an alternative you could index the group by using the argmin() function. I tried to do this with transform but it was just returning the entire dataframe. I'm not sure why that should be, it does however work with apply:
def trans_function2(df):
return df.loc[df['Primary'].argmin()]
df.groupby(level = 0).apply(trans_function2)
That again gives me:
Primary Time
sweep1 -2832.013203 0.009845
sweep2 -2605.402055 0.003832
sweep3 -2865.320903 0.003880
sweep4 -2820.273560 0.001202
I'm not totally sure why this function does not work with transform - perhaps someone will enlighten us.
I do not know if this will work with your multi-index frame, but it is worth a try; working with:
>>> df
tag tick val
z C 2014-09-07 32
y C 2014-09-08 67
x A 2014-09-09 49
w A 2014-09-10 80
v B 2014-09-11 51
u B 2014-09-12 25
t C 2014-09-13 22
s B 2014-09-14 8
r A 2014-09-15 76
q C 2014-09-16 4
find the indexer using idxmax and then use .loc:
>>> i = df.groupby('tag')['val'].idxmax()
>>> df.loc[i]
tag tick val
w A 2014-09-10 80
v B 2014-09-11 51
y C 2014-09-08 67

Categories

Resources