Pandas cumsum aggregate between column values - python

I have following dataframe dfgeo:
x y z zt n k pv span geometry
0 6574878.210 4757530.610 1152.588 1 8 4 90 57.63876043929083 POINT (6574878.210 4757530.610)
1 6574919.993 4757570.314 1174.724 0 138.6733617172676 POINT (6574919.993 4757570.314)
2 6575020.518 4757665.839 1177.339 0 302.14812028088545 POINT (6575020.518 4757665.839)
3 6575239.548 4757873.972 1160.156 1 8 4 90 154.5778555448033 POINT (6575239.548 4757873.972)
4 6575351.603 4757980.452 1202.418 0 125.77721657819234 POINT (6575351.603 4757980.452)
5 6575442.780 4758067.093 1199.297 0 131.65377203050443 POINT (6575442.780 4758067.093)
6 6575538.217 4758157.782 1192.914 1 8 4 90 99.73509645559476 POINT (6575538.217 4758157.782)
7 6575594.625 4758240.033 1217.442 0 254.95055120769572 POINT (6575594.625 4758240.033)
8 6575738.820 4758450.289 1174.477 0 198.23448987983204 POINT (6575738.820 4758450.289)
I want to sum values of span column between zt==1 :
def summarize(group):
s = group['zt'].eq(1).cumsum()
return group.groupby(s).agg(
D=('span', 'sum')
)
dfzp=summarize(dfgeo)
print(dfzp)
Print output:
zt
1 57.63876043929083138.6733617172676302.14812028...
2 154.5778555448033125.77721657819234131.6537720...
3 99.73509645559476254.95055120769572198.2344898...
4 137.49102047762113226.75941023488875102.731299...
5 223.552487532538871.61932167407961
6 217.28304840632796141.34049561326185237.708809...
Example desired output is sum of subdataframes between zt with value 1
zt
1 498.44
2 412.007
3 (sum between zt==1 )
...

First use pd.to_numeric to convert the dtype of column span to numeric type, then use Series.groupby on column span and aggregate using sum:
df['span'] = pd.to_numeric(df['span'], errors='coerce')
s = df['span'].groupby(df['zt'].eq(1).cumsum()).sum()
Result:
print(s)
zt
1 498.460242
2 412.008844
3 552.920138
Name: span, dtype: float64
EDIT (For multiple columns):
cols = ['x', 'y', 'span']
df[cols] = df[cols].apply(pd.to_numeric, errors='coerce')
s = df[cols].groupby(df['zt'].eq(1).cumsum()).sum()
Result:
x y span
zt
1 1.972482e+07 1.427277e+07 57.638760
2 1.972603e+07 1.427392e+07 412.008844
3 1.972687e+07 1.427485e+07 552.920138

If the desired result is the sum of 'span' on a subset of 'dfgeo', conditionally to zt==1, I would try:
a = dfgeo[dfgeo['zt']==1]
x = a['span'].sum()

Related

Pandas sum last n rows of df.count() results into one row

I am looking for a way to generate nice summary statistics of a dataframe. Consider the following example:
>> df = pd.DataFrame({"category":['u','v','w','u','y','z','y','z','x','x','y','z','x','z','x']})
>> df['category'].value_counts()
z 4
x 4
y 3
u 2
v 1
w 1
>> ??
count pct
z 4 27%
x 4 27%
y 3 20%
Other (3) 4 27%
The result sums the value counts of the n=3 last rows up, deletes them and then adds them as one row to the original value counts. Also it would be nice to have everything as percents. Any ideas how to implement this? Cheers!
For DataFrame with percentages use Series.iloc with indexing, crate DataFrame by Series.to_frame, add new row and new column filled by percentages:
s = df['category'].value_counts()
n= 3
out = s.iloc[:-n].to_frame('count')
out.loc['Other ({n})'] = s.iloc[-n:].sum()
out['pct'] = out['count'].div(out['count'].sum()).apply(lambda x: f"{x:.0%}")
print (out)
count pct
z 4 27%
x 4 27%
y 3 20%
Other (3) 4 27%
I would use tail(-3) to get the last values except for the first 3:
counts = df['category'].value_counts()
others = counts.tail(-3)
counts[f'Others ({len(others)})'] = others.sum()
counts.drop(others.index, inplace=True)
counts.to_frame(name='count').assign(pct=lambda d: d['count'].div(d['count'].sum()).mul(100).round())
Output:
count pct
z 4 27.0
x 4 27.0
y 3 20.0
Others (3) 4 27.0
This snippet
df = pd.DataFrame({"category":['u','v','w','u','y','z','y','z','x','x','y','z','x','z','x']})
cutoff_index = 3
categegory_counts = pd.DataFrame([df['category'].value_counts(),df['category'].value_counts(normalize=True)],index=["Count","Percent"]).T.reset_index()
other_rows = categegory_counts[cutoff_index:].set_index("index")
categegory_counts = categegory_counts[:cutoff_index].set_index("index")
summary_table = pd.concat([categegory_counts,pd.DataFrame(other_rows.sum(),columns=[f"Other ({len(other_rows)})"]).T])
summary_table = summary_table.astype({'Count':'int'})
summary_table['Percent'] = summary_table['Percent'].apply(lambda x: "{0:.2f}%".format(x*100))
print(summary_table)
will give you what you need. Also in a nice format;)
Count Percent
z 4 26.67%
x 4 26.67%
y 3 20.00%
Other (3) 4 26.67%

group column values with difference of 3(say) digit in python

I am new in python, problem statement is like we have below data as dataframe
df = pd.DataFrame({'Diff':[1,1,2,3,4,4,5,6,7,7,8,9,9,10], 'value':[x,x,y,x,x,x,y,x,z,x,x,y,y,z]})
Diff value
1 x
1 x
2 y
3 x
4 x
4 x
5 y
6 x
7 z
7 x
8 x
9 y
9 y
10 z
we need to group diff column with diff of 3 (let's say), like 0-3,3-6,6-9,>9, and value should be count
Expected output is like
Diff x y z
0-3 2 1
3-6 3 1
6-9 3 1
>=9 2 1
Example
example code is wrong. someone who want exercise, use following code
df = pd.DataFrame({'Diff':[1,1,2,3,4,4,5,6,7,7,8,9,9,10],
'value':'x,x,y,x,x,x,y,x,z,x,x,y,y,z'.split(',')})
Code
labels = ['0-3', '3-6', '6-9', '>=9']
grouper = pd.cut(df['Diff'], bins=[0, 3, 6, 9, float('inf')], right=False, labels=labels)
pd.crosstab(grouper, df['value'])
output:
value x y z
Diff
0-3 2 1 0
3-6 3 1 0
6-9 3 0 1
>=9 0 2 1

In Pandas, how can I group by every N rows within a key, saving the last value of one column and calculating another based on all rows in that 'set'?

I have data that is structured like below with a time, category, active indicator and a numerical value.
Input
i time cat. active item_count
0 00:00:00 X TRUE 2
1 00:00:06 X FALSE 4
2 00:00:08 X TRUE 13
3 00:00:25 Y FALSE 11
4 00:01:10 Y TRUE 2
5 00:01:58 Y TRUE 6
6 00:02:53 Y TRUE 2
7 07:40:29 X FALSE 1
8 08:34:52 X FALSE 2
9 11:50:48 X TRUE 5
10 11:55:42 X TRUE 3
I want to calculate the rate of active items for every 2 rows within a category, and copy the time of the last row in each 2-row set to get this output:
Output
time cat. rate
00:00:06 X 0.33 (2/(2+4))
07:40:29 X 13/14
00:01:10 Y 2/13
00:02:53 Y 8/8
11:50:48 X 5/7
11:55:42 X 3/3
The 'sets' in the input would be the rows [[0,1], [2,7], [8,9], [10]] for category X and [[3,4],[5,6]] for category Y.
How would I set this up? Sort by category, then time, then step through every N items? I found GroupBy.nth while searching for a solution though am not sure if it applies here.
First create helper Series with cumcount, pass to another groupby and aggregate lambda function with last, last some data cleaning - reset_index with rename:
Also for rate column need sum only True values and divide from right side by rdiv with sum of all values.
g = df.groupby('cat.').cumcount() // 2
df1 = (df.groupby(['cat.', g], sort=False)
.agg({'item_count': 'sum', 'time':'last'}))
print (df1)
item_count time
cat.
X 0 6 00:00:06
1 14 07:40:29
Y 0 13 00:01:10
1 8 00:02:53
X 2 7 11:50:48
3 3 11:55:42
s = df[df['active']].groupby(['cat.', g], sort=False)['item_count'].sum()
print (s)
cat.
X 0 2
1 13
Y 0 2
1 8
X 2 5
3 3
Name: item_count, dtype: int64
df1['rate'] = df1.pop('item_count').rdiv(s, axis=0)
d= {'time_last':'time'}
df1 = df1.reset_index(level=1, drop=True).reset_index().rename(columns=d)
print (df1)
cat. time rate
0 X 00:00:06 0.333333
1 X 07:40:29 0.928571
2 Y 00:01:10 0.153846
3 Y 00:02:53 1.000000
4 X 11:50:48 0.714286
5 X 11:55:42 1.000000
Here is a way to do it, I'm not really using tools that pandas provides but it's a (seemingly) working solution until one using pandas tools comes out.
def rate_dataframe(df):
df_sorted = df.sort_values(['cat.', 'time', 'active'])
prev_row = df_sorted.iloc[0]
cat_count, active_count, not_active_count = 0, 0, 0
ratio_rows = list()
for _, row in df_sorted.iterrows():
if row['active']:
active_count += row['item_count']
else:
not_active_count += row['item_count']
if cat_count == 1 and prev_row['cat.'] == row['cat.']:
ratio = active_count / (active_count + not_active_count)
ratio_rows.append([row['time'], row['cat.'], ratio])
cat_count, active_count, not_active_count = 0, 0, 0
elif cat_count == 0:
cat_count += 1
elif cat_count == 1 and prev_row['cat.'] != row['cat.']:
# handle last row in cat if nbCatRows is odd
if row['active']:
active_count, not_active_count = row['item_count'], 0
else:
active_count, not_active_count = 0, row['item_count']
ratio_rows.append([
prev_row['time'],
prev_row['cat.'],
int(prev_row['active'])
])
prev_row = row
return pd.DataFrame(ratio_rows, columns=['time', 'cat.', 'rate'])

How to calculate amounts that row values greater than a specific value in pandas?

How to calculate amounts that row values greater than a specific value in pandas?
For example, I have a Pandas DataFrame dff. I want to count row values greater than 0.
dff = pd.DataFrame(np.random.randn(9,3),columns=['a','b','c'])
dff
a b c
0 -0.047753 -1.172751 0.428752
1 -0.763297 -0.539290 1.004502
2 -0.845018 1.780180 1.354705
3 -0.044451 0.271344 0.166762
4 -0.230092 -0.684156 -0.448916
5 -0.137938 1.403581 0.570804
6 -0.259851 0.589898 0.099670
7 0.642413 -0.762344 -0.167562
8 1.940560 -1.276856 0.361775
I am using an inefficient way. How to be more efficient?
dff['count'] = 0
for m in range(len(dff)):
og = 0
for i in dff.columns:
if dff[i][m] > 0:
og += 1
dff['count'][m] = og
dff
a b c count
0 -0.047753 -1.172751 0.428752 1
1 -0.763297 -0.539290 1.004502 1
2 -0.845018 1.780180 1.354705 2
3 -0.044451 0.271344 0.166762 2
4 -0.230092 -0.684156 -0.448916 0
5 -0.137938 1.403581 0.570804 2
6 -0.259851 0.589898 0.099670 2
7 0.642413 -0.762344 -0.167562 1
8 1.940560 -1.276856 0.361775 2
You can create a boolean mask of your DataFrame, that is True wherever a value is greater than your threshold (in this case 0), and then use sum along the first axis.
dff.gt(0).sum(1)
0 1
1 1
2 2
3 2
4 0
5 2
6 2
7 1
8 2
dtype: int64

Conditional length of a binary data series in Pandas

Having a DataFrame with the following column:
df['A'] = [1,1,1,0,1,1,1,1,0,1]
What would be the best vectorized way to control the length of "1"-series by some limiting value? Let's say the limit is 2, then the resulting column 'B' must look like:
A B
0 1 1
1 1 1
2 1 0
3 0 0
4 1 1
5 1 1
6 1 0
7 1 0
8 0 0
9 1 1
One fully-vectorized solution is to use the shift-groupby-cumsum-cumcount combination1 to indicate where consecutive runs are shorter than 2 (or whatever limiting value you like). Then, & this new boolean Series with the original column:
df['B'] = ((df.groupby((df.A != df.A.shift()).cumsum()).cumcount() <= 1) & df.A)\
.astype(int) # cast the boolean Series back to integers
This produces the new column in the DataFrame:
A B
0 1 1
1 1 1
2 1 0
3 0 0
4 1 1
5 1 1
6 1 0
7 1 0
8 0 0
9 1 1
1 See the pandas cookbook; the section on grouping, "Grouping like Python’s itertools.groupby"
Another way (checking if previous two are 1):
In [443]: df = pd.DataFrame({'A': [1,1,1,0,1,1,1,1,0,1]})
In [444]: limit = 2
In [445]: df['B'] = map(lambda x: df['A'][x] if x < limit else int(not all(y == 1 for y in df['A'][x - limit:x])), range(len(df)))
In [446]: df
Out[446]:
A B
0 1 1
1 1 1
2 1 0
3 0 0
4 1 1
5 1 1
6 1 0
7 1 0
8 0 0
9 1 1
If you know that the values in the series will all be either 0 or 1, I think you can use a little trick involving convolution. Make a copy of your column (which need not be a Pandas object, it can just be a normal Numpy array)
a = df['A'].as_matrix()
and convolve it with a sequence of 1's that is one longer than the cutoff you want, then chop off the last cutoff elements. E.g. for a cutoff of 2, you would do
long_run_count = numpy.convolve(a, [1, 1, 1])[:-2]
The resulting array, in this case, gives the number of 1's that occur in the 3 elements prior to and including that element. If that number is 3, then you are in a run that has exceeded length 2. So just set those elements to zero.
a[long_run_count > 2] = 0
You can now assign the resulting array to a new column in your DataFrame.
df['B'] = a
To turn this into a more general method:
def trim_runs(array, cutoff):
a = numpy.asarray(array)
a[numpy.convolve(a, numpy.ones(cutoff + 1))[:-cutoff] > cutoff] = 0
return a

Categories

Resources