I am looking for a way to generate nice summary statistics of a dataframe. Consider the following example:
>> df = pd.DataFrame({"category":['u','v','w','u','y','z','y','z','x','x','y','z','x','z','x']})
>> df['category'].value_counts()
z 4
x 4
y 3
u 2
v 1
w 1
>> ??
count pct
z 4 27%
x 4 27%
y 3 20%
Other (3) 4 27%
The result sums the value counts of the n=3 last rows up, deletes them and then adds them as one row to the original value counts. Also it would be nice to have everything as percents. Any ideas how to implement this? Cheers!
For DataFrame with percentages use Series.iloc with indexing, crate DataFrame by Series.to_frame, add new row and new column filled by percentages:
s = df['category'].value_counts()
n= 3
out = s.iloc[:-n].to_frame('count')
out.loc['Other ({n})'] = s.iloc[-n:].sum()
out['pct'] = out['count'].div(out['count'].sum()).apply(lambda x: f"{x:.0%}")
print (out)
count pct
z 4 27%
x 4 27%
y 3 20%
Other (3) 4 27%
I would use tail(-3) to get the last values except for the first 3:
counts = df['category'].value_counts()
others = counts.tail(-3)
counts[f'Others ({len(others)})'] = others.sum()
counts.drop(others.index, inplace=True)
counts.to_frame(name='count').assign(pct=lambda d: d['count'].div(d['count'].sum()).mul(100).round())
Output:
count pct
z 4 27.0
x 4 27.0
y 3 20.0
Others (3) 4 27.0
This snippet
df = pd.DataFrame({"category":['u','v','w','u','y','z','y','z','x','x','y','z','x','z','x']})
cutoff_index = 3
categegory_counts = pd.DataFrame([df['category'].value_counts(),df['category'].value_counts(normalize=True)],index=["Count","Percent"]).T.reset_index()
other_rows = categegory_counts[cutoff_index:].set_index("index")
categegory_counts = categegory_counts[:cutoff_index].set_index("index")
summary_table = pd.concat([categegory_counts,pd.DataFrame(other_rows.sum(),columns=[f"Other ({len(other_rows)})"]).T])
summary_table = summary_table.astype({'Count':'int'})
summary_table['Percent'] = summary_table['Percent'].apply(lambda x: "{0:.2f}%".format(x*100))
print(summary_table)
will give you what you need. Also in a nice format;)
Count Percent
z 4 26.67%
x 4 26.67%
y 3 20.00%
Other (3) 4 26.67%
Related
I have following dataframe dfgeo:
x y z zt n k pv span geometry
0 6574878.210 4757530.610 1152.588 1 8 4 90 57.63876043929083 POINT (6574878.210 4757530.610)
1 6574919.993 4757570.314 1174.724 0 138.6733617172676 POINT (6574919.993 4757570.314)
2 6575020.518 4757665.839 1177.339 0 302.14812028088545 POINT (6575020.518 4757665.839)
3 6575239.548 4757873.972 1160.156 1 8 4 90 154.5778555448033 POINT (6575239.548 4757873.972)
4 6575351.603 4757980.452 1202.418 0 125.77721657819234 POINT (6575351.603 4757980.452)
5 6575442.780 4758067.093 1199.297 0 131.65377203050443 POINT (6575442.780 4758067.093)
6 6575538.217 4758157.782 1192.914 1 8 4 90 99.73509645559476 POINT (6575538.217 4758157.782)
7 6575594.625 4758240.033 1217.442 0 254.95055120769572 POINT (6575594.625 4758240.033)
8 6575738.820 4758450.289 1174.477 0 198.23448987983204 POINT (6575738.820 4758450.289)
I want to sum values of span column between zt==1 :
def summarize(group):
s = group['zt'].eq(1).cumsum()
return group.groupby(s).agg(
D=('span', 'sum')
)
dfzp=summarize(dfgeo)
print(dfzp)
Print output:
zt
1 57.63876043929083138.6733617172676302.14812028...
2 154.5778555448033125.77721657819234131.6537720...
3 99.73509645559476254.95055120769572198.2344898...
4 137.49102047762113226.75941023488875102.731299...
5 223.552487532538871.61932167407961
6 217.28304840632796141.34049561326185237.708809...
Example desired output is sum of subdataframes between zt with value 1
zt
1 498.44
2 412.007
3 (sum between zt==1 )
...
First use pd.to_numeric to convert the dtype of column span to numeric type, then use Series.groupby on column span and aggregate using sum:
df['span'] = pd.to_numeric(df['span'], errors='coerce')
s = df['span'].groupby(df['zt'].eq(1).cumsum()).sum()
Result:
print(s)
zt
1 498.460242
2 412.008844
3 552.920138
Name: span, dtype: float64
EDIT (For multiple columns):
cols = ['x', 'y', 'span']
df[cols] = df[cols].apply(pd.to_numeric, errors='coerce')
s = df[cols].groupby(df['zt'].eq(1).cumsum()).sum()
Result:
x y span
zt
1 1.972482e+07 1.427277e+07 57.638760
2 1.972603e+07 1.427392e+07 412.008844
3 1.972687e+07 1.427485e+07 552.920138
If the desired result is the sum of 'span' on a subset of 'dfgeo', conditionally to zt==1, I would try:
a = dfgeo[dfgeo['zt']==1]
x = a['span'].sum()
i have a huge data and when and there a lot of duplicate so i wanna remove all value have less then 5 in value_counts() function
like this and less i wanna remove it
If want remove values from counts Series use boolean indexing:
y = pd.Series(['a'] * 5 + ['b'] * 2 + ['c'] * 3 + ['d'] * 7)
s = y.value_counts()
out = s[s > 4]
print (out)
d 7
a 5
dtype: int64
If want remove values from original Series use Series.isin:
y1 = y[y.isin(out.index)]
print (y1)
0 a
1 a
2 a
3 a
4 a
10 d
11 d
12 d
13 d
14 d
15 d
16 d
dtype: object
Thank you mr.jezrael your answer so helpful and i will add a small tip, after you gathering the values this how you can replace the values :
s = y.value_counts()
x = s[s>5]
for z in y:
if z not in x:
y = y.replace([z],'Other')
else:
continue
from itertools import product
import pandas as pd
df = pd.DataFrame.from_records(product(range(10), range(10)))
df = df.sample(90)
df.columns = "c1 c2".split()
df = df.sort_values(df.columns.tolist()).reset_index(drop=True)
# c1 c2
# 0 0 0
# 1 0 1
# 2 0 2
# 3 0 3
# 4 0 4
# .. .. ..
# 85 9 4
# 86 9 5
# 87 9 7
# 88 9 8
# 89 9 9
#
# [90 rows x 2 columns]
How do I quickly find, identify, and remove the last duplicate of all symmetric pairs in this data frame?
An example of symmetric pair is that '(0, 1)' is equal to '(1, 0)'. The latter should be removed.
The algorithm must be fast, so it is recommended to use numpy. Converting to python object is not allowed.
You can sort the values, then groupby:
a= np.sort(df.to_numpy(), axis=1)
df.groupby([a[:,0], a[:,1]], as_index=False, sort=False).first()
Option 2: If you have a lot of pairs c1, c2, groupby can be slow. In that case, we can assign new values and filter by drop_duplicates:
a= np.sort(df.to_numpy(), axis=1)
(df.assign(one=a[:,0], two=a[:,1]) # one and two can be changed
.drop_duplicates(['one','two']) # taken from above
.reindex(df.columns, axis=1)
)
One way is using np.unique with return_index=True and use the result to index the dataframe:
a = np.sort(df.values)
_, ix = np.unique(a, return_index=True, axis=0)
print(df.iloc[ix, :])
c1 c2
0 0 0
1 0 1
20 2 0
3 0 3
40 4 0
50 5 0
6 0 6
70 7 0
8 0 8
9 0 9
11 1 1
21 2 1
13 1 3
41 4 1
51 5 1
16 1 6
71 7 1
...
frozenset
mask = pd.Series(map(frozenset, zip(df.c1, df.c2))).duplicated()
df[~mask]
I will do
df[~pd.DataFrame(np.sort(df.values,1)).duplicated().values]
From pandas and numpy tri
s=pd.crosstab(df.c1,df.c2)
s=s.mask(np.triu(np.ones(s.shape)).astype(np.bool) & s==0).stack().reset_index()
Here's one NumPy based one for integers -
def remove_symm_pairs(df):
a = df.to_numpy(copy=False)
b = np.sort(a,axis=1)
idx = np.ravel_multi_index(b.T,(b.max(0)+1))
sidx = idx.argsort(kind='mergesort')
p = idx[sidx]
m = np.r_[True,p[:-1]!=p[1:]]
a_out = a[np.sort(sidx[m])]
df_out = pd.DataFrame(a_out)
return df_out
If you want to keep the index data as it is, use return df.iloc[np.sort(sidx[m])].
For generic numbers (ints/floats, etc.), we will use a view-based one -
# https://stackoverflow.com/a/44999009/ #Divakar
def view1D(a): # a is array
a = np.ascontiguousarray(a)
void_dt = np.dtype((np.void, a.dtype.itemsize * a.shape[1]))
return a.view(void_dt).ravel()
and simply replace the step to get idx with idx = view1D(b) in remove_symm_pairs.
If this needs to be fast, and if your variables are integer, then the following trick may help: let v,w be the columns of your vector; construct [v+w, np.abs(v-w)] =: [x, y]; then sort this matrix lexicographically, remove duplicates, and finally map it back to [v, w] = [(x+y), (x-y)]/2.
I have data that is structured like below with a time, category, active indicator and a numerical value.
Input
i time cat. active item_count
0 00:00:00 X TRUE 2
1 00:00:06 X FALSE 4
2 00:00:08 X TRUE 13
3 00:00:25 Y FALSE 11
4 00:01:10 Y TRUE 2
5 00:01:58 Y TRUE 6
6 00:02:53 Y TRUE 2
7 07:40:29 X FALSE 1
8 08:34:52 X FALSE 2
9 11:50:48 X TRUE 5
10 11:55:42 X TRUE 3
I want to calculate the rate of active items for every 2 rows within a category, and copy the time of the last row in each 2-row set to get this output:
Output
time cat. rate
00:00:06 X 0.33 (2/(2+4))
07:40:29 X 13/14
00:01:10 Y 2/13
00:02:53 Y 8/8
11:50:48 X 5/7
11:55:42 X 3/3
The 'sets' in the input would be the rows [[0,1], [2,7], [8,9], [10]] for category X and [[3,4],[5,6]] for category Y.
How would I set this up? Sort by category, then time, then step through every N items? I found GroupBy.nth while searching for a solution though am not sure if it applies here.
First create helper Series with cumcount, pass to another groupby and aggregate lambda function with last, last some data cleaning - reset_index with rename:
Also for rate column need sum only True values and divide from right side by rdiv with sum of all values.
g = df.groupby('cat.').cumcount() // 2
df1 = (df.groupby(['cat.', g], sort=False)
.agg({'item_count': 'sum', 'time':'last'}))
print (df1)
item_count time
cat.
X 0 6 00:00:06
1 14 07:40:29
Y 0 13 00:01:10
1 8 00:02:53
X 2 7 11:50:48
3 3 11:55:42
s = df[df['active']].groupby(['cat.', g], sort=False)['item_count'].sum()
print (s)
cat.
X 0 2
1 13
Y 0 2
1 8
X 2 5
3 3
Name: item_count, dtype: int64
df1['rate'] = df1.pop('item_count').rdiv(s, axis=0)
d= {'time_last':'time'}
df1 = df1.reset_index(level=1, drop=True).reset_index().rename(columns=d)
print (df1)
cat. time rate
0 X 00:00:06 0.333333
1 X 07:40:29 0.928571
2 Y 00:01:10 0.153846
3 Y 00:02:53 1.000000
4 X 11:50:48 0.714286
5 X 11:55:42 1.000000
Here is a way to do it, I'm not really using tools that pandas provides but it's a (seemingly) working solution until one using pandas tools comes out.
def rate_dataframe(df):
df_sorted = df.sort_values(['cat.', 'time', 'active'])
prev_row = df_sorted.iloc[0]
cat_count, active_count, not_active_count = 0, 0, 0
ratio_rows = list()
for _, row in df_sorted.iterrows():
if row['active']:
active_count += row['item_count']
else:
not_active_count += row['item_count']
if cat_count == 1 and prev_row['cat.'] == row['cat.']:
ratio = active_count / (active_count + not_active_count)
ratio_rows.append([row['time'], row['cat.'], ratio])
cat_count, active_count, not_active_count = 0, 0, 0
elif cat_count == 0:
cat_count += 1
elif cat_count == 1 and prev_row['cat.'] != row['cat.']:
# handle last row in cat if nbCatRows is odd
if row['active']:
active_count, not_active_count = row['item_count'], 0
else:
active_count, not_active_count = 0, row['item_count']
ratio_rows.append([
prev_row['time'],
prev_row['cat.'],
int(prev_row['active'])
])
prev_row = row
return pd.DataFrame(ratio_rows, columns=['time', 'cat.', 'rate'])
I want to compute the "carryover" of a series. This computes a value for each row and then adds it to the previously computed value (for the previous row).
How do I do this in pandas?
decay = 0.5
test = pd.DataFrame(np.random.randint(1,10,12),columns = ['val'])
test
val
0 4
1 5
2 7
3 9
4 1
5 1
6 8
7 7
8 3
9 9
10 7
11 2
decayed = []
for i, v in test.iterrows():
if i ==0:
decayed.append(v.val)
continue
d = decayed[i-1] + v.val*decay
decayed.append(d)
test['loop_decay'] = decayed
test.head()
val loop_decay
0 4 4.0
1 5 6.5
2 7 10.0
3 9 14.5
4 1 15.0
Consider a vectorized version with cumsum() where you cumulatively sum (val * decay) with the very first val.
However, you then need to subtract the very first (val * decay) since cumsum() includes it:
test['loop_decay'] = (test.ix[0,'val']) + (test['val']*decay).cumsum() - (test.ix[0,'val']*decay)
You can utilize pd.Series.shift() to create a dataframe with val[i] and val[i-1] and then apply your function across a single axis (1 in this case):
# Create a series that shifts the rows by 1
test['val2'] = test.val.shift()
# Set the first row on the shifted series to 0
test['val2'].ix[0] = 0
# Apply the decay formula:
test['loop_decay'] = test.apply(lambda x: x['val'] + x['val2'] * 0.5, axis=1)