I am trying to do some aggregation on a multi-indexDataFrame based on a DatetimeIndex generated from pandas.date_range.
My DatetimeIndex looks like this:
DatetimeIndex(['2000-05-30', '2000-05-31', '2000-06-01' ... '2001-1-31'])
And my multi-index DateFrame looks like this:
value
date id
2000-05-31 1 0
2 1
3 1
2000-06-30 2 1
3 0
4 0
2000-07-30 2 1
4 0
1 0
2002-09-30 1 1
3 1
The dates in the DatetimeIndex may or may not be in the date index.
I need to retrieve all the id such that the percentage of value==1 is greater than or equal to some decimal threshold e.g. 0.6 for all the rows where the date for that id is in the DatetimeIndex.
For example if the threshold is 0.5, then the output should be [2, 3] or some DataFrame containing 2 and 3.
1 does not meet the requirement because 2002-09-30 is not in the DatetimeIndex.
I have a solution with loops and dictonaries to keep track of how often value==1 for each id, but it runs very slowly.
How can I utilize pandas to perform this aggregation?
Thank you.
You can use:
#define range
rng = pd.date_range('2000-05-30', '2000-7-01')
#filtering with isin
df = df[df.index.get_level_values('date').isin(rng)]
#get all treshes
s = df.groupby('id')['value'].mean()
print (s)
id
1 0.0
2 1.0
3 0.5
4 0.0
Name: value, dtype: float64
#get all values of index by tresh
a = s.index[s >= 0.5].tolist()
print (a)
[2, 3]
Related
I have a data frame like this:
A
0
1
0
2
and I would like to sum the values "so far" of the dataframe in a cumulative format, so if A increases by 1 then I would like the sum to increase by 1 as well, as so:
A Sum
0 0
1 1
0 1
2 2
I have to keep a record of when this change occurs for the analysis, so I can't just sum the entire column at once.
I thought about doing:
df = df.assign(A_before=df.A.shift(1))
df['change'] = (df.A - df.A_before)
df['sum'] = df['A'] + df['A_before']
but it's not adding the sum values from the previous rows as well, only the values in the same rows.
Any solutions? Thank you.
You can do diff with cumsum
df.A.diff().ge(1).cumsum()
0 0
1 1
2 1
3 2
Name: A, dtype: int64
df['sum']=df.A.diff().ge(1).cumsum()
When I do df.isnull().sum(), I get the count of null values in a column. But the default axis for .sum() is None, or 0 - which should be summing across the columns.
Why does .sum() calculate the sum down the columns, instead of the rows, when the default says to sum across axis = 0?
Thanks!
I'm seeing the opposite behavior as you explained:
Sums across the columns
In [3309]: df1.isnull().sum(1)
Out[3309]:
0 0
1 1
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 0
10 0
11 0
dtype: int64
Sums down the columns
In [3310]: df1.isnull().sum()
Out[3310]:
date 0
variable 1
value 0
dtype: int64
Uh.. this is not what I am seeing for functionality. Let's look at this small example.
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':[np.nan, np.nan, 3],'B':[1,1,3]}, index =[*'abc'])
print(df)
print(df.isnull().sum())
print(df.sum())
Note the columns are uppercase 'A' and 'B', and the index or row indexes are lowercase.
Output:
A B
a NaN 1
b NaN 1
c 3.0 3
A 2
B 0
dtype: int64
A 3.0
B 5.0
dtype: float64
Per docs:
axis : {index (0), columns (1)} Axis for the function to be applied
on.
The axis parameter is orthogonal to the direction which you wish to sum.
Unfortunately, the pandas documentation for sum doesn't currently make this clear, but the documentation for count does:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.count.html
Parameters
axis{0 or ‘index’, 1 or ‘columns’}, default 0
If 0 or ‘index’ counts are generated for each column. If 1 or ‘columns’ counts are generated for each row.
Is there a Pandas function equivalent to the MS Excel fill handle?
It fills data down or extends a series if more than one cell is selected. My specific application is filling down with a set value in a specific column from a specific row in the dataframe, not necessarily filling a series.
This simple function essentially does what I want. I think it would be nice if ffill could be modified to fill in this way...
def fill_down(df, col, val, start, end = 0, interval = 1):
if not end:
end = len(df)
for i in range(start,end,interval):
df[col].iloc[i] += val
return df
As others commented, there isn't a GUI for pandas, but ffill gives the functionality you're looking for. You can also use ffill with groupby for more powerful functionality. For example:
>>> df
A B
0 12 1
1 NaN 1
2 4 2
3 NaN 2
>>> df.A = df.groupby('B').A.ffill()
A B
0 12 1
1 12 1
2 4 2
3 4 2
Edit: If you don't have NaN's, you could always create the NaN's where you want to fill down. For example:
>>> df
Out[8]:
A B
0 1 2
1 3 3
2 4 5
>>> df.replace(3, np.nan)
Out[9]:
A B
0 1.0 2.0
1 NaN NaN
2 4.0 5.0
I have a code below that creates a summary table of missing values in each column of my data frame. I wish I could build a similar table to count unique values, but DataFrame does not have an unique() method, only each column independently.
def missing_values_table(df):
mis_val = df.isnull().sum()
mis_val_percent = 100 * df.isnull().sum()/len(df)
mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
mis_val_table_ren_columns = mis_val_table.rename(
columns = {0 : 'Missing Values', 1 : '% of Total Values'})
return mis_val_table_ren_columns
(source: https://stackoverflow.com/a/39734251/7044473)
How can I accomplish the same for unique values?
You can use function called 'nunique()' to get unique count of all columns
df = pd.DataFrame(np.random.randint(0, 3, (4, 3)))
print(df)
0 1 2
0 2 0 2
1 1 2 1
2 1 2 2
3 1 1 2
count=df.nunique()
print(count)
0 2
1 3
2 2
dtype: int64
You can create a series of unique value counts using the pd.unique function. For example:
>>> df = pd.DataFrame(np.random.randint(0, 3, (4, 3)))
>>> print(df)
0 1 2
0 2 0 2
1 1 2 1
2 1 2 2
3 1 1 2
>>> pd.Series({col: len(pd.unique(df[col])) for col in df})
0 2
1 3
2 2
dtype: int64
If you actually want the number of times each value appears in each column, you can do a similar thing with pd.value_counts:
>>> pd.DataFrame({col: pd.value_counts(df[col]) for col in df}).fillna(0)
0 1 2
0 0.0 1 0.0
1 3.0 1 1.0
2 1.0 2 3.0
This is not exactly what you asked for, but may be useful for your analysis.
def diversity_percentage(df, columns):
"""
This function returns the number of different elements in each column as a percentage of the total elements in the group.
A low value indicates there are many repeated elements.
Example 1: a value of 0 indicates all values are the same.
Example 2: a value of 100 indicates all values are different.
"""
diversity = dict()
for col in columns:
diversity[col] = len(df[col].unique())
diversity_series = pd.Series(diversity)
return (100*diversity_series/len(df)).sort_values()
__
>>> diversity_percentage(df, selected_columns)
operationdate 0.002803
payment 1.076414
description 16.933901
customer_id 17.536581
customer_name 48.895554
customer_email 62.129282
token 68.290632
id 100.000000
transactionid 100.000000
dtype: float64
However, you can always return diversity_series directly and will obtain just the count.
I have data in type pd.DataFrame which looks like the following:
type date sum
A Jan-1 1
A Jan-3 2
B Feb-1 1
B Feb-2 3
B Feb-5 6
The task is to build a continuous time series for each type (the missing date should be filled with 0).
The expected result is:
type date sum
A Jan-1 1
A Jan-2 0
A Jan-3 2
B Feb-1 1
B Feb-2 3
B Feb-3 0
B Feb-4 0
B Feb-5 6
Is it possible to do that with pandas or other Python tools?
The real dataset has millions of rows.
You first must change your date to a datetime and put that column in the index to take advantage of resampling and then you can convert the date back to its original format
# change to datetime
df['date'] =pd.to_datetime(df.date, format="%b-%d")
df = df.set_index('date')
# resample to fill in missing dates
df1 = df.groupby('type').resample('d')['sum'].asfreq().fillna(0)
df1 = df1.reset_index()
# change back to original date format
df1['date'] = df1.date.dt.strftime('%b-%d')
output
type date sum
0 A Jan-01 1.0
1 A Jan-02 0.0
2 A Jan-03 2.0
3 B Feb-01 1.0
4 B Feb-02 3.0
5 B Feb-03 0.0
6 B Feb-04 0.0
7 B Feb-05 6.0