Pandas: Difference between largest and smallest value within group - python

Given a data frame that looks like this
GROUP VALUE
1 5
2 2
1 10
2 20
1 7
I would like to compute the difference between the largest and smallest value within each group. That is, the result should be
GROUP DIFF
1 5
2 18
What is an easy way to do this in Pandas?
What is a fast way to do this in Pandas for a data frame with about 2 million rows and 1 million groups?

Using #unutbu 's df
per timing
unutbu's solution is best over large data sets
import pandas as pd
import numpy as np
df = pd.DataFrame({'GROUP': [1, 2, 1, 2, 1], 'VALUE': [5, 2, 10, 20, 7]})
df.groupby('GROUP')['VALUE'].agg(np.ptp)
GROUP
1 5
2 18
Name: VALUE, dtype: int64
np.ptp docs returns the range of an array
timing
small df
large df
df = pd.DataFrame(dict(GROUP=np.arange(1000000) % 100, VALUE=np.random.rand(1000000)))
large df
many groups
df = pd.DataFrame(dict(GROUP=np.arange(1000000) % 10000, VALUE=np.random.rand(1000000)))

groupby/agg generally performs best when you take advantage of the built-in aggregators such as 'max' and 'min'. So to obtain the difference, first compute the max and min and then subtract:
import pandas as pd
df = pd.DataFrame({'GROUP': [1, 2, 1, 2, 1], 'VALUE': [5, 2, 10, 20, 7]})
result = df.groupby('GROUP')['VALUE'].agg(['max','min'])
result['diff'] = result['max']-result['min']
print(result[['diff']])
yields
diff
GROUP
1 5
2 18

Note: this will get the job done, but #piRSquared's answer has faster methods.
You can use groupby(), min(), and max():
df.groupby('GROUP')['VALUE'].apply(lambda g: g.max() - g.min())

Related

Fastest way of filter index values based on list values from multiple columns in Pandas Dataframe?

I have a data frame like below. It has two columns column1,column2 from these two columns I want to filter few values(combinations of two lists) and get the index. Though I wrote the logic for it. It is will be too slow for filtering from a larger data frame. Is there any faster way to filter the data and get the list of indexes?
Data frame:-
import pandas as pd
d = {'col1': [11, 20,90,80,30], 'col2': [30, 40,50,60,90]}
df = pd.DataFrame(data=d)
print(df)
col1 col2
0 11 30
1 20 40
2 90 50
3 80 60
4 30 90
l1=[11,90,30]
l2=[30,50,90]
final_result=[]
for i,j in zip(l1,l2):
res=df[(df['col1']==i) & (df['col2']==j)]
final_result.append(res.index[0])
print(final_result)
[0, 2, 4]
You can just use underlying numpy array and create the boolean indexing:
mask=(df[['col1', 'col2']].values[:,None]==np.vstack([l1,l2]).T).all(-1).any(1)
# mask
# array([ True, False, True, False, True])
df.index[mask]
# prints
# Int64Index([0, 2, 4], dtype='int64')
you can use:
condition_1=df['col1'].astype(str).str.contains('|'.join(map(str, l1)))
condition_2=df['col2'].astype(str).str.contains('|'.join(map(str, l2)))
final_result=df.loc[ condition_1 & condition_2 ].index.to_list()
here is one way to do it. Merging the two DF and filtering where value exists in both DF
# create a DF of the list you like to match with
df2=pd.DataFrame({'col1': l1, 'col2': l2})
# merge the two DF
df3=df.merge(df2, how='left',
on=['col1', 'col2'], indicator='foundIn')
# filterout rows that are in both
out=df3[df3['foundIn'].eq('both')].index.to_list()
out
[0, 2, 4]

Scalable approach to make values in a list as column values in a dataframe in pandas in Python

I have a pandas dataframe which has only one column, the value of each cell in the column is a list/array of numbers, this list is of length 100 and this length is consistent across all the cell values.
We need to convert each list value as a column value, in other words have a dataframe which has 100 columns and each column value is at a list/array item.
Something like this
becomes
It can be done with iterrows() as shown below, but we have around 1.5 million rows and need a scalable solution as iterrows() would take alot of time.
cols = [f'col_{i}' for i in range(0, 4)]
df_inter = pd.DataFrame(columns = cols)
for index, row in df.iterrows():
df_inter.loc[len(df_inter)] = row['message']
You can do this:
In [28]: df = pd.DataFrame({'message':[[1,2,3,4,5], [3,4,5,6,7]]})
In [29]: df
Out[29]:
message
0 [1, 2, 3, 4, 5]
1 [3, 4, 5, 6, 7]
In [30]: res = pd.DataFrame(df.message.tolist(), index= df.index)
In [31]: res
Out[31]:
0 1 2 3 4
0 1 2 3 4 5
1 3 4 5 6 7
I think this would work:
df.message.apply(pd.Series)
To use dask to scale (assuming it is installed):
import dask.dataframe as dd
ddf = dd.from_pandas(df, npartitions=8)
ddf.message.apply(pd.Series, meta={0: 'object'})

Pandas - Recode variable based on value_counts() results

This is my dataframe:
df = pd.DataFrame({'col1': [1, 1, 1, 2, 2, 3, 4],
'col2': [1, 3, 2, 4, 6, 5, 7]})
I try to recode values based on how often they appear in my dataset, here I want to relabel every value which occurs only once to "other". This is the desired output:
#desired
"col1": [1,1,1,2,2,"other", "other"]
I tried this but it did not work:
df["recoded"] = np.where(df["col1"].value_counts() > 1, df["col1"], "other")
My idea is to save the value counts and filter them and then loop over the result array, but this seems overly complicated. Is there an easy "pythonic/pandas" way to archieve this?
You are close - need Series.map for same length of Series like original DataFrame:
df["recoded"] = np.where(df["col1"].map(df["col1"].value_counts()) > 1, df["col1"], "other")
Or use GroupBy.transform with count values by GroupBy.size:
df["recoded"] = np.where(df.groupby('col1')["col1"].transform('size') > 1,
df["col1"],
"other")
If need check duplicates use Series.duplicated with keep=False for return mask by all duplicates:
df["recoded"] = np.where(df["col1"].duplicated(keep=False), df["col1"], "other")
print (df)
0 1 1 1
1 1 3 1
2 1 2 1
3 2 4 2
4 2 6 2
5 3 5 other
6 4 7 other

Normalize each column of a pandas DataFrame

Each column of the Dataframe needs their values to be normalized according the value of the first element in that column.
for timestamp, prices in data.iteritems():
normalizedPrices = prices / prices[0]
print normalizedPrices # how do we update the DataFrame with this Series?
However how do we update the DataFrame once we have created the normalized column of data? I believe if we do prices = normalizedPrices we are merely acting on a copy/view of the DataFrame rather than the original DataFrame itself.
It might be simplest to normalize the entire DataFrame in one go (and avoid looping over rows/columns altogether):
>>> df = pd.DataFrame({'a': [2, 4, 5], 'b': [3, 9, 4]}, dtype=np.float) # a DataFrame
>>> df
a b
0 2 3
1 4 9
2 5 4
>>> df = df.div(df.loc[0]) # normalise DataFrame and bind back to df
>>> df
a b
0 1.0 1.000000
1 2.0 3.000000
2 2.5 1.333333
Assign to data[col]:
for col in data:
data[col] /= data[col].iloc[0]
import numpy
data[0:] = data[0:].values/data[0:1].values

Operations on Pandas DataFrame Index

How can I easily perform an operation on a Pandas DataFrame Index? Lets say I create a DataFrame like so:
df = DataFrame(rand(5,3), index=[0, 1, 2, 4, 5])
and I want to find the mean sampling rate. The way I do this now doesn't seem quite right.
fs = 1./np.mean(np.diff(df.index.values.astype(np.float)))
I feel like there must be a better way to do this, but I can't figure it out.
Thanks for any help.
#BrenBarn is correct, better to make a column in a frame, but you can do this
In [2]: df = DataFrame(np.random.rand(5,3), index=[0, 1, 2, 4, 5])
In [3]: df.index.to_series()
Out[3]:
0 0
1 1
2 2
4 4
5 5
dtype: int64
In [4]: s = df.index.to_series()
In [5]: 1./s.diff().mean()
Out[5]: 0.80000000000000004

Categories

Resources