Aggregation of pandas groupby objects - python

I am trying to aggregate some statistics from a groupby object on chunks of data. I have to chunk the data because there are many (18 million) rows. I want to find the number of rows in each group in each chunk, then sum them together. I can add groupby objects but when a group is not present in one term, a NaN is the result. See this case:
>>> df = pd.DataFrame({'X': ['A','B','C','A','B','C','B','C','D','B','C','D'],
'Y': range(12)})
>>> df
X Y
0 A 0
1 B 1
2 C 2
3 A 3
4 B 4
5 C 5
6 B 6
7 C 7
8 D 8
9 B 9
10 C 10
11 D 11
>>> df[0:6].groupby(['X']).count() + df[6:].groupby(['X']).count()
Y
X
A NaN
B 4
C 4
D NaN
But I want to see:
>>> df[0:6].groupby(['X']).count() + df[6:].groupby(['X']).count()
Y
X
A 2
B 4
C 4
D 2
Is there a good way to do this? Note in the real code I am looping through a chunked iterator of a million rows per groupby.

Call add and pass fill_value=0 you could iteratively add whilst chunking I guess:
In [98]:
df = pd.DataFrame({'X': ['A','B','C','A','B','C','B','C','D','B','C','D'],
'Y': np.arange(12)})
df[0:6].groupby(['X']).count().add(df[6:].groupby(['X']).count(), fill_value=0)
Out[98]:
Y
X
A 2
B 4
C 4
D 2

Related

Comparing 2 columns group by group in pandas or python

I currently have a dataset here where i am unsure of how to compare if the groups have similar values. Here is a sample of my dataset
type value
a 1
a 2
a 3
a 4
b 2
b 3
b 4
b 5
c 1
c 3
c 4
d 2
d 3
d 4
I want to know which rows are similar, in the sense that all the (values in 1 type) are present in another type. So for example type d has value 2,3,4 and type a has value 1,2,3,4
so this is 'similar' or can be considered the same so i would like it output something that tells me d is similar to A.
Expected output should be like this
type value similarity
a 1 A is similar to B and D
a 2
a 3
a 4
b 2 b is similar to a and d
b 3
b 4
b 5
c 1 c is similar to a
c 3
c 4
d 2 d is similar to a and b
d 3
d 4
not sure if this can be done in python or pandas but guidance is really appreciated as i'm really lost and not sure where to begain
the output also does not have to be what i just put as an example here, it can just be another csv that tells me which types are similar and
I would use set operations.
assuming similarity means at least N items in common:
from itertools import combinations
# define minimum number of common items
N = 3
# aggregate as sets
s = df.groupby('type')['value'].agg(set)
# generate all combinations of sets
# and check is the intersection is at least N items
out = (pd.Series([len(a&b)>=N for a, b in combinations(s, 2)],
index=pd.MultiIndex.from_tuples(combinations(s.index, 2)))
)
# concat and add the reversed combinations (a/b -> b/a)
# we could have used a product in the first part but this
# would have required performing the computations twice
similarity = (
pd.concat([out, out.swaplevel()])
.loc[lambda x: x].reset_index(-1)
.groupby(level=0)['level_1'].apply(lambda g: f"{g.name} is similar to {', '.join(g)}")
)
# update the first row of each group with the string
df.loc[~df['type'].duplicated(), 'similarity'] = df['type'].map(similarity)
print(df)
Output:
type value similarity
0 a 1 a is similar to b, c, d
1 a 2 NaN
2 a 3 NaN
3 a 4 NaN
4 b 2 b is similar to d, a
5 b 3 NaN
6 b 4 NaN
7 b 5 NaN
8 c 1 c is similar to a
9 c 3 NaN
10 c 4 NaN
11 d 2 d is similar to a, b
12 d 3 NaN
13 d 4 NaN
assuming similarity means one set is the subset of the other:
from itertools import combinations
s = df.groupby('type')['value'].agg(set)
out = (pd.Series([a.issubset(b) or b.issubset(a) for a, b in combinations(s, 2)],
index=pd.MultiIndex.from_tuples(combinations(s.index, 2)))
)
similarity = (
pd.concat([out, out.swaplevel()])
.loc[lambda x: x].reset_index(-1)
.groupby(level=0)['level_1'].apply(lambda g: f"{g.name} is similar to {', '.join(g)}")
)
df.loc[~df['type'].duplicated(), 'similarity'] = df['type'].map(similarity)
print(df)
Output:
type value similarity
0 a 1 a is similar to c, d
1 a 2 NaN
2 a 3 NaN
3 a 4 NaN
4 b 2 b is similar to d
5 b 3 NaN
6 b 4 NaN
7 b 5 NaN
8 c 1 c is similar to a
9 c 3 NaN
10 c 4 NaN
11 d 2 d is similar to a, b
12 d 3 NaN
13 d 4 NaN
You can use:
# Group all rows and transform as set
df1 = df.groupby('type', as_index=False)['value'].agg(set)
# Get all combinations
df1 = df1.merge(df1, how='cross').query('type_x != type_y')
# Compute the intersection between sets
df1['similarity'] = [row.value_x.intersection(row.value_y)
for row in df1[['value_x', 'value_y']].itertuples()]
# Keep rows with at least 3 similarities then export report
sim = (df1.loc[df1['similarity'].str.len() >= 3].groupby('type_x')['type_y']
.agg(', '.join).rename('similarity').rename_axis(index='type')
.reset_index())
Output:
>>> sim
type similarity
0 a b, c, d
1 b a, d
2 c a
3 d a, b

Replace specific values in a dataframe by column mean in pandas

I'm a python beginner and I'm trying to do some operations with dataframes that I usually do with R language.
I Have a large dataframe with 2592 rows and 205 columns and I want to replace the 0.0 values by half the minimum value of its column.
An example with a random dataframe would be:
>>> import pandas as pd
>>> import numpy as np
>>> np.random.seed(1)
>>> df = pd.DataFrame(np.random.randint(0,10, size=(3,5)), columns = ['A', 'B', 'C', 'D', 'E'])
>>> print(df)
A B C D E
0 5 8 9 5 0
1 0 1 7 6 9
2 2 4 5 2 4
And the result I'm looking for is:
A B C D E
0 5 8 9 5 2
1 1 1 7 6 9
2 2 4 5 2 4
Intuitively I would do it like this:
>>> for column in df:
for element in column:
if element == 0:
element = df[column].min()/2
But it doesn't work... any help?
Thank you!
Use DataFrame.mask with replace minimum values without 0 divide by 2:
df1 = df.mask(df.eq(0), df.replace(0, np.nan).min().div(2), axis=1)
print(df1)
A B C D E
0 5 8 9 5 2
1 1 1 7 6 9
2 2 4 5 2 4
For more efficient solution is possible use (thanks #mozway):
m = df.eq(0)
df1 = df.mask(m, df[~m].min().div(2), axis=1)
To work on your "intuitive" way, this is how to do it.
Use a function to perform the fancy logics you need.
Pandas has .apply function is optimised, so it should be sufficiently fast anyway.
import pandas as pd
import numpy as np
np.random.seed(1)
df = pd.DataFrame(np.random.randint(0,10, size=(3,5)), columns = ['A', 'B', 'C', 'D', 'E'])
def make_half_minimum(value, dataseries):
if value == 0:
dataseries_ = dataseries[dataseries!=0]
return dataseries_.min()/2
else:
return value
for column_name in df.columns:
df[column_name] = df[column_name].apply(lambda x: make_half_minimum(x,df[column_name]))
print(df)
A B C D E
0 5.0 8 9 5 2.0
1 1.0 1 7 6 9.0
2 2.0 4 5 2 4.0
[Finished in 521ms]

pandas DataFrame isin and following row

For a given DataFrame, sorted by b and index reset:
df = pd.DataFrame({'a': list('abcdef'),
'b': [0, 2, 7, 3, 9, 15]}
).sort_values('b').reset_index(drop=True)
a b
0 a 0
1 b 2
2 d 3
3 c 7
4 e 9
5 f 15
and a list, v
v = list('adf')
I would like to pull out just the rows in v and the following row (if there is one), similar to grep -A1:
a b
0 a 0
1 b 2
2 d 3
3 c 7
5 f 15
I can do this by concatenating the index from isin and the index from isin plus one, like so:
df[df.index.isin(
np.concatenate(
(df[df['a'].isin(v)].index,
df[df['a'].isin(v)].index + 1)))]
But this is long and not too easy to understand. Is there a better way?
You can combine the isin condition and the shift (next row) to create the boolean you needed:
df[df.a.isin(v).pipe(lambda x: x | x.shift())]
# a b
#0 a 0
#1 b 2
#2 d 3
#3 c 7
#5 f 15

Delete pandas column if column name begins with a number

I have a pandas DataFrame with about 200 columns. Roughly, I want to do this
for col in df.columns:
if col begins with a number:
df.drop(col)
I'm not sure what are the best practices when it comes to handling pandas DataFrames, how should I handle this? Will my pseudocode work, or is it not recommended to modify a pandas dataframe in a for loop?
I think simpliest is select all columns which not starts with number by filter with regex - ^ is for start of string and \D is for not number:
df1 = df.filter(regex='^\D')
Similar alternative:
df1 = df.loc[:, df.columns.str.contains('^\D')]
Or inverse condition and select numbers:
df1 = df.loc[:, ~df.columns.str.contains('^\d')]
df1 = df.loc[:, ~df.columns.str[0].str.isnumeric()]
If want use your pseudocode:
for col in df.columns:
if col[0].isnumeric():
df = df.drop(col, axis=1)
Sample:
df = pd.DataFrame({'2A':list('abcdef'),
'1B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D3':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')})
print (df)
1B 2A C D3 E F
0 4 a 7 1 5 a
1 5 b 8 3 3 a
2 4 c 9 5 6 a
3 5 d 4 7 9 b
4 5 e 2 1 2 b
5 4 f 3 0 4 b
df1 = df.filter(regex='^\D')
print (df1)
C D3 E F
0 7 1 5 a
1 8 3 3 a
2 9 5 6 a
3 4 7 9 b
4 2 1 2 b
5 3 0 4 b
An alternative can be this:
columns = [x for x in df.columns if not x[0].isdigit()]
df = df[columns]

Python method to compare 1 value_id against another columns' value_ids in separate dataframes?

I have 2 csv files. Each contains a data set with multiple columns and an ASSET_ID column. I used pandas to read each csv file in as a df1 and df2. My problem has been trying to define a function to iterate over the ASSET_ID value in df1 and compare each value against all the ASSET_ID values in df2. From there I want to return all the corresponding rows from df1's ASSET_ID's that matched df2. Any help would be appreciated I've been working on this for hours with little to show for it. dtypes are float or int.
My configuration = windows xp, python 2.7 anaconda distribution
Create a boolean mask of the values will index the rows where the 2 df's match, no need to iterate and much faster.
Example:
# define a list of values
a = list('abcdef')
b = range(6)
df = pd.DataFrame({'X':pd.Series(a),'Y': pd.Series(b)})
# c has x values for 'a' and 'd' so these should not match
c = list('xbcxef')
df1 = pd.DataFrame({'X':pd.Series(c),'Y': pd.Series(b)})
print(df)
print(df1)
X Y
0 a 0
1 b 1
2 c 2
3 d 3
4 e 4
5 f 5
[6 rows x 2 columns]
X Y
0 x 0
1 b 1
2 c 2
3 x 3
4 e 4
5 f 5
[6 rows x 2 columns]
In [4]:
# now index your df using boolean condition on the values
df[df.X == df1.X]
Out[4]:
X Y
1 b 1
2 c 2
4 e 4
5 f 5
[4 rows x 2 columns]
EDIT:
So if you have different length series then that won't work, in which case you can use isin:
So create 2 dataframes of different lengths:
a = list('abcdef')
b = range(6)
d = range(10)
df = pd.DataFrame({'X':pd.Series(a),'Y': pd.Series(b)})
c = list('xbcxefxghi')
df1 = pd.DataFrame({'X':pd.Series(c),'Y': pd.Series(d)})
print(df)
print(df1)
X Y
0 a 0
1 b 1
2 c 2
3 d 3
4 e 4
5 f 5
[6 rows x 2 columns]
X Y
0 x 0
1 b 1
2 c 2
3 x 3
4 e 4
5 f 5
6 x 6
7 g 7
8 h 8
9 i 9
[10 rows x 2 columns]
Now use isin to select rows from df1 where the id's exist in df:
In [7]:
df1[df1.X.isin(df.X)]
Out[7]:
X Y
1 b 1
2 c 2
4 e 4
5 f 5
[4 rows x 2 columns]

Categories

Resources