pandas grouped with aggregation stats across all dataframe columns

pandas grouped with aggregation stats across all dataframe columns - python

I am grouping data in a pandas dataframe and using some aggregation functions to generate results data. Input data:
A B C D E F
0 aa 5 3 2 2 2
1 aa 3 2 2 3 3
2 ac 2 0 2 7 7
3 ac 9 2 3 8 8
4 ac 2 3 7 3 3
5 ad 0 0 0 1 1
6 ad 9 9 9 9 9
7 ad 6 6 6 6 6
8 ad 3 3 3 3 3
The pandas grouped function seems to only operate on one column at a time but I want to generate the statistic on all columns in my df. For example, I can use the function grouped['C'].agg([np.mean, len]) to generate the statistics on column 'C' but what if I want to generate these statistics on all columns A - F?
The output from this is:
A count_C mean_C
0 aa 2 2.500000
1 ac 3 1.666667
2 ad 4 4.500000
But what I want is:
A count_B mean_B count_C mean_C count_D mean_D etc...
0 aa 2 4.000000 2 2.500000 2 2.0 etc...
1 ac 3 4.333333 3 2.500000 3 4.0
2 ad 4 4.500000 4 2.500000 4 4.5
Is there any easy way to do the group by with aggregation in a single command? If not, is there an easy way to iterate over all columns and merge in new aggregation statistics results for each column?
Here's my full code so far:
import pandas as pd
import numpy as np
import pprint as pp
test_dataframe = pd.DataFrame({
'A' : ['aa', 'aa', 'ac', 'ac', 'ac', 'ad', 'ad', 'ad', 'ad'],
'B' : [5, 3, 2, 9, 2, 0, 9, 6, 3],
'C' : [3, 2, 0, 2, 3, 0, 9, 6, 3],
'D' : [2, 2, 2, 3, 7, 0, 9, 6, 3],
'E' : [2, 3, 7, 8, 3, 1, 9, 6, 3],
'F' : [2, 3, 7, 8, 3, 1, 9, 6, 3]
})
#group, aggregate, convert object to df, sort index
grouped = test_dataframe.groupby(['A'])
grouped_stats = grouped['C'].agg([np.mean, len])
grouped_stats = pd.DataFrame(grouped_stats).reset_index()
grouped_stats.rename(columns = {'mean':'mean_C', 'len':'count_C'}, inplace=True)
grouped_stats.sort_index(axis=1, inplace=True)
print "Input: "
pp.pprint(test_dataframe)
print "Output: "
pp.pprint(grouped_stats)

You don't have to call grouped['B'] grouped['C'] one by one, simply pass your entire groupby object and pandas will apply the aggregate functions to all columns.
import pandas as pd
test_dataframe = pd.DataFrame({
'A' : ['aa', 'aa', 'ac', 'ac', 'ac', 'ad', 'ad', 'ad', 'ad'],
'B' : [5, 3, 2, 9, 2, 0, 9, 6, 3],
'C' : [3, 2, 0, 2, 3, 0, 9, 6, 3],
'D' : [2, 2, 2, 3, 7, 0, 9, 6, 3],
'E' : [2, 3, 7, 8, 3, 1, 9, 6, 3],
'F' : [2, 3, 7, 8, 3, 1, 9, 6, 3]
})
agg_funcs = ['count', 'mean']
test_dataframe = test_dataframe.groupby(['A']).agg(agg_funcs)
columns = 'B C D E F'.split()
names = [y + '_' + x for x in columns for y in agg_funcs]
test_dataframe.columns = names
Out[89]:
count_B mean_B count_C mean_C count_D mean_D count_E mean_E count_F mean_F
A
aa 2 4.0000 2 2.5000 2 2.0 2 2.50 2 2.50
ac 3 4.3333 3 1.6667 3 4.0 3 6.00 3 6.00
ad 4 4.5000 4 4.5000 4 4.5 4 4.75 4 4.75

Related

Appending columns to other columns in Pandas

Given the dataframe:
d = {'col1': [1, 2, 3, 4, 7], 'col2': [4, 5, 6, 9, 5], 'col3': [7, 8, 12, 1, 11], 'col4': [12, 13, 14, 15, 16]}
What is the easiest way to append the third column to the first and the fourth column to the second?
The result should look like.
d = {'col1': [1, 2, 3, 4, 7, 7, 8, 12, 1, 11], 'col2': [4, 5, 6, 9, 5, 12, 13, 14, 15, 16],
I need to use this for a script with different column names, thus referencing columns by name is not possible. I have tried something along the lines of df.iloc[:,x] to achieve this.

You can use:
out = pd.concat([subdf.set_axis(['col1', 'col2'], axis=1)
for _, subdf in df.groupby(pd.RangeIndex(df.shape[1]) // 2, axis=1)])
print(out)
# Output
col1 col2
0 1 4
1 2 5
2 3 6
3 4 9
4 7 5
0 7 12
1 8 13
2 12 14
3 1 15
4 11 16

You can change the column names and concat:
pd.concat([df[['col1', 'col2']],
df[['col3', 'col4']].set_axis(['col1', 'col2'], axis=1)])
Add ignore_index=True to reset the index in the process.
Output:
col1 col2
0 1 4
1 2 5
2 3 6
3 4 9
4 7 5
0 7 12
1 8 13
2 12 14
3 1 15
4 11 16
Or, using numpy:
N = 2
pd.DataFrame(
df
.values.reshape((-1,df.shape[1]//2,N))
.reshape(-1,N,order='F'),
columns=df.columns[:N]
)

This may not be the most efficient solution but, you can do it using the pd.concat() function in pandas.
First convert your initial dict d into a pandas Dataframe and then apply the concat function.
d = {'col1': [1, 2, 3, 4, 7], 'col2': [4, 5, 6, 9, 5], 'col3': [7, 8, 12, 1, 11], 'col4': [12, 13, 14, 15, 16]}
df = pd.DataFrame(d)
d_2 = {'col1':pd.concat([df.iloc[:,0],df.iloc[:,2]]),'col2':pd.concat([df.iloc[:,1],df.iloc[:,3]])}
d_2 is your required dict. Convert it to a dataframe if you need it to,
df_2 = pd.DataFrame(d_2)

How to replicate same values based on the index value of other column in python

I have a dataframe like below and I want to add another column that is replicated untill certain condition is met.
sample_df = pd.DataFrame(data={
'id': ['A', 'B', 'C'],
'n' : [ 1, 2, 3],
'v' : [ 10, 13, 8],
'z' : [5, 3, 6],
'g' : [8, 8, 10]
})
additional_rows=
Now I want to add another column which contains additional information about the dataframe. For instance, I want to replicate Yes untill id is B and No when it is below B and Yes from C to D and from from D to E Maybe.
The output I am expecting is as follows:
sample_df = pd.DataFrame(data={
'id': ['A', 'B', 'C','G','D','E'],
'n' : [ 1, 2, 3, 5, 5, 9],
'v' : [ 10, 13, 8, 8, 4 , 3],
'z' : [5, 3, 6, 9, 9, 8],
'New Info': ['Yes','Yes','No','No','Maybe','Maybe']
})
sample_df
id n v z New Info
0 A 1 10 5 Yes
1 B 2 13 3 Yes
2 C 3 8 6 No
3 G 5 8 9 No
4 D 5 4 9 Maybe
5 E 9 3 8 Maybe
How can I achieve this in python?

You can use np.select to return results based on conditions. Since you were talking more about positional conditions I used df.index:
sample_df = pd.DataFrame(data={
'id': ['A', 'B', 'C','G','D','E'],
'n' : [ 1, 2, 3, 5, 5, 9],
'v' : [ 10, 13, 8, 8, 4 , 3],
'z' : [5, 3, 6, 9, 9, 8]
})
sample_df['New Info'] = np.select([sample_df.index<2, sample_df.index<4],['Yes', 'No'], 'Maybe')
sample_df
Out[1]:
id n v z New Info
0 A 1 10 5 Yes
1 B 2 13 3 Yes
2 C 3 8 6 No
3 G 5 8 9 No
4 D 5 4 9 Maybe
5 E 9 3 8 Maybe

Pandas drop duplicated values partially

I have a dataframe as
df=pd.DataFrame({'A':[1, 3, 3, 4, 5, 3, 3],
'B':[0, 2, 3, 4, 5, 6, 7],
'C':[7, 2, 2, 5, 7, 2, 2]})
I would like to drop the duplicated values from columns A and C. However, I want it to work partially.
If I use
df.drop_duplicates(subset=['A','C'], keep='first')
It will drop row 2, 5, 6. However, I only want to drop row 2 and 6. The desired results are like:
df=pd.DataFrame({'A':[1, 3, 4, 5, 3],
'B':[0, 2, 4, 5, 6],
'C':[7, 2, 5, 7, 2]})

Here's how you can do this, using shift:
df.loc[(df[["A", "C"]].shift() != df[["A", "C"]]).any(axis=1)].reset_index(drop=True)
Output:
A B C
0 1 0 7
1 3 2 2
2 4 4 5
3 5 5 7
4 3 6 2
This question is a nice reference.

You can just keep every second repetition of A, C pair:
df=df.loc[df.groupby(["A", "C"]).cumcount()%2==0]
Outputs:
A B C
0 1 0 7
1 3 2 2
3 4 4 5
4 5 5 7
5 3 6 2

Overwrite NaNs in a column based on identifier

I have a dataframe that contains some NaN-values in a t-column. The values in the t-column belong to a certain id and should be the same per id:
df = pd.DataFrame({"t" : [4, 4, 1, 1, float('nan'), 2, 2, 2, float('nan'), 10],
"id": [1, 1, 2, 2, 3, 3, 3 , 3, 4, 4]})
Therefore, I would like to overwrite the NaN in t with the non-NaN in t for the respective id and ultimately end up with
df = pd.DataFrame({"t" : [4, 4, 1, 1, 2, 2, 2, 2, 10, 10],
"id": [1, 1, 2, 2, 3, 3, 3 , 3, 4, 4]})

New strategy... Create a map by dropping na and reassign using loc and mask.
import pandas as pd
df = pd.DataFrame({"t" : [4, 4, 1, 1, float('nan'), 2, 2, 2, float('nan'), 10],
"id": [1, 1, 2, 2, 3, 3, 3 , 3, 4, 4]})
# create mask
m = pd.isna(df['t'])
# create map
#d = df[~m].set_index('id')['t'].drop_duplicates()
d = df[~m].set_index('id')['t'].to_dict()
# assign map to the slice of the dataframe containing nan
df.loc[m,'t'] = df.loc[m,'id'].map(d)
print(df)
df returns:
id t
0 1 4.0
1 1 4.0
2 2 1.0
3 2 1.0
4 3 2.0
5 3 2.0
6 3 2.0
7 3 2.0
8 4 10.0
9 4 10.0

Use sort_values with groupby and transform for same column with first:
df['t'] = df.sort_values(['id','t']).groupby('id')['t'].transform('first')
Alternative solution is map by Series created by dropna with drop_duplicates:
df['t'] = df['id'].map(df.dropna(subset=['t']).drop_duplicates('id').set_index('id')['t'])
print (df)
id t
0 1 4.0
1 1 4.0
2 2 1.0
3 2 1.0
4 3 2.0
5 3 2.0
6 3 2.0
7 3 2.0
8 4 10.0
9 4 10.0

How to make a multi-dimensional column into a single valued vector for training data in sklearn pandas

I have a data set in which certain column is a combination of couple of independent values, as in the example below:
id age marks
1 5 3,6,7
2 7 1,2
3 4 34,78,2
Thus the column by itself is composed of multiple values, and I need to pass the vector into a machine learning algorithm , I cannot really combine the values to assign a single value like :
3,6,7 => 1
1,2 => 2
34,78,2 => 3
making my new vector as
id age marks
1 5 1
2 7 2
3 4 3
and then subsequently pass it to the algorithm , as the number of such combination would be infinite and also that might not really capture the real meaning of the data.
how to handle such situation where individual feature is a combination of multiple features.
Note :
the values in column marks are just examples, it could be anything a list of values. it could be list of integer or list of string , string composed of multiple stings separated by commas

You can pd.factorize tuples
Assuming marks is a list
df
id age marks
0 1 5 [3, 6, 7]
1 2 7 [1, 2]
2 3 4 [34, 78, 2]
3 4 5 [3, 6, 7]
Apply tuple and factorize
df.assign(new=pd.factorize(df.marks.apply(tuple))[0] + 1)
id age marks new
0 1 5 [3, 6, 7] 1
1 2 7 [1, 2] 2
2 3 4 [34, 78, 2] 3
3 4 5 [3, 6, 7] 1
setup df
df = pd.DataFrame([
[1, 5, ['3', '6', '7']],
[2, 7, ['1', '2']],
[3, 4, ['34', '78', '2']],
[4, 5, ['3', '6', '7']]
], [0, 1, 2, 3], ['id', 'age', 'marks']
)

UPDATE: I think we can use CountVectorizer in this case:
assuming we have the following DF:
In [33]: df
Out[33]:
id age marks
0 1 5 [3, 6, 7]
1 2 7 [1, 2]
2 3 4 [34, 78, 2]
3 4 11 [3, 6, 7]
In [34]: %paste
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import TreebankWordTokenizer
vect = CountVectorizer(ngram_range=(1,1), stop_words=None, tokenizer=TreebankWordTokenizer().tokenize)
X = vect.fit_transform(df.marks.apply(' '.join))
r = pd.DataFrame(X.toarray(), columns=vect.get_feature_names())
## -- End pasted text --
Result:
In [35]: r
Out[35]:
1 2 3 34 6 7 78
0 0 0 1 0 1 1 0
1 1 1 0 0 0 0 0
2 0 1 0 1 0 0 1
3 0 0 1 0 1 1 0
OLD answer:
you can first convert your list to string and then categorize it:
In [119]: df
Out[119]:
id age marks
0 1 5 [3, 6, 7]
1 2 7 [1, 2]
2 3 4 [34, 78, 2]
3 4 11 [3, 6, 7]
In [120]: df['new'] = pd.Categorical(pd.factorize(df.marks.str.join('|'))[0])
In [121]: df
Out[121]:
id age marks new
0 1 5 [3, 6, 7] 0
1 2 7 [1, 2] 1
2 3 4 [34, 78, 2] 2
3 4 11 [3, 6, 7] 0
In [122]: df.dtypes
Out[122]:
id int64
age int64
marks object
new category
dtype: object
this will also work if marks is a column of strings:
In [124]: df
Out[124]:
id age marks
0 1 5 3,6,7
1 2 7 1,2
2 3 4 34,78,2
3 4 11 3,6,7
In [125]: df['new'] = pd.Categorical(pd.factorize(df.marks.str.join('|'))[0])
In [126]: df
Out[126]:
id age marks new
0 1 5 3,6,7 0
1 2 7 1,2 1
2 3 4 34,78,2 2
3 4 11 3,6,7 0

Tp access them as either [[x, y, z], [x, y, z]] or [[x, x], [y, y], [z, z]] (whatever is most appropriate for the function you need to call) then use:
import pandas as pd
import numpy as np
df = pd.DataFrame(dict(a=[1, 2, 3, 4], b=[3, 4, 3, 4], c=[[1,2,3], [1,2], [], [2]]))
df.values
zip(*df.values)
where
>>> df
a b c
0 1 3 [1, 2, 3]
1 2 4 [1, 2]
2 3 3 []
3 4 4 [2]
>>> df.values
array([[1, 3, [1, 2, 3]],
[2, 4, [1, 2]],
[3, 3, []],
[4, 4, [2]]], dtype=object)
>>> zip(*df.values)
[(1, 2, 3, 4), (3, 4, 3, 4), ([1, 2, 3], [1, 2], [], [2])]
To convert a column try this:
import pandas as pd
import numpy as np
df = pd.DataFrame(dict(a=[1, 2], b=[3, 4], c=[[1,2,3], [1,2]]))
df['c'].apply(lambda x: np.mean(x))
before:
>>> df
a b c
0 1 3 [1, 2, 3]
1 2 4 [1, 2]
after:
>>> df
a b c
0 1 3 2.0
1 2 4 1.5

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

pandas grouped with aggregation stats across all dataframe columns - python

Related

Appending columns to other columns in Pandas

How to replicate same values based on the index value of other column in python

Pandas drop duplicated values partially

Overwrite NaNs in a column based on identifier

How to make a multi-dimensional column into a single valued vector for training data in sklearn pandas

Categories

Resources