I am grouping data in a pandas dataframe and using some aggregation functions to generate results data. Input data:
A B C D E F
0 aa 5 3 2 2 2
1 aa 3 2 2 3 3
2 ac 2 0 2 7 7
3 ac 9 2 3 8 8
4 ac 2 3 7 3 3
5 ad 0 0 0 1 1
6 ad 9 9 9 9 9
7 ad 6 6 6 6 6
8 ad 3 3 3 3 3
The pandas grouped function seems to only operate on one column at a time but I want to generate the statistic on all columns in my df. For example, I can use the function grouped['C'].agg([np.mean, len]) to generate the statistics on column 'C' but what if I want to generate these statistics on all columns A - F?
The output from this is:
A count_C mean_C
0 aa 2 2.500000
1 ac 3 1.666667
2 ad 4 4.500000
But what I want is:
A count_B mean_B count_C mean_C count_D mean_D etc...
0 aa 2 4.000000 2 2.500000 2 2.0 etc...
1 ac 3 4.333333 3 2.500000 3 4.0
2 ad 4 4.500000 4 2.500000 4 4.5
Is there any easy way to do the group by with aggregation in a single command? If not, is there an easy way to iterate over all columns and merge in new aggregation statistics results for each column?
Here's my full code so far:
import pandas as pd
import numpy as np
import pprint as pp
test_dataframe = pd.DataFrame({
'A' : ['aa', 'aa', 'ac', 'ac', 'ac', 'ad', 'ad', 'ad', 'ad'],
'B' : [5, 3, 2, 9, 2, 0, 9, 6, 3],
'C' : [3, 2, 0, 2, 3, 0, 9, 6, 3],
'D' : [2, 2, 2, 3, 7, 0, 9, 6, 3],
'E' : [2, 3, 7, 8, 3, 1, 9, 6, 3],
'F' : [2, 3, 7, 8, 3, 1, 9, 6, 3]
})
#group, aggregate, convert object to df, sort index
grouped = test_dataframe.groupby(['A'])
grouped_stats = grouped['C'].agg([np.mean, len])
grouped_stats = pd.DataFrame(grouped_stats).reset_index()
grouped_stats.rename(columns = {'mean':'mean_C', 'len':'count_C'}, inplace=True)
grouped_stats.sort_index(axis=1, inplace=True)
print "Input: "
pp.pprint(test_dataframe)
print "Output: "
pp.pprint(grouped_stats)
You don't have to call grouped['B'] grouped['C'] one by one, simply pass your entire groupby object and pandas will apply the aggregate functions to all columns.
import pandas as pd
test_dataframe = pd.DataFrame({
'A' : ['aa', 'aa', 'ac', 'ac', 'ac', 'ad', 'ad', 'ad', 'ad'],
'B' : [5, 3, 2, 9, 2, 0, 9, 6, 3],
'C' : [3, 2, 0, 2, 3, 0, 9, 6, 3],
'D' : [2, 2, 2, 3, 7, 0, 9, 6, 3],
'E' : [2, 3, 7, 8, 3, 1, 9, 6, 3],
'F' : [2, 3, 7, 8, 3, 1, 9, 6, 3]
})
agg_funcs = ['count', 'mean']
test_dataframe = test_dataframe.groupby(['A']).agg(agg_funcs)
columns = 'B C D E F'.split()
names = [y + '_' + x for x in columns for y in agg_funcs]
test_dataframe.columns = names
Out[89]:
count_B mean_B count_C mean_C count_D mean_D count_E mean_E count_F mean_F
A
aa 2 4.0000 2 2.5000 2 2.0 2 2.50 2 2.50
ac 3 4.3333 3 1.6667 3 4.0 3 6.00 3 6.00
ad 4 4.5000 4 4.5000 4 4.5 4 4.75 4 4.75
Related
Given the dataframe:
d = {'col1': [1, 2, 3, 4, 7], 'col2': [4, 5, 6, 9, 5], 'col3': [7, 8, 12, 1, 11], 'col4': [12, 13, 14, 15, 16]}
What is the easiest way to append the third column to the first and the fourth column to the second?
The result should look like.
d = {'col1': [1, 2, 3, 4, 7, 7, 8, 12, 1, 11], 'col2': [4, 5, 6, 9, 5, 12, 13, 14, 15, 16],
I need to use this for a script with different column names, thus referencing columns by name is not possible. I have tried something along the lines of df.iloc[:,x] to achieve this.
You can use:
out = pd.concat([subdf.set_axis(['col1', 'col2'], axis=1)
for _, subdf in df.groupby(pd.RangeIndex(df.shape[1]) // 2, axis=1)])
print(out)
# Output
col1 col2
0 1 4
1 2 5
2 3 6
3 4 9
4 7 5
0 7 12
1 8 13
2 12 14
3 1 15
4 11 16
You can change the column names and concat:
pd.concat([df[['col1', 'col2']],
df[['col3', 'col4']].set_axis(['col1', 'col2'], axis=1)])
Add ignore_index=True to reset the index in the process.
Output:
col1 col2
0 1 4
1 2 5
2 3 6
3 4 9
4 7 5
0 7 12
1 8 13
2 12 14
3 1 15
4 11 16
Or, using numpy:
N = 2
pd.DataFrame(
df
.values.reshape((-1,df.shape[1]//2,N))
.reshape(-1,N,order='F'),
columns=df.columns[:N]
)
This may not be the most efficient solution but, you can do it using the pd.concat() function in pandas.
First convert your initial dict d into a pandas Dataframe and then apply the concat function.
d = {'col1': [1, 2, 3, 4, 7], 'col2': [4, 5, 6, 9, 5], 'col3': [7, 8, 12, 1, 11], 'col4': [12, 13, 14, 15, 16]}
df = pd.DataFrame(d)
d_2 = {'col1':pd.concat([df.iloc[:,0],df.iloc[:,2]]),'col2':pd.concat([df.iloc[:,1],df.iloc[:,3]])}
d_2 is your required dict. Convert it to a dataframe if you need it to,
df_2 = pd.DataFrame(d_2)
I have a dataframe like below and I want to add another column that is replicated untill certain condition is met.
sample_df = pd.DataFrame(data={
'id': ['A', 'B', 'C'],
'n' : [ 1, 2, 3],
'v' : [ 10, 13, 8],
'z' : [5, 3, 6],
'g' : [8, 8, 10]
})
additional_rows=
Now I want to add another column which contains additional information about the dataframe. For instance, I want to replicate Yes untill id is B and No when it is below B and Yes from C to D and from from D to E Maybe.
The output I am expecting is as follows:
sample_df = pd.DataFrame(data={
'id': ['A', 'B', 'C','G','D','E'],
'n' : [ 1, 2, 3, 5, 5, 9],
'v' : [ 10, 13, 8, 8, 4 , 3],
'z' : [5, 3, 6, 9, 9, 8],
'New Info': ['Yes','Yes','No','No','Maybe','Maybe']
})
sample_df
id n v z New Info
0 A 1 10 5 Yes
1 B 2 13 3 Yes
2 C 3 8 6 No
3 G 5 8 9 No
4 D 5 4 9 Maybe
5 E 9 3 8 Maybe
How can I achieve this in python?
You can use np.select to return results based on conditions. Since you were talking more about positional conditions I used df.index:
sample_df = pd.DataFrame(data={
'id': ['A', 'B', 'C','G','D','E'],
'n' : [ 1, 2, 3, 5, 5, 9],
'v' : [ 10, 13, 8, 8, 4 , 3],
'z' : [5, 3, 6, 9, 9, 8]
})
sample_df['New Info'] = np.select([sample_df.index<2, sample_df.index<4],['Yes', 'No'], 'Maybe')
sample_df
Out[1]:
id n v z New Info
0 A 1 10 5 Yes
1 B 2 13 3 Yes
2 C 3 8 6 No
3 G 5 8 9 No
4 D 5 4 9 Maybe
5 E 9 3 8 Maybe
I have a dataframe as
df=pd.DataFrame({'A':[1, 3, 3, 4, 5, 3, 3],
'B':[0, 2, 3, 4, 5, 6, 7],
'C':[7, 2, 2, 5, 7, 2, 2]})
I would like to drop the duplicated values from columns A and C. However, I want it to work partially.
If I use
df.drop_duplicates(subset=['A','C'], keep='first')
It will drop row 2, 5, 6. However, I only want to drop row 2 and 6. The desired results are like:
df=pd.DataFrame({'A':[1, 3, 4, 5, 3],
'B':[0, 2, 4, 5, 6],
'C':[7, 2, 5, 7, 2]})
Here's how you can do this, using shift:
df.loc[(df[["A", "C"]].shift() != df[["A", "C"]]).any(axis=1)].reset_index(drop=True)
Output:
A B C
0 1 0 7
1 3 2 2
2 4 4 5
3 5 5 7
4 3 6 2
This question is a nice reference.
You can just keep every second repetition of A, C pair:
df=df.loc[df.groupby(["A", "C"]).cumcount()%2==0]
Outputs:
A B C
0 1 0 7
1 3 2 2
3 4 4 5
4 5 5 7
5 3 6 2
I have a dataframe that contains some NaN-values in a t-column. The values in the t-column belong to a certain id and should be the same per id:
df = pd.DataFrame({"t" : [4, 4, 1, 1, float('nan'), 2, 2, 2, float('nan'), 10],
"id": [1, 1, 2, 2, 3, 3, 3 , 3, 4, 4]})
Therefore, I would like to overwrite the NaN in t with the non-NaN in t for the respective id and ultimately end up with
df = pd.DataFrame({"t" : [4, 4, 1, 1, 2, 2, 2, 2, 10, 10],
"id": [1, 1, 2, 2, 3, 3, 3 , 3, 4, 4]})
New strategy... Create a map by dropping na and reassign using loc and mask.
import pandas as pd
df = pd.DataFrame({"t" : [4, 4, 1, 1, float('nan'), 2, 2, 2, float('nan'), 10],
"id": [1, 1, 2, 2, 3, 3, 3 , 3, 4, 4]})
# create mask
m = pd.isna(df['t'])
# create map
#d = df[~m].set_index('id')['t'].drop_duplicates()
d = df[~m].set_index('id')['t'].to_dict()
# assign map to the slice of the dataframe containing nan
df.loc[m,'t'] = df.loc[m,'id'].map(d)
print(df)
df returns:
id t
0 1 4.0
1 1 4.0
2 2 1.0
3 2 1.0
4 3 2.0
5 3 2.0
6 3 2.0
7 3 2.0
8 4 10.0
9 4 10.0
Use sort_values with groupby and transform for same column with first:
df['t'] = df.sort_values(['id','t']).groupby('id')['t'].transform('first')
Alternative solution is map by Series created by dropna with drop_duplicates:
df['t'] = df['id'].map(df.dropna(subset=['t']).drop_duplicates('id').set_index('id')['t'])
print (df)
id t
0 1 4.0
1 1 4.0
2 2 1.0
3 2 1.0
4 3 2.0
5 3 2.0
6 3 2.0
7 3 2.0
8 4 10.0
9 4 10.0
I have a data set in which certain column is a combination of couple of independent values, as in the example below:
id age marks
1 5 3,6,7
2 7 1,2
3 4 34,78,2
Thus the column by itself is composed of multiple values, and I need to pass the vector into a machine learning algorithm , I cannot really combine the values to assign a single value like :
3,6,7 => 1
1,2 => 2
34,78,2 => 3
making my new vector as
id age marks
1 5 1
2 7 2
3 4 3
and then subsequently pass it to the algorithm , as the number of such combination would be infinite and also that might not really capture the real meaning of the data.
how to handle such situation where individual feature is a combination of multiple features.
Note :
the values in column marks are just examples, it could be anything a list of values. it could be list of integer or list of string , string composed of multiple stings separated by commas
You can pd.factorize tuples
Assuming marks is a list
df
id age marks
0 1 5 [3, 6, 7]
1 2 7 [1, 2]
2 3 4 [34, 78, 2]
3 4 5 [3, 6, 7]
Apply tuple and factorize
df.assign(new=pd.factorize(df.marks.apply(tuple))[0] + 1)
id age marks new
0 1 5 [3, 6, 7] 1
1 2 7 [1, 2] 2
2 3 4 [34, 78, 2] 3
3 4 5 [3, 6, 7] 1
setup df
df = pd.DataFrame([
[1, 5, ['3', '6', '7']],
[2, 7, ['1', '2']],
[3, 4, ['34', '78', '2']],
[4, 5, ['3', '6', '7']]
], [0, 1, 2, 3], ['id', 'age', 'marks']
)
UPDATE: I think we can use CountVectorizer in this case:
assuming we have the following DF:
In [33]: df
Out[33]:
id age marks
0 1 5 [3, 6, 7]
1 2 7 [1, 2]
2 3 4 [34, 78, 2]
3 4 11 [3, 6, 7]
In [34]: %paste
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import TreebankWordTokenizer
vect = CountVectorizer(ngram_range=(1,1), stop_words=None, tokenizer=TreebankWordTokenizer().tokenize)
X = vect.fit_transform(df.marks.apply(' '.join))
r = pd.DataFrame(X.toarray(), columns=vect.get_feature_names())
## -- End pasted text --
Result:
In [35]: r
Out[35]:
1 2 3 34 6 7 78
0 0 0 1 0 1 1 0
1 1 1 0 0 0 0 0
2 0 1 0 1 0 0 1
3 0 0 1 0 1 1 0
OLD answer:
you can first convert your list to string and then categorize it:
In [119]: df
Out[119]:
id age marks
0 1 5 [3, 6, 7]
1 2 7 [1, 2]
2 3 4 [34, 78, 2]
3 4 11 [3, 6, 7]
In [120]: df['new'] = pd.Categorical(pd.factorize(df.marks.str.join('|'))[0])
In [121]: df
Out[121]:
id age marks new
0 1 5 [3, 6, 7] 0
1 2 7 [1, 2] 1
2 3 4 [34, 78, 2] 2
3 4 11 [3, 6, 7] 0
In [122]: df.dtypes
Out[122]:
id int64
age int64
marks object
new category
dtype: object
this will also work if marks is a column of strings:
In [124]: df
Out[124]:
id age marks
0 1 5 3,6,7
1 2 7 1,2
2 3 4 34,78,2
3 4 11 3,6,7
In [125]: df['new'] = pd.Categorical(pd.factorize(df.marks.str.join('|'))[0])
In [126]: df
Out[126]:
id age marks new
0 1 5 3,6,7 0
1 2 7 1,2 1
2 3 4 34,78,2 2
3 4 11 3,6,7 0
Tp access them as either [[x, y, z], [x, y, z]] or [[x, x], [y, y], [z, z]] (whatever is most appropriate for the function you need to call) then use:
import pandas as pd
import numpy as np
df = pd.DataFrame(dict(a=[1, 2, 3, 4], b=[3, 4, 3, 4], c=[[1,2,3], [1,2], [], [2]]))
df.values
zip(*df.values)
where
>>> df
a b c
0 1 3 [1, 2, 3]
1 2 4 [1, 2]
2 3 3 []
3 4 4 [2]
>>> df.values
array([[1, 3, [1, 2, 3]],
[2, 4, [1, 2]],
[3, 3, []],
[4, 4, [2]]], dtype=object)
>>> zip(*df.values)
[(1, 2, 3, 4), (3, 4, 3, 4), ([1, 2, 3], [1, 2], [], [2])]
To convert a column try this:
import pandas as pd
import numpy as np
df = pd.DataFrame(dict(a=[1, 2], b=[3, 4], c=[[1,2,3], [1,2]]))
df['c'].apply(lambda x: np.mean(x))
before:
>>> df
a b c
0 1 3 [1, 2, 3]
1 2 4 [1, 2]
after:
>>> df
a b c
0 1 3 2.0
1 2 4 1.5