How to use groupby agg and rename functions for all columns

How to use groupby agg and rename functions for all columns - python

Question
How do I get the following result without having to assign a function dictionary for every column?
df.groupby(level=0).agg({'one': {'SUM': 'sum', 'HowMany': 'count'},
'two': {'SUM': 'sum', 'HowMany': 'count'}})
What I've done so far
Consider the df:
import pandas as pd
import numpy as np
idx = pd.MultiIndex.from_product([['A', 'B'], ['One', 'Two']],
names=['Alpha', 'Numeric'])
df = pd.DataFrame(np.arange(8).reshape(4, 2), idx, ['one', 'two'])
df
I want to use groupby().agg() where I run the set of functions and rename their output columns.
This works fine.
df.groupby(level=0).agg({'one': {'SUM': 'sum', 'HowMany': 'count'}})
But I want to do this for all columns. I could do this:
df.groupby(level=0).agg(['sum', 'count'])
But I'm missing the great renaming I've done. I'd hoped that this would work:
df.groupby(level=0).agg({'SUM': 'sum', 'HowMany': 'count'})
But it doesn't. I get this error:
KeyError: 'SUM'
This makes sense. Pandas is looking at the keys of the passed dictionary for columns names. It's how I got the example at the start to work.

You can use set_levels:
g = df.groupby(level=0).agg(['sum', 'count'])
g.columns.set_levels(['SUM', 'HowMany'], 1, inplace=True)
g
>>>
one two
SUM HowMany SUM HowMany
Alpha
A 2 2 4 2
B 10 2 12 2

is using .rename() an option for you?
In [7]: df.groupby(level=0).agg(['sum', 'count']).rename(columns=dict(sum='SUM', count='HowMany'))
Out[7]:
one two
SUM HowMany SUM HowMany
Alpha
A 2 2 4 2
B 10 2 12 2

This is an ugly answer:
gb = df.stack(0).groupby(level=[0, -1])
df1 = gb.agg({'SUM': 'sum', 'HowMany': 'count'})
df1.unstack().swaplevel(0, 1, 1).sort_index(1, 0)

Related

Frequencies of combinaties of columns in Python

I have several columns that contain specific diseases. Here an example of a piece of it:
I want to make all possible combinations so I can check which combination of diseases mostly occur. So I want to make all combinations of 2 columns (A&B, A&C, A&D, B&C, B&D, C&D), but also combinations of 3 and 4 columns (A&B&C, B&C&D and so on). I have the following script for this:
from itertools import combinations
df.join(pd.concat({'_'.join(x): df[x[0]].str.cat(df[list(x[1:])].astype(str),
sep='')
for i in (2, 3, 4)
for x in combinations(df, i)}, axis=1))
But that generates a lot of extra columns in my dataset, and I still haven't got the frequencies of all combinations. This is the output that I would like to get:
What script can I use for this?

Use DataFrame.stack with aggregate join and last count by Series.value_counts:
s = df.stack().groupby(level=0).agg(','.join).value_counts()
print (s)
artritis,asthma 2
cancer,artritis,heart_failure,asthma 1
cancer,heart_failure 1
dtype: int64
If need 2 columns DataFrame:
df = s.rename_axis('vals').reset_index(name='count')
print (df)
vals count
0 artritis,asthma 2
1 cancer,artritis,heart_failure,asthma 1
2 cancer,heart_failure 1

You can create a pivot table
def index_agg_fn(x):
x = [e for e in x if e != '']
return ','.join(x)
df = pd.DataFrame({'A': ['cancer', 'cancer', None, None],
'B': ['artritis', None, 'artritis', 'artritis'],
'C': ['heart_failure', 'heart_failure', None, None],
'D': ['asthma', None, 'asthma', 'asthma']})
df['count'] = 1
ptable = pd.pivot_table(df.fillna(''), index=['A', 'B', 'C', 'D'], values=['count'], aggfunc='sum')
ptable.index = list(map(index_agg_fn, ptable.index))
print(ptable)
Result
count
artritis,asthma 2
cancer,heart_failure 1
cancer,artritis,heart_failure,asthma 1

Error when creating a data frame in Python: ValueError [duplicate]

This may be a simple question, but I can not figure out how to do this. Lets say that I have two variables as follows.
a = 2
b = 3
I want to construct a DataFrame from this:
df2 = pd.DataFrame({'A':a,'B':b})
This generates an error:
ValueError: If using all scalar values, you must pass an index
I tried this also:
df2 = (pd.DataFrame({'a':a,'b':b})).reset_index()
This gives the same error message.

The error message says that if you're passing scalar values, you have to pass an index. So you can either not use scalar values for the columns -- e.g. use a list:
>>> df = pd.DataFrame({'A': [a], 'B': [b]})
>>> df
A B
0 2 3
or use scalar values and pass an index:
>>> df = pd.DataFrame({'A': a, 'B': b}, index=[0])
>>> df
A B
0 2 3

You may try wrapping your dictionary into a list:
my_dict = {'A':1,'B':2}
pd.DataFrame([my_dict])
A B
0 1 2

You can also use pd.DataFrame.from_records which is more convenient when you already have the dictionary in hand:
df = pd.DataFrame.from_records([{ 'A':a,'B':b }])
You can also set index, if you want, by:
df = pd.DataFrame.from_records([{ 'A':a,'B':b }], index='A')

You need to create a pandas series first. The second step is to convert the pandas series to pandas dataframe.
import pandas as pd
data = {'a': 1, 'b': 2}
pd.Series(data).to_frame()
You can even provide a column name.
pd.Series(data).to_frame('ColumnName')

Maybe Series would provide all the functions you need:
pd.Series({'A':a,'B':b})
DataFrame can be thought of as a collection of Series hence you can :
Concatenate multiple Series into one data frame (as described here )
Add a Series variable into existing data frame ( example here )

Pandas magic at work. All logic is out.
The error message "ValueError: If using all scalar values, you must pass an index" Says you must pass an index.
This does not necessarily mean passing an index makes pandas do what you want it to do
When you pass an index, pandas will treat your dictionary keys as column names and the values as what the column should contain for each of the values in the index.
a = 2
b = 3
df2 = pd.DataFrame({'A':a,'B':b}, index=[1])
A B
1 2 3
Passing a larger index:
df2 = pd.DataFrame({'A':a,'B':b}, index=[1, 2, 3, 4])
A B
1 2 3
2 2 3
3 2 3
4 2 3
An index is usually automatically generated by a dataframe when none is given. However, pandas does not know how many rows of 2 and 3 you want. You can however be more explicit about it
df2 = pd.DataFrame({'A':[a]*4,'B':[b]*4})
df2
A B
0 2 3
1 2 3
2 2 3
3 2 3
The default index is 0 based though.
I would recommend always passing a dictionary of lists to the dataframe constructor when creating dataframes. It's easier to read for other developers. Pandas has a lot of caveats, don't make other developers have to experts in all of them in order to read your code.

You could try:
df2 = pd.DataFrame.from_dict({'a':a,'b':b}, orient = 'index')
From the documentation on the 'orient' argument: If the keys of the passed dict should be the columns of the resulting DataFrame, pass ‘columns’ (default). Otherwise if the keys should be rows, pass ‘index’.

I usually use the following to to quickly create a small table from dicts.
Let's say you have a dict where the keys are filenames and the values their corresponding filesizes, you could use the following code to put it into a DataFrame (notice the .items() call on the dict):
files = {'A.txt':12, 'B.txt':34, 'C.txt':56, 'D.txt':78}
filesFrame = pd.DataFrame(files.items(), columns=['filename','size'])
print(filesFrame)
filename size
0 A.txt 12
1 B.txt 34
2 C.txt 56
3 D.txt 78

You need to provide iterables as the values for the Pandas DataFrame columns:
df2 = pd.DataFrame({'A':[a],'B':[b]})

I had the same problem with numpy arrays and the solution is to flatten them:
data = {
'b': array1.flatten(),
'a': array2.flatten(),
}
df = pd.DataFrame(data)

import pandas as pd
a=2
b=3
dict = {'A': a, 'B': b}
pd.DataFrame(pd.Series(dict)).T
# *T :transforms the dataframe*
Result:
A B
0 2 3

To figure out the "ValueError" understand DataFrame and "scalar values" is needed.
To create a Dataframe from dict, at least one Array is needed.
IMO, array itself is indexed.
Therefore, if there is an array-like value there is no need to specify index.
e.g. The index of each element in ['a', 's', 'd', 'f'] are 0,1,2,3 separately.
df_array_like = pd.DataFrame({
'col' : 10086,
'col_2' : True,
'col_3' : "'at least one array'",
'col_4' : ['one array is arbitrary length', 'multi arrays should be the same length']})
print("df_array_like: \n", df_array_like)
Output:
df_array_like:
col col_2 col_3 col_4
0 10086 True 'at least one array' one array is arbitrary length
1 10086 True 'at least one array' multi arrays should be the same length
As shows in the output, the index of the DataFrame is 0 and 1.
Coincidently same with the index of the array ['one array is arbitrary length', 'multi arrays should be the same length']
If comment out the 'col_4', it will raise
ValueError("If using all scalar values, you must pass an index")
Cause scalar value (integer, bool, and string) does not have index
Note that Index(...) must be called with a collection of some kind
Since index used to locate all the rows of DataFrame
index should be an array. e.g.
df_scalar_value = pd.DataFrame({
'col' : 10086,
'col_2' : True,
'col_3' : "'at least one array'"
}, index = ['fst_row','snd_row','third_row'])
print("df_scalar_value: \n", df_scalar_value)
Output:
df_scalar_value:
col col_2 col_3
fst_row 10086 True 'at least one array'
snd_row 10086 True 'at least one array'
third_row 10086 True 'at least one array'
I'm a beginner, I'm learning python and English. 👀

I tried transpose() and it worked.
Downside: You create a new object.
testdict1 = {'key1':'val1','key2':'val2','key3':'val3','key4':'val4'}
df = pd.DataFrame.from_dict(data=testdict1,orient='index')
print(df)
print(f'ID for DataFrame before Transpose: {id(df)}\n')
df = df.transpose()
print(df)
print(f'ID for DataFrame after Transpose: {id(df)}')
Output
0
key1 val1
key2 val2
key3 val3
key4 val4
ID for DataFrame before Transpose: 1932797100424
key1 key2 key3 key4
0 val1 val2 val3 val4
ID for DataFrame after Transpose: 1932797125448
```

the input does not have to be a list of records - it can be a single dictionary as well:
pd.DataFrame.from_records({'a':1,'b':2}, index=[0])
a b
0 1 2
Which seems to be equivalent to:
pd.DataFrame({'a':1,'b':2}, index=[0])
a b
0 1 2

This is because a DataFrame has two intuitive dimensions - the columns and the rows.
You are only specifying the columns using the dictionary keys.
If you only want to specify one dimensional data, use a Series!

If you intend to convert a dictionary of scalars, you have to include an index:
import pandas as pd
alphabets = {'A': 'a', 'B': 'b'}
index = [0]
alphabets_df = pd.DataFrame(alphabets, index=index)
print(alphabets_df)
Although index is not required for a dictionary of lists, the same idea can be expanded to a dictionary of lists:
planets = {'planet': ['earth', 'mars', 'jupiter'], 'length_of_day': ['1', '1.03', '0.414']}
index = [0, 1, 2]
planets_df = pd.DataFrame(planets, index=index)
print(planets_df)
Of course, for the dictionary of lists, you can build the dataframe without an index:
planets_df = pd.DataFrame(planets)
print(planets_df)

Change your 'a' and 'b' values to a list, as follows:
a = [2]
b = [3]
then execute the same code as follows:
df2 = pd.DataFrame({'A':a,'B':b})
df2
and you'll get:
A B
0 2 3

simplest options ls :
dict = {'A':a,'B':b}
df = pd.DataFrame(dict, index = np.arange(1) )

Another option is to convert the scalars into list on the fly using Dictionary Comprehension:
df = pd.DataFrame(data={k: [v] for k, v in mydict.items()})
The expression {...} creates a new dict whose values is a list of 1 element. such as :
In [20]: mydict
Out[20]: {'a': 1, 'b': 2}
In [21]: mydict2 = { k: [v] for k, v in mydict.items()}
In [22]: mydict2
Out[22]: {'a': [1], 'b': [2]}

Convert Dictionary to Data Frame
col_dict_df = pd.Series(col_dict).to_frame('new_col').reset_index()
Give new name to Column
col_dict_df.columns = ['col1', 'col2']

You could try this:
df2 = pd.DataFrame.from_dict({'a':a,'b':b}, orient = 'index')

If you have a dictionary you can turn it into a pandas data frame with the following line of code:
pd.DataFrame({"key": d.keys(), "value": d.values()})

Just pass the dict on a list:
a = 2
b = 3
df2 = pd.DataFrame([{'A':a,'B':b}])

How to replace df.loc with df.reindex without KeyError

I have a huge dataframe which I get from a .csv file. After defining the columns I only want to use the one I need. I used Python 3.8.1 version and it worked great, although raising the "FutureWarning:
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative."
If I try to do the same in Python 3.10.x I get a KeyError now: "[’empty’] not in index"
In order to get slice/get rid of columns I don't need I use the .loc function like this:
df = df.loc[:, ['laenge','Timestamp', 'Nick']]
How can I get the same result with .reindex function (or any other) without getting the KeyError?
Thanks

If need only columns which exist in DataFrame use numpy.intersect1d:
df = df[np.intersect1d(['laenge','Timestamp', 'Nick'], df.columns)]
Same output is if use DataFrame.reindex with remove only missing values columns:
df = df.reindex(['laenge','Timestamp', 'Nick'], axis=1).dropna(how='all', axis=1)
Sample:
df = pd.DataFrame({'laenge': [0,5], 'col': [1,7], 'Nick': [2,8]})
print (df)
laenge col Nick
0 0 1 2
1 5 7 8
df = df[np.intersect1d(['laenge','Timestamp', 'Nick'], df.columns)]
print (df)
Nick laenge
0 2 0
1 8 5

Use reindex:
df = pd.DataFrame({'A': [0], 'B': [1], 'C': [2]})
# A B C
# 0 0 1 2
df.reindex(['A', 'C', 'D'], axis=1)
output:
A C D
0 0 2 NaN
If you need to get only the common columns, you can use Index.intersection:
cols = ['A', 'C', 'E']
df[df.columns.intersection(cols)]
output:
A C
0 0 2

Aggregate columns based on values and sum

import pandas as pd
test = pd.DataFrame({'Area': ['Tipperary','Tipperary','Cork','Dublin'],
'Deaths': [11,33,44,55]}
)
I have this problem on a much larger scale but for readability I have created a smaller version, what groupby logic do i need to group by the Area column and sum, meaning I end up with 3 rows as opposed to 4 because Tipperary is in there twice. Say if I had 6 columns altogether how would I do this and keep my existing dataframe as it is? IE just reduce the row count because of the duplicated values in 'Area'

If the other columns have more than just numbers, you can use .groupby and .agg with different functions for each column. If you do not want to move the grouping column to the index, you can set the parameter as_index = False in groupby.
import pandas as pd
test = pd.DataFrame({'Area': ['Tipperary', 'Tipperary', 'Cork', 'Dublin'],
'Deaths': [11, 33, 44, 55],
'Text': ['a', 'b', 'c', 'd'],
'Numbers': [1, 4, 3, 2]}
)
out = test.groupby('Area', as_index=False).agg({'Deaths': 'sum', 'Text': lambda x: ','.join(i for i in x), 'Numbers': 'max'})
print(out)
Prints:
Area Deaths Text Numbers
0 Cork 44 c 3
1 Dublin 55 d 2
2 Tipperary 44 a,b 4

You can simply use the .groupby method
import pandas as pd
test = pd.DataFrame({'Area': ['Tipperary','Tipperary','Cork','Dublin'],
'Deaths': [11,33,44,55]}
)
test.groupby('Area').sum()

Creating a dataframe in a for loop based on another dataframe

I have a data frame, df, and I'd like to get all the columns in it and the count of unique values in it and save it as another data frame. I can't seem to find a way to do that. I can, however, print what I want on the console. Here's what I mean:
def counting_unique_values_in_df(df):
for evry_colm in df:
print (evry_colm, "-", df[evry_colm].value_counts().count())
Now that prints what I want just fine. Instead of printing, if I do something like newdf = pd.DataFrame(evry_colm, df[evry_colm].value_counts().count(), columns = ('a', 'b')), it throws an error that reads "TypeError: object of type 'numpy.int32' has no len()". Obviously, that isn't right.
Soo, how can I make a data frame like columnName and UniqueCounts?

To count unique values per column you can use apply and nunique function on data frame.
Something like:
import pandas as pd
df = pd.DataFrame([
{'a': 1, 'b': 2},
{'a': 2, 'b': 2}
])
count_series = df.apply(lambda col: col.nunique())
# returned object is pandas Series
# a 2
# b 1
# to map it to DataFrame try
pd.DataFrame(count_series).T

import pandas as pd
df = pd.DataFrame({'A': [1, 1, 2, 2], 'B': [1, 2, 3, 4]})
print(df)
print()
df = pd.DataFrame({col: [df[col].nunique()] for col in df})
print(df)
Output:
A B
0 1 1
1 1 2
2 2 3
3 2 4
A B
0 2 4

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to use groupby agg and rename functions for all columns - python

You can use set_levels: g = df.groupby(level=0).agg(['sum', 'count']) g.columns.set_levels(['SUM', 'HowMany'], 1, inplace=True) g >>> one two SUM HowMany SUM HowMany Alpha A 2 2 4 2 B 10 2 12 2

is using .rename() an option for you? In [7]: df.groupby(level=0).agg(['sum', 'count']).rename(columns=dict(sum='SUM', count='HowMany')) Out[7]: one two SUM HowMany SUM HowMany Alpha A 2 2 4 2 B 10 2 12 2

This is an ugly answer: gb = df.stack(0).groupby(level=[0, -1]) df1 = gb.agg({'SUM': 'sum', 'HowMany': 'count'}) df1.unstack().swaplevel(0, 1, 1).sort_index(1, 0)

Related

Frequencies of combinaties of columns in Python

Error when creating a data frame in Python: ValueError [duplicate]

How to replace df.loc with df.reindex without KeyError

Aggregate columns based on values and sum

Creating a dataframe in a for loop based on another dataframe

Categories

Resources