dask: how to groupby, aggregate without losing column used for groupby - python

How do one get a SQL-style grouped output when grouping following data:
item frequency
A 5
A 9
B 2
B 4
C 6
df.groupby(by = ["item"]).sum()
results in this:
item frequency
A 14
B 6
C 6
In pandas it is achieved by setting as_index=False. But dask doesn't support this argument in groupby. It currently omits item column and returns the series with frequency column.

Perhaps call .reset_index afterwards?

Related

Calculate quantile for each observation in a dataframe

I am new to Python and I have the following dataframe structure:
data = {'name': ["a","b","c","d","e","f","g","h"], 'value1': [1,2,3,4,5,6,7,8],'value2': [1,2,3,4,5,6,7,8]}
data = pd.DataFrame.from_dict(data)
data = data.transpose()
What I want to calculate is a new dataframe, where for each row, each column has a value corresponding to the quantile in the data.
In other words, I am trying to understand how to apply the function pd.quantile to return a dataframe with each entry being equal to the quantile value of the column in the row.
I tried the following, but I don't think it works:
x.quantile(q = 0.9,axis =0)
or:
x.apply(quantile,axis=0)
Many thanks in advance.
This is because you transpose your data and as per pandas documentation here https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.transpose.html
When the DataFrame has mixed dtypes, we get a transposed DataFrame
with the object dtype
Your dataframe after loading looks like below, which means it has 'mixed dtypes' (one column is object / category and the other two are integers).
name value1 value2
0 a 1 1
1 b 2 2
2 c 3 3
3 d 4 4
4 e 5 5
5 f 6 6
6 g 7 7
7 h 8 8
In this case you transpose your data and it is being converted to object dtype, which means that quantile function does not understand it as numbers.
Try removing transposing step and use axis argument to decide for which direction you want to calculate quantiles.
By the way, you can do transposition with:
df = df.T

Different aggregation for dataframe with several columns

I am looking for some short-cut to reduce the manual grouping required:
I have a dataframe with many columns. When grouping the dataframe by 'Level', I want to group two columns using nunique(),but all other columns (ca. 60 columns representing years from 2021 onward) using mean().
Does anyone have an idea how to define 'the rest' of the columns?
Thanks!
I would do it following way
import pandas as pd
df = pd.DataFrame({'X':[1,1,1,2,2,2],'A':[1,2,3,4,5,6],'B':[1,2,3,4,5,6],'C':[7,8,9,10,11,12],'D':[13,14,15,16,17,18],'E':[19,20,21,22,23,24]})
aggdct = dict.fromkeys(df.columns, pd.Series.mean)
del aggdct['X']
aggdct['A'] = pd.Series.nunique
print(df.groupby('X').agg(aggdct))
output
A B C D E
X
1 3 2 8 14 20
2 3 5 11 17 23
Explanation: I prepare dict with information how to agg using dict.fromkeys which does provide dict with keys being names of column and values being pd.Series.mean functions, then remove column to be used in groupby and changing selected column to hold pd.Series.nunique rather than pd.Series.mean

selecting rows with min and max values of a defined column in pandas

I have the following dataframe:
A,B,C,D
10,1,2,3
1,4,7,3
10,5,2,3
40,7,9,3
9,9,9,9
I would like to create another dataframe starting from the previous one which have only two row. The selection of these two rows is based on the minimum and maximum value in the column "A". I would like to get:
A,B,C,D
1,4,7,3
40,7,9,3
Do you think I should work with a sort of index.min e index.max and then select only the two rows and append then in a new dataframe? Do you have same other suggestions?
Thanks for any kind of help,
Best
IIUC you can simply subset the dataframe with an OR condition on df.A.min() and df.A.max():
df = df[(df.A==df.A.min())|(df.A==df.A.max())]
df
A B C D
1 1 4 7 3
3 40 7 9 3
Yes, you can use idxmin/idxmax and then use loc:
df.loc[df['A'].agg(['idxmin','idxmax']) ]
Output:
A B C D
1 1 4 7 3
3 40 7 9 3
Note that this only gives one row for min and one for max. If you want all values, you should use #CHRD's solution.

How to get a column of group id values for a pandas DataFrame based on the groups produced by a groupby operation

I have a DataFrame which I want to groupby with a few columns. I know how to aggregate the data after that, or view each index tuple. However, I am unsure of the best way to just append the "group number" of each group in a column on the original dataframe:
For example, I have a dataframe, df, with two indices (a_id and b_id) which I want to use for grouping the df using groupby.
import pandas as pd
a = pd.DataFrame({'a_id':['q','q','q','q','q','r','r','r','r','r'],
'b_id':['m','m','j','j','j','g','g','f','f','f'],
'val': [1,2,3,4,5,6,7,8,9,8]})
# Output:
a_id b_id val
0 q m 1
1 q m 2
2 q j 3
3 q j 4
4 q j 5
5 r g 6
6 r g 7
7 r f 8
8 r f 9
9 r f 8
When I do the groupby, rather than aggregate everything, I just want to add a column group_id that has an integer representing the group. However, I am not sure if there is a simple way to do this. My current solution involves inverting the GroupBy.indices dictionary, turning that into a series, and appending it to the dataframe as follows:
gb = a.groupby(['a_id','b_id'])
dict_g = dict(enumerate(gb.indices.values()))
dict_g_reversed = {x:k for k,v in dict_g.items() for x in v}
group_ids = pd.Series(dict_g_reversed)
a['group_id'] = group_ids
This gives me sort of what I want, although the group_id indices are not in the right order. This seems like it should be a simple function, but I'm not sure why it seems not to be. I know in MATLAB, for example, they have a findgroups that does exactly what I would like. So far I haven't been able to find an equivalent in pandas. How can this be done with a pd DataFrame?
You can using ngroup this will provide the order as occurrence
a.groupby(['a_id','b_id']).ngroup()
Or using factorize
pd.factorize(list(map(tuple,a[['a_id','b_id']].values.tolist())))[0]+1
df['newid']=pd.factorize(list(map(tuple,a.values.tolist())))[0]+1

Transpose all rows in one column of dataframe to multiple columns based on certain conditions

I would like to convert one column of data to multiple columns in dataframe based on certain values/conditions.
Please find the code to generate the input dataframe
df1 = pd.DataFrame({'VARIABLE':['studyid',1,'age_interview', 65,'Gender','1.Male',
'2.Female',
'Ethnicity','1.Chinese','2.Indian','3.Malay']})
The data looks like as shown below
Please note that I may not know the column names in advance. But it usually follows this format. What I have shown above is a sample data and real data might have around 600-700 columns and data arranged in this fashion
What I would like to do is convert values which start with non-digits(characters) as new columns in dataframe. It can be a new dataframe.
I attempted to write a for loop but failed to due to the below error. Can you please help me achieve this outcome.
for i in range(3,len(df1)):
#str(df1['VARIABLE'][i].contains('^\d'))
if (df1['VARIABLE'][i].astype(str).contains('^\d') == True):
Through the above loop, I was trying to check whether first char is a digit, if yes, then retain it as a value (ex: 1,2,3 etc) and if it's a character (ex:gender, ethnicity etc), then create a new column. But guess this is an incorrect and lengthy approach
For example, in the above example, the columns would be studyid,age_interview,Gender,Ethnicity.
The final output would look like this
Can you please let me know if there is an elegant approach to do this?
You can use groupby to do something like:
m=~df1['VARIABLE'].str[0].str.isdigit().fillna(True)
new_df=(pd.DataFrame(df1.groupby(m.cumsum()).VARIABLE.apply(list).
values.tolist()).set_index(0).T)
print(new_df.rename_axis(None,axis=1))
studyid age_interview Gender Ethnicity
1 1 65 1.Male 1.Chinese
2 None None 2.Female 2.Indian
3 None None None 3.Malay
Explanation: m is a helper series which helps seperating the groups:
print(m.cumsum())
0 1
1 1
2 2
3 2
4 3
5 3
6 3
7 4
8 4
9 4
10 4
Then we group this helper series and apply list:
df1.groupby(m.cumsum()).VARIABLE.apply(list)
VARIABLE
1 [studyid, 1]
2 [age_interview, 65]
3 [Gender, 1.Male, 2.Female]
4 [Ethnicity, 1.Chinese, 2.Indian, 3.Malay]
Name: VARIABLE, dtype: object
At this point we have each group as a list with the column name as the first entry.
So we create a dataframe with this and set the first column as index and transpose to get our desired output.
Use itertools.groupby and then construct pd.DataFrame:
import pandas as pd
import itertools
l = ['studyid',1,'age_interview', 65,'Gender','1.Male',
'2.Female',
'Ethnicity','1.Chinese','2.Indian','3.Malay']
l = list(map(str, l))
grouped = [list(g) for k, g in itertools.groupby(l, key=lambda x:x[0].isnumeric())]
d = {k[0]: v for k,v in zip(grouped[::2],grouped[1::2])}
pd.DataFrame.from_dict(d, orient='index').T
Output:
Gender studyid age_interview Ethnicity
0 1.Male 1 65 1.Chinese
1 2.Female None None 2.Indian
2 None None None 3.Malay

Categories

Resources