groupy and aggregation in Dask

groupy and aggregation in Dask - python

My dataframe look like the below:
# initialize list of lists
data = [['tom', 10], ['nick', 15], ['juli', 14],['tom', 10], ['juli', 15] ]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['Name', 'Age'])
Name Age
0 tom 10
1 nick 15
2 juli 14
3 tom 10
4 juli 15
I want to group by the 'Name', count the 'Age' and unique count of 'Age'.
Using pandas I got the result:
Age
count nunique
Name
juli 2 2
nick 1 1
tom 2 1
Pandas code :
types = ['count', 'nunique']
df.groupby('Name').agg({'Age': types})
How can i achieve this in Dask?
In dask, I can do either count or nunique...
ddf = daskdf.from_pandas(df, npartitions=4)
ddf.groupby('Name').Age.count().to_frame().compute()
Age
Name
nick 1
tom 2
juli 2

The advantage of lazy computations is that you can specify them one at a time, but the actual computation will be done with some optimization to avoid redundant calculations.
Specifically, you can create lazy computation for nunique and count separately, then combine the computed results:
# calculation with dask
dask_series = ddf.groupby("Name")["Age"]
# these are lazy results that will need to be computed
lazy_results = [
dask_series.nunique().to_frame(name="age_nunique"),
dask_series.count().to_frame(name="age_count"),
]
# note that concatenation happens on computed results
print(pd.concat(*dd.compute(lazy_results), axis=1))
Here's the full snippet:
import dask.dataframe as dd
import pandas as pd
# initialize list of lists
data = [["tom", 10], ["nick", 15], ["juli", 14], ["tom", 10], ["juli", 15]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns=["Name", "Age"])
# calculation with pandas
types = ["count", "nunique"]
print(df.groupby("Name").agg({"Age": types}))
# Age
# count nunique
# Name
# juli 2 2
# nick 1 1
# tom 2 1
# calculation with dask
dask_series = ddf.groupby("Name")["Age"]
# these are lazy results that will need to be computed
lazy_results = [
dask_series.nunique().to_frame(name="age_nunique"),
dask_series.count().to_frame(name="age_count"),
]
# note that concatenation happens on computed results
print(pd.concat(*dd.compute(lazy_results), axis=1))
# age_nunique age_count
# Name
# nick 1 1
# tom 1 2
# juli 2 2

Related

pandas.DataFrame: aggregate rows based on regex

What I want to do
I have a trouble to clean my data because some values were not input correctly.
import pandas as pd
data = [[1, 2], [2, 4], [3, 6], [4, 8], [5, 10]]
index = ['100: Test', '100: test', '101: FOO', '102: WWW', '101: foo foo']
columns = ['column1', 'column2']
df = pd.DataFrame(data, index=index, columns=columns)
print(df)
## Current output!!!!
# column1 column2
#100: Test 1 2
#100: test 2 4
#101: FOO 3 6
#102: WWW 4 8
#101: foo foo 5 10
## DO SOMETHING!!!!
print(df)
## Expected output!!!!
# column1 column2
#100: Test 2 4
#101: FOO 8 16
#102: WWW 4 8
My DataFrame.index consists of "ID" + "Name". However, names are not correct, so one ID may show up in more than one row.
Two requests
Sum up rows with the same ID.
Choose one name for the result. (For example, I can use either "Test" or "test" for ID=100.)
What I tried
I tried to use groupby function, but it doesn't seem to have regex compatibility.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html
df2 = df.groupby(level=0).sum()
print(df2)
## Output
# column1 column2
#100: Test 1 2
#100: test 2 4
#101: FOO 3 6
#101: foo foo 5 10
#102: WWW 4 8
Environment
Python 3.10.5
Pandas 1.4.3

Your expected output for Test does not reflect that you are trying to do a summation, but from what I can gather this is what you want. groupby can take a function or a mapping or even a series as the by argument. Here, you just want the lowercase version of the index:
df.groupby(df.index.str.lower()).sum()
which gives
column1 column2
100: test 3 6
101: foo 8 16
102: www 4 8
Here, what I've done is passed it the lowercase index, and it simply groups the rows based on matching elements in the series.
Edit
Based on the updated question, to match the numbers, you can use regular expressions:
df.groupby(df.index.str.extract(r"(\d+):", expand=False)).sum()
which gives
column1 column2
100 3 6
101 8 16
102 4 8
It isn't clear what would take precedence 101: foo foo or 101: FOO, it seems like the numbers here are the important part regardless.

import numpy as np
import pandas as pd
# Data Import
data = [[1, 2], [2, 4], [3, 6], [4, 8], [5, 10]]
index = ['100: Test', '100: test', '101: FOO', '102: WWW', '101: foo']
columns = ['column1', 'column2']
df = pd.DataFrame(data, index=index, columns=columns)
# Data Pre-process
df.reset_index(inplace=True)
df.rename(columns={'index':'ID_Name'},inplace=True)
df['ID'] = df['ID_Name'].str.split(':').str[0]
df.sort_values(['ID','ID_Name'],inplace=True)
df_group = df.groupby(['ID'])[['column1','column2']].sum().reset_index()
df_group
df = pd.merge(df,df_group,how='left',left_on='ID',right_on='ID')
df_final = df.groupby(['ID']).first()
# Data Clean Process
df_final.rename(columns={'column1_y':'column1','column2_y':'column2'},inplace= True)
df_final.drop(['column1_x','column2_x'],axis = 1 , inplace=True)
# Output Display
df_final
Hi Dmjy,
I have attached the code for you, please try from your side,
and if you still have any question please let me know
Thanks
Leon

pandas combine nested dataframes into one single dataframe

I have a dataframe, where in one column (we'll call it info) all the cells/rows contain another dataframe inside. I want to loop through all the rows in this column and literally stack the nested dataframes on top of each other, because they all have the same columns
How would I go about this?

You could try as follows:
import pandas as pd
length=5
# some dfs
nested_dfs = [pd.DataFrame({'a': [*range(length)],
'b': [*range(length)]}) for x in range(length)]
print(nested_dfs[0])
a b
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
# df with nested_dfs in info
df = pd.DataFrame({'info_col': nested_dfs})
# code to be implemented
lst_dfs = df['info_col'].values.tolist()
df_final = pd.concat(lst_dfs,axis=0, ignore_index=True)
df_final.tail()
a b
20 0 0
21 1 1
22 2 2
23 3 3
24 4 4
This method should be a bit faster than the solution offered by nandoquintana, which also works.
Incidentally, it is ill advised to name a df column info. This is because df.info is actually a function. E.g., normally df['col_name'].values.tolist() can also be written as df.col_name.values.tolist(). However, if you try this with df.info.values.tolist(), you will run into an error:
AttributeError: 'function' object has no attribute 'values'
You also run the risk of overwriting the function if you start assigning values to your column on top of doing something which you probably don't want to do. E.g.:
print(type(df.info))
<class 'method'>
df.info=1
# column is unaffected, you just create an int variable
print(type(df.info))
<class 'int'>
# but:
df['info']=1
# your column now has all 1's
print(type(df['info']))
<class 'pandas.core.series.Series'>

This is the solution that I came up with, although it's not the fastest which is why I am still leaving the question unanswered
df1 = pd.DataFrame()
for frame in df['Info'].tolist():
df1 = pd.concat([df1, frame], axis=0).reset_index(drop=True)

Our dataframe has three columns (col1, col2 and info).
In info, each row has a nested df as value.
import pandas as pd
nested_d1 = {'coln1': [11, 12], 'coln2': [13, 14]}
nested_df1 = pd.DataFrame(data=nested_d1)
nested_d2 = {'coln1': [15, 16], 'coln2': [17, 18]}
nested_df2 = pd.DataFrame(data=nested_d2)
d = {'col1': [1, 2], 'col2': [3, 4], 'info': [nested_df1, nested_df2]}
df = pd.DataFrame(data=d)
We could combine all nested dfs rows appending them to a list (as nested dfs schema is constant) and concatenating them later.
nested_dfs = []
for index, row in df.iterrows():
nested_dfs.append(row['info'])
result = pd.concat(nested_dfs, sort=False).reset_index(drop=True)
print(result)
This would be the result:
coln1 coln2
0 11 13
1 12 14
2 15 17
3 16 18

Change shape of dataframe with multiindex

How can change the shape of my multiindexed dataframe from:
to something like this, but with all cells values, not only of the first index:
I have tried to do it but somehow receive only the dataframe as above with this code:
numbers = [100,50,20,10,5,2,1]
for number in numbers:
dfj[number] = df['First_column_value_name'].xs(key=number, level='Second_multiindex_column_name')
list_of_columns_position = []
for number in numbers:
R_string = '{}_R'.format(number)
list_of_columns_position.append(R_string)
df_positions_as_columns = pd.concat(dfj.values(), ignore_index=True, axis=1)
df_positions_as_columns.columns = list_of_columns_position

Split your first columns into 2 parts then join the result with the second column and finally pivot your dataframe:
Setup:
data = {'A': ['TLM_1/100', 'TLM_1/50', 'TLM_1/20',
'TLM_2/100', 'TLM_2/50', 'TLM_2/20'],
'B': [11, 12, 13, 21, 22, 23]}
df = pd.DataFrame(data)
print(df)
# Output:
A B
0 TLM_1/100 11
1 TLM_1/50 12
2 TLM_1/20 13
3 TLM_2/100 21
4 TLM_2/50 22
5 TLM_2/20 23
>>> df[['B']].join(df['A'].str.split('/', expand=True)) \
.pivot(index=0, columns=1, values='B') \
.rename_axis(index=None, columns=None) \
.add_suffix('_R')
100_R 20_R 50_R
TLM_1 11 13 12
TLM_2 21 23 22

use a regular expression to split the label column into two columns a and b then group by column a and unstack the grouping.

python-pandas: new column based on index?

I have a df similar to the one below:
name age sex
1 john 12 m
2 mary 13 f
3 joseph 12 m
4 maria 14 f
How can I make a new column based on the index? for example for index 1 and 2, i want them to have the label 1 and for index 3 and 4, i want them to be labeled 2, like so:
name age sex label
1 john 12 m cluster1
2 mary 13 f cluster1
3 joseph 12 m cluster2
4 maria 14 f cluster2
Should i use something like (df.index.isin([1, 2])) == 'cluster1'? I think it's not possible to do df['target'] = (df.index.isin([1, 2])) == 'cluster1 assuming that label doesn't exist in the beginning.

I think this is what you are looking for? You can use lists for different clusters to make your labels arbitrary in this way.
import pandas as pd
data = {'name':['bob','sue','mary','steve'], 'age':[11, 23, 53, 44]}
df = pd.DataFrame(data)
print(df)
df['label'] = 0
cluster1 = [0, 3]
cluster2 = [1, 2]
df.loc[cluster1, 'label'] = 1
df.loc[cluster2, 'label'] = 2
#another way
#df.iloc[cluster1, df.columns.get_loc('label')] = 1
#df.iloc[cluster2, df.columns.get_loc('label')] = 2
print(df)
output:
name age
0 bob 11
1 sue 23
2 mary 53
3 steve 44
name age label
0 bob 11 1
1 sue 23 2
2 mary 53 2
3 steve 44 1
You can let the initial column creation to be anything. So you can either have it be one of the cluster values (so then you only have to set the other cluster manually instead of both) or you can have it be None so you can then easily check after assigning labels that you didn't miss any rows.
If the assignment to clusters is truly arbitrary I don't think you'll be able to automate it much more than this.

Is this the solution you are looking for? I doubled the data so you can try different sequences. Here, if you write create_label(df, 3) instead of 2, it will iterate over 3 by 3. It gives you an opportunity to have a parametric solution.
import pandas as pd
df = pd.DataFrame({'name': ['john', 'mary', 'joseph', 'maria', 'john', 'mary', 'joseph', 'maria'],
'age': [12, 13, 12, 14, 12, 13, 12, 14],
'sex': ['m', 'f','m', 'f', 'm', 'f','m', 'f']})
df.index = df.index + 1
df['label'] = pd.Series()
def create_label(data, each_row):
i = 0
j = 1
while i <= len(data):
data['label'][i: i + each_row] = 'label' + str(j)
i += each_row
j += 1
return data
df_new = create_label(df, 2)

For small data frame or dataset you can use the below code
Label=pd.Series(['cluster1','cluster1','cluster2','cluster2'])
df['label']=Label

you can use a for loop and use list to get a new column with desired data
import pandas as pd
df = pd.read_csv("dataset.csv")
list1 = []
for i in range(len(df.name)):
if i < 2:
list1.append('cluster1')
else:
list1.append('cluster2')
label = pd.Series(list1)
df['label'] = label

You can simply use iloc and assign the values for the columns:
import pandas as pd
df = pd.read_csv('test.txt',sep='\+', engine = "python")
df["label"] = "" # adds empty "label" column
df["label"].iloc[0:2] = "cluster1"
df["label"].iloc[2:4] = "cluster2"
Since the values do not follow a certain order, as per your comments, you'd have to assign each "cluster" value manually.

How to update values in a specific row in a Python Pandas DataFrame?

With the nice indexing methods in Pandas I have no problems extracting data in various ways. On the other hand I am still confused about how to change data in an existing DataFrame.
In the following code I have two DataFrames and my goal is to update values in a specific row in the first df from values of the second df. How can I achieve this?
import pandas as pd
df = pd.DataFrame({'filename' : ['test0.dat', 'test2.dat'],
'm': [12, 13], 'n' : [None, None]})
df2 = pd.DataFrame({'filename' : 'test2.dat', 'n':16}, index=[0])
# this overwrites the first row but we want to update the second
# df.update(df2)
# this does not update anything
df.loc[df.filename == 'test2.dat'].update(df2)
print(df)
gives
filename m n
0 test0.dat 12 None
1 test2.dat 13 None
[2 rows x 3 columns]
but how can I achieve this:
filename m n
0 test0.dat 12 None
1 test2.dat 13 16
[2 rows x 3 columns]

So first of all, pandas updates using the index. When an update command does not update anything, check both left-hand side and right-hand side. If you don't update the indices to follow your identification logic, you can do something along the lines of
>>> df.loc[df.filename == 'test2.dat', 'n'] = df2[df2.filename == 'test2.dat'].loc[0]['n']
>>> df
Out[331]:
filename m n
0 test0.dat 12 None
1 test2.dat 13 16
If you want to do this for the whole table, I suggest a method I believe is superior to the previously mentioned ones: since your identifier is filename, set filename as your index, and then use update() as you wanted to. Both merge and the apply() approach contain unnecessary overhead:
>>> df.set_index('filename', inplace=True)
>>> df2.set_index('filename', inplace=True)
>>> df.update(df2)
>>> df
Out[292]:
m n
filename
test0.dat 12 None
test2.dat 13 16

In SQL, I would have do it in one shot as
update table1 set col1 = new_value where col1 = old_value
but in Python Pandas, we could just do this:
data = [['ram', 10], ['sam', 15], ['tam', 15]]
kids = pd.DataFrame(data, columns = ['Name', 'Age'])
kids
which will generate the following output :
Name Age
0 ram 10
1 sam 15
2 tam 15
now we can run:
kids.loc[kids.Age == 15,'Age'] = 17
kids
which will show the following output
Name Age
0 ram 10
1 sam 17
2 tam 17
which should be equivalent to the following SQL
update kids set age = 17 where age = 15

If you have one large dataframe and only a few update values I would use apply like this:
import pandas as pd
df = pd.DataFrame({'filename' : ['test0.dat', 'test2.dat'],
'm': [12, 13], 'n' : [None, None]})
data = {'filename' : 'test2.dat', 'n':16}
def update_vals(row, data=data):
if row.filename == data['filename']:
row.n = data['n']
return row
df.apply(update_vals, axis=1)

Update null elements with value in the same location in other.
Combines a DataFrame with other DataFrame using func to element-wise combine columns. The row and column indexes of the resulting DataFrame will be the union of the two.
df1 = pd.DataFrame({'A': [None, 0], 'B': [None, 4]})
df2 = pd.DataFrame({'A': [1, 1], 'B': [3, 3]})
df1.combine_first(df2)
A B
0 1.0 3.0
1 0.0 4.0
more information in this link

There are probably a few ways to do this, but one approach would be to merge the two dataframes together on the filename/m column, then populate the column 'n' from the right dataframe if a match was found. The n_x, n_y in the code refer to the left/right dataframes in the merge.
In[100] : df = pd.merge(df1, df2, how='left', on=['filename','m'])
In[101] : df
Out[101]:
filename m n_x n_y
0 test0.dat 12 None NaN
1 test2.dat 13 None 16
In[102] : df['n'] = df['n_y'].fillna(df['n_x'])
In[103] : df = df.drop(['n_x','n_y'], axis=1)
In[104] : df
Out[104]:
filename m n
0 test0.dat 12 None
1 test2.dat 13 16

If you want to put anything in the iith row, add square brackets:
df.loc[df.iloc[ii].name, 'filename'] = [{'anything': 0}]

I needed to update and add suffix to few rows of the dataframe on conditional basis based on the another column's value of the same dataframe -
df with column Feature and Entity and need to update Entity based on specific feature type
df.loc[df.Feature == 'dnb', 'Entity'] = 'duns_' + df.loc[df.Feature == 'dnb','Entity']

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

groupy and aggregation in Dask - python

Related

pandas.DataFrame: aggregate rows based on regex

pandas combine nested dataframes into one single dataframe

Change shape of dataframe with multiindex

python-pandas: new column based on index?

How to update values in a specific row in a Python Pandas DataFrame?

Categories

Resources