Using `.groupby().apply()` instead of `.groupby().agg()` - python

Suppose I have a dataframe like this
d = {'User':['A', 'A', 'B'],
'time':[1,2,3],
'state':['CA', 'CA', 'OR'],
'type':['cd', 'dvd', 'cd']}
df = pd.Dataframe(data=d)
I want to create a function that where I will pass in a single users dataframe so for example
user_df = df[df['User'] == 'A']
Then the function will return a single row data frame that will look like this
d = {'User':['A'],
'avg_time':[1.5],
'state':['CA'],
'cd':[1],
'dvd':[1]}
res_df = pd.Dataframe(data=d)
Then that function will be used to apply this across the entire dataframe of users, so I will have
def some_function():
Then I will write df.groupby('User').apply(some_function). Then I will have this as the resulting new dataframe
d = {'User':['A','B'],
'avg_time':[1.5, 3],
'state':['CA', 'OR'],
'cd':[1, 1],
'dvd':[1, 0]}
final_df = pd.Dataframe(data=d)
I know I can grab values for the df like this
avg_time = user_df['time'].mean()
state = user_df['state'].iloc[0]
type_counts = user_df['type'].value_counts().to_dict()
But I am not sure how to tranform this into a results row dataframe. Any help is appreciated. The reasoning on why I want to do it in this way instead of .agg() is because I am going to parallelize this function to make it run faster since I will have a very large dataframe.

IIUC,
def aggUser(df):
a = pd.DataFrame({'avg_time':df['time'].mean(),
'state': [df['state'].iloc[0]]})
b = df['type'].value_counts().to_frame().T.reset_index(drop=True)
return pd.concat([a,b], axis=1).set_axis(df['User'].iloc[[0]])
pd.concat([aggUser(df.query('User == "A"')),
aggUser(df.query('User == "B"'))])
Output:
avg_time state cd dvd
User
A 1.5 CA 1 1.0
B 3.0 OR 1 NaN
df.groupby('User', group_keys=False).apply(aggUser)
Output:
avg_time state cd dvd
User
A 1.5 CA 1 1.0
B 3.0 OR 1 NaN

Related

Create a datarame from outputs of a function in python

I want to make a dataframe from all outputs from a python function. So I want to create a dataset as df. Any ideas?
import pandas as pd
def test(input):
kl = len(input)
return kl
test("kilo")
test("pound")
# initialize list of lists
data = [[4], [5]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ["Name"])
Assuming this input and function:
words = [['abc', 'd'], ['efgh', '']]
def test(input):
kl = len(input)
return kl
You can create the DataFrame:
df = pd.DataFrame(words)
0 1
0 abc d
1 efgh
then applymap your function (Warning: applymap is slow, in this particular case (getting the length), there as much faster vectorial methods):
df2 = df.applymap(test)
0 1
0 3 1
1 4 0
Or run your function in python before creating the DataFrame:
df = pd.DataFrame([[test(x) for x in l] for l in words])
0 1
0 3 1
1 4 0
A related approach would be to repeatedly call your function to make a list and then form the dataframe from it:
import pandas as pd
words = ['kilo', 'pound', 'ton']
def test(input):
kl = len(input)
return kl
data = [] # create empty list
for entry in words:
data.append(test(entry))
df = pd.DataFrame(data, columns = ['names'])

Python Pandas : group by in groups by and average, count, median

Suppose I have a dataframe that looks like this
d = {'User' : ['A', 'A', 'B', 'C', 'C', 'C'],
'time':[1,2,3,4,4,4],
'state':['CA', 'CA', 'ID', 'OR','OR','OR']}
df = pd.DataFrame(data = d)
Now suppose I want to create new dataframe that takes the average and median of time, grabs the users state, and generate a new column as well that counts the number of times that user appears in the User column, i.e.
d = {'User' : ['A', 'B', 'C'],
'avg_time':[1.5,3,4],
'median_time':[1.5,3,4],
'state':['CA','ID','OR'],
'user_count':[2,1,3]}
df_res = pd.DataFrame(data=d)
I know that I can do a group by mean statement like this
df.groupby(['User'], as_index=False).mean().groupby('User')['time'].mean()
This gives me a pandas series, and I assume I can make this into a dataframe if I wanted but how would I do the latter above for all the other columns I am interested in?
Try using pd.NamedAgg:
df.groupby('User').agg(avg_time=('time','mean'),
mean_time=('time','median'),
state=('state','first'),
user_count=('time','count')).reset_index()
Output:
User avg_time mean_time state user_count
0 A 1.5 1.5 CA 2
1 B 3.0 3.0 ID 1
2 C 4.0 4.0 OR 3
You can even pass multiple aggregate functions for the columns in the form of dictionary, something like this:
out = df.groupby('User').agg({'time': [np.mean, np.median], 'state':['first']})
time state
mean median first
User
A 1.5 1.5 CA
B 3.0 3.0 ID
C 4.0 4.0 OR
It gives multi-level columns, you can either drop the level or just join them:
>>> out.columns = ['_'.join(col) for col in out.columns]
time_mean time_median state_first
User
A 1.5 1.5 CA
B 3.0 3.0 ID
C 4.0 4.0 OR

Dataframe to Sankey Diagram

I want to generate a Sankey Diagram from product data looking like this.
id begin_date status
1 01.02.2020 a
1 10.02.2020 b
1 17.02.2020 c
2 02.02.2020 d
2 06.03.2020 b
2 17.04.2020 c
For your experimentation:
pd.DataFrame([[1, '2020-02-01', 'a'], [1, '2020-02-10', 'b'], [1, '2020-02-17', 'c'], [2, '2020-02-02', 'd'], [2, '2020-03-06', 'b'],[2, '2020-04-17', 'c']], columns=['id', 'begin_date', 'status'])
After looking at this explanation:
Draw Sankey Diagram from dataframe
I want to construct the "Source-Target-Value"-Dataframe looking like this. To improve understanding, I did not convert Source and Target to integers.
# with Source = previous status
# with Target = next status
# with Value = count of IDs that transition from Source to Target
Source Target Value Link Color
a b 1 rgba(127, 194, 65, 0.2)
b c 2 rgba(127, 194, 65, 0.2)
d b 1 rgba(211, 211, 211, 0.5)
The problem lies in generating Source, Target, and Value.
The Source and Target should be the status transition from a to b. The Value is the count of ids doing that transition.
What is the best way to do this?
EDIT: Using an online generator, the result would look like this:
Found the answer!
# assuming df is sorted by begin_date
import pandas as pd
df = pd.read_csv(r"path")
dfs = []
unique_ids = df["id"].unique()
for uid in unique_ids:
df_t = df[df["id"] == uid].copy()
df_t["status_next"] = df_t["status"].shift(-1)
df_t["status_append"] = df_t["status"] + df_t["status_next"]
df_t = df_t.groupby("status_append").agg(Value=("status_append","count")).reset_index()
dfs.append(df_t)
df = pd.concat(dfs, ignore_index=True)
df = df.groupby("status_append").agg(Value=("Value","sum")).reset_index()
df["Source"] = df['status_append'].astype(str).str[0]
df["Target"] = df['status_append'].astype(str).str[1]
df = df.drop("status_append", axis=1)
df = df[["Source", "Target", "Value"]]
yields
Source Target Value
a b 1
b c 2
d b 1

Access multi index column names in groupby objects in pandas in for loop

Say I have a dataframe like this:
df = pd.DataFrame({"name":["ss", "ss", "ss", "xx", "xx", "xx"], "num":[1,1,2,1,1,2], "m":[1,2,3,4,5,6]})
def somefunction(m):
mean = np.mean(np.array(list(m)))
return mean
result = []
for i,group in df.groupby(['name', 'num'], as_index=False):
row_result = []
mean = somefunction(group['m'])
row_result = [group['name'], group['num'], mean ]
result.append(row_result)
headers = ['name', 'num', 'm']
stats1 = pd.DataFrame(result, columns=headers)
stats1
With the above piece of code, my resultant dataframe looks like this.
But what I would really like to have is the following:
If I had used apply to perform the function, I could have just used reset_index to get what I wanted. But what I am computing has similar work flow to the example I gave below. Using group['name'] or group['num'] returns the entire series. How can I just get the group name and num in my final dataframe?
Ignore the function definition (it's just dummy), mine is much more complex than computing mean.
Let use groupby, mean, and reset_index:
df.groupby(['name','num']).mean().reset_index()
Output:
name num m
0 ss 1 1.5
1 ss 2 3.0
2 xx 1 4.5
3 xx 2 6.0
Using your code, you can get groups via the i:
def somefunction(m):
mean = np.mean(np.array(list(m)))
return mean
result = []
for i,group in df.groupby(['name', 'num'], as_index=False):
row_result = []
mean = somefunction(group['m'])
row_result = [i[0], i[1], mean ]
result.append(row_result)
headers = ['name', 'num', 'm']
stats1 = pd.DataFrame(result, columns=headers)
stats1
Output:
name num m
0 ss 1 1.5
1 ss 2 3.0
2 xx 1 4.5
3 xx 2 6.0

How do I change a single index value in pandas dataframe?

energy.loc['Republic of Korea']
I want to change the value of index from 'Republic of Korea' to 'South Korea'.
But the dataframe is too large and it is not possible to change every index value. How do I change only this single value?
#EdChum's solution looks good.
Here's one using rename, which would replace all these values in the index.
energy.rename(index={'Republic of Korea':'South Korea'},inplace=True)
Here's an example
>>> example = pd.DataFrame({'key1' : ['a','a','a','b','a','b'],
'data1' : [1,2,2,3,nan,4],
'data2' : list('abcdef')})
>>> example.set_index('key1',inplace=True)
>>> example
data1 data2
key1
a 1.0 a
a 2.0 b
a 2.0 c
b 3.0 d
a NaN e
b 4.0 f
>>> example.rename(index={'a':'c'}) # can also use inplace=True
data1 data2
key1
c 1.0 a
c 2.0 b
c 2.0 c
b 3.0 d
c NaN e
b 4.0 f
You want to do something like this:
as_list = df.index.tolist()
idx = as_list.index('Republic of Korea')
as_list[idx] = 'South Korea'
df.index = as_list
Basically, you get the index as a list, change that one element, and the replace the existing index.
Try This
df.rename(index={'Republic of Korea':'South Korea'},inplace=True)
If you have MultiIndex DataFrame, do this:
# input DataFrame
import pandas as pd
t = pd.DataFrame(data={'i1':[0,0,0,0,1,1,1,1,2,2,2,2],
'i2':[0,1,2,3,0,1,2,3,0,1,2,3],
'x':[1.,2.,3.,4.,5.,6.,7.,8.,9.,10.,11.,12.]})
t.set_index(['i1','i2'], inplace=True)
t.sort_index(inplace=True)
# changes index level 'i1' values 0 to -1
t.rename(index={0:-1}, level='i1', inplace=True)
Here's another good one, using replace on the column.
df.reset_index(inplace=True)
df.drop('index', axis = 1, inplace=True)
df["Country"].replace("Republic of Korea", value="South Korea", inplace=True)
df.set_index("Country", inplace=True)
Here's another idea based on set_value
df = df.reset_index()
df.drop('index', axis = 1, inplace=True)
index = df.index[df["Country"] == "Republic of Korea"]
df.set_value(index, "Country", "South Korea")
df = df.set_index("Country")
df["Country"] = df.index
We can use rename function to change row index or column name. Here is the example,
Suppose data frame is like given below,
student_id marks
index
1 12 33
2 23 98
To change index 1 to 5
we will use axis = 0 which is for row
df.rename({ 1 : 5 }, axis=0)
df refers to data frame variable. So, output will be like
student_id marks
index
5 12 33
2 23 98
To change column name
we will have to use axis = 1
df.rename({ "marks" : "student_marks" }, axis=1)
so, changed data frame is
student_id student_marks
index
5 12 33
2 23 98
This seems to work too:
energy.index.values[energy.index.tolist().index('Republic of Korea')] = 'South Korea'
No idea though whether this is recommended or discouraged.

Categories

Resources