When grouping a Pandas DataFrame, when should I use transform and when should I use aggregate? How do
they differ with respect to their application in practice and which one do you
consider more important?
consider the dataframe df
df = pd.DataFrame(dict(A=list('aabb'), B=[1, 2, 3, 4], C=[0, 9, 0, 9]))
groupby is the standard use aggregater
df.groupby('A').mean()
maybe you want these values broadcast across the whole group and return something with the same index as what you started with.
use transform
df.groupby('A').transform('mean')
df.set_index('A').groupby(level='A').transform('mean')
agg is used when you have specific things you want to run for different columns or more than one thing run on the same column.
df.groupby('A').agg(['mean', 'std'])
df.groupby('A').agg(dict(B='sum', C=['mean', 'prod']))
Related
Situation:
1. all_task_usage_10_19
all_task_usage_10_19 is the file which consists of 29229472 rows × 20 columns.
There are multiple rows with the same ID inside the column machine_ID with different values in other columns.
Columns:
'start_time_of_the_measurement_period','end_time_of_the_measurement_period', 'job_ID', 'task_index','machine_ID', 'mean_CPU_usage_rate','canonical_memory_usage', 'assigned_memory_usage','unmapped_page_cache_memory_usage', 'total_page_cache_memory_usage', 'maximum_memory_usage','mean_disk_I/O_time', 'mean_local_disk_space_used', 'maximum_CPU_usage','maximum_disk_IO_time', 'cycles_per_instruction_(CPI)', 'memory_accesses_per_instruction_(MAI)', 'sample_portion',
'aggregation_type', 'sampled_CPU_usage'
2. clustering code
I am trying to cluster multiple machine_ID records using the following code, referencing: How to combine multiple rows into a single row with pandas
3. Output
Output displayed using: with option_context as it allows to better visualise the content
My Aim:
I am trying to cluster multiple rows with the same machine_ID into a single record, so I can apply algorithms like Moving averages, LSTM and HW for predicting cloud workloads.
Something like this.
Maybe a Multi-Index is what you're looking for?
df.set_index(['machine_ID', df.index])
Note that by default set_index returns a new dataframe, and does not change the original.
To change the original (and return None) you can pass an argument inplace=True.
Example:
df = pd.DataFrame({'machine_ID': [1, 1, 2, 2, 3],
'a': [1, 2, 3, 4, 5],
'b': [10, 20, 30, 40, 50]})
new_df = df.set_index(['machine_ID', df.index]) # not in-place
df.set_index(['machine_ID', df.index], inplace=True) # in-place
For me, it does create a multi-index: first level is 'machine_ID', second one is the previous range index:
The below code worked for me:
all_task_usage_10_19.groupby('machine_ID')[['start_time_of_the_measurement_period','end_time_of_the_measurement_period','job_ID', 'task_index','mean_CPU_usage_rate', 'canonical_memory_usage',
'assigned_memory_usage', 'unmapped_page_cache_memory_usage', 'total_page_cache_memory_usage', 'maximum_memory_usage',
'mean_disk_I/O_time', 'mean_local_disk_space_used','maximum_CPU_usage',
'maximum_disk_IO_time', 'cycles_per_instruction_(CPI)',
'memory_accesses_per_instruction_(MAI)', 'sample_portion',
'aggregation_type', 'sampled_CPU_usage']].agg(list).reset_index()
What is a good method to sum dataframes for all Null / NaN values when using Koalas?
or stated another way
How might I return a list by column of total null value counts. I am trying to avoid converting the dataframe to spark or pandas if possible.
NOTE: .sum() omits null values in Koalas (skipna:boolean, default True - can't change to False). So running df1.isnull().sum() is out of the question
numpy was listed as an alternative but due to the dataframe being in Koalas I observed that .sum() still was omitting the nan values.
Disclaimer: I get I can run pandas on Spark but I understand that is counter productive resource wise. I hesitate to sum it from a Spark or Pandas dataframe and then convert the dataframe into Koalas (again wasting resources in my opinion). I'm working with a dataset that contains 73 columns and 4m rows.
You can actually use df.isnull(). The reason for that is that it returns an "array" of booleans to indicate whether a value is missing. Therefore, if you first call isnull and then sum you will get the correct count.
Example:
import databricks.koalas as ks
df = ks.DataFrame([
[1, 3, 9],
[2, 3, 7],
[3, None, 3]
], ["c1", "c2", "c3"])
df.isnull().sum()
I am having trouble adding several dataframes in a list of dataframes. My goal is to add dataframes from a list of dataframes based on the criteria from another list.
Example: Suppose we have a list of 10 Dataframes, DfList and another list called OrderList.
Suppose OrderList = [3, 2, 1, 4].
Then I would like to obtain a new list of 4 Dataframes in the form [DfList(0) + DfList(1) + DfList(2), DfList(3) + DfList(4), DfList(5), DfList(6) + DfList(7) + DfList(8) + DfList(9)]
I have tried a few ways to do this creating functions using DataFrame.add. Initially, my hope was that I could use the form sum(DfList(0), DfList(1), DfList(2)) to do this but quickly learned that sum() doesn't seem to be supported with DataFrames.
I was hoping to use something like sum(DfList[0:2]) and making OrderList cumulative so I could just use sum(DfList[OrderList[i]:OrderList[i+1]]) but keep getting unsupported operand type errors.
Is there an easy way to do this that I am not considering or is there a different approach entirely that you would suggest?
EDIT: The output I am looking for is another list of DataFrames containing four summed DataFrames based on OrderList (across all columns.) Three DataFrames added together for the first, two for the second, one for the third, and four for the fourth.
If you have a list of DataFrames as you said, you can use the operation sum(DfList[0:2]), but you need to be careful with the order of the columns in each DataFrame in your list because the order provided is used when adding the DataFrames. The addition does not occur accordingly to the names of the columns. If you need, the order of the columns can be changed as showed in this other question.
This example illustrates the issue:
import pandas as pd
df1 = pd.DataFrame({1:[1,23,4], 2:['x','y','z']})
df2 = pd.DataFrame({2:['x','y','z'], 1:[1,23,4]})
try:
df1 + df2
except TypeError:
print("Error")
df1 = pd.DataFrame({1:[1,23,4], 2:['x','y','z']})
df2 = pd.DataFrame({1:[1,23,4], 2:['x','y','z']})
#works fine
df1 + df2
Also, the logic that you used for the cumulative sum in sum(DfList[OrderList[i]:OrderList[i+1]])is not correct. For this to be the case, the OrderList would also need to be cumulative and have one extra element to start from zero, so instead of OrderList = [3, 2, 1, 4], you would have OrderList = [0, 3, 5, 6, 10].
data_c["dropoff_district"] = "default value"
data_c["distance"] = "default value" #Formed a new column named distance for geocoder
data_c["time_of_day"] = "default value" #Formed a new column named time of the day for timestamps
So I create these columns at the start of the project for plotting and data manipulaton.After I edited and filled these columns with certain values, I wanted to perform a groupby operation on data_c.
avg_d = data_c.groupby(by = 'distance').sum().reset_index()
Although when I perform a groupby on data_c, I somehow lose my 'time_of_day' and 'dropoff_district' columns in avg_d. How can I solve this issue?
The problem is that Pandas doesn't know how to add date/time objects together. Thus, when you tell Pandas to groupby and then sum, it throws out the columns it doesn't know what to do with. Example,
df = pd.DataFrame([['2019-01-01', 2, 3], ['2019-02-02', 2, 4], ['2019-02-03', 3, 5]],
columns=['day', 'distance', 'duration'])
df.day = pd.to_datetime(df.day)
If I just run your query, I'd get,
>>> df.groupby('distance').sum()
duration
distance
2 7
3 5
You can fix this by telling Pandas you want to do something different with those columns---for example, take the first value,
df.groupby('distance').agg({
'duration': 'sum',
'day': 'first'
})
which brings them back,
duration day
distance
2 7 2019-01-01
3 5 2019-02-03
Groupby does not remove your columns. The sum() call does. If those columns are not numeric, you will not retain them after sum().
So how do you like to retain columns 'time_of_day' and 'dropoff_district'? Assume you still want to keep them when they are distinct, put them into groupby:
data_c.groupby(['distance','time_of_day','dropoff_district']).sum().reset_index()
otherwise, you will have multiple different 'time_of_day' for the same 'distance'. You need to massage your data first.
I've started using Pandas for some large Datasets and mostly it works really well. There are some questions I have regarding the indices though
I have a MultiIndex with three levels - let's say a, b, c. How do I slice along index a - I just want the values where a = 5, 7, 10, 13. Doing df.ix[[5, 7, 10, 13]] does not work as pointed out in the documentation
I need to have different indices on a DF - can I create these multiple indices and not associate them to a dataframe and use them to give me back the raw ndarray index?
Can I slice a MultiIndex on its own not in a series or Dataframe?
Thanks in advance
For the first part, you can use boolean indexing using get_level_values:
df[df.index.get_level_values('a').isin([5, 7, 10, 13])]
For the second two, you can inspect the MultiIndex object by calling:
df.index
(and this can be inspected/sliced.)
Edit: This answer for pandas versions lower than 0.10.0 only:
Okay #hayden had the right idea to start with:
An index has the method get_level_values() which returns, however, an array (in pandas versions < 0.10.0). The isin() method doesn't exist for arrays but this works:
from pandas import lib
lib.ismember(df.index.get_level_values('a'), set([5, 7, 10, 13])
That only answers question 1 - but I'll give an update if I crack 2, 3 (half done with #hayden's help)