Python Pandas groupby removes columns - python

data_c["dropoff_district"] = "default value"
data_c["distance"] = "default value" #Formed a new column named distance for geocoder
data_c["time_of_day"] = "default value" #Formed a new column named time of the day for timestamps
So I create these columns at the start of the project for plotting and data manipulaton.After I edited and filled these columns with certain values, I wanted to perform a groupby operation on data_c.
avg_d = data_c.groupby(by = 'distance').sum().reset_index()
Although when I perform a groupby on data_c, I somehow lose my 'time_of_day' and 'dropoff_district' columns in avg_d. How can I solve this issue?

The problem is that Pandas doesn't know how to add date/time objects together. Thus, when you tell Pandas to groupby and then sum, it throws out the columns it doesn't know what to do with. Example,
df = pd.DataFrame([['2019-01-01', 2, 3], ['2019-02-02', 2, 4], ['2019-02-03', 3, 5]],
columns=['day', 'distance', 'duration'])
df.day = pd.to_datetime(df.day)
If I just run your query, I'd get,
>>> df.groupby('distance').sum()
duration
distance
2 7
3 5
You can fix this by telling Pandas you want to do something different with those columns---for example, take the first value,
df.groupby('distance').agg({
'duration': 'sum',
'day': 'first'
})
which brings them back,
duration day
distance
2 7 2019-01-01
3 5 2019-02-03

Groupby does not remove your columns. The sum() call does. If those columns are not numeric, you will not retain them after sum().
So how do you like to retain columns 'time_of_day' and 'dropoff_district'? Assume you still want to keep them when they are distinct, put them into groupby:
data_c.groupby(['distance','time_of_day','dropoff_district']).sum().reset_index()
otherwise, you will have multiple different 'time_of_day' for the same 'distance'. You need to massage your data first.

Related

How to combine multiple rows into a single row with many columns in pandas using an id (clustering multiple records with same id into one record)

Situation:
1. all_task_usage_10_19
all_task_usage_10_19 is the file which consists of 29229472 rows × 20 columns.
There are multiple rows with the same ID inside the column machine_ID with different values in other columns.
Columns:
'start_time_of_the_measurement_period','end_time_of_the_measurement_period', 'job_ID', 'task_index','machine_ID', 'mean_CPU_usage_rate','canonical_memory_usage', 'assigned_memory_usage','unmapped_page_cache_memory_usage', 'total_page_cache_memory_usage', 'maximum_memory_usage','mean_disk_I/O_time', 'mean_local_disk_space_used', 'maximum_CPU_usage','maximum_disk_IO_time', 'cycles_per_instruction_(CPI)', 'memory_accesses_per_instruction_(MAI)', 'sample_portion',
'aggregation_type', 'sampled_CPU_usage'
2. clustering code
I am trying to cluster multiple machine_ID records using the following code, referencing: How to combine multiple rows into a single row with pandas
3. Output
Output displayed using: with option_context as it allows to better visualise the content
My Aim:
I am trying to cluster multiple rows with the same machine_ID into a single record, so I can apply algorithms like Moving averages, LSTM and HW for predicting cloud workloads.
Something like this.
Maybe a Multi-Index is what you're looking for?
df.set_index(['machine_ID', df.index])
Note that by default set_index returns a new dataframe, and does not change the original.
To change the original (and return None) you can pass an argument inplace=True.
Example:
df = pd.DataFrame({'machine_ID': [1, 1, 2, 2, 3],
'a': [1, 2, 3, 4, 5],
'b': [10, 20, 30, 40, 50]})
new_df = df.set_index(['machine_ID', df.index]) # not in-place
df.set_index(['machine_ID', df.index], inplace=True) # in-place
For me, it does create a multi-index: first level is 'machine_ID', second one is the previous range index:
The below code worked for me:
all_task_usage_10_19.groupby('machine_ID')[['start_time_of_the_measurement_period','end_time_of_the_measurement_period','job_ID', 'task_index','mean_CPU_usage_rate', 'canonical_memory_usage',
'assigned_memory_usage', 'unmapped_page_cache_memory_usage', 'total_page_cache_memory_usage', 'maximum_memory_usage',
'mean_disk_I/O_time', 'mean_local_disk_space_used','maximum_CPU_usage',
'maximum_disk_IO_time', 'cycles_per_instruction_(CPI)',
'memory_accesses_per_instruction_(MAI)', 'sample_portion',
'aggregation_type', 'sampled_CPU_usage']].agg(list).reset_index()

How to sort a dataFrame in python pandas by two or more columns based on a list of values? [duplicate]

So I have a pandas DataFrame, df, with columns that represent taxonomical classification (i.e. Kingdom, Phylum, Class etc...) I also have a list of taxonomic labels that correspond to the order I would like the DataFrame to be ordered by.
The list looks something like this:
class_list=['Gammaproteobacteria', 'Bacteroidetes', 'Negativicutes', 'Clostridia', 'Bacilli', 'Actinobacteria', 'Betaproteobacteria', 'delta/epsilon subdivisions', 'Synergistia', 'Mollicutes', 'Nitrospira', 'Spirochaetia', 'Thermotogae', 'Aquificae', 'Fimbriimonas', 'Gemmatimonadetes', 'Dehalococcoidia', 'Oscillatoriophycideae', 'Chlamydiae', 'Nostocales', 'Thermodesulfobacteria', 'Erysipelotrichia', 'Chlorobi', 'Deinococci']
This list would correspond to the Dataframe column df['Class']. I would like to sort all the rows for the whole dataframe based on the order of the list as df['Class'] is in a different order currently. What would be the best way to do this?
You could make the Class column your index column
df = df.set_index('Class')
and then use df.loc to reindex the DataFrame with class_list:
df.loc[class_list]
Minimal example:
>>> df = pd.DataFrame({'Class': ['Gammaproteobacteria', 'Bacteroidetes', 'Negativicutes'], 'Number': [3, 5, 6]})
>>> df
Class Number
0 Gammaproteobacteria 3
1 Bacteroidetes 5
2 Negativicutes 6
>>> df = df.set_index('Class')
>>> df.loc[['Bacteroidetes', 'Negativicutes', 'Gammaproteobacteria']]
Number
Bacteroidetes 5
Negativicutes 6
Gammaproteobacteria 3
Alex's solution doesn't work if your original dataframe does not contain all of the elements in the ordered list i.e.: if your input data at some point in time does not contain "Negativicutes", this script will fail. One way to get past this is to append your df's in a list and concatenate them at the end. For example:
ordered_classes = ['Bacteroidetes', 'Negativicutes', 'Gammaproteobacteria']
df_list = []
for i in ordered_classes:
df_list.append(df[df['Class']==i])
ordered_df = pd.concat(df_list)

Pandas: best way to add a total row that calculates the sum of specific (multiple) columns while preserving the data type

I am trying to create a row at the bottom of a dataframe to show the sum of certain columns. I am under the impression that this shall be a really simple operation, but to my surprise, none of the methods I found on SO works for me in one step.
The methods that I've found on SO:
df.loc['TOTAL'] = df.sum()
This doesn't work for me as long as there are non-numeric columns in the dataframe. I need to select the columns first and then concat the non-numeric columns back
df.append(df.sum(numeric_only=True), ignore_index=True)
This won't preserve my data types. Integer column will be converted to float.
df3.loc['Total', 'ColumnA']= df['ColumnA'].sum()
I can only use this to sum one column.
I must have missed something in the process as this is not that hard an operation. Please let me know how I can add a sum row while preserving the data type of the dataframe.
Thanks.
Edit:
First off, sorry for the late update. I was on the road for the last weekend
Example:
df1 = pd.DataFrame(data = {'CountyID': [77, 95], 'Acronym': ['LC', 'NC'], 'Developable': [44490, 56261], 'Protected': [40355, 35943],
'Developed': [66806, 72211]}, index = ['Lehigh', 'Northampton'])
What I want to get would be
Please ignore the differences of the index.
It's a little tricky for me because I don't need to get the sum for the column 'County ID' since it's for specific indexing. So the question is more about getting the sum of specific numeric columns.
Thanks again.
Here is some toy data to use as an example:
df = pd.DataFrame({'A':[1.0,2.0,3.0],'B':[1,2,3],'C':['A','B','C']})
So that we can preserve the dtypes after the sum, we will store them as d
d = df.dtypes
Next, since we only want to sum numeric columns, pass numeric_only=True to sum(), but follow similar logic to your first attempt
df.loc['Total'] = df.sum(numeric_only=True)
And finally, reset the dtypes of your DataFrame to their original values.
df.astype(d)
A B C
0 1.0 1 A
1 2.0 2 B
2 3.0 3 C
Total 6.0 6 NaN
To select the numeric columns, you can do
df_numeric = df.select_dtypes(include = ['int64', 'float64'])
df_num_cols = df_numeric.columns
Then do what you did first (using what I found here)
df.loc['Total'] = pd.Series(df[df_num_cols].sum(), index = [df_num_cols])

Why is 'series' type not a special case of 'dataframe' type in pandas? [duplicate]

In Pandas, when I select a label that only has one entry in the index I get back a Series, but when I select an entry that has more then one entry I get back a data frame.
Why is that? Is there a way to ensure I always get back a data frame?
In [1]: import pandas as pd
In [2]: df = pd.DataFrame(data=range(5), index=[1, 2, 3, 3, 3])
In [3]: type(df.loc[3])
Out[3]: pandas.core.frame.DataFrame
In [4]: type(df.loc[1])
Out[4]: pandas.core.series.Series
Granted that the behavior is inconsistent, but I think it's easy to imagine cases where this is convenient. Anyway, to get a DataFrame every time, just pass a list to loc. There are other ways, but in my opinion this is the cleanest.
In [2]: type(df.loc[[3]])
Out[2]: pandas.core.frame.DataFrame
In [3]: type(df.loc[[1]])
Out[3]: pandas.core.frame.DataFrame
The TLDR
When using loc
df.loc[:] = Dataframe
df.loc[int] = Dataframe if you have more than one column and Series if you have only 1 column in the dataframe
df.loc[:, ["col_name"]] = Dataframe if you have more than one row and Series if you have only 1 row in the selection
df.loc[:, "col_name"] = Series
Not using loc
df["col_name"] = Series
df[["col_name"]] = Dataframe
You have an index with three index items 3. For this reason df.loc[3] will return a dataframe.
The reason is that you don't specify the column. So df.loc[3] selects three items of all columns (which is column 0), while df.loc[3,0] will return a Series. E.g. df.loc[1:2] also returns a dataframe, because you slice the rows.
Selecting a single row (as df.loc[1]) returns a Series with the column names as the index.
If you want to be sure to always have a DataFrame, you can slice like df.loc[1:1]. Another option is boolean indexing (df.loc[df.index==1]) or the take method (df.take([0]), but this used location not labels!).
Use df['columnName'] to get a Series and df[['columnName']] to get a Dataframe.
You wrote in a comment to joris' answer:
"I don't understand the design
decision for single rows to get converted into a series - why not a
data frame with one row?"
A single row isn't converted in a Series.
It IS a Series: No, I don't think so, in fact; see the edit
The best way to think about the pandas data structures is as flexible
containers for lower dimensional data. For example, DataFrame is a
container for Series, and Panel is a container for DataFrame objects.
We would like to be able to insert and remove objects from these
containers in a dictionary-like fashion.
http://pandas.pydata.org/pandas-docs/stable/overview.html#why-more-than-1-data-structure
The data model of Pandas objects has been choosen like that. The reason certainly lies in the fact that it ensures some advantages I don't know (I don't fully understand the last sentence of the citation, maybe it's the reason)
.
Edit : I don't agree with me
A DataFrame can't be composed of elements that would be Series, because the following code gives the same type "Series" as well for a row as for a column:
import pandas as pd
df = pd.DataFrame(data=[11,12,13], index=[2, 3, 3])
print '-------- df -------------'
print df
print '\n------- df.loc[2] --------'
print df.loc[2]
print 'type(df.loc[1]) : ',type(df.loc[2])
print '\n--------- df[0] ----------'
print df[0]
print 'type(df[0]) : ',type(df[0])
result
-------- df -------------
0
2 11
3 12
3 13
------- df.loc[2] --------
0 11
Name: 2, dtype: int64
type(df.loc[1]) : <class 'pandas.core.series.Series'>
--------- df[0] ----------
2 11
3 12
3 13
Name: 0, dtype: int64
type(df[0]) : <class 'pandas.core.series.Series'>
So, there is no sense to pretend that a DataFrame is composed of Series because what would these said Series be supposed to be : columns or rows ? Stupid question and vision.
.
Then what is a DataFrame ?
In the previous version of this answer, I asked this question, trying to find the answer to the Why is that? part of the question of the OP and the similar interrogation single rows to get converted into a series - why not a data frame with one row? in one of his comment,
while the Is there a way to ensure I always get back a data frame? part has been answered by Dan Allan.
Then, as the Pandas' docs cited above says that the pandas' data structures are best seen as containers of lower dimensional data, it seemed to me that the understanding of the why would be found in the characteristcs of the nature of DataFrame structures.
However, I realized that this cited advice must not be taken as a precise description of the nature of Pandas' data structures.
This advice doesn't mean that a DataFrame is a container of Series.
It expresses that the mental representation of a DataFrame as a container of Series (either rows or columns according the option considered at one moment of a reasoning) is a good way to consider DataFrames, even if it isn't strictly the case in reality. "Good" meaning that this vision enables to use DataFrames with efficiency. That's all.
.
Then what is a DataFrame object ?
The DataFrame class produces instances that have a particular structure originated in the NDFrame base class, itself derived from the PandasContainer base class that is also a parent class of the Series class.
Note that this is correct for Pandas until version 0.12. In the upcoming version 0.13, Series will derive also from NDFrame class only.
# with pandas 0.12
from pandas import Series
print 'Series :\n',Series
print 'Series.__bases__ :\n',Series.__bases__
from pandas import DataFrame
print '\nDataFrame :\n',DataFrame
print 'DataFrame.__bases__ :\n',DataFrame.__bases__
print '\n-------------------'
from pandas.core.generic import NDFrame
print '\nNDFrame.__bases__ :\n',NDFrame.__bases__
from pandas.core.generic import PandasContainer
print '\nPandasContainer.__bases__ :\n',PandasContainer.__bases__
from pandas.core.base import PandasObject
print '\nPandasObject.__bases__ :\n',PandasObject.__bases__
from pandas.core.base import StringMixin
print '\nStringMixin.__bases__ :\n',StringMixin.__bases__
result
Series :
<class 'pandas.core.series.Series'>
Series.__bases__ :
(<class 'pandas.core.generic.PandasContainer'>, <type 'numpy.ndarray'>)
DataFrame :
<class 'pandas.core.frame.DataFrame'>
DataFrame.__bases__ :
(<class 'pandas.core.generic.NDFrame'>,)
-------------------
NDFrame.__bases__ :
(<class 'pandas.core.generic.PandasContainer'>,)
PandasContainer.__bases__ :
(<class 'pandas.core.base.PandasObject'>,)
PandasObject.__bases__ :
(<class 'pandas.core.base.StringMixin'>,)
StringMixin.__bases__ :
(<type 'object'>,)
So my understanding is now that a DataFrame instance has certain methods that have been crafted in order to control the way data are extracted from rows and columns.
The ways these extracting methods work are described in this page:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing
We find in it the method given by Dan Allan and other methods.
Why these extracting methods have been crafted as they were ?
That's certainly because they have been appraised as the ones giving the better possibilities and ease in data analysis.
It's precisely what is expressed in this sentence:
The best way to think about the pandas data structures is as flexible
containers for lower dimensional data.
The why of the extraction of data from a DataFRame instance doesn't lies in its structure, it lies in the why of this structure. I guess that the structure and functionning of the Pandas' data structure have been chiseled in order to be as much intellectually intuitive as possible, and that to understand the details, one must read the blog of Wes McKinney.
If the objective is to get a subset of the data set using the index, it is best to avoid using loc or iloc. Instead you should use syntax similar to this :
df = pd.DataFrame(data=range(5), index=[1, 2, 3, 3, 3])
result = df[df.index == 3]
isinstance(result, pd.DataFrame) # True
result = df[df.index == 1]
isinstance(result, pd.DataFrame) # True
every time we put [['column name']] it returns Pandas DataFrame object,
if we put ['column name'] we got Pandas Series object
If you also select on the index of the dataframe then the result can be either a DataFrame or a Series or it can be a Series or a scalar (single value).
This function ensures that you always get a list from your selection (if the df, index and column are valid):
def get_list_from_df_column(df, index, column):
df_or_series = df.loc[index,[column]]
# df.loc[index,column] is also possible and returns a series or a scalar
if isinstance(df_or_series, pd.Series):
resulting_list = df_or_series.tolist() #get list from series
else:
resulting_list = df_or_series[column].tolist()
# use the column key to get a series from the dataframe
return(resulting_list)

Pandas recalculate index after a concatenation

I have a problem where I produce a pandas dataframe by concatenating along the row axis (stacking vertically).
Each of the constituent dataframes has an autogenerated index (ascending numbers).
After concatenation, my index is screwed up: it counts up to n (where n is the shape[0] of the corresponding dataframe), and restarts at zero at the next dataframe.
I am trying to "re-calculate the index, given the current order", or "re-index" (or so I thought). Turns out that isn't exactly what DataFrame.reindex seems to be doing.
Here is what I tried to do:
train_df = pd.concat(train_class_df_list)
train_df = train_df.reindex(index=[i for i in range(train_df.shape[0])])
It failed with "cannot reindex from a duplicate axis." I don't want to change the order of my data... just need to delete the old index and set up a new one, with the order of rows preserved.
If your index is autogenerated and you don't want to keep it, you can use the ignore_index option.
`
train_df = pd.concat(train_class_df_list, ignore_index=True)
This will autogenerate a new index for you, and my guess is that this is exactly what you are after.
After vertical concatenation, if you get an index of [0, n) followed by [0, m), all you need to do is call reset_index:
train_df.reset_index(drop=True)
(you can do this in place using inplace=True).
import pandas as pd
>>> pd.concat([
pd.DataFrame({'a': [1, 2]}),
pd.DataFrame({'a': [1, 2]})]).reset_index(drop=True)
a
0 1
1 2
2 1
3 2
This should work:
train_df.reset_index(inplace=True, drop=True)
Set drop to True to avoid an additional column in your dataframe.

Categories

Resources