DataFrame.drop not dropping expected rows in Pandas - python

I have a Pandas DataFrame that includes rows that I want to drop based on values in a column "population":
data['population'].value_counts()
general population 21
developmental delay 20
sibling 2
general population + developmental delay 1
dtype: int64
here, I want to drop the two rows that have sibling as the value. So, I believe the following should do the trick:
data = data.drop(data.population=='sibling', axis=0)
It does drop 2 rows, as you can see in the resulting value counts, but they were not the rows with the specified value.
data.population.value_counts()
developmental delay 20
general population 19
sibling 2
general population + developmental delay 1
dtype: int64
Any idea what is going on here?

dataFrame.drop accepts an index (list of labels) as a parameter, not a mask.
To use drop you should do:
data = data.drop(data.index[data.population == 'sibling'])
however it is much simpler to do
data = data[data.population != 'sibling']

Related

Arithmetic operations across rows in pandas dataframe

How do I perform an arithmetic operation across rows and columns for a data frame like the one shown below?
For example I want to calculate gross margin (gross profit/Revenue) - this is basically dividing one row by another row. I want to do this across all columns.
I think you need to restructure your dataframe a little bit to do this most effectively. If you transposed your dataframe such that Revenue, etc were columns and the years were the index, you could do:
df["gross_margin"] = df["Gross profit"] / df["Revenue"]
If you don't want to make so many changes, you should at least set the metric as the index.
df = df.set_index("Metric")
And then you could:
gross_margin = df.loc["Gross profit", :] / df.loc["Revenue", :]
here is one way to do it
df2=df.T
df2['3']=df2.iloc[1:,2]/df2.iloc[1:,0]
df2=df2.T
df2.iloc[3,0] = 'Gross Margin'
df2
Metric 2012 2013 2014 2015 2016
0 Revenue 116707394.0 133084076.0 143328982.0 151271526.0 181910977.0
1 Cost_of_Sales -66538762.0 -76298147.0 -82099051.0 -83925957.0 -106583385.0
2 Gross_profit 501686320.0 56785929.0 612299310.0 67345569.0 75327592.0
3 Gross Margin 4.298668 0.426692 4.271985 0.445197 0.41409

Why does groupby operations behave differently

When using pandas groupby functions and manipulating the output after the groupby, I've noticed that some functions behave differently in terms of what is returned as the index and how this can be manipulated.
Say we have a dataframe with the following information:
Name Type ID
0 Book1 ebook 1
1 Book2 paper 2
2 Book3 paper 3
3 Book1 ebook 1
4 Book2 paper 2
if we do
df.groupby(["Name", "Type"]).sum()
we get a DataFrame:
ID
Name Type
Book1 ebook 2
Book2 paper 4
Book3 paper 3
which contains a MultiIndex with the columns used in the groupby:
MultiIndex([('Book1', 'ebook'),
('Book2', 'paper'),
('Book3', 'paper')],
names=['Name', 'Type'])
and one column called ID.
but if I apply a size() function, the result is a Series:
Name Type
Book1 ebook 2
Book2 paper 2
Book3 paper 1
dtype: int64
And at last, if I do a pct_change(), we get only the resulting DataFrame column:
ID
0 NaN
1 NaN
2 NaN
3 0.0
4 0.0
TL;DR. I want to know why some functions return a Series whilst some others a DataFrame as this made me confused when dealing with different operations within the same DataFrame.
From the document
Size:
Returns
Series
Number of rows in each group.
For the sum , since you did not pass the column for sum, so it will return the data frame without the groupby key
df.groupby(["Name", "Type"])['ID'].sum() # return Series
Function like diff and pct_change is not agg, it will return the value with the same index as original dataframe, for count , mean, sum they are agg, return with the value and groupby key as index
The outputs are different because the aggregations are different, and those are what mostly control what is returned. Think of the array equivalent. The data are the same but one "aggregation" returns a single scalar value, the other returns an array the same size as the input
import numpy as np
np.array([1,2,3]).sum()
#6
np.array([1,2,3]).cumsum()
#array([1, 3, 6], dtype=int32)
The same thing goes for aggregations of a DataFrameGroupBy object. All the first part of the groupby does is create a mapping from the DataFrame to the groups. Since this doesn't really do anything there's no reason why the same groupby with a different operation needs to return the same type of output (see above).
gp = df.groupby(["Name", "Type"])
# Haven't done any aggregations yet...
The other important part here is that we have a DataFrameGroupBy object. There are also SeriesGroupBy objects, and that difference can change the return.
gp
#<pandas.core.groupby.generic.DataFrameGroupBy object>
So what happens when you aggregate?
With a DataFrameGroupBy when you choose an aggregation (like sum) that collapses to a single value per group the return will be a DataFrame where the indices are the unique grouping keys. The return is a DataFrame because we provided a DataFrameGroupBy object. DataFrames can have multiple columns and had there been another numeric column it would have aggregated that too, necessitating the DataFrame output.
gp.sum()
# ID
#Name Type
#Book1 ebook 2
#Book2 paper 4
#Book3 paper 3
On the other hand if you use a SeriesGroupBy object (select a single column with []) then you'll get a Series back, again with the index of unique group keys.
df.groupby(["Name", "Type"])['ID'].sum()
|------- SeriesGroupBy ----------|
#Name Type
#Book1 ebook 2
#Book2 paper 4
#Book3 paper 3
#Name: ID, dtype: int64
For aggregations that return arrays (like cumsum, pct_change) a DataFrameGroupBy will return a DataFrame and a SeriesGroupBy will return a Series. But the index is no longer the unique group keys. This is because that would make little sense; typically you'd want to do a calculation within the group and then assign the result back to the original DataFrame. As a result the return is indexed like the original DataFrame you provided for aggregation. This makes creating these columns very simple as pandas handles all of the alignment
df['ID_pct_change'] = gp.pct_change()
# Name Type ID ID_pct_change
#0 Book1 ebook 1 NaN
#1 Book2 paper 2 NaN
#2 Book3 paper 3 NaN
#3 Book1 ebook 1 0.0 # Calculated from row 0 and aligned.
#4 Book2 paper 2 0.0
But what about size? That one is a bit weird. The size of a group is a scalar. It doesn't matter how many columns the group has or whether values in those columns are missing, so sending it a DataFrameGroupBy or SeriesGroupBy object is irrelevant. As a result pandas will always return a Series. Again being a group level aggregation that returns a scalar it makes sense to have the return indexed by the unique group keys.
gp.size()
#Name Type
#Book1 ebook 2
#Book2 paper 2
#Book3 paper 1
#dtype: int64
Finally for completeness, though aggregations like sum return a single scalar value it can often be useful to bring those values back to the every row for that group in the original DataFrame. However the return of a normal .sum has a different index, so it won't align. You could merge the values back on the unique keys, but pandas provides the ability to transform these aggregations. Since the intent here is to bring it back to the original DataFrame, the Series/DataFrame is indexed like the original input
gp.transform('sum')
# ID
#0 2 # Row 0 is Book1 ebook which has a group sum of 2
#1 4
#2 3
#3 2 # Row 3 is also Book1 ebook which has a group sum of 2
#4 4

Find Gaps in a Pandas Dataframe

I have a Dataframe which has a column for Minutes and correlated value, the frequency is about 79 seconds but sometimes there is missing data for a period (no rows at all). I want to detect if there is a gap of 25 or more Minutes and delete the dataset if so.
How do I test if there is a gap which is?
The dataframe looks like this:
INDEX minutes data
0 23.000 1.456
1 24.185 1.223
2 27.250 0.931
3 55.700 2.513
4 56.790 1.446
... ... ...
So there is a irregular but short gap and one that exceeds 25 Minutes. In this case I want the dataset to be empty:
I am quite new to Python, especially to Pandas so an explanation would be helpful to learn.
You can use numpy.roll to create a column with shifted values (i.e. the first value from the original column becomes the second value, the second becomes the third, etc):
import pandas as pd
import numpy as np
df = pd.DataFrame({'minutes': [23.000, 24.185, 27.250, 55.700, 56.790]})
np.roll(df['minutes'], 1)
# output: array([56.79 , 23. , 24.185, 27.25 , 55.7 ])
Add this as a new column to your dataframe and subtract the original column with the new column.
We also drop the first row beforehand, since we don't want to calculate the difference from your first timepoint in the original column and your last timepoint that got rolled to the start of the new column.
Then we just ask if any of the values resulting from the subtraction is above your threshold:
df['rolled_minutes'] = np.roll(df['minutes'], 1)
dropped_df = df.drop(index=0)
diff = dropped_df['minutes'] - dropped_df['rolled_minutes']
(diff > 25).any()
# output: True

Select columns in a DataFrame conditional on row

I am attempting to generate a dataframe (or series) based on another dataframe, selecting a different column from the first frame dependent on the row using another series. In the below simplified example, I want the frame1 values from 'a' for the first three rows, and 'b for the final two (the picked_values series).
frame1=pd.DataFrame(np.random.randn(10).reshape(5,2),index=range(5),columns=['a','b'])
picked_values=pd.Series(['a','a','a','b','b'])
Frame1
a b
0 0.283519 1.462209
1 -0.352342 1.254098
2 0.731701 0.236017
3 0.022217 -1.469342
4 0.386000 -0.706614
Trying to get to the series:
0 0.283519
1 -0.352342
2 0.731701
3 -1.469342
4 -0.706614
I was hoping values[picked_values] would work, but this ends up with five columns.
In the real-life example, picked_values is a lot larger and calculated.
Thank you for your time.
Use df.lookup
pd.Series(frame1.lookup(picked_values.index,picked_values))
0 0.283519
1 -0.352342
2 0.731701
3 -1.469342
4 -0.706614
dtype: float64
Here's a NumPy based approach using integer indexing and Series.searchsorted:
frame1.values[frame1.index, frame1.columns.searchsorted(picked_values.values)]
# array([0.22095278, 0.86200616, 1.88047197, 0.49816937, 0.10962954])

Pivot across multiple columns with repeating values in each column

I am trying to pivot a pandas dataframe, but the data is following a strange format that I cannot seem to pivot. The data is structured as below:
Date, Location, Action1, Quantity1, Action2, Quantity2, ... ActionN, QuantityN
<date> 1 Lights 10 CFloor 1 ... Null Null
<date2> 2 CFloor 2 CWalls 4 ... CBasement 15
<date3> 2 CWalls 7 CBasement 4 ... NUll Null
Essentially, each action will always have a quantity attached to it (which may be 0), but null actions will never have a quantity (the quantity will just be null). The format I am trying to achieve is the following:
Lights CFloor CBasement CWalls
1 10 1 0 0
2 0 2 19 11
The index of the rows becomes the location while the columns become any unique action found across the multiple activity columns. When pulling the data together, the value of each row/column is the sum of each quantity associated with the action (i.e Action1 corresponds to Quantity1). Is there a way to do this with the native pandas pivot funciton?
My current code performs a ravel across all the activity columns to get a list of all unique activities. It will also grab all the unique locations from the Location column. Once I have the unique columns, I create an empty dataframe and fill it with zeros:
Lights CFloor CBasement CWalls
1 0 0 0 0
2 0 0 0 0
I then iterate back over the old data frame with the itertuples() method (I was told it was significantly faster than iterrows()) and populate the new dataframe. This empty dataframe acts as a template that is stored in memory and filled later.
#Creates a template from the dataframe
def create_template(df):
act_cols = ['Activity01', 'Activity02', 'Activity03', 'Activity04']
activities = df[act_cols]
flat_acts = activities.values.ravel('K')
unique_locations = pd.unique(df['Location'])
unique_acts = pd.unique(flat_acts)
pivot_template = pd.DataFrame(index=unique_locations, columns=unique_acts).fillna(0)
return pivot_template
#Fills the template from the dataframe
def create_pivot(df, pivot_frmt):
act_cols = ['Activity01', 'Activity02', 'Activity03', 'Activity04']
quant_cols = ['Quantity01', 'Quantity02', 'Quantity03', 'Quantity04']
for row in df.itertuples():
for act, quantity in zip(act_cols, quant_cols):
act_val = getattr(row, act)
if pd.notna(act_val):
quantity_val = getattr(row, quantity)
location = getattr(row, 'Location')
pivot_frmt.loc[location, act_val] += quantity_val
return pivot_frmt
While my solution works, it is incredibly slow when dealing with a large dataset and has taken 10 seconds or more to complete this type of operation. Any help would be greatly appreciated!
After experimenting with various pandas functions, such as melt and pivot on multiple columns simulatenously, I found a solution that worked for me:
For every quantity-activity pair, I build a partial frame of the final dataset and store it in a list. Once every pair has been addressed I will end up with multiple dataframes that all have the same row counts, but potentially different column counts. I solved this issue by simply concatenating the columns and if any columns are repeated, I then sum them to get the final result.
def test_pivot(df):
act_cols = ['Activity01', 'Activity02', 'Activity03', 'Activity04']
quant_cols = ['Quantity01', 'Quantity02', 'Quantity03', 'Quantity04']
dfs = []
for act, quant in zip(act_cols, quant_cols):
partial = pd.crosstab(index=df['Location'], columns=df[act], values=df[quant], aggfunc=np.sum).fillna(0)
dfs.append(partial)
finalDf = pd.concat(dfs, axis=1)
finalDf = test.groupby(finalDf.columns, axis=1).sum()
return finalDf
There are two assumptions that I make during this approach:
The indexes maintain their order across all partial dataframes
There are an equivalent number of indexes across all partial dataframes
While this is probably not the most elegant solution, it achieves the desired result and reduced the time it took to process the data by a very significant margin (from 10s ~4k rows to 0.2s ~4k rows). If anybody has a better way to deal with this type of scenario and do the process outlined above in one shot, then I would love to see your response!

Categories

Resources