Combine only certain rows in dataframe effeciently

Combine only certain rows in dataframe effeciently - python

So I have a dataframe that has the beginning and end times of certain activities in subsequent rows that have the same id and activity. Every now and then there is a row without an end that I want to drop evtl. (id 3 & 5 in this example). The rows that are paired (with id/act pairs: 1/10,2/10 & 1/10 at a different time) can be merged, i.e. the second row can be dropped. I can add the end times simply by shifting one column, but I am having a hard time getting rid of the unnecessary rows without iterating through the whole dataframe.
import pandas as pd
df = pd.DataFrame([[1,10,20],[1,10,25],[2,10,40],[2,10,41],[3,10,42],[1,10,45],[1,10,45],[5,10,50]], columns=['id','act','time'])
df["time 2"]=df["time"].shift(-1)

Thank yo uso much for the quick reply, but I actually fixed this myself with a very simple solution:
df = pd.DataFrame([[1,10,20],[1,10,25],[2,10,40],[2,10,41],[3,10,42],[1,10,45],[1,10,45],[5,10,50]], columns=['id','act','time'])
id act time
0 1 10 20
1 1 10 25
2 2 10 40
3 2 10 41
4 3 10 42
5 1 10 45
6 1 10 45
7 5 10 50
df["end"]=df["time"].shift(-1)
df["id 2"]=df["id"].shift(-1)
df["act 2"]=df["act"].shift(-1)
df.drop(df.index[len(df)-1],inplace=True)
id act time time 2 id 2 act 2
0 1 10 20 25.0 1.0 10.0
1 1 10 25 40.0 2.0 10.0
2 2 10 40 41.0 2.0 10.0
3 2 10 41 42.0 3.0 10.0
4 3 10 42 45.0 1.0 10.0
5 1 10 45 45.0 1.0 10.0
6 1 10 45 50.0 5.0 10.0
df=df.loc[(df["id"]==df["id 2"])== (df["act"]==df["act 2"])]
df.drop(columns=["id 2","act 2"],axis=0,inplace=True)
id act time end
0 1 10 20 25.0
2 2 10 40 41.0
5 1 10 45 45.0

Related

Python Pandas: "Series" objects are mutable, thus cannot be hashed when using .groupby

I want to take the 2nd derivative of column['Value'] and place it into another column. There is also another column called ['Cycle'] that organizes the data into various cycles. So for each cycle, I want to take the 2nd derivative of those sets of number.
I have tried using this:
Data3['Diff2'] = Data3.groupby('Cycle#').apply(Data3['Value'] - 2*Data3['Value'].shift(1) + Data3['Value'].shift(2))
Which works for giving me the 2nd derivative (before adding the groupby) but now I am getting the error:
TypeError: 'Series' objects are mutable, thus they cannot be hashed
Anyone know why?

rng = np.random.default_rng(seed=42)
df = pd.DataFrame(
{"Cycle#": rng.integers(1,4, size=12),
"Value": rng.integers(1,11, size=12)*10
})
df
###
Cycle# Value
0 1 80
1 3 80
2 2 80
3 2 80
4 2 60
5 3 20
6 1 90
7 3 50
8 1 60
9 1 40
10 2 20
11 3 100
df['Diff2'] = df.groupby('Cycle#', as_index=False)['Value'].transform(lambda x:x - 2*x.shift(1) + x.shift(2))
df
###
Cycle# Value Diff2
0 1 80 NaN
1 3 80 NaN
2 2 80 NaN
3 2 80 NaN
4 2 60 -20.0
5 3 20 NaN
6 1 90 NaN
7 3 50 90.0
8 1 60 -40.0
9 1 40 10.0
10 2 20 -20.0
11 3 100 20.0

How to filter a dataframe and identify records based on a condition on multiple other columns

id zone price
0 0000001 1 33.0
1 0000001 2 24.0
2 0000001 3 34.0
3 0000001 4 45.0
4 0000001 5 51.0
I have the above pandas dataframe, here there are multiple ids (only 1 id is shown here). dataframe consist of a certain id with 5 zones and 5 prices. these prices should follow the below pattern
p1 (price of zone 1) < p2< p3< p4< p5
if anything out of order we should identify and print anomaly records to a file.
here in this example p3 <p4 <p5 but p1 and p2 are erroneous. (p1 > p2 whereas p1 < p2 is expected)
therefore 1st 2 records should be printed to a file
likewise this has to be done to the entire dataframe for all unique ids in it
My dataframe is huge, what is the most efficient way to do this filtering and identify erroneous records?

You can compute the diff per group after sorting the values to ensure the zones are increasing. If the diff is ≤ 0 the price is not strictly increasing and the rows should be flagged:
s = (df.sort_values(by=['id', 'zone']) # sort rows
.groupby('id') # group by id
['price'].diff() # compute the diff
.le(0) # flag those ≤ 0 (not increasing)
)
df[s|s.shift(-1)] # slice flagged rows + previous row
Example output:
id zone price
0 1 1 33.0
1 1 2 24.0
Example input:
id zone price
0 1 1 33.0
1 1 2 24.0
2 1 3 34.0
3 1 4 45.0
4 1 5 51.0
5 2 1 20.0
6 2 2 24.0
7 2 3 34.0
8 2 4 45.0
9 2 5 51.0
saving to file
df[s|s.shift(-1)].to_csv('incorrect_prices.csv')

Another way would be to first sort your dataframe by id and zone in ascending order and compare the next price with previous price using groupby.shift() creating a new column. Then you can just print out the prices that have fell in value:
import numpy as np
import pandas as pd
df.sort_values(by=['id','zone'],ascending=True)
df['increase'] = np.where(df.zone.eq(1),'no change',
np.where(df.groupby('id')['price'].shift(1) < df['price'],'inc','dec'))
>>> df
id zone price increase
0 1 1 33 no change
1 1 2 24 dec
2 1 3 34 inc
3 1 4 45 inc
4 1 5 51 inc
5 2 1 34 no change
6 2 2 56 inc
7 2 3 22 dec
8 2 4 55 inc
9 2 5 77 inc
10 3 1 44 no change
11 3 2 55 inc
12 3 3 44 dec
13 3 4 66 inc
14 3 5 33 dec
>>> df.loc[df.increase.eq('dec')]
id zone price increase
1 1 2 24 dec
7 2 3 22 dec
12 3 3 44 dec
14 3 5 33 dec
I have added some extra ID's to try and mimic your real data.

How do I aggregate rows with an upper bound on column value?

I have a pd.DataFrame I'd like to transform:
id values days time value_per_day
0 1 15 15 1 1
1 1 20 5 2 4
2 1 12 12 3 1
I'd like to aggregate these into equal buckets of 10 days. Since days at time 1 is larger than 10, this should spill into the next row, having the value/day of the 2nd row an average of the 1st and the 2nd.
Here is the resulting output, where (values, 0) = 15*(10/15) = 10 and (values, 1) = (5+20)/2:
id values days value_per_day
0 1 10 10 1.0
1 1 25 10 2.5
2 1 10 10 1.0
3 1 2 2 1.0
I've tried pd.Grouper:
df.set_index('days').groupby([pd.Grouper(freq='10D', label='right'), 'id']).agg({'values': 'mean'})
Out[146]:
values
days id
5 days 1 16
15 days 1 10
But I'm clearly using it incorrectly.
csv for convenience:
id,values,days,time
1,10,15,1
1,20,5,2
1,12,12,3

Notice: this is a time cost solution
newdf=df.reindex(df.index.repeat(df.days))
v=np.arange(sum(df.days))//10
dd=pd.DataFrame({'value_per_day': newdf.groupby(v).value_per_day.mean(),'days':np.bincount(v)})
dd
Out[102]:
days value_per_day
0 10 1.0
1 10 2.5
2 10 1.0
3 2 1.0
dd.assign(value=dd.days*dd.value_per_day)
Out[103]:
days value_per_day value
0 10 1.0 10.0
1 10 2.5 25.0
2 10 1.0 10.0
3 2 1.0 2.0
I did not include groupby id here, if you need that for your real data, you can do for loop with df.groupby(id) , then apply above steps within the for loop

Change Cells in Pandas DataFrame Based on Conditional Slices

I'm playing around with the Titanic dataset, and what I'd like to do is fill in all the NaN/Null values of the Age column with the median value base on that Pclass.
Here is some data:
train
PassengerId Pclass Age
0 1 3 22
1 2 1 35
2 3 3 26
3 4 1 35
4 5 3 35
5 6 1 NaN
6 7 1 54
7 8 3 2
8 9 3 27
9 10 2 14
10 11 1 Nan
Here is what I would like to end up with:
PassengerId Pclass Age
0 1 3 22
1 2 1 35
2 3 3 26
3 4 1 35
4 5 3 35
5 6 1 35
6 7 1 54
7 8 3 2
8 9 3 27
9 10 2 14
10 11 1 35
The first thing I came up with is this - In the interest of brevity I have only included one slice for Pclass equal to 1 rather than including 2 and 3:
Pclass_1 = train['Pclass']==1
train[Pclass_1]['Age'].fillna(train[train['Pclass']==1]['Age'].median(), inplace=True)
As far as I understand, this method creates a view rather than editing train itself (I don't quite understand how this is different from a copy, or if they are analogous in terms of memory -- that is an aside I would love to hear about if possible). I particularly like this Q/A on the topic View vs Copy, How Do I Tell? but it doesn't include the insight I'm looking for.
Looking through Pandas docs I learned why you want to use .loc to avoid this pitfall. However I just can't seem to get the syntax right.
Pclass_1 = train.loc[:,['Pclass']==1]
Pclass_1.Age.fillna(train[train['Pclass']==1]['Age'].median(),inplace=True)
I'm getting lost in indices. This one ends up looking for a column named False which obviously doesn't exist. I don't know how to do this without chained indexing. train.loc[:,train['Pclass']==1] returns an exception IndexingError: Unalignable boolean Series key provided.

In this part of the line,
train.loc[:,['Pclass']==1]
the part ['Pclass'] == 1 is comparing the list ['Pclass'] to the value 1, which returns False. The .loc[] is then evaluated as .loc[:,False] which is causing the error.
I think you mean:
train.loc[train['Pclass']==1]
which selects all of the rows where Pclass is 1. This fixes the error, but it will still give you the "SettingWithCopyWarning".
EDIT 1
(old code removed)
Here is an approach that uses groupby with transform to create a Series
containing the median Age for each Pclass. The Series is then used as the argument to fillna() to replace the missing values with the median. Using this approach will correct all passenger classes at the same time, which is what the OP originally requested. The solution comes from the answer to Python-pandas Replace NA with the median or mean of a group in dataframe
import pandas as pd
from io import StringIO
tbl = """PassengerId Pclass Age
0 1 3 22
1 2 1 35
2 3 3 26
3 4 1 35
4 5 3 35
5 6 1
6 7 1 54
7 8 3 2
8 9 3 27
9 10 2 14
10 11 1
"""
train = pd.read_table(StringIO(tbl), sep='\s+')
print('Original:\n', train)
median_age = train.groupby('Pclass')['Age'].transform('median') #median Ages for all groups
train['Age'].fillna(median_age, inplace=True)
print('\nNaNs replaced with median:\n', train)
The code produces:
Original:
PassengerId Pclass Age
0 1 3 22.0
1 2 1 35.0
2 3 3 26.0
3 4 1 35.0
4 5 3 35.0
5 6 1 NaN
6 7 1 54.0
7 8 3 2.0
8 9 3 27.0
9 10 2 14.0
10 11 1 NaN
NaNs replaced with median:
PassengerId Pclass Age
0 1 3 22.0
1 2 1 35.0
2 3 3 26.0
3 4 1 35.0
4 5 3 35.0
5 6 1 35.0
6 7 1 54.0
7 8 3 2.0
8 9 3 27.0
9 10 2 14.0
10 11 1 35.0
One thing to note is that this line, which uses inplace=True:
train['Age'].fillna(median_age, inplace=True)
can be replaced with assignment using .loc:
train.loc[:,'Age'] = train['Age'].fillna(median_age)

DataFrame: fillna() with running sum of valid values

I'm working a Pandas Dataframe, that looks like this:
0 Data
1
2
3
4 5
5
6
7
8 21
9
10 2
11
12
13
14
15
I'm trying to fill the blank with next valid values by: df.fillna(method='backfill'). This works, but then I need to add the previous valid value to the next valid value, from the bottom up, such as:
0 Data
1 28
2 28
3 28
4 28
5 23
6 23
7 23
8 23
9 2
10 2
11
12
13
14
15
I can get this to work by looping over it, but is there a method within pandas that can do this?
Thanks a lot!

You could reverse the df, then fillna(0) and then cumsum and reverse again:
In [12]:
df = df[::-1].fillna(0).cumsum()[::-1]
df
Out[12]:
Data
0 28.0
1 28.0
2 28.0
3 28.0
4 23.0
5 23.0
6 23.0
7 23.0
8 2.0
9 2.0
10 0.0
11 0.0
12 0.0
13 0.0
14 0.0
here we use slicing notation to reverse the df, then replace all NaN with 0, perform cumsum and reverse back

Another simple way to do that : df.sum()-df.fillna(0).cumsum()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Combine only certain rows in dataframe effeciently - python

Related

Python Pandas: "Series" objects are mutable, thus cannot be hashed when using .groupby

How to filter a dataframe and identify records based on a condition on multiple other columns

How do I aggregate rows with an upper bound on column value?

Change Cells in Pandas DataFrame Based on Conditional Slices

DataFrame: fillna() with running sum of valid values

Categories

Resources