Iterating and averaging pandas data frame - python

I have a database with a lot of rows such as:
timestamp name price profit
bob 5 4
jim 3 2
jim 2 6
bob 6 7
jim 4 1
jim 6 3
bob 3 1
The data base is sorted by a timestamp. I would like to be able to add a new column where it would take the last 2 values in the price column before the current value and average them into a new column. So that the first three rows would look something like this with a new column:
timestamp name price profit new column
bob 5 4 4.5
jim 3 2 3
jim 2 6 5
(6+3)/2 = 4.5
(2+4)/2 = 3
(4+6)/2 = 5
This isn't for a school project or anything this is just something I'm working on on my own. I've tried asking a similar question to this but I don't think I was very clear. Thanks in advance!

def shift_n_roll(df):
return df.shift(-1).rolling(2).mean().shift(-1)
df['new column'] = df.groupby('name').price.apply(shift_n_roll)
df

By looking at the result you want, I'm guess you want average of the two prices following the current one instead of "2 values in the price column before the current value".
I made up timestamp values that you omitted to be clear.
print df
timestamp name price profit
0 2016-01-01 bob 5 4
1 2016-01-02 jim 3 2
2 2016-01-03 jim 2 6
3 2016-01-04 bob 6 7
4 2016-01-05 jim 4 1
5 2016-01-06 jim 6 3
6 2016-01-07 bob 3 1
#No need to sort if you already did.
#df.sort_values(['name','timestamp'], inplace=True)
df['new column'] = (df.groupby('name')['price'].shift(-1) + df.groupby('name')['price'].shift(-2)) / 2
print df.dropna()
timestamp name price profit new column
0 2016-01-01 bob 5 4 4.5
1 2016-01-02 jim 3 2 3.0
2 2016-01-03 jim 2 6 5.0

Related

How to assign identical random IDs conditionally to "related" rows in pandas?

New to Python I'm struggling with the problem to assign some random IDs to "related" rows
where the relation is simply their proximity (within 14 days) in consecutive days grouped by user. In that example I chose uuidwithout any specific intention. It could be any other random IDs uniquely indentifying conceptually related rows.
import pandas as pd
import uuid
import numpy as np
Here is a dummy dataframe:
dummy_df = pd.DataFrame({"transactionid": [1, 2, 3, 4, 5, 6, 7, 8],
"user": ["michael",
"michael",
"michael",
"tom",
"tom",
"tom",
"tom",
"tom"],
"transactiontime": pd.to_datetime(["2022-01-01",
"2022-01-02",
"2022-01-03",
"2022-09-01",
"2022-09-13",
"2022-10-17",
"2022-10-20",
"2022-11-17"])})
dummy_df.head(10)
transactionid user transactiontime
0 1 michael 2022-01-01
1 2 michael 2022-01-02
2 3 michael 2022-01-03
3 4 tom 2022-09-01
4 5 tom 2022-09-13
5 6 tom 2022-10-17
6 7 tom 2022-10-20
7 8 tom 2022-11-17
Here I sort transactions and calculate their difference in days:
dummy_df = dummy_df.assign(
timediff = dummy_df
.sort_values('transactiontime')
.groupby(["user"])['transactiontime'].diff() / np.timedelta64(1, 'D')
).fillna(0)
dummy_df.head(10)
transactionid user transactiontime timediff
0 1 michael 2022-01-01 0.0
1 2 michael 2022-01-02 1.0
2 3 michael 2022-01-03 1.0
3 4 tom 2022-09-01 0.0
4 5 tom 2022-09-13 12.0
5 6 tom 2022-10-17 34.0
6 7 tom 2022-10-20 3.0
7 8 tom 2022-11-17 28.0
Here I create a new column with a random IDs for each related transaction - though it does not work as expected:
dummy_df.assign(related_transaction = np.where((dummy_df.timediff >= 0) & (dummy_df.timediff < 15), uuid.uuid4(), dummy_df.transactionid))
transactionid user transactiontime timediff related_transaction
0 1 michael 2022-01-01 0.0 fd630f07-6564-4773-aff9-44ecb1e4211d
1 2 michael 2022-01-02 1.0 fd630f07-6564-4773-aff9-44ecb1e4211d
2 3 michael 2022-01-03 1.0 fd630f07-6564-4773-aff9-44ecb1e4211d
3 4 tom 2022-09-01 0.0 fd630f07-6564-4773-aff9-44ecb1e4211d
4 5 tom 2022-09-13 12.0 fd630f07-6564-4773-aff9-44ecb1e4211d
5 6 tom 2022-10-17 34.0 6
6 7 tom 2022-10-20 3.0 fd630f07-6564-4773-aff9-44ecb1e4211d
7 8 tom 2022-11-17 28.0 8
What I would expect is something like given that the user group difference between transactions is within 14 days:
transactionid user transactiontime timediff related_transaction
0 1 michael 2022-01-01 0.0 ad2a8f23-05a5-49b1-b45e-cbf3f0ba23ff
1 2 michael 2022-01-02 1.0 ad2a8f23-05a5-49b1-b45e-cbf3f0ba23ff
2 3 michael 2022-01-03 1.0 ad2a8f23-05a5-49b1-b45e-cbf3f0ba23ff
3 4 tom 2022-09-01 0.0 b1da2251-7770-4756-8863-c82f90657542
4 5 tom 2022-09-13 12.0 b1da2251-7770-4756-8863-c82f90657542
5 6 tom 2022-10-17 34.0 485a8d97-80d1-4184-8fc8-99523f471527
6 7 tom 2022-10-20 3.0 485a8d97-80d1-4184-8fc8-99523f471527
7 8 tom 2022-11-17 28.0 8
Taking the idea from Luise, we start with an empty column for related_transaction. Then, we iterate through each row. For each date, we check if it is already part of a transaction. If so, continue. Otherwise, assign a new transaction to that date and all other dates within 15 following days for the same user:
import datetime
df = dummy_df
df['related_transaction'] = None
for i, row in dummy_df.iterrows():
if df.loc[i].related_transaction is not None:
# We already assigned that row
continue
df.loc[ # Select where:
(df.transactiontime <= row.transactiontime + datetime.timedelta(days=15)) & # Current row + 15 days
(df.user == row.user) & # Same user
(pd.isna(df.related_transaction)), # Don't overwrite anything already assigned
'related_transaction' # Set this column to:
] = uuid.uuid4() # Assign new UUID
This gives the output:
transactionid user transactiontime related_transaction
0 1 michael 2022-01-01 82d28e10-149b-481e-ba41-f5833662ba99
1 2 michael 2022-01-02 82d28e10-149b-481e-ba41-f5833662ba99
2 3 michael 2022-01-03 82d28e10-149b-481e-ba41-f5833662ba99
3 4 tom 2022-09-01 fa253663-8615-419a-afda-7646906024f0
4 5 tom 2022-09-13 fa253663-8615-419a-afda-7646906024f0
5 6 tom 2022-10-17 d6152d4b-1560-40e0-8589-bd8e3da363db
6 7 tom 2022-10-20 d6152d4b-1560-40e0-8589-bd8e3da363db
7 8 tom 2022-11-17 2a93d78d-b6f6-4f0f-bb09-1bc18361aa21
In your example, the dates are already sorted, that's an important assumption I'm making here!
The mismatch between your code and your desired result is that uuid.uuid4() creates an ID a single time and assigns it to all the relevant rows defined by np.where(). Instead, you need to generate the IDs in a vectorized way.
Try the following approach:
df.loc[ROW_CONDITIONs, COLUMNS] = VECTORIZED_ID_GENERATOR
which for your example would be
dummy_df.loc[(dummy_df['timediff'] >= 0) & (dummy_df['timediff'] < 15), 'related_transaction'] = dummy_df.apply(lambda _: uuid.uuid4(), axis=1)
Take into account that this only solves your question of how to assign random IDs using uuid conditionally in Pandas. It looks to me that you also need to generate the same ID for the same user and for transactions every 15 days. My advice for that would be to generate a dataframe where every row is a combination of two transactions and add a condition saying that the users from both transactions need to be the same.

Add or Subract two columns in a dataframe on basis of column?

I have df with has three columns name,amount and type.
I'm trying to add or subract values to user on basis of type
Here's my sample df
name amount type
0 John 10 ADD
1 John 20 ADD
2 John 50 ADD
3 John 50 SUBRACT
4 Adam 15 ADD
5 Adam 25 ADD
6 Adam 5 ADD
7 Adam 30 SUBRACT
8 Mary 100 ADD
My resultant df
name amount
0 John 30
1 Adam 15
2 Mary 100
Idea is multiple by 1 if ADD and -1 if SUBRACT column and then aggregate sum:
df1 = (df['amount'].mul(df['type'].map({'ADD':1, 'SUBRACT':-1}))
.groupby(df['name'], sort=False)
.sum()
.reset_index(name='amount'))
print (df1)
name amount
0 John 30
1 Adam 15
2 Mary 100
Detail:
print (df['type'].map({'ADD':1, 'SUBRACT':-1}))
0 1
1 1
2 1
3 -1
4 1
5 1
6 1
7 -1
8 1
Name: type, dtype: int64
Also is possible specify only negative values with numpy.where for multiple by -1 and all another by 1:
df1 = (df['amount'].mul(np.where(df['type'].eq('SUBRACT'), -1, 1))
.groupby(df['name'], sort=False)
.sum()
.reset_index(name='amount'))
print (df1)
name amount
0 John 30
1 Adam 15
2 Mary 100
One idea could be to use Series.where to change the sign of amount accordingly and then groupby.sum:
df.amount.where(df.type.eq('ADD'), -df.amount).groupby(df.name).sum().reset_index()
name amount
0 Adam 15
1 John 30
2 Mary 100

Disproportionate stratified sampling in Pandas

How can I randomly select one row from each group (column Name) in the following dataframe:
Distance Name Time Order
1 16 John 5 0
4 31 John 9 1
0 23 Kate 3 0
3 15 Kate 7 1
2 32 Peter 2 0
5 26 Peter 4 1
Expected result:
Distance Name Time Order
4 31 John 9 1
0 23 Kate 3 0
2 32 Peter 2 0
you can use a groupby on Name col and apply sample
df.groupby('Name',as_index=False).apply(lambda x:x.sample()).reset_index(drop=True)
Distance Name Time Order
0 31 John 9 1
1 15 Kate 7 1
2 32 Peter 2 0
You can shuffle all samples using, for example, the numpy function random.permutation. Then groupby by Name and take N first rows from each group:
df.iloc[np.random.permutation(len(df))].groupby('Name').head(1)
you can achive that using unique
df['Name'].unique()
Shuffle the dataframe:
df.sample(frac=1)
And then drop duplicated rows:
df.drop_duplicates(subset=['Name'])
df.drop_duplicates(subset='Name')
Distance Name Time Order
1 16 John 5 0
0 23 Kate 3 0
2 32 Peter 2 0
This should help, but this not random choice, it keeps the first
How about using random
like this,
Import your provided data,
df=pd.read_csv('random_data.csv', header=0)
which looks like this,
Distance Name Time Order
1 16 John 5 0
4 3 John 9 1
0 23 Kate 3 0
3 15 Kate 7 1
then get a random column name,
colname = df.columns[random.randint(1, 3)]
and below it selected 'Name',
print(df[colname])
1 John
4 John
0 Kate
3 Kate
Name: Name, dtype: object
Of course I could have condensed this to,
print(df[df.columns[random.randint(1, 3)]])

How can we fill the empty values in the column?

I have table A with 3 columns. The column (val) has some of the empty values. The question is: Is there any way to fill the empty values based on the previous value using python. For example, Alex and John take vale 20 and Sam takes value 100.
A = [ id name val
1 jack 10
2 mec 20
3 alex
4 john
5 sam 250
6 tom 100
7 sam
8 hellen 300]
You can try to take in data as a pandas Dataframe and use the built in function fillna() to solve your problem. For an example,
df = # your data
df.fillna(method='pad')
Would return a dataframe like,
id name val
0 1 jack 10
1 2 mec 20
2 3 alex 20
3 4 john 20
4 5 sam 250
5 6 tom 100
6 7 sam 100
7 8 hellen 300
You can refer to this page for more information.

how to create a row_number based on some conditions in pandas

I have a data frame like this:
Clinic Number date
0 1 2015-05-05
1 1 2015-05-05
2 1 2016-01-01
3 2 2015-05-05
4 2 2016-05-05
5 3 2017-05-05
6 3 2017-05-05
I want to create a new column and fill it out based on some conditions. so the new data frame should be like this:
Clinic Number date row_number
0 1 2015-05-05 1
1 1 2015-05-05 1
2 1 2016-01-01 2
3 2 2015-05-05 3
4 2 2016-05-05 4
5 3 2017-05-05 5
6 3 2017-05-05 5
what is the rule for putting entries inside new column:
where Clinic Number and date is the same they will get same numbers, if it changes it will increases.
For example here 1 2015-05-05 has two rows which have same Clinic Number and date so they all get 1. the next row have Clinic Number=1 but the date is not the same as previous rows so it will get 2.
where Clinic Number=2 there is no row with Clinic Number=2 and the same date so it got 3 and the next row is 4...
till now I have tried something like this:
def createnumber(x):
x['row_number'] = i
d['row_number']= pd1.groupby(['Clinic Number','date']).apply(createnumber)
but I do not know how to implement this function.
I appreciate if you can help me with this:)
Also I have looked at links like this but they are not dynamic (i mean here the row number should be increased based on some conditions)
Instead of a groupby, you could just do something like this, naming your conditions seperately. So if the date shifts OR the clinic number changes, you return True, and then get the cumsum of those True values:
df['row_number'] = (df.date.ne(df.date.shift()) | df['Clinic Number'].ne(df['Clinic Number'].shift())).cumsum()
>>> df
Clinic Number date row_number
0 1 2015-05-05 1
1 1 2015-05-05 1
2 1 2016-01-01 2
3 2 2015-05-05 3
4 2 2016-05-05 4
5 3 2017-05-05 5
You'll need to make sure your dataframe is sorted by Clinic Number and Date first (you could do df.sort_values(['Clinic Number', 'date'], inplace=True) if it's not sorted already)

Categories

Resources