Randomly selecting rows from dataframe column by date - python

For a given dataframe column, I would like to randomly select by day roughly 60% and add to a new column, add the remaining 40% to another column, multiply the 40% column by (-1), and create a new column that merges these back together for each day (so that each day I have a ratio of 60/40):
I have asked the same question without the daily specification here: Randomly selecting rows from dataframe column
Example below illustrates this (although my ratio is not exactly 60/40 there):
dict0 = {'date':[1/1/2019,1/1/2019,1/1/2019,1/2/2019,1/1/2019,1/2/2019],'x1': [1,2,3,4,5,6]}
df = pd.DataFrame(dict0)###
df['date'] = pd.to_datetime(df['date']).dt.date
dict1 = {'date':[1/1/2019,1/1/2019,1/1/2019,1/2/2019,1/1/2019,1/2/2019],'x1': [1,2,3,4,5,6],'x2': [1,'nan',3,'nan',5,6],'x3': ['nan',2,'nan',4,'nan','nan']}
df = pd.DataFrame(dict1)###
df['date'] = pd.to_datetime(df['date']).dt.date
dict2 = {'date':[1/1/2019,1/1/2019,1/1/2019,1/2/2019,1/1/2019,1/2/2019],'x1': [1,2,3,4,5,6],'x2': [1,'nan',3,'nan',5,6],'x3': ['nan',-2,'nan',-4,'nan','nan']}
df = pd.DataFrame(dict2)###
df['date'] = pd.to_datetime(df['date']).dt.date
dict3 = {'date':[1/1/2019,1/1/2019,1/1/2019,1/2/2019,1/1/2019,1/2/2019],'x1': [1,2,3,4,5,6],'x2': [1,'nan',3,'nan',5,6],'x3': ['nan',-2,'nan',- 4,'nan','nan'],'x4': [1,-2,3,-4,5,6]}
df = pd.DataFrame(dict3)###
df['date'] = pd.to_datetime(df['date']).dt.date

you can use groupby and sample, get the index values, then create the column x4 with loc, and fillna with the -1 multiplied column like:
idx= df.groupby('date').apply(lambda x: x.sample(frac=0.6)).index.get_level_values(1)
df.loc[idx, 'x4'] = df.loc[idx, 'x1']
df['x4'] = df['x4'].fillna(-df['x1'])

Related

pandas groupby column with rolling mean, limited between datetimes, without iterating over each row

I have data in a dataframe as follows:
ROWS = 1000
df = pandas.DataFrame()
df['DaT'] = pandas.date_range('2000-1-1', periods=ROWS, freq='H')
df['cat'] = numpy.random.choice(['a','b','c'],size=ROWS)
df['val'] = numpy.random.randint(2,size=ROWS)
df['r10'] = df.groupby(['cat'])['val'].apply(lambda x: x.rolling(10).mean() )
I need to calculate a column that, is grouped by category 'cat', and is a rolling (10periods) mean of the value 'val' column, but the rolling mean for a given row cannot include values from the day it occurs on.
The desired result ('wanted') can be generated as follows:
df['wanted'] = numpy.nan
for idx, row in df.iterrows():
Rdate = row['DaT'].normalize()
Rcat = row['cat']
try: df.loc[idx,'wanted'] = df[(df['DaT'] < Rdate) & (df['cat'] == Rcat) ]['val'].rolling(10).mean().iloc[-1]
except: df.loc[idx,'wanted'] = numpy.nan
The above is an awful solution, but gets the result. It is very slow for 100000+rows that need to go through. Is there are more elegant solution?
I have tried using combinations of shift and even quantize to get a more efficient solution, but no success yet

Filter each column by having the same value three times or more

I have a Data set that contains Dates as an index, and each column is the name of an item with count as value. I'm trying to figure out how to filter each column where there will be more than 3 consecutive days where the count is zero for each different column. I was thinking of using a for loop, any help is appreciated. I'm using python for this project.
I'm fairly new to python, so far I tried using for loops, but did not get it to work in any way.
for i in a.index:
if a.loc[i,'name']==3==df.loc[i+1,'name']==df.loc[i+2,'name']:
print(a.loc[i,"name"])
Cannot add integral value to Timestamp without freq.
It would be better if you included a sample dataframe and desired output in your question. Please do the next time. This way, I have to guess what your data looks like and may not be answering your question. I assume the values are integers. Does your dataframe have a row for every day? I will assume that might not be the case. I will make it so that every day in the last delta days has a row. I created a sample dataframe like this:
import pandas as pd
import numpy as np
import datetime
# Here I am just creating random data from your description
delta = 365
start_date = datetime.datetime.now() - datetime.timedelta(days=delta)
end_date = datetime.datetime.now()
datetimes = [end_date - diff for diff in [datetime.timedelta(days=i) for i in range(delta,0,-1)]]
# This is the list of dates we will have in our final dataframe (includes all days)
dates = pd.Series([date.strftime('%Y-%m-%d') for date in datetimes], name='Date', dtype='datetime64[ns]')
# random integer dataframe
df = pd.DataFrame(np.random.randint(0, 5, size=(delta,4)), columns=['item' + str(i) for i in range(4)])
df = pd.concat([df, dates], axis=1).set_index('Date')
# Create a missing day
df = df.drop(df.loc['2019-08-01'].name)
# Reindex so that index has all consecutive days
df = df.reindex(index=dates)
Now that we have a sample dataframe, the rest will be straightforward. I am going to check if a value in the dataframe is equal to 0 and then do a rolling sum with the window of 4 (>3). This way I can avoid for loops. The resulting dataframe has all the rows where at least one of the items had a value of 0 for 4 consecutive rows. If there is a 0 for more than window consecutive rows, it will show as two rows where the dates are just one day apart. I hope that makes sense.
# custom function as I want "np.nan" returned if a value does not equal "test_value"
def equals(df_value, test_value=0):
return 1 if df_value == test_value else np.nan
# apply the function to every value in the dataframe
# for each row, calculate the sum of four subsequent rows (>3)
df = df.applymap(equals).rolling(window=4).sum()
# if there was np.nan in the sum, the sum is np.nan, so it can be dropped
# keep the rows where there is at least 1 value
df = df.dropna(thresh=1)
# drop all columns that don't have any values
df = df.dropna(thresh=1, axis=1)

Aggregation sum Pandas

I have the following dataframe
How can I aggregate the number of tickets (summing) for every month?
I tried:
df_res[df_res["type"]=="other"].groupby(["type","date"])["n_tickets"].sum()
date is an object
You need assign to new DataFrame for same size of Series created by Series.dt.month:
#if necessary convert to datetimes
df['date'] = pd.to_datetime(df['date'])
df = df_res[df_res["type"]=="pax"]
#type is same, so should be omited
out = df.groupby(df["date"].dt.month)["n_tickets"].sum()
#if need column with same value `pax`
#out = df.groupby(['type',df["date"].dt.month])["n_tickets"].sum()
If want grouping by pax and no pax:
types = np.where(df_res["type"]=="pax", 'pax', 'no pax')
df_res.groupby([types, df_res["date"].dt.month])["n_tickets"].sum()

Python setting a value in one dataframe based on contiguous dates ("intervals") in a "lookup" dataframe

If I have a dataframe with a value for a date "interval" and then another dataframe of consecutive dates, how can I set a value in the second dataframe given the date interval in the first dataframe.
# first dataframe (the "lookup", if you will)
df1 = pd.DataFrame(np.random.random((10, 1)))
df1['date'] = pd.date_range('2017-1-1', periods=10, freq='10D')
# second dataframe
df2 = pd.DataFrame(np.arange(0,100))
df2['date'] = pd.date_range('2016-12-29', periods=100, freq='D')
So if df2 date is greater than or equal to a df1 date and less than a contiguous date in df1 we would say something like:
df2['multiplier'] = df1[0], for the proper element that fits within the dates.
Also not sure how the upper boundary would be handled, i.e. if df2 date is greater than the greatest date in df1, it would get the last value in df1.
This feels dirty, so with apologies to the arts of element-wise operations, here's my go at it.
# create an "end date" second column by shifting the date
df1['end_date'] = df1['date'].shift(-1) + pd.DateOffset(-1)
# create a simple list by nested iteration
multiplier = []
for elem, row in df2.iterrows():
if row['date'] < min(df1['date']):
# kinda don't care about this instance
multiplier.append(0)
elif row['date'] < max(df1['date']):
tmp_mult = df1[(df1['date'] <= row['date']) & (row['date'] <= df1['end_date'])][0].values[0]
multiplier.append(tmp_mult)
# for l_elem, l_row in df1.iterrows():
# if l_row.date <= row['date'] <= l_row.end_date:
# multiplier.append(l_row[0])
else:
multiplier.append(df1.loc[df1.index.max(), 0])
# set the list as a new column in the dataframe
df2['multiplier'] = multiplier

Creating an empty Pandas DataFrame column with a fixed first value then filling it with a formula

I'd like to create an emtpy column in an existing DataFrame with the first value in only one column to = 100. After that I'd like to iterate and fill the rest of the column with a formula, like row[C][t-1] * (1 + row[B][t])
very similar to:
Creating an empty Pandas DataFrame, then filling it?
But the difference is fixing the first value of column 'C' to 100 vs entirely formulas.
import datetime
import pandas as pd
import numpy as np
todays_date = datetime.datetime.now().date()
index = pd.date_range(todays_date-datetime.timedelta(10), periods=10, freq='D')
columns = ['A','B','C']
df_ = pd.DataFrame(index=index, columns=columns)
df_ = df_.fillna(0)
data = np.array([np.arange(10)]*3).T
df = pd.DataFrame(data, index=index, columns=columns)
df['B'] = df['A'].pct_change()
df['C'] = df['C'].shift() * (1+df['B'])
## how do I set 2016-10-03 in Column 'C' to equal 100 and then calc consequtively from there?
df
Try this. Unfortunately, something similar to a for loop is likely needed because you will need to calculate the next row based on the prior rows value which needs to be saved to a variable as it moves down the rows (c_column in my example):
c_column = []
c_column.append(100)
for x,i in enumerate(df['B']):
if(x>0):
c_column.append(c_column[x-1] * (1+i))
df['C'] = c_column

Categories

Resources