I have rows representing ranges (from->to). Here is a subset of the data.
df = DataFrame({'from': ['2015-08-24','2015-08-24'], 'to': ['2015-08-26','2015-08-31']})
from to
0 2015-08-24 2015-08-26
1 2015-08-24 2015-08-31
I want to count the number of business days for each day in the ranges. Here is my code.
# Creating a business time index by taking min an max values from the ranges
b_range = pd.bdate_range(start=min(df['from']), end=max(df['to']))
# Init of a new DataFrame with this index and the count at 0
result = DataFrame(0, index=b_range, columns=['count'])
# Iterating over the range to select the index in the result and update the count column
for index, row in df.iterrows():
result.loc[pd.bdate_range(row['from'],row['to']),'count'] += 1
print(result)
count
2015-08-24 2
2015-08-25 2
2015-08-26 2
2015-08-27 1
2015-08-28 1
2015-08-31 1
It works, but does anyone know a more pythonic way of doing that (i.e. without the for loop) ?
Caveat, I sort of hate this answer but on this tiny dataframe it is over 2x faster so I'll throw it out there as a workable, if not elegant, alternative.
df2 = df.apply( lambda x: [ pd.bdate_range( x['from'], x['to'] ) ], axis=1 )
arr = np.unique( np.hstack( df2.values ), return_counts=True )
result = pd.DataFrame( arr[1], index=arr[0] )
Basically all I'm doing here is to make a column with all the dates in it and then use numpy unique (analog of pandas value_counts) to add everything up. I was hoping to come up with something more elegant and readable but this is what I have at the moment.
Here is a method that use cumsum(). It should be faster than for-loop if you have alot of range:
import pandas as pd
df = pd.DataFrame({
'from': ['2015-08-24','2015-08-24'],
'to': ['2015-08-26','2015-08-31']})
df = df.apply(pd.to_datetime)
from_date = min(df['from'])
to_date = max(df['to'])
b_range = pd.bdate_range(start=from_date, end=to_date)
d_range = pd.date_range(start=from_date, end=to_date)
s = pd.Series(0, index=d_range)
from_count = df["from"].value_counts()
to_count = df["to"].value_counts()
s.add(from_count, fill_value=0).sub(to_count.shift(freq="D"), fill_value=0).cumsum().reindex(b_range)
I was not completely satisfied by these solutions. So I kept searching and I think I found a rather elegant and fast solution.
It's inspired by the section "Pivoting 'long' to 'wide' format" explained in the Wes McKinney book: Python for Data Analysis.
I have put a lot of comments in my code but I think it's preferable to print out each step to understand it.
df = DataFrame({'from': ['2015-08-24','2015-08-24'], 'to': ['2015-08-26','2015-08-31']})
# Convert boundaries to datetime
df['from'] = pd.to_datetime(df['from'], format='%Y-%m-%d')
df['to'] = pd.to_datetime(df['to'], format='%Y-%m-%d')
# Reseting index to create a row id named index
df = df.reset_index(level=0)
# Pivoting data to obtain 'from' as row index and row id ('index') as column,
# each cell cointaining the 'to' date
# In consequence each range (from - to pair) is split into as many columns.
pivoted = df.pivot('from', 'index', 'to')
# Reindexing the table with a range of business dates (i.e. working days)
pivoted = pivoted.reindex(index=pd.bdate_range(start=min(df['from']),
end=max(df['to'])))
# Filling the NA values forward to copy the to date
# now each row of each column contains the corresponding to date
pivoted = pivoted.fillna(method='ffill')
# Computing the basically 'from' - 'to' for each column and each row and converting the result in days
# to obtain the number of days between the date in the index and the 'to' date
# Note: one day is added to include the right side of the interval
pivoted = pivoted.apply(lambda x: (x + Day() - x.index) / np.timedelta64(1, 'D'),
axis=0)
# Clipping value lower than 0 (not in the range) to 0
# and values upper than 0 to 1 (only one by day and by id)
pivoted = pivoted.clip_lower(0).clip_upper(1)
# Summing along the columns and that's it
pivoted.sum(axis=1)
Related
I have data frame. Under the same index, I have "early_date" & "latest_date", which are in "int" dtype. I want to create additional values in between the "early_date" & "latest_date" row-values. Incidentally, I want to stack the generated values into new rows between them.
Here is how I did it,
df = pd.DataFrame({'index': [1,1,2,2,3,3],
'variable': ['early_date', 'late_date']*3,
'value': [201952,202001,202002,202004,202006,202012]})
# This is what your data looks like unmelted
df_p = df.pivot('index', 'variable', 'value').reset_index()
df_p.columns.name = ''
df_p['new'] = [list(range(x,y+1)) for x, y in zip(df_p.pop('early_date'), df_p.pop('late_date'))]
This is the result
In the column "new", the filling between "201952" & "202001" in index 1 has became 201952, 201953, 201954...201999, 202001.
However, since the "new" column is actually representing the year and weeks. In index 1 case,
It shall not be filling anything between 201952 & 202001, and the result should be [201952, 202001]. Since week 52 is the end of the year.
What can I do to handling these cases?
IIUC, you can add a condition in your for loop:
df_p['new'] = [list(range(x,y+1)) if str(x)[-2:]!='52' else [x,y]
for x, y in zip(df_p.pop('early_date'), df_p.pop('late_date'))]
print(df_p)
index new
0 1 [201952, 202001]
1 2 [202002, 202003, 202004]
2 3 [202006, 202007, 202008, 202009, 202010, 20201...
I'm trying to figure out how to calculate the difference in index of the current row with the row WHERE a certain column has a certain value.
i.e.
I have a dataframe:
import pandas as pd
# pandas settings
pd.set_option('display.max_columns', 320)
pd.set_option('display.max_rows', 1320)
pd.set_option('display.width', 320)
df = pd.read_csv('https://www.dropbox.com/s/hy94jp4d7qwmv04/eurusd_df1.csv?dl=1')
So I would like to calculate how many indexes behind is the row with candle = candle-20
So for instance, if current row is 583185 and the candle value is 119, then the candle we are interested in is 99. We need to figure out current_index - index(where candle=99 1st occurrence)
I hope I made myself clear, cheers =)
EDIT:
Ok, I did pretty bad explaination above..
I believe I'm actually quite close to solving this myself. Have a look:
x = df.index[df.candle == df.candle - 20][0]
df['test'] = df.bid.rolling(int(x)).mean()
So the 'test' column should be the mean() value of the df.bid last X rows, where X is how many rows between current df.candle and the one that is 20 candles back (first iteration so [0] (there are many rows with same candle value))
But the code above gives an error:
IndexError: index 0 is out of bounds for axis 0 with size 0
Here is a method to accomplish this:
# Generate example data
np.random.seed(0)
df = pd.Series(np.round(np.random.rand(1000000)*1000), dtype=int, name='candle').to_frame()
# Compute row index where df.candle is 20 less than candle_value at current_index
current_index = 583185
candle_value = df.loc[current_index, 'candle'] # = 119 in your df
index = df.index[df.candle == candle_value - 20][0]
print(index)
835
Edit: To compute the difference in indexes, just subtract them:
X = current_index - index
print(X)
582350
Then you can compute your formula:
b = 0.015 * TP.rolling(X).std()
I have a big dataframe (~10 millon rows). Each row has:
category
start position
end position
If two rows are in the same category and the start and end position overlap with a +-5 tolerance, I want to keep just one of the rows.
For example
1, cat1, 10, 20
2, cat1, 12, 21
3, cat2, 10, 25
I want to filter out 1 or 2.
What I'm doing right now isn't very efficient,
import pandas as pd
df = pd.read_csv('data.csv', sep='\t', header=None)
dfs = []
for seq in df.category.unique():
dfs[seq] = df[df.category == seq]
for index, row in df.iterrows():
if index in discard:
continue
df_2 = dfs[row.category]
res = df_2[(abs(df_2.start - row.start) <= params['min_distance']) & (abs(df_2.end - row.end) <= params['min_distance'])]
if len(res.index) > 1:
discard.extend(res.index.values)
rows.append(row)
df = pd.DataFrame(rows)
I've also tried a different approach making use of a sorted version of the dataframe.
my_index = 0
indexes = []
discard = []
count = 0
curr = 0
total_len = len(df.index)
while my_index < total_len - 1:
row = df.iloc[[my_index]]
cond = True
next_index = 1
while cond:
second_row = df.iloc[[my_index + next_index]]
c1 = (row.iloc[0].category == second_row.iloc[0].category)
c2 = (abs(second_row.iloc[0].sstart - row.iloc[0].sstart) <= params['min_distance'])
c3 = (abs(second_row.iloc[0].send - row.iloc[0].send) <= params['min_distance'])
cond = c1 and c2 and c3
if cond and (c2 amd c3):
indexes.append(my_index)
cond = True
next_index += 1
indexes.append(my_index)
my_index += next_index
indexes.append(total_len - 1)
The problem is that this solution is not perfect, sometimes it misses a row because the overlapping could be several rows ahead, and not in the next one
I'm looking for any ideas on how approach this problem in a more pandas friendly way, if exists.
The approach here should be this:
pandas.groupby by categories
agg(Func) on groupby result
the Func should implement the logic of finding the best range inside categories (sorted search, balanced trees or anything else)
Do you want to merge all similar or only 2 consecutive?
If all similar, I suggest you first order the rows, by category, then on the 2 other columns and squash similar in a single row.
If only consecutive 2 then, check if the next value is in the range you set and if yes, merge it. Here you can see how:
merge rows pandas dataframe based on condition
I don't believe the numeric comparisons can be made without a loop, but you can make at least part of this cleaner and more efficient:
dfs = []
for seq in df.category.unique():
dfs[seq] = df[df.category == seq]
Instead of this, use df.groupby('category').apply(drop_duplicates).droplevel(0), where drop_duplicates is a function containing your second loop. The function will then be called separately for each category, with a dataframe that contains only the filtered rows. The outputs will be combined back into a single dataframe. The dataframe will be a MultiIndex with the value of "category" as an outer level; this can be removed with droplevel(0).
Secondly, within the category you could sort by the first of the two numeric columns for another small speed-up:
def drop_duplicates(df):
df = df.sort_values("sstart")
...
This will allow you to stop the inner loop as soon as the sstart column value is out of range, instead of comparing every row to every other row.
I have a dataset structured like this:
"Date","Time","Open","High","Low","Close","Volume"
This time series represent the values of a generic stock market.
I want to calculate the difference in percentage between two rows of the column "Close" (in fact, I want to know how much the value of the stock increased or decreased; each row represent a day).
I've done this with a for loop(that is terrible using pandas in a big data problem) and I create the right results but in a different DataFrame:
rows_number = df_stock.shape[0]
# The first row will be 1, because is calculated in percentage. If haven't any yesterday the value must be 1
percentage_df = percentage_df.append({'Date': df_stock.iloc[0]['Date'], 'Percentage': 1}, ignore_index=True)
# Foreach days, calculate the market trend in percentage
for index in range(1, rows_number):
# n_yesterday : 100 = (n_today - n_yesterday) : x
n_today = df_stock.iloc[index]['Close']
n_yesterday = self.df_stock.iloc[index-1]['Close']
difference = n_today - n_yesterday
percentage = (100 * difference ) / n_yesterday
percentage_df = percentage_df .append({'Date': df_stock.iloc[index]['Date'], 'Percentage': percentage}, ignore_index=True)
How could I refactor this taking advantage of dataFrame api, thus removing the for loop and creating a new column in place?
df['Change'] = df['Close'].pct_change()
or if you want to calucale change in reverse order:
df['Change'] = df['Close'].pct_change(-1)
I would suggest to first make the Date column as DateTime indexing for this you can use
df_stock = df_stock.set_index(['Date'])
df_stock.index = pd.to_datetime(df_stock.index, dayfirst=True)
Then simply access any row with specific column by using datetime indexing and do any kind of operations whatever you want for example to calculate difference in percentage between two rows of the column "Close"
df_stock['percentage'] = ((df_stock['15-07-2019']['Close'] - df_stock['14-07-2019']['Close'])/df_stock['14-07-2019']['Close']) * 100
You can also use for loop to do the operations for each date or row:
for Dt in df_stock.index:
Using diff
(-df['Close'].diff())/df['Close'].shift()
This is the closest to what i'm looking for that I've found
Let's say my dataframe looks something like this:
d = {'item_number':['K208UL','AKD098008','DF900A','K208UL','AKD098008']
'Comp_ID':['998798098','988797387','12398787','998798098','988797387']
'date':['2016-11-12','2016-11-13','2016-11-17','2016-11-13','2016-11-14']}
df = pd.DataFrame(data=d)
I would like to count the amount of times where the same item_number and Comp_ID were observed on consecutive days.
I imagine this will look something along the lines of:
g = df.groupby(['Comp_ID','item_number'])
g.apply(lambda x: x.loc[x.iloc[i,'date'].shift(-1) - x.iloc[i,'date'] == 1].count())
However, I would need to extract the day from each date as an int before comparing, which I'm also having trouble with.
for i in df.index:
wbc_seven.iloc[i, 'day_column'] = datetime.datetime.strptime(df.iloc[i,'date'],'%Y-%m-%d').day
Apparently location based indexing only allows for integers? How could I solve this problem?
However, I would need to extract the day from each date as an int
before comparing, which I'm also having trouble with.
Why?
To fix your code, you need:
consecutive['date'] = pd.to_datetime(consecutive['date'])
g = consecutive.groupby(['Comp_ID','item_number'])
g['date'].apply(lambda x: sum(abs((x.shift(-1) - x)) == pd.to_timedelta(1, unit='D')))
Note the following:
The code above avoids repetitions. That is a basic programming principle: Don't Repeat Yourself
It converts 1 to timedelta for proper comparison.
It takes the absolute difference.
Tip, write a top level function for your work, instead of a lambda, as it accords better readability, brevity, and aesthetics:
def differencer(grp, day_dif):
"""Counts rows in grp separated by day_dif day(s)"""
d = abs(grp.shift(-1) - grp)
return sum(d == pd.to_timedelta(day_dif, unit='D'))
g['date'].apply(differencer, day_dif=1)
Explanation:
It is pretty straightforward. The dates are converted to Timestamp type, then subtracted. The difference will result in a timedelta, which needs to also be compared with a timedelta object, hence the conversion of 1 (or day_dif) to timedelta. The result of that conversion will be a Boolean Series. Boolean are represented by 0 for False and 1 for True. Sum of a Boolean Series will return the total number of True values in the Series.
One solution would be to use pivot tables to count the number of times a Comp_ID and an item_number were observed on consecutive days.
import pandas as pd
d = {'item_number':['K208UL','AKD098008','DF900A','K208UL','AKD098008'],'Comp_ID':['998798098','988797387','12398787','998798098','988797387'],'date':['2016-11-12','2016-11-13','2016-11-17','2016-11-13','2016-11-14']}
df = pd.DataFrame(data=d).sort_values(['item_number','Comp_ID'])
df['date'] = pd.to_datetime(df['date'])
df['delta'] = (df['date'] - df['date'].shift(1))
df = df[(df['delta']=='1 days 00:00:00.000000000') & (df['Comp_ID'] == df['Comp_ID'].shift(1)) &
(df['item_number'] == df['item_number'].shift(1))].pivot_table( index=['item_number','Comp_ID'],
values=['date'],aggfunc='count').reset_index()
df.rename(columns={'date':'consecutive_days'},inplace =True)
Results in
item_number Comp_ID consecutive_days
0 AKD098008 988797387 1
1 K208UL 998798098 1