Calculate pandas dataframe index difference based on the value of another column - python

I'm trying to figure out how to calculate the difference in index of the current row with the row WHERE a certain column has a certain value.
i.e.
I have a dataframe:
import pandas as pd
# pandas settings
pd.set_option('display.max_columns', 320)
pd.set_option('display.max_rows', 1320)
pd.set_option('display.width', 320)
df = pd.read_csv('https://www.dropbox.com/s/hy94jp4d7qwmv04/eurusd_df1.csv?dl=1')
So I would like to calculate how many indexes behind is the row with candle = candle-20
So for instance, if current row is 583185 and the candle value is 119, then the candle we are interested in is 99. We need to figure out current_index - index(where candle=99 1st occurrence)
I hope I made myself clear, cheers =)
EDIT:
Ok, I did pretty bad explaination above..
I believe I'm actually quite close to solving this myself. Have a look:
x = df.index[df.candle == df.candle - 20][0]
df['test'] = df.bid.rolling(int(x)).mean()
So the 'test' column should be the mean() value of the df.bid last X rows, where X is how many rows between current df.candle and the one that is 20 candles back (first iteration so [0] (there are many rows with same candle value))
But the code above gives an error:
IndexError: index 0 is out of bounds for axis 0 with size 0

Here is a method to accomplish this:
# Generate example data
np.random.seed(0)
df = pd.Series(np.round(np.random.rand(1000000)*1000), dtype=int, name='candle').to_frame()
# Compute row index where df.candle is 20 less than candle_value at current_index
current_index = 583185
candle_value = df.loc[current_index, 'candle'] # = 119 in your df
index = df.index[df.candle == candle_value - 20][0]
print(index)
835
Edit: To compute the difference in indexes, just subtract them:
X = current_index - index
print(X)
582350
Then you can compute your formula:
b = 0.015 * TP.rolling(X).std()

Related

How to compute occurrencies of specific value and its percentage for each column based on condition pandas dataframe?

I have the following dataframe df, in which I highlighted in green the cells with values of interest:
enter image description here
and I would like to obtain for each columns (therefore by considering the whole dataframe) the following statistics: the occurrence of a value less or equal to 0.5 (green cells in the dataframe) -Nan values are not to be included- and its percentage in the considered columns in order to use say 50% as benchmark.
For the point asked I tried with value_count like (df['A'].value_counts()/df['A'].count())*100, but this returns the partial result not the way I would and only for specific columns; I was also thinking about using filter or lamba function like df.loc[lambda x: x <= 0.5] but cleary that is not the result I wanted.
The goal/output will be a dataframe as shown below in which are displayed just the columns that "beat" the benchmark (recall: at least (half) 50% of their values <= 0.5).
enter image description here
e.g. in column A the count would be 2 and the percentage: 2/3 * 100 = 66%, while in column B the count would be 4 and the percentage: 4/8 * 100 = 50%. (The same goes for columns X, Y and Z). On the other hand in column C where 2/8 *100 = 25% won't beat the benchmark and therefore not considered in the output.
Is there a suitable way to achieve this IYHO? Apologies in advance if this was a kinda duplicated question but I found no other questions able to help me out, and thx to any saviour.
I believe I have understood your ask in the below code...
It would be good if you could provide an expected output in your question so that it is easier to follow.
Anyways the first part of the code below is just set up so can be ignored as you already have your data set up.
Basically I have created a quick function for you that will return the percentage of values that are under a threshold that you can define.
This function is called in a loop of all the columns within your dataframe and if this percentage is more than the output threshold (again you can define it) it will keep it for actually outputting.
import pandas as pd
import numpy as np
import random
import datetime
### SET UP ###
base = datetime.datetime.today()
date_list = [base - datetime.timedelta(days=x) for x in range(10)]
def rand_num_list(length):
peak = [round(random.uniform(0,1),1) for i in range(length)] + [0] * (10-length)
random.shuffle(peak)
return peak
df = pd.DataFrame(
{
'A':rand_num_list(3),
'B':rand_num_list(5),
'C':rand_num_list(7),
'D':rand_num_list(2),
'E':rand_num_list(6),
'F':rand_num_list(4)
},
index=date_list
)
df = df.replace({0:np.nan})
##############
print(df)
def less_than_threshold(thresh_df, thresh_col, threshold):
if len(thresh_df[thresh_col].dropna()) == 0:
return 0
return len(thresh_df.loc[thresh_df[thresh_col]<=threshold]) / len(thresh_df[thresh_col].dropna())
output_dict = {'cols':[]}
col_threshold = 0.5
output_threshold = 0.5
for col in df.columns:
if less_than_threshold(df, col, col_threshold) >= output_threshold:
output_dict['cols'].append(col)
df_output = df.loc[:,output_dict.get('cols')]
print(df_output)
Hope this achieves your goal!

Using pandas, how to filter rows with similar values in two columns

I have a big dataframe (~10 millon rows). Each row has:
category
start position
end position
If two rows are in the same category and the start and end position overlap with a +-5 tolerance, I want to keep just one of the rows.
For example
1, cat1, 10, 20
2, cat1, 12, 21
3, cat2, 10, 25
I want to filter out 1 or 2.
What I'm doing right now isn't very efficient,
import pandas as pd
df = pd.read_csv('data.csv', sep='\t', header=None)
dfs = []
for seq in df.category.unique():
dfs[seq] = df[df.category == seq]
for index, row in df.iterrows():
if index in discard:
continue
df_2 = dfs[row.category]
res = df_2[(abs(df_2.start - row.start) <= params['min_distance']) & (abs(df_2.end - row.end) <= params['min_distance'])]
if len(res.index) > 1:
discard.extend(res.index.values)
rows.append(row)
df = pd.DataFrame(rows)
I've also tried a different approach making use of a sorted version of the dataframe.
my_index = 0
indexes = []
discard = []
count = 0
curr = 0
total_len = len(df.index)
while my_index < total_len - 1:
row = df.iloc[[my_index]]
cond = True
next_index = 1
while cond:
second_row = df.iloc[[my_index + next_index]]
c1 = (row.iloc[0].category == second_row.iloc[0].category)
c2 = (abs(second_row.iloc[0].sstart - row.iloc[0].sstart) <= params['min_distance'])
c3 = (abs(second_row.iloc[0].send - row.iloc[0].send) <= params['min_distance'])
cond = c1 and c2 and c3
if cond and (c2 amd c3):
indexes.append(my_index)
cond = True
next_index += 1
indexes.append(my_index)
my_index += next_index
indexes.append(total_len - 1)
The problem is that this solution is not perfect, sometimes it misses a row because the overlapping could be several rows ahead, and not in the next one
I'm looking for any ideas on how approach this problem in a more pandas friendly way, if exists.
The approach here should be this:
pandas.groupby by categories
agg(Func) on groupby result
the Func should implement the logic of finding the best range inside categories (sorted search, balanced trees or anything else)
Do you want to merge all similar or only 2 consecutive?
If all similar, I suggest you first order the rows, by category, then on the 2 other columns and squash similar in a single row.
If only consecutive 2 then, check if the next value is in the range you set and if yes, merge it. Here you can see how:
merge rows pandas dataframe based on condition
I don't believe the numeric comparisons can be made without a loop, but you can make at least part of this cleaner and more efficient:
dfs = []
for seq in df.category.unique():
dfs[seq] = df[df.category == seq]
Instead of this, use df.groupby('category').apply(drop_duplicates).droplevel(0), where drop_duplicates is a function containing your second loop. The function will then be called separately for each category, with a dataframe that contains only the filtered rows. The outputs will be combined back into a single dataframe. The dataframe will be a MultiIndex with the value of "category" as an outer level; this can be removed with droplevel(0).
Secondly, within the category you could sort by the first of the two numeric columns for another small speed-up:
def drop_duplicates(df):
df = df.sort_values("sstart")
...
This will allow you to stop the inner loop as soon as the sstart column value is out of range, instead of comparing every row to every other row.

Count the number of days in several ranges

I have rows representing ranges (from->to). Here is a subset of the data.
df = DataFrame({'from': ['2015-08-24','2015-08-24'], 'to': ['2015-08-26','2015-08-31']})
from to
0 2015-08-24 2015-08-26
1 2015-08-24 2015-08-31
I want to count the number of business days for each day in the ranges. Here is my code.
# Creating a business time index by taking min an max values from the ranges
b_range = pd.bdate_range(start=min(df['from']), end=max(df['to']))
# Init of a new DataFrame with this index and the count at 0
result = DataFrame(0, index=b_range, columns=['count'])
# Iterating over the range to select the index in the result and update the count column
for index, row in df.iterrows():
result.loc[pd.bdate_range(row['from'],row['to']),'count'] += 1
print(result)
count
2015-08-24 2
2015-08-25 2
2015-08-26 2
2015-08-27 1
2015-08-28 1
2015-08-31 1
It works, but does anyone know a more pythonic way of doing that (i.e. without the for loop) ?
Caveat, I sort of hate this answer but on this tiny dataframe it is over 2x faster so I'll throw it out there as a workable, if not elegant, alternative.
df2 = df.apply( lambda x: [ pd.bdate_range( x['from'], x['to'] ) ], axis=1 )
arr = np.unique( np.hstack( df2.values ), return_counts=True )
result = pd.DataFrame( arr[1], index=arr[0] )
Basically all I'm doing here is to make a column with all the dates in it and then use numpy unique (analog of pandas value_counts) to add everything up. I was hoping to come up with something more elegant and readable but this is what I have at the moment.
Here is a method that use cumsum(). It should be faster than for-loop if you have alot of range:
import pandas as pd
df = pd.DataFrame({
'from': ['2015-08-24','2015-08-24'],
'to': ['2015-08-26','2015-08-31']})
df = df.apply(pd.to_datetime)
from_date = min(df['from'])
to_date = max(df['to'])
b_range = pd.bdate_range(start=from_date, end=to_date)
d_range = pd.date_range(start=from_date, end=to_date)
s = pd.Series(0, index=d_range)
from_count = df["from"].value_counts()
to_count = df["to"].value_counts()
s.add(from_count, fill_value=0).sub(to_count.shift(freq="D"), fill_value=0).cumsum().reindex(b_range)
I was not completely satisfied by these solutions. So I kept searching and I think I found a rather elegant and fast solution.
It's inspired by the section "Pivoting 'long' to 'wide' format" explained in the Wes McKinney book: Python for Data Analysis.
I have put a lot of comments in my code but I think it's preferable to print out each step to understand it.
df = DataFrame({'from': ['2015-08-24','2015-08-24'], 'to': ['2015-08-26','2015-08-31']})
# Convert boundaries to datetime
df['from'] = pd.to_datetime(df['from'], format='%Y-%m-%d')
df['to'] = pd.to_datetime(df['to'], format='%Y-%m-%d')
# Reseting index to create a row id named index
df = df.reset_index(level=0)
# Pivoting data to obtain 'from' as row index and row id ('index') as column,
# each cell cointaining the 'to' date
# In consequence each range (from - to pair) is split into as many columns.
pivoted = df.pivot('from', 'index', 'to')
# Reindexing the table with a range of business dates (i.e. working days)
pivoted = pivoted.reindex(index=pd.bdate_range(start=min(df['from']),
end=max(df['to'])))
# Filling the NA values forward to copy the to date
# now each row of each column contains the corresponding to date
pivoted = pivoted.fillna(method='ffill')
# Computing the basically 'from' - 'to' for each column and each row and converting the result in days
# to obtain the number of days between the date in the index and the 'to' date
# Note: one day is added to include the right side of the interval
pivoted = pivoted.apply(lambda x: (x + Day() - x.index) / np.timedelta64(1, 'D'),
axis=0)
# Clipping value lower than 0 (not in the range) to 0
# and values upper than 0 to 1 (only one by day and by id)
pivoted = pivoted.clip_lower(0).clip_upper(1)
# Summing along the columns and that's it
pivoted.sum(axis=1)

Pandas: Filter dataframe for values that are too frequent or too rare

On a pandas dataframe, I know I can groupby on one or more columns and then filter values that occur more/less than a given number.
But I want to do this on every column on the dataframe. I want to remove values that are too infrequent (let's say that occur less than 5% of times) or too frequent. As an example, consider a dataframe with following columns: city of origin, city of destination, distance, type of transport (air/car/foot), time of day, price-interval.
import pandas as pd
import string
import numpy as np
vals = [(c, np.random.choice(list(string.lowercase), 100, replace=True)) for c in
'city of origin', 'city of destination', 'distance, type of transport (air/car/foot)', 'time of day, price-interval']
df = pd.DataFrame(dict(vals))
>> df.head()
city of destination city of origin distance, type of transport (air/car/foot) time of day, price-interval
0 f p a n
1 k b a f
2 q s n j
3 h c g u
4 w d m h
If this is a big dataframe, it makes sense to remove rows that have spurious items, for example, if time of day = night occurs only 3% of the time, or if foot mode of transport is rare, and so on.
I want to remove all such values from all columns (or a list of columns). One idea I have is to do a value_counts on every column, transform and add one column for each value_counts; then filter based on whether they are above or below a threshold. But I think there must be a better way to achieve this?
This procedure will go through each column of the DataFrame and eliminate rows where the given category is less than a given threshold percentage, shrinking the DataFrame on each loop.
This answer is similar to that provided by #Ami Tavory, but with a few subtle differences:
It normalizes the value counts so you can just use a percentile threshold.
It calculates counts just once per column instead of twice. This results in faster execution.
Code:
threshold = 0.03
for col in df:
counts = df[col].value_counts(normalize=True)
df = df.loc[df[col].isin(counts[counts > threshold].index), :]
Code timing:
df2 = pd.DataFrame(np.random.choice(list(string.lowercase), [1e6, 4], replace=True),
columns=list('ABCD'))
%%timeit df=df2.copy()
threshold = 0.03
for col in df:
counts = df[col].value_counts(normalize=True)
df = df.loc[df[col].isin(counts[counts > threshold].index), :]
1 loops, best of 3: 485 ms per loop
%%timeit df=df2.copy()
m = 0.03 * len(df)
for c in df:
df = df[df[c].isin(df[c].value_counts()[df[c].value_counts() > m].index)]
1 loops, best of 3: 688 ms per loop
I would go with one of the following:
Option A
m = 0.03 * len(df)
df[np.all(
df.apply(
lambda c: c.isin(c.value_counts()[c.value_counts() > m].index).as_matrix()),
axis=1)]
Explanation:
m = 0.03 * len(df) is the threshold (it's nice to take the constant out of the complicated expression)
df[np.all(..., axis=1)] retains the rows where some condition was obtained across all columns.
df.apply(...).as_matrix applies a function to all columns, and makes a matrix of the results.
c.isin(...) checks, for each column item, whether it is in some set.
c.value_counts()[c.value_counts() > m].index is the set of all values in a column whose count is above m.
Option B
m = 0.03 * len(df)
for c in df.columns:
df = df[df[c].isin(df[c].value_counts()[df[c].value_counts() > m].index)]
The explanation is similar to the one above.
Tradeoffs:
Personally, I find B more readable.
B creates a new DataFrame for each filtering of a column; for large DataFrames, it's probably more expensive.
I am new to Python and using Pandas. I came up with the following solution below. Maybe other people might have a better or more efficient approach.
Assuming your DataFrame is DF, you can use the following code below to filter out all infrequent values. Just be sure to update the col and bin_freq variable. DF_Filtered is your new filtered DataFrame.
# Column you want to filter
col = 'time of day'
# Set your frequency to filter out. Currently set to 5%
bin_freq = float(5)/float(100)
DF_Filtered = pd.DataFrame()
for i in DF[col].unique():
counts = DF[DF[col]==i].count()[col]
total_counts = DF[col].count()
freq = float(counts)/float(total_counts)
if freq > bin_freq:
DF_Filtered = pd.concat([DF[DF[col]==i],DF_Filtered])
print DF_Filtered
DataFrames support clip_lower(threshold, axis=None) and clip_upper(threshold, axis=None), which remove all values below or above (respectively) a certain threshhold.
We can also replace all the rare categories with one label, say "Rare" and remove later if this doesn't add value to prediction.
# function finds the labels that are more than certain percentage/threshold
def get_freq_labels(df, var, rare_perc):
df = df.copy()
tmp = df.groupby(var)[var].count() / len(df)
return tmp[tmp > rare_perc].index
vars_cat = [val for val in data.columns if data[val].dtype=='O']
for var in vars_cat:
# find the frequent categories
frequent_cat = get_freq_labels(data, var, 0.05)
# replace rare categories by the string "Rare"
data[var] = np.where(data[var].isin(
frequent_cat ), data[var], 'Rare')

How to change values in a dataframe efficiently

# Incorporate delisting return
i = 0
for tc, col in dlret.iloc[:,0:50].iteritems():
idx = col.index[col.notnull()]
if len(idx) != 0:
tr = idx[0]
val = col.ix[tr]
#ret.ix[tr, tc] = val #this line is too slow
i += 1
if math.floor(i/10) > math.floor((i-1)/10):
print i
The dlret DataFrame is of 600 or so rows and 25000+ columns. I iterate through the columns to look for the first nonnull value (the delisting return) and then find the corresponding location in the ret DataFrame to set the value to that of the delisting return. However, the code runs painfully slow using ix to index the corresponding location. Any suggestion on how to efficiently achieve this?
According to your comment, what you want is iterate through the columns to look for the first non-null value for each column and update the ret DataFrame.
You can do this with following code:
mask_first_nonnull = dlret.notnull() & (dlret.notnull().cumsum()==1)
ret[mask_first_nonnull]=dlret[mask_first_nonnull]

Categories

Resources