I have a data frame consisting of lists as elements. I want to find the closest matching values within a percentage of a given value.
My code:
df = pd.DataFrame({'A':[[1,2],[4,5,6]]})
df
A
0 [1, 2]
1 [3, 5, 7]
# in each row, lets find a the values and their index that match 5 with 20% tolerance
val = 5
tol = 0.2 # find values matching 5 or 20% within 5 (4 or 6)
df['Matching_index'] = (df['A'].map(np.array)-val).map(abs).map(np.argmin)
Present solution:
df
A Matching_index
0 [1, 2] 1 # 2 matches closely with 5 but this is wrong
1 [4, 5, 6] 1 # 5 matches with 5, correct.
Expected solution:
df
A Matching_index
0 [1, 2] NaN # No matching value, hence NaN
1 [4, 5, 6] 1 # 5 matches with 5, correct.
Idea is get difference with val and then replace to missing values if not match tolerance, last get np.nanargmin which raise error if all missing values, so added next condition with np.any:
def f(x):
a = np.abs(np.array(x)-val)
m = a <= val * tol
return np.nanargmin(np.where(m, a, np.nan)) if m.any() else np.nan
df['Matching_index'] = df['A'].map(f)
print (df)
A Matching_index
0 [1, 2] NaN
1 [4, 5, 6] 1.0
Pandas solution:
df1 = pd.DataFrame(df['A'].tolist(), index=df.index).sub(val).abs()
df['Matching_index'] = df1.where(df1 <= val * tol).dropna(how='all').idxmin(axis=1)
I'm not sure it you want all indexes or just a counter.
Try this:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':[[1,2],[4,5,6,7,8]]})
val = 5
tol = 0.3
def closest(arr,val,tol):
idxs = [ idx for idx,el in enumerate(arr) if (np.abs(el - val) < val*tol)]
result = len(idxs) if len(idxs) != 0 else np.nan
return result
df['Matching_index'] = df['A'].apply(closest, args=(val,tol,))
df
If you want all the indexes, just return idxs instead of len(idxs).
Related
Normally when you want to create a turn a set of data into a Data Frame, you make a list for each column, create a dictionary from those lists, then create a data frame from the dictionary.
The data frame I want to create has 75 columns, all with the same number of rows. Defining lists one-by-one isn't going work. Instead I decided to make a single list and iteratively put a certain chunk of each row onto a Data Frame.
Here I will make an example where I turn a list into a data frame:
lst = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
# Example list
df =
a b c d e
0 0 2 4 6 8
1 1 3 5 7 9
# Result I want from the example list
Here is my test code:
import pandas as pd
import numpy as np
dict = {'a':[], 'b':[], 'c':[], 'd':[], 'e':[]}
df = pd.DataFrame(dict)
# Here is my test data frame, it contains 5 columns and no rows.
lst = np.arange(10).tolist()
# This is my test list, it looks like this lst = [0, 2, …, 9]
for i in range(len(lst)):
df.iloc[:, i] = df.iloc[:, i]\
.append(pd.Series(lst[2 * i:2 * i + 2]))
# This code is supposed to put two entries per column for the whole data frame.
# For the first column, i = 0, so [2 * (0):2 * (0) + 2] = [0:2]
# df.iloc[:, 0] = lst[0:2], so df.iloc[:, 0] = [0, 1]
# Second column i = 1, so [2 * (1):2 * (1) + 2] = [2:4]
# df.iloc[:, 1] = lst[2:4], so df.iloc[:, 1] = [2, 3]
# This is how the code was supposed to allocate lst to df.
# However it outputs an error.
When I run this code I get this error:
ValueError: cannot reindex from a duplicate axis
When I add ignore_index = True such that I have
for i in range(len(lst)):
df.iloc[:, i] = df.iloc[:, i]\
.append(pd.Series(lst[2 * i:2 * i + 2]), ignore_index = True)
I get this error:
IndexError: single positional indexer is out-of-bounds
After running the code, I check the results of df. The output is the same whether I ignore index or not.
In: df
Out:
a b c d e
0 0 NaN NaN NaN NaN
1 1 NaN NaN NaN NaN
It seems that the first loop runs fine, but the error occurs when trying to fill the second column.
Does anybody know how to get this to work? Thank you.
IIUC:
lst = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
alst = np.array(lst)
df = pd.DataFrame(alst.reshape(2,-1, order='F'), columns = [*'abcde'])
print(df)
Output:
a b c d e
0 0 2 4 6 8
1 1 3 5 7 9
I have defined a function to create a dataframe, but I get two lists in each column, how could I get each element of the list as a separate row in the dataframe as shown below.
a = [1, 2, 3, 4]
def function():
result = []
for i in range(0, len(a)):
number = [i for i in a]
operation = [8*i for i in a]
result.append({'number': number, 'operation': operation})
df = pd.DataFrame(result, columns=['number','operation'])
return df
function()
Result:
number operation
0 [1, 2, 3, 4] [8, 16, 24, 32]
What I really want to:
number operation
0 1 8
1 2 16
2 3 24
3 4 34
Can anyone help me please? :)
Your problems are twofold, firstly you are pushing the entire list of values (instead of the "current" value) into the result array on each pass through your for loop, and secondly you are overwriting the dataframe each time as well. It would be simpler to use a list comprehension to generate the values for the dataframe:
import pandas as pd
a = [1, 2, 3, 4]
def function():
result = [{'number' : i, 'operation' : 8*i} for i in a]
df = pd.DataFrame(result)
return df
print(function())
Output:
number operation
0 1 8
1 2 16
2 3 24
3 4 32
import numpy as np
a = [1, 2, 3, 4]
def function():
for i in range(0, len(a)):
number = [i for i in a]
operation = [8*i for i in a]
v=np.rot90(np.array((number,operation)))
result=np.flipud(v)
df = pd.DataFrame(result, columns=['number','operation'])
return df
print (function())
number operation
0 1 8
1 2 16
2 3 24
3 4 32
You are almost there. Just replace number = [i for i in a] with number = a[i] and operation = [8*i for i in a] with operation = 8 * a[i]
(FYI: No need to create pandas dataframe inside loop. You can get same output with pandas dataframe creation outside loop)
Refer to the below code:
a = [1, 2, 3, 4]
def function():
result = []
for i in range(0, len(a)):
number = a[i]
operation = 8*a[i]
result.append({'number': number, 'operation': operation})
df = pd.DataFrame(res, columns=['number','operation'])
return df
function()
number operation
0 1 8
1 2 16
2 3 24
3 4 32
Given a dataframe my goal is to sample rows such that values in one column are as balanced as possible.
Say I have a dataframe below, the sample size is 3 and target column is c
a | b | c
1 | 2 | 0
3 | 4 | 0
5 | 6 | 1
7 | 8 | 2
9 | 10| 2
11| 12| 2
One of possible samples would be
a | b | c
1 | 2 | 0
5 | 6 | 1
7 | 8 | 2
In case of sample size is not a multiple of the number of unique classes, it is fine to have difference in 1 item or so.
How would I approach this in pandas?
EDIT: provided solution that worked for me in answers
I first generated sample sizes for each unique value of column c so that it is balanced. The remainders are distributed over the first few elements
unique_values = df['c'].unique()
sample_sizes = [(k//len(df.columns))] * len(unique_values)
i = 0
while i < k%len(df.columns):
sample_sizes[i]+= 1
i= I+1
This bit generates the samples based on the generated sample sizes
df2= pd.concat([df.loc[df['c'] == unique_values[i]].sample() for i in range(len(sample_sizes)) for j in range(sample_sizes[i])])
You can just get a random sample of the dataframe based on the minimum count of the target column.
column = 'c'
df = df.groupby(column).sample(n=df[column].value_counts().min(), random_state='42')
First, we create your example dataframe
columns = ['a', 'b', 'c']
data = [[1, 2, 0], [4, 4, 0], [5, 6, 1], [7, 8, 2], [9, 10, 2], [11, 12, 2]]
df = pd.DataFrame(data = data, columns = columns)
Now, with the following function you can do what you want
def balanced_sample(dataframe, sample_size, target_column):
# extract existing possible classes
target_columns_values = dataframe.loc[:, target_column].unique().tolist()
# count number of classes
target_columns_unique_classes_size = len(target_columns_values)
# checking if sample size is multiple of number of classes
if sample_size%target_columns_unique_classes_size !=0:
print('Sample size is not a multiple of the number of unique classes')
# to have difference in 1 item or so
instances_per_class = round(sample_size/target_columns_unique_classes_size)
# other possibilitie is to use
# sample_size//target_columns_unique_classes_size instead of round(...)
# but then, instances_per_class will be always <= than
# sample_size/target_columns_unique_classes_size
# checking if there is enought examples per class
values_per_class = dataframe.loc[:, target_column].value_counts()
for idx in values_per_class.index:
if instances_per_class>values_per_class[idx]:
print('Class {} has only {} example, so it is impossible to use {}
sample size, i.e., {} per class'.format(idx, values_per_class[idx],
sample_size, instances_per_class))
return pd.DataFrame(columns = dataframe.columns)
# creating the result dataframe
data = []
for classes in target_columns_values:
class_values = dataframe[dataframe.loc[:, target_column] ==
classes].sample(instances_per_class).values.tolist()
data+=class_values
result_dataframe = pd.DataFrame(columns = dataframe.columns, data = data)
return result_dataframe
Now we check the function:
And with other options:
I hope you find it useful, if you have any doubt, comment it here and I will try to answer you.
Question is a bit ambiguous but let say you want to randomly select 1 row for each column c category one could do:
import pandas as pd
data = [
[1, 2, 0], [1, 4, 0], [2, 2, 1],
[4, 5, 1], [3, 7, 2], [3, 3, 2],
[1, 2, 6], [3, 2, 6], [5, 2, 6]
]
df = pd.DataFrame(data, columns=['a', 'b', 'c'])
sample = df.groupby('c').apply(lambda x: x.sample(n=1).squeeze())
a b c
c
0 1 4 0
1 2 2 1
2 3 3 2
6 1 2 6
I am posting the solution that works for me. It is not the most beautiful or efficient code. But that's honest work.
df = pd.read_csv(path)
target_col = 't'
unique_values = df[target_col].unique()
k = 8 #sample size
per_class_sample_size = int(k/unique_values.shape[0])
arr_samples_per_class = [0] * len(unique_values)
leftover = k - (per_class_sample_size * len(unique_values))
for i, v in enumerate(unique_values):
occ = df[df[target_col] == v].shape[0]
if leftover > 0 and occ > per_class_sample_size:
sz = per_class_sample_size + 1
leftover -= 1
else:
sz = per_class_sample_size if occ >= per_class_sample_size else occ
arr_samples_per_class[i] = sz
fdf = None
for v, sz in zip(unique_values, arr_samples_per_class):
ss = df.loc[df[target_col] == v].sample(sz)
fdf = ss if fdf is None else pd.concat([fdf, ss], axis=0)
I have dataframe:
A B C D
1 0 0 2
0 1 0 0
0 0 0 0
I need to select all values which are greater then 0 and put them in a list.
if row doesnt contain any positive value 0 should be written to list.
So, the output for given dataframe should look like this:
[1,2,1,0]
How this can be resolved?
Here is a simple loop you could use (looping through df.values gives us rows as arrays):
output = []
for ar in df.values:
nonzeros = ar[ar > 0]
# If nonzeros is not empty proceed and extend the output
if nonzeros.size:
output.extend(nonzeros)
# If not add 0
else:
output.append(0)
print(output)
returns:
[1, 2, 1, 0]
We can make extensive use of pandas + numpy here:
Mask all values which are greater than 0
m = df.gt(0)
A B C D
0 True False False True
1 False True False False
2 False False False False
Mask rows which dont contain any values above 0:
s1 = m.any(axis=1).astype(int).values
Get all the values greater than 0 in an array:
s2 = df.values[m]
Finally concat both arrays with each other:
np.concatenate([s2, s1[s1==0]]).tolist()
Output
[1, 2, 1, 0]
In your case , first stack with your df, then we apply your condition , if the row contain the none 0 we select , if all 0 , then we keep it as zero
df.stack().groupby(level=0).apply(lambda x : x.head(1) if all(x==0) else x[x!=0]).tolist()
[1, 2, 1, 0]
Or without apply
np.concatenate(df.mask(df==0).stack().groupby(level=0).apply(list).reindex(df.index,fill_value=[0]).values)
array([1., 2., 1., 0.])
Shorten the process
np.concatenate(list(map(lambda x : [x[0]] if all(x==0) else x[x!=0],df.values)))
array([1, 2, 1, 0])
You could apply a custom function which will process each row of the DataFrame and return a list. Then to sum returned lists.
In [1]: import pandas as pd
In [2]: df = pd.read_clipboard()
In [3]: df
Out[3]:
A B C D
0 1 0 0 2
1 0 1 0 0
2 0 0 0 0
In [4]: def get_positive_values(row):
...: # If all elements in a row are zeros
...: # then return a list with a single zero
...: if row.eq(0).all():
...: return [0]
...: # Else return a list with positive values only.
...: return row[row.gt(0)].tolist()
...:
...:
In [5]: df.apply(get_positive_values, axis=1).sum()
Out[5]: [1, 2, 1, 0]
I have a large dataframe of stockprice data with df.columns = ['open','high','low','close']
Problem definition:
When an EMA crossover happens, i am mentioning df['cross'] = cross. Everytime a crossover happens, if we label the current crossover as crossover4, I want to check if the minimum value of df['low'] between crossover 3 and 4 IS GREATER THAN the minimum value of df['low'] between crossover 1 and 2. I have made an attempt at the code based on the help i have received from 'Gherka' so far. I have indexed the crossing over and found minimum values between consecutive crossovers.
So, everytime a crossover happens, it has to be compared with the previous 3 crossovers and I need to check MIN(CROSS4,CROSS 3) > MIN(CROSS2,CROSS1).
I would really appreciate it if you guys could help me complete.
import pandas as pd
import numpy as np
import bisect as bs
data = pd.read_csv("Nifty.csv")
df = pd.DataFrame(data)
df['5EMA'] = df['Close'].ewm(span=5).mean()
df['10EMA'] = df['Close'].ewm(span=10).mean()
condition1 = df['5EMA'].shift(1) < df['10EMA'].shift(1)
condition2 = df['5EMA'] > df['10EMA']
df['cross'] = np.where(condition1 & condition2, 'cross', None)
cross_index_array = df.loc[df['cross'] == 'cross'].index
def find_index(a, x):
i = bs.bisect_left(a, x)
return a[i-1]
def min_value(x):
"""Find the minimum value of 'Low' between crossovers 1 and 2, crossovers 3 and 4, etc..."""
cur_index = x.name
prev_cross_index = find_index(cross_index_array, cur_index)
return df.loc[prev_cross_index:cur_index, 'Low'].min()
df['min'] = None
df['min'][df['cross'] == 'cross'] = df.apply(min_value, axis=1)
print(df)
This should do the trick:
import pandas as pd
df = pd.DataFrame({'open': [1, 2, 3, 4, 5],
'high': [5, 6, 6, 5, 7],
'low': [1, 3, 3, 4, 4],
'close': [3, 5, 3, 5, 6]})
df['day'] = df.apply(lambda x: 'bull' if (
x['close'] > x['open']) else None, axis=1)
df['min'] = None
df['min'][df['day'] == 'bull'] = pd.rolling_min(
df['low'][df['day'] == 'bull'], window=2)
print(df)
# close high low open day min
# 0 3 5 1 1 bull NaN
# 1 5 6 3 2 bull 1
# 2 3 6 3 3 None None
# 3 5 5 4 4 bull 3
# 4 6 7 4 5 bull 4
Open for comments!
If I understand your question correctly, you need a dynamic "rolling window" over which to calculate the minimum value. Assuming your index is a default one meaning it's sorted in the ascending order, you can try the following approach:
import pandas as pd
import numpy as np
from bisect import bisect_left
df = pd.DataFrame({'open': [1, 2, 3, 4, 5],
'high': [5, 6, 6, 5, 7],
'low': [1, 3, 2, 4, 4],
'close': [3, 5, 3, 5, 6]})
This uses the same sample data as mommermi, but with low on the third day changed to 2 as the third day should also be included in the "rolling window".
df['day'] = np.where(df['close'] > df['open'], 'bull', None)
We calculate the day column using vectorized numpy operation which should be a little faster.
bull_index_array = df.loc[df['day'] == 'bull'].index
We store the index values of the rows (days) that we've flagged as bulls.
def find_index(a, x):
i = bisect_left(a, x)
return a[i-1]
Bisect from the core library will enable us to find the index of the previous bull day in an efficient way. This requires that the index is sorted which it is by default.
def min_value(x):
cur_index = x.name
prev_bull_index = find_index(bull_index_array, cur_index)
return df.loc[prev_bull_index:cur_index, 'low'].min()
Next, we define a function that will create our "dynamic" rolling window by slicing the original dataframe by previous and current index.
df['min'] = df.apply(min_value, axis=1)
Finally, we apply the min_value function row-wise to the dataframe, yielding this:
open high low close day min
0 1 5 1 3 bull NaN
1 2 6 3 5 bull 1.0
2 3 6 2 3 None 2.0
3 4 5 4 5 bull 2.0
4 5 7 4 6 bull 4.0