data frame and list operation

data frame and list operation - python

There are three columns in df: mins, maxs, and col. I would like to generate a binary list according to the following rule: if col[i] is smaller than or equal to mins[i], add a "1" to the list and keep adding "1" for each row until col[i+n] is greater than or equal maxs[i+n]. After reaching maxs[i+n], add "0" to the list for each row until finding the next row where col[i] is smaller than or equal to mins[i]. Repeat this entire process till going over all rows.
For example,
col mins maxs
2 1 6 (0)
4 2 6 (0)
2 3 7 (1)
5 5 6 (1)
4 3 8 (1)
4 2 5 (1)
5 3 5 (0)
4 0 5 (0)
3 3 8 (1)
......
So the list would be [0,0,1,1,1,1,0,0,1]. Does this make sense?
I gave it a shot and wrote the following, which unfortunately did not achieve what I wanted.
def get_list(col, mins, maxs):
l = []
i = 0
while i <= len(col):
if col[i] <= mins[i]:
l.append(1)
while col[i+1] <= maxs[i+1]:
l.append(1)
i += 1
break
break
return l
Thank you so much folks!

My answer may not be elegant but should work according to your expectation.
Import the pandas library.
import pandas as pd
Create dataframe according to data provided.
input_data = {
'col': [2, 4, 2, 5, 4, 4, 5, 4, 3],
'mins': [1, 2, 3, 5, 3, 2 , 3, 0, 3],
'maxs': [6, 6, 7, 6, 8, 5, 5, 5, 8]
}
dataframe_ = pd.DataFrame(data=input_data)
Using a for loop iterate over the rows. The switch variable will change accordingly depending on the conditions was provided which results in the binary column being populated.
binary_switch = False
for index, row in dataframe_.iterrows():
if row['col'] <= row['mins']:
binary_switch = True
elif row['col'] >= row['maxs']:
binary_switch = False
binary_output = 1 if binary_switch else 0
dataframe_.at[index, 'binary'] = binary_output
dataframe_['binary'] = dataframe_['binary'].astype('int')
print(dataframe_)
Output from code.
col mins maxs binary
0 2 1 6 0
1 4 2 6 0
2 2 3 7 1
3 5 5 6 1
4 4 3 8 1
5 4 2 5 1
6 5 3 5 0
7 4 0 5 0
8 3 3 8 1

Your rules give the following decision tree:
1: is col <= mins?
True: l.append(1)
False: next question
2: was col <= mins before?
False: l.append(0)
True: next question:
3: is col >= maxs?
True: l.append(0)
False: l.append(1)
Making this into a function with an if/else tree, you get this:
def make_binary_list(df):
l = []
col_lte_mins = False
for index, row in df.iterrows():
col = row["col"]
mins = row["mins"]
maxs = row["maxs"]
if col <= mins:
col_lte_mins = True
l.append(1)
else:
if col_lte_mins:
if col >= maxs:
col_lte_mins = False
l.append(0)
else:
l.append(1)
else:
l.append(0)
return l
make_binary_list(df) gives [0, 0, 1, 1, 1, 1, 0, 0, 1]

Related

pandas for loop for running average does not work

I tried to make a kind of running average - out of 90 rows, every 3 in column A should make an average that would be the same as those rows in column B.
For example:
From this:
df = pd.DataFrame( A B
2 0
3 0
4 0
7 0
9 0
8 0)
to this:
df = pd.DataFrame( A B
2 3
3 3
4 3
7 8
9 8
8 8)
I tried running this code:
x=0
for i in df['A']:
if x<90:
y = (df['A'][x]+ df['A'][(x +1)]+df['A'][(x +2)])/3
df['B'][x] = y
df['B'][(x+1)] = y
df['B'][(x+2)] = y
x=x+3
print(y)
It does print the correct Y
But does not change B
I know there is a better way to do it, and if anyone knows - it would be great if they shared it. But the more important thing for me is to understand why what I wrote down doesn't have an effect on the df.

You could group by the index divided by 3, then use transform to compute the mean of those values and assign to B:
df = pd.DataFrame({'A': [2, 3, 4, 7, 9, 8], 'B': [0, 0, 0, 0, 0, 0]})
df['B'] = df.groupby(df.index // 3)['A'].transform('mean')
Output:
A B
0 2 3
1 3 3
2 4 3
3 7 8
4 9 8
5 8 8
Note that this relies on the index being of the form 0,1,2,3,4,.... If that is not the case, you could either reset the index (df.reset_index(drop=True)) or use np.arange(df.shape[0]) instead i.e.
df['B'] = df.groupby(np.arange(df.shape[0]) // 3)['A'].transform('mean')

i = 0
batch_size = 3
df = pd.DataFrame({'A':[2,3,4,7,9,8,9,10],'B':[-1] * 8})
while i < len(df):
j = min(i+batch_size-1,len(df)-1)
avg =sum(df.loc[i:j,'A'])/ (j-i+1)
df.loc[i:j,'B'] = [avg] * (j-i+1)
i+=batch_size
df
corner case when len(df) % batch_size != 0 assumes we take the average of the leftover rows.

Compare two pandas DataFrames in the most efficient way

Let's consider two pandas dataframes:
import numpy as np
import pandas as pd
df = pd.DataFrame([1, 2, 3, 2, 5, 4, 3, 6, 7])
check_df = pd.DataFrame([3, 2, 5, 4, 3, 6, 4, 2, 1])
If want to do the following thing:
If df[1] > check_df[1] or df[2] > check_df[1] or df[3] > check_df[1] then we assign to df 1, and 0 otherwise
If df[2] > check_df[2] or df[3] > check_df[2] or df[4] > check_df[2] then we assign to df 1, and 0 otherwise
We apply the same algorithm to end of DataFrame
My primitive code is the following:
df_copy = df.copy()
for i in range(len(df) - 3):
moving_df = df.iloc[i:i+3]
if (moving_df >check_df.iloc[i]).any()[0]:
df_copy.iloc[i] = 1
else:
df_copy.iloc[i] = -1
df_copy
0
0 -1
1 1
2 -1
3 1
4 1
5 -1
6 3
7 6
8 7
Could you please give me a advice, if there is any possibility to do this without loop?

IIUC, this is easily done with a rolling.min:
df['out'] = np.where(df[0].rolling(N, min_periods=1).max().shift(1-N).gt(check_df[0]),
1, -1)
output:
0 out
0 1 -1
1 2 1
2 3 -1
3 2 1
4 5 1
5 4 -1
6 3 1
7 6 -1
8 7 -1
to keep the last items as is:
m = df[0].rolling(N).max().shift(1-N)
df['out'] = np.where(m.gt(check_df[0]),
1, -1)
df['out'] = df['out'].mask(m.isna(), df[0])
output:
0 out
0 1 -1
1 2 1
2 3 -1
3 2 1
4 5 1
5 4 -1
6 3 1
7 6 6
8 7 7

Although #mozway has already provided a very smart solution, I would like to share my approach as well, which was inspired by this post.
You could create your own object that compares a series with a rolling series. The comparison could be performed by typical operators, i.e. >, < or ==. If at least one comparison holds, the object would return a pre-defined value (given in list returns_tf, where the first element would be returned if the comparison is true, and the second if it's false).
Possible Code:
import numpy as np
import pandas as pd
df = pd.DataFrame([1, 2, 3, 2, 5, 4, 3, 6, 7])
check_df = pd.DataFrame([3, 2, 5, 4, 3, 6, 4, 2, 1])
class RollingComparison:
def __init__(self, comparing_series: pd.Series, rolling_series: pd.Series, window: int):
self.comparing_series = comparing_series.values[:-1*window]
self.rolling_series = rolling_series.values
self.window = window
def rolling_window_mask(self, option: str = "smaller"):
shape = self.rolling_series.shape[:-1] + (self.rolling_series.shape[-1] - self.window + 1, self.window)
strides = self.rolling_series.strides + (self.rolling_series.strides[-1],)
rolling_window = np.lib.stride_tricks.as_strided(self.rolling_series, shape=shape, strides=strides)[:-1]
rolling_window_mask = (
self.comparing_series.reshape(-1, 1) < rolling_window if option=="smaller" else (
self.comparing_series.reshape(-1, 1) > rolling_window if option=="greater" else self.comparing_series.reshape(-1, 1) == rolling_window
)
)
return rolling_window_mask.any(axis=1)
def assign(self, option: str = "rolling", returns_tf: list = [1, -1]):
mask = self.rolling_window_mask(option)
return np.concatenate((np.where(mask, returns_tf[0], returns_tf[1]), self.rolling_series[-1*self.window:]))
The assignments can be achieved as follows:
roller = RollingComparison(check_df[0], df[0], 3)
check_df["rolling_smaller_checking"] = roller.assign(option="smaller")
check_df["rolling_greater_checking"] = roller.assign(option="greater")
check_df["rolling_equals_checking"] = roller.assign(option="equal")
Output (the column rolling_smaller_checking equals your desired output):
0 rolling_smaller_checking rolling_greater_checking rolling_equals_checking
0 3 -1 1 1
1 2 1 -1 1
2 5 -1 1 1
3 4 1 1 1
4 3 1 -1 1
5 6 -1 1 1
6 4 3 3 3
7 2 6 6 6
8 1 7 7 7

Checking the length of a part of a dataframe in conditional row selection in pandas

Suppose I have a pandas dataframe like this:
first second third
1 2 2 1
2 2 1 0
3 3 4 5
4 4 6 3
5 5 4 3
6 8 8 4
7 3 4 2
8 5 6 6
and could be created with the code:
dataframe = pd.DataFrame(
{
'first': [2, 2, 3, 4, 5, 8, 3, 5],
'second': [2, 1, 4, 6, 4, 8, 4, 6],
'third': [1, 0, 5, 3, 3, 4, 2, 6]
}
)
I want to select the rows in which the value of the second column is more than the value of the first column and at the same time the values in the third column are less than the values in the second column for k consecutive rows where the last row of these k consecutive rows is exactly before the row in which the value of the second column is more than the value of the first column, and k could be any integer between 2 and 4 (closed interval).
So, the output should be rows:
3, 7, 8
To get the above-mentioned result using conditional row selection in pandas, I know I should write a code like this:
dataframe[(dataframe['first'] < dataframe['second']) & (second_condition)].index
But I don't know what to write for the second_condition which I have explained above. Can anyone help me with this?

The trick here is to calculate the rolling sum on a boolean mask to find out the number of values in k previous rows where third column is less than the second column
k = 2
m1 = df['second'].gt(df['first'])
m2 = df['third'].lt(df['second']).shift(fill_value=0).rolling(k).sum().eq(k)
print(df[m1 & m2])
first second third
3 3 4 5
7 3 4 2
8 5 6 6

I will center my answer in the second part of your question. You need to use shift function to compare. It allows you to shift rows.
Assuming your k is fixed at 2, you should do something like this:
import pandas as pd
df = pd.DataFrame(
{
'first': [2, 2, 3, 4, 5, 8, 3, 5],
'second': [2, 1, 4, 6, 4, 8, 4, 6],
'third': [1, 0, 5, 3, 3, 4, 2, 6]
}
)
# this is the line
df[(df['third'] < df['second'].shift(1)) & (df['third'] < df['second'].shift(2))]
What's going on?
Start comparing 'third' with previous value of 'second' by shifting one row, and then shift it two places in a second condition.
Note this only works for fixed values of k. What if k is variable?
In such case, you need to write your condition dynamically. The following code assumes that condition must be met for all values of n in [1,k]
k = 2 # pick any k > 1
df[~pd.concat([df['third'] < df['second'].shift(n) for n in range(1, k+1)]).any(level=0)].index
What's going on here?: Long answer
first, we check using the shift trick, which are the rows that meet your criteria for every value of n in [1, k]:
In [1]: [df['third'] < df['second'].shift(n) for n in range(1, k+1)]
out[1]:
[0 False
1 True
2 False
3 True
4 True
5 False
6 True
7 False
dtype: bool,
0 False
1 False
2 False
3 False
4 True
5 True
6 True
7 True
dtype: bool]
then, we concatenate them to create a single dataframe, with a column for each of the k values.
In [2]: pd.concat([df['third'] < df['second'].shift(n) for n in range(1, k+1)])
Out[2]:
0 False
1 True
2 False
3 True
4 True
5 False
6 True
7 False
0 False
1 False
2 False
3 False
4 True
5 True
6 True
7 True
dtype: bool
finally, we pick to use as index all rows that meets the criteria for any column (i.e. value of n). So: if it is true for any n, we will return it:
In [3]: pd.concat([df['third'] < df['second'].shift(n) for n in range(1, k+1)]).any(level=0)
Out[3]:
0 False
1 True
2 False
3 True
4 True
5 True
6 True
7 True
dtype: bool
Then, all you need to do is to project over your original dataframe and pick up the index.
In [3]: df[~pd.concat([df['third'] < df['second'].shift(n) for n in range(1, k+1)]).any(level=0)].index
Out[3]: Int64Index([0, 2], dtype='int64')
Final note
If the criteria must be met for all the values n in [1, k], then replace .any with .all.

# First condition is easy
cond1 = df["second"] > df["first"]
# Since the second condition compare the second and third column, let's compute
# the result before hand for convenience
s = df["third"] < df["second"]
# Now we are gonna run a rolling window over `s`. What we want is that the
# previous `k` rows of `s` are all True.
# A rolling window always ends on the current row but you want it to end on the
# previous row. So we will increase the window size by 1 and exclude the last
# element from comparison.
all_true = lambda arr: arr[:-1].all()
cond2_with_k_equal_2 = s.rolling(3).apply(all_true, raw=True)
cond2_with_k_equal_3 = s.rolling(4).apply(all_true, raw=True)
cond2_with_k_equal_4 = s.rolling(5).apply(all_true, raw=True)
cond2 = cond2_with_k_equal_2 | cond2_with_k_equal_3 | cond2_with_k_equal_4
# Or you can consolidate them into a loop
cond2 = pd.Series(False, df.index)
for k in range(2,5):
cond2 |= s.rolling(k+1).apply(all_true, raw=True)
# Get the result
df[cond1 & cond2]

How to create a new column through a specific condition?

I have a column like this:
1
0
0
0
0
1
0
0
0
1
0
0
and need as result:
1
1
1
1
1
2
2
2
2
3
3
3
A method/algorithm that divides into ranks from 1 to 1 and gives them successively values.
Any idea?

You can loop through the list and use a counter to update the column value, and increment it everytime you find the number 1.
def rank(lst):
counter = 0
for i, column in enumerate(lst):
if column == 1:
counter+=1
lst[i] = counter

def fill_arr(arr):
curr = 1
for i in range(1, len(arr)):
arr[i] = curr
if i < len(arr)-1 and arr[i+1] == 1:
curr += 1
return arr
A quick test
arr = [1,0,0,0,1,0,0,0,1,0,0]
fill_arr(arr)
[1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3]
The idea is as follows:
keep track of the number of 1s we encounter it curr by looking ahead and increment it as we see new 1s.
set the elements at the current index to curr.
we start at index 1 since we know that there is a one at index zero. This helps us reduce edge cases and make the algorithm easier to manage.

What you are looking for is usually called the cumulated sums; or as a verb, you're looking to increasingly accumulate the values in the list.
For a python list:
import itertools
l1 = [1,0,0,0,1,0,0,0,1,0,0]
l2 = list(itertools.accumulate(l1))
print(l2)
# [1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3]
For a numpy array:
import numpy
a1 = numpy.array([1,0,0,0,1,0,0,0,1,0,0])
a2 = a1.cumsum()
print(a2)
# [1 1 1 1 2 2 2 2 3 3 3]
For a column in a pandas dataframe:
import pandas
df = pandas.DataFrame({'col1': [1,0,0,0,1,0,0,0,1,0,0]})
df['col2'] = df['col1'].cumsum()
print(df)
# col1 col2
# 0 1 1
# 1 0 1
# 2 0 1
# 3 0 1
# 4 1 2
# 5 0 2
# 6 0 2
# 7 0 2
# 8 1 3
# 9 0 3
# 10 0 3
Documentation:
itertools.accumulate;
numpy.cumsum;
numpy.ndarray.cumsum;
pandas.DataFrame.cumsum;
pandas.Series.cumsum.

Compute the last (decimal) digit of x1 ^ (x2 ^ (x3 ^ (... ^ xn))) [duplicate]

This question already has answers here:
Last digit of power list
(2 answers)
Closed 4 years ago.
I need to find the unit digit of x1 ^ (x2 ^ (x3 ^ (... ^ xn))) from integers passed into the function as a list.
For example the input [3, 4, 2] would return 1 because 3 ^ (4 ^ 2) = 3 ^ 16 = 43046721 the last digit of which is 1.
The function needs to be efficient as possible because obviously trying to calculate 767456 ^ 981242 is not very quick.
I have tried a few methods but I think the best way to solve this is using sequences. For example any number ending in a 1, when raised to a power, will always end in 1. For 2, the resulting number will end in either 2, 4, 6 or 8.
If a number is raised to a power, the last digit of the resulting number will follow a pattern based on the last digit of the exponent:
1: Sequence is 1
2: Sequence is 2, 4, 8, 6
3: Sequence is 3, 9, 7, 1
4: Sequence is 4, 6
5: Sequence is 5
6: Sequence is 6
7: Sequence is 7, 9, 3, 1
8: Sequence is 8, 4, 2, 6
9: Sequence is 9, 1
0: Sequence is 0
I think the easiest way to calculate the overall last digit is to work backwards through the list and calculate the last digit of each calculation one at a time until I get back to the start but I am not sure how to do this?
If anyone could help or suggest another method that is equally or more efficient than that would be appreciated.
I have this code so far but it does not work for very large numbers
def last_digit(lst):
if lst == []:
return 1
total = lst[len(lst)-2] ** lst[len(lst)-1]
for n in reversed(range(len(lst)-2)):
total = pow(lst[n], total)
return total%10
Edit: 0 ^ 0 should be assumed to be 1

x^n = x^(n%4) because the last digit always has a period of 4.
x ^2 ^3 ^4 ^5
1 1 1 1 1
2 4 8 6 2
3 9 7 1 3
4 6 4 6 4
5 5 5 5 5
6 6 6 6 6
7 9 3 1 7
8 4 2 6 8
9 1 9 1 9
As you can see, all 9 digits have a period of 4 so we can use %4 to make calculations easier.
There's also a pattern if we do this %4.
x ^0 ^1 ^2 ^3 ^4 ^5 ^6 ^7 ^8 ^9
1 1 1 1 1 1 1 1 1 1 1
2 1 2 0 0 0 0 0 0 0 0
3 1 3 1 3 1 3 1 3 1 3
4 1 0 0 0 0 0 0 0 0 0
5 1 1 1 1 1 1 1 1 1 1 (all %4)
6 1 2 0 0 0 0 0 0 0 0
7 1 3 1 3 1 3 1 3 1 3
8 1 0 0 0 0 0 0 0 0 0
9 1 1 1 1 1 1 1 1 1 1
As shown, there is a pattern for each x when n>1. Therefore, you can see that (x^n)%4 = (x^(n+4k))%4 when n>1. We can then prevent the issues that arises from n=0 and n=1 by adding 4 to n. This is because, if (x^n)%4 = (x^(n+4k))%4, then (x^n)%4 = (x^(n%4+4))%4 as well.
powers = [3, 9, 7, 1]
lastDigit = 1
for i in range(len(powers) - 1, -1, -1):
if lastDigit == 0:
lastDigit = 1
elif lastDigit == 1:
lastDigit = powers[i]
else:
lastDigit = powers[i]**(lastDigit%4+4)
print(lastDigit%10)

This is more math than programming. Notice that all the sequences you listed has length either 1, 2, or 4. More precisely, x^4 always ends with either 0, 1, 5, 6, as does x^(4k). So if you know x^(m mod 4) mod 10, you know x^m mod 10.
Now, to compute x2^(x3^(...^xn)) mod 4. The story is very similar, x^2 mod 4 is ether 0 if x=2k or 1 if x=2k+1 (why?). So
is 0 if x2 == 0
is 1 if x2 > 0 and x3 == 0
if x2 is even, then it is either 2 or 0 with 2 occurs only when x2 mod 4 == 2 and (x3==1 or (any x4,...xn == 0) ).
if x2 is odd, then x2^2 mod 4 == 1, so we get 1 if x3 is even else x2 mod 4.
Enough math, let's talk coding. There might be corner cases that I haven't cover, but it's should work for most cases.
def last_digit(lst):
if len(lst) == 0:
return 1
x = lst[0] % 10
if len(lst) == 1:
return x
# these number never change
if x in [0,1,5,6]:
return x
# now we care for x[1] ^ 4:
x1 = x[1] % 4
# only x[0] and x[1]
if len(lst) == 2 or x1==0:
return x[0] ** x1 % 10
# now that x[2] comes to the picture
if x1 % 2: # == 1
x1_pow_x2 = x1 if (x[2]%2) else 1
else:
x1_pow_x2 = 2 if (x1==2 and x[2]%2 == 1) else 0
# we almost done:
ret = x ** x1_pow_x2 % 10
# now, there's a catch here, if x[1]^(x[2]^...^x[n-1]) >= 4,
# we need to multiply ret with the last digit of x ** 4
if x[1] >=4 or (x[1] > 1 and x[2] > 1):
ret = (ret * x**4) % 10
return ret

Working off of your sequences idea and fleshing it out, you'd want to create a dictionary that can map all relevant sequences.
mapping = {}
for i in range(1, 10):
mapping[i] = [i]
last_digit = i
while True:
last_digit *= i
last_digit = last_digit%10
if last_digit in mapping[i]:
break
else:
mapping[i].append(last_digit)
print(mapping)
This produces Output: mapping
{1: [1],
2: [2, 4, 8, 6],
3: [3, 9, 7, 1],
4: [4, 6],
5: [5],
6: [6],
7: [7, 9, 3, 1],
8: [8, 4, 2, 6],
9: [9, 1]}
Now the real logic can start, The key takeaway is that the pattern repeats itself after the sequence is completed. So, it does not matter how big the power is, if you just use a modulo and figure out which position of the sequence it should occupy.
def last_digit_func(lst, mapping):
if lst == []: #taken from OP
return 1
last_digit = lst[0] % 10
if 0 in lst[1:]: #edge case 0 as a power
return 1
if last_digit == 0: #edge case 0 at start
return last_digit
for current_power in lst[1:]:
if len(mapping[last_digit]) == 1:
return last_digit
ind = current_power % len(mapping[last_digit])
ind -= 1 #zero indexing, but powers start from 1.
last_digit = mapping[last_digit][ind]
return last_digit
test1 = [3, 4, 2]
print(last_digit_func(test1, mapping)) #prints 1
I verified this by calculating the powers in python.
test2 = [767456 , 981242]
print(last_digit_func(test2, mapping)) #prints 6
And i tried to verify this by running it in python....i now have instant regrets and my program is still trying to solve it out. Oh well :)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

data frame and list operation - python

Related

pandas for loop for running average does not work

Compare two pandas DataFrames in the most efficient way

Checking the length of a part of a dataframe in conditional row selection in pandas

How to create a new column through a specific condition?

Compute the last (decimal) digit of x1 ^ (x2 ^ (x3 ^ (... ^ xn))) [duplicate]

Categories

Resources