select individual rows from multiindex pandas dataframe [duplicate]

select individual rows from multiindex pandas dataframe [duplicate] - python

This question already has answers here:
Dynamically filtering a pandas dataframe
(4 answers)
Closed 4 years ago.
I am trying to select individual rows from a multiindex dataframe using a list of multiindices.
For example. I have got the following dataframe:
Col1
A B C
1 1 1 -0.148593
2 2.043589
2 3 -1.696572
4 -0.249049
2 1 5 2.012294
6 -1.756410
2 7 0.476035
8 -0.531612
I would like to select all 'C' with (A,B) = [(1,1), (2,2)]
Col1
A B C
1 1 1 -0.148593
2 2.043589
2 2 7 0.476035
8 -0.531612
My flawed code for this is as follows:
import pandas as pd
import numpy as np
arrays = [np.array([1, 1, 1, 1, 2, 2, 2, 2]), np.array([1, 1, 2, 2, 1, 1, 2, 2]), np.array([1, 2, 3, 4, 5, 6, 7, 8])]
df = pd.DataFrame(np.random.randn(8), index=arrays, columns=['Col1'])
df.rename_axis(['A','B','C'], inplace=True)
print(df)
idx_lst = [(1,1), (2,2)]
test = df.loc(axis=0)[idx_lst]
print(test)

One option is to use pd.DataFrame.query:
res = df.query('((A == 1) & (B == 1)) | ((A == 2) & (B == 2))')
print(res)
Col1
A B C
1 1 1 0.981483
2 0.851543
2 2 7 -0.522760
8 -0.332099
For a more generic solution, you can use f-strings (Python 3.6+), which should perform better than str.format or manual concatenation.
filters = [(1,1), (2,2)]
filterstr = '|'.join(f'(A=={i})&(B=={j})' for i, j in filters)
res = df.query(filterstr)
print(filterstr)
(A==1)&(B==1)|(A==2)&(B==2)

The following might help:
idx_lst = [(1,1), (2,2)]
df.loc(0)[[ z for z in df.index if (z[0], z[1]) in idx_lst ]]
# Out[941]:
# Col1
# A B C
# 1 1 1 0.293952
# 2 0.197045
# 2 2 7 2.007493
# 8 0.937420

Related

pandas for loop for running average does not work

I tried to make a kind of running average - out of 90 rows, every 3 in column A should make an average that would be the same as those rows in column B.
For example:
From this:
df = pd.DataFrame( A B
2 0
3 0
4 0
7 0
9 0
8 0)
to this:
df = pd.DataFrame( A B
2 3
3 3
4 3
7 8
9 8
8 8)
I tried running this code:
x=0
for i in df['A']:
if x<90:
y = (df['A'][x]+ df['A'][(x +1)]+df['A'][(x +2)])/3
df['B'][x] = y
df['B'][(x+1)] = y
df['B'][(x+2)] = y
x=x+3
print(y)
It does print the correct Y
But does not change B
I know there is a better way to do it, and if anyone knows - it would be great if they shared it. But the more important thing for me is to understand why what I wrote down doesn't have an effect on the df.

You could group by the index divided by 3, then use transform to compute the mean of those values and assign to B:
df = pd.DataFrame({'A': [2, 3, 4, 7, 9, 8], 'B': [0, 0, 0, 0, 0, 0]})
df['B'] = df.groupby(df.index // 3)['A'].transform('mean')
Output:
A B
0 2 3
1 3 3
2 4 3
3 7 8
4 9 8
5 8 8
Note that this relies on the index being of the form 0,1,2,3,4,.... If that is not the case, you could either reset the index (df.reset_index(drop=True)) or use np.arange(df.shape[0]) instead i.e.
df['B'] = df.groupby(np.arange(df.shape[0]) // 3)['A'].transform('mean')

i = 0
batch_size = 3
df = pd.DataFrame({'A':[2,3,4,7,9,8,9,10],'B':[-1] * 8})
while i < len(df):
j = min(i+batch_size-1,len(df)-1)
avg =sum(df.loc[i:j,'A'])/ (j-i+1)
df.loc[i:j,'B'] = [avg] * (j-i+1)
i+=batch_size
df
corner case when len(df) % batch_size != 0 assumes we take the average of the leftover rows.

Compare two pandas DataFrames in the most efficient way

Let's consider two pandas dataframes:
import numpy as np
import pandas as pd
df = pd.DataFrame([1, 2, 3, 2, 5, 4, 3, 6, 7])
check_df = pd.DataFrame([3, 2, 5, 4, 3, 6, 4, 2, 1])
If want to do the following thing:
If df[1] > check_df[1] or df[2] > check_df[1] or df[3] > check_df[1] then we assign to df 1, and 0 otherwise
If df[2] > check_df[2] or df[3] > check_df[2] or df[4] > check_df[2] then we assign to df 1, and 0 otherwise
We apply the same algorithm to end of DataFrame
My primitive code is the following:
df_copy = df.copy()
for i in range(len(df) - 3):
moving_df = df.iloc[i:i+3]
if (moving_df >check_df.iloc[i]).any()[0]:
df_copy.iloc[i] = 1
else:
df_copy.iloc[i] = -1
df_copy
0
0 -1
1 1
2 -1
3 1
4 1
5 -1
6 3
7 6
8 7
Could you please give me a advice, if there is any possibility to do this without loop?

IIUC, this is easily done with a rolling.min:
df['out'] = np.where(df[0].rolling(N, min_periods=1).max().shift(1-N).gt(check_df[0]),
1, -1)
output:
0 out
0 1 -1
1 2 1
2 3 -1
3 2 1
4 5 1
5 4 -1
6 3 1
7 6 -1
8 7 -1
to keep the last items as is:
m = df[0].rolling(N).max().shift(1-N)
df['out'] = np.where(m.gt(check_df[0]),
1, -1)
df['out'] = df['out'].mask(m.isna(), df[0])
output:
0 out
0 1 -1
1 2 1
2 3 -1
3 2 1
4 5 1
5 4 -1
6 3 1
7 6 6
8 7 7

Although #mozway has already provided a very smart solution, I would like to share my approach as well, which was inspired by this post.
You could create your own object that compares a series with a rolling series. The comparison could be performed by typical operators, i.e. >, < or ==. If at least one comparison holds, the object would return a pre-defined value (given in list returns_tf, where the first element would be returned if the comparison is true, and the second if it's false).
Possible Code:
import numpy as np
import pandas as pd
df = pd.DataFrame([1, 2, 3, 2, 5, 4, 3, 6, 7])
check_df = pd.DataFrame([3, 2, 5, 4, 3, 6, 4, 2, 1])
class RollingComparison:
def __init__(self, comparing_series: pd.Series, rolling_series: pd.Series, window: int):
self.comparing_series = comparing_series.values[:-1*window]
self.rolling_series = rolling_series.values
self.window = window
def rolling_window_mask(self, option: str = "smaller"):
shape = self.rolling_series.shape[:-1] + (self.rolling_series.shape[-1] - self.window + 1, self.window)
strides = self.rolling_series.strides + (self.rolling_series.strides[-1],)
rolling_window = np.lib.stride_tricks.as_strided(self.rolling_series, shape=shape, strides=strides)[:-1]
rolling_window_mask = (
self.comparing_series.reshape(-1, 1) < rolling_window if option=="smaller" else (
self.comparing_series.reshape(-1, 1) > rolling_window if option=="greater" else self.comparing_series.reshape(-1, 1) == rolling_window
)
)
return rolling_window_mask.any(axis=1)
def assign(self, option: str = "rolling", returns_tf: list = [1, -1]):
mask = self.rolling_window_mask(option)
return np.concatenate((np.where(mask, returns_tf[0], returns_tf[1]), self.rolling_series[-1*self.window:]))
The assignments can be achieved as follows:
roller = RollingComparison(check_df[0], df[0], 3)
check_df["rolling_smaller_checking"] = roller.assign(option="smaller")
check_df["rolling_greater_checking"] = roller.assign(option="greater")
check_df["rolling_equals_checking"] = roller.assign(option="equal")
Output (the column rolling_smaller_checking equals your desired output):
0 rolling_smaller_checking rolling_greater_checking rolling_equals_checking
0 3 -1 1 1
1 2 1 -1 1
2 5 -1 1 1
3 4 1 1 1
4 3 1 -1 1
5 6 -1 1 1
6 4 3 3 3
7 2 6 6 6
8 1 7 7 7

AttributeError: 'DataFrame' object has no attribute 'set_value'

I'm using flask and getting error at set_values. I'm reading the input from html and passing it to the code
#app.route('/home', methods=['POST'])
def first():
source = request.files['first']
destination = request.files['second']
df = pd.read_csv(source)
df1 = pd.read_csv(destination)
val1 = int(request.form['val1'])
val2 = int(request.form['val2'])
val3 = int(request.form['val3'])
target = request.form['str']
df2 = df[df.columns[val2]]
count = 0
for j in df[df.columns[val1]]:
x = df1.loc[df1[df1.columns[val3]] == j].index.values
for i in x:
df1.set_value(i, target, df2[count])
count = count + 1
df1.to_csv('result.csv', index=False)

Check your pandas version.
df.set_value() is deprecated since pandas version 0.21.0
Instead use df.at
import pandas as pd
df = pd.DataFrame({"A":[1, 5, 3, 4, 2],
"B":[3, 2, 4, 3, 4],
"C":[2, 2, 7, 3, 4],
"D":[4, 3, 6, 12, 7]})
df.at[2,'B']=100
A B C D
0 1 3 2 4
1 5 2 2 3
2 3 100 7 6
3 4 3 3 12
4 2 4 4 7

pandas create a DataFrame by multiplying every element in a list with every other element

I need to populate a dataframe with a matrix built from a single list, but the math and python syntax are beyond me. I essentially need to perform some math operations as if the same list were both the rows and the columns.
So it should look something like this....
#Input
list = [1,2,3,4]
create a matrix using some math on the list, like matrix[i,j] = list[i] * list[j]
#output
np.matrix([[1,2,3,4], [2,4,6,8], [3,6,9,12], [4,8,12,16]])
df = pd.dataframe[np.matrix]

Broadcasted multiplication will work here:
arr = np.array([1, 2, 3, 4])
pd.DataFrame(arr * arr[:,None])
0 1 2 3
0 1 2 3 4
1 2 4 6 8
2 3 6 9 12
3 4 8 12 16
Alternatively, most numpy arithmetic functions define an .outer unfunc:
pd.DataFrame(np.multiply.outer(arr, arr))
0 1 2 3
0 1 2 3 4
1 2 4 6 8
2 3 6 9 12
3 4 8 12 16

data = [1,2,3,4]
Nested for loops would work:
import numpy as np
a = []
for n in data:
row = []
for m in data:
math = some_operation_on(m,n)
row.append(math)
a.append(row)
a = np.array(a)
For simple operations like your example use numpy.meshgrid.
In [21]: a = [1,2,3,4]
In [22]: x,y = np.meshgrid(a,a)
In [23]: x*y
Out[23]:
array([[ 1, 2, 3, 4],
[ 2, 4, 6, 8],
[ 3, 6, 9, 12],
[ 4, 8, 12, 16]])

Find the average of the element above and below in that column if that element is 0 - Pandas DataFrame

I'd like to create a new dataframe using the same values from another dataframe, unless there is a 0 value. If there is a 0 value, I'd like to find the average of the entry before and after.
For Example:
df = A B C
5 2 1
3 4 5
2 1 0
6 8 7
I'd like the result to look like the df below:
df_new = A B C
5 2 1
3 4 5
2 1 6
6 8 7

import pandas as pd
import numpy as np
df = pd.DataFrame({'A':[5, 3, 2, 6], 'B':[2, 4, 1, 8], 'C':[1, 5, 0, 7]})
Nrows = len(df)
def run(col):
originalValues = list(df[col])
values = list(np.where(np.array(list(df[col])) == 0)[0])
indices2replace = filter(lambda x: x > 0 and x < Nrows, values)
for index in indices2replace:
originalValues[index] = 0.5 * (originalValues[index+1] + originalValues[index-1])
return originalValues
newDF = pd.DataFrame(map(lambda x: run(x) , df.columns)).transpose()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

select individual rows from multiindex pandas dataframe [duplicate] - python

The following might help: idx_lst = [(1,1), (2,2)] df.loc(0)[[ z for z in df.index if (z[0], z[1]) in idx_lst ]] # Out[941]: # Col1 # A B C # 1 1 1 0.293952 # 2 0.197045 # 2 2 7 2.007493 # 8 0.937420

Related

pandas for loop for running average does not work

Compare two pandas DataFrames in the most efficient way

AttributeError: 'DataFrame' object has no attribute 'set_value'

pandas create a DataFrame by multiplying every element in a list with every other element

Find the average of the element above and below in that column if that element is 0 - Pandas DataFrame

Categories

Resources