Multiply each column by their relative factor using prefix column name - python

I have a matrix like
id |v1_m1 v2_m1 v3_m1 f_m1 v1_m2 v2_m2 v3_m2 f_m2|
1 | 0 .5 .5 4 0.1 0.3 0.6 4 |
2 | 0.3 .3 .4 8 0.2 0.4 0.4 7 |
What I want is to mulply each v's in m1 by the f_m1 column, and all the v's columns with the suffix "_m2" by ghe f_m2 column.
The output that I expect is something like this:
id |v1_m1 v2_m1 v3_m1 v1_m2 v2_m2 v3_m2 |
1 | 0 2 2 0.4 1.2 2.4 |
2 | 2.4 2.4 3.2 1.4 2.8 2.8 |

for m in range (1,maxm):
for i in range (1,maxv):
df["v{}_m{}".format(i,m)] = df["v{}_m{}".format(i,m)]*df["f_m{}".format(m)]
for m in range (1,maxm):
df.drop(columns=["f_m{}".format(m)])

You could do this with some fancy dataframe reshaping:
df.columns = pd.MultiIndex.from_arrays(zip(*df.columns.str.split('_')))
df=df.stack()
df_mul = df.filter(like='v').mul(df.filter(like='f').squeeze(), axis=0)
df_mul = df_mul.unstack().sort_index(level=1, axis=1)
df_mul.columns = [f'{i}_{j}' for i, j in df_mul.columns]
df_mul
Output:
v1_m1 v2_m1 v3_m1 v1_m2 v2_m2 v3_m2
id
1 0.0 2.0 2.0 0.4 1.2 2.4
2 2.4 2.4 3.2 1.4 2.8 2.8
Details:
Create MultiIndex column headers split on '_'
Reshape dataframe stacking the m# to rows, leaving four columns f and
three v's
Using filter, we can select the v columns and multiply by the f
series created by selecting the single column and using squeeze to
create a pd.Series from a single column dataframe
unstack the m# level back to columns
Flatten the MultiIndex column header back to single level using
f-string with list comprehension.

Assuming that your matrix is a pandas dataframe called df, I would like to give my nomination for a list comprehension approach if you enjoy them.
import itertools
items = [(i[0][0],i[0][1].multiply(i[1][1]))
for i in itertools.product(df.items(),repeat=2)
if (i[0][0][-2:]==i[1][0][-2:])
and i[1][0][:1]=='f'
and i[0][0][:1]!='f']
df_mul = pd.DataFrame.from_dict({i[0]:i[1] for i in items})
It should be superfast on larger versions of this problem.
Explanation -
Creates a generator for cross-product between each column as (c1,c2) tuples
Keeps only the columns where last 2 alphabets are same for both c1,c2 AND c2 starts with 'f', AND c1 doesn't start with 'f' (leaving you with the columns you wanna operate on as individual tuples). Something like this - [('v1_m1', 'f_m1'), ('v2_m1', 'f_m1'), ('v1_m2', 'f_m2')]
Multiplies the columns, attaches a column name and saves them as items (similar structure to df.items())
Turns the items into a dataframe

Related

Pandas Slice Columns and select subsets based on between condition

I have a dataframe as follows:
100 105 110
timestamp
2020-11-0112:00:00 0.2 0.5 0.1
2020-11-0112:01:00 0.3 0.8 0.2
2020-11-0112:02:00 0.8 0.9 0.4
2020-11-0112:03:00 1 0 0.4
2020-11-0112:04:00 0 1 0.5
2020-11-0112:05:00 0.5 1 0.2
I want to select columns with dataframe where the values would be greater than equal 0.5 and less than equal to 1, and I want the index/timestamp in which these occurrences happened. Each column could have multiple such occurrences. So, 100, can be between 0.5 and 1 from 12:00 to 12:03 and then again from 12:20 to 12:30. It needs to reset when it hits 0. The column names are variable.
I also want the time difference in which the column value was between 0.5 and 1, so from the above it was 3 minutes, and 10 minutes.
The expected output would be with a dict for ranges the indexes appeared in:
100 105 110
timestamp
2020-11-0112:00:00 NaN 0.5 NaN
2020-11-0112:01:00 NaN 0.8 NaN
2020-11-0112:02:00 0.8 0.9 NaN
2020-11-0112:03:00 1 NaN NaN
2020-11-0112:04:00 NaN 1 0.5
2020-11-0112:05:00 0.5 1 NaN
and probably a way to calculate the minutes which could be in a dict/list of dicts:
["105":
[{"from": "2020-11-0112:00:00", "to":"2020-11-0112:02:00"},
{"from": "2020-11-0112:04:00", "to":"2020-11-0112:05:00"}]
...
]
Essentially I want a the dicts at the end to evaluate.
Basically, it would be best if you got the ordered sequence of timestamps; then, you can manipulate it to get the differences. If the question is only about Pandas slicing and not about timestamp operations, then you need to do the following operation:
df[df["100"] >= 0.5][df["100"] <= 1]["timestamp"].values
Pandas data frames comparaision operations
For Pandas, data frames, normal comparison operations are overridden. If you do dataframe_instance >= 0.5, the result is a sequence of boolean values. An individual value in the sequence results from comparing an individual data frame value to 0.5.
Pandas data frame slicing
This sequence could be used to filter a subsequence from your data frame. It is possible because Pandas slicing is overridden and implemented as a reach filtering algorithm.

Only want to consider a dataframe up to the present point

I have a dataframe and I am trying to do something along the lines of
df['foo'] = np.where(myfunc(df) == 1, 10, 20)
but I only want to consider the dataframe up to the present, for example if my dataframe looked like
A B C
1 0.3 0.3 1.6
2 0.6 0.6 0.4
3 0.9 0.9 1.2
4 1.2 1.2 0.8
and I was generating the value of 'foo' for the third row, I would be looking at the dataframe's first through third rows, but not the fourth row. Is it possible to accomplish this?
It is certainly possible. The dataframe up to the present is given by
df.iloc[:present],
and you can do whatever you want with it, in particular, use where, as described here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.where.html

Manual Feature Engineering in Pandas - Mean of 1 Column vs All Other Columns

Hard to describe this one, but for every column in a dataframe, create a new column that contains the mean of the current column vs the one next to it, then get the mean of that first column vs the next one down the line. Running Python 3.6.
For Example, given this dataframe:
I would like to get this output:
That exact order of the added columns at the end isn't important, but it needs to be able to handle every possible combination of means between all columns, with a depth of 2 (i.e. compare one column to another). Ideally, I would like to have the depth set as a separate variable, so I could have a depth of 3, where it would do this but compare 3 columns to one another.
Ideas? Thanks!
UPDATE
I got this to work, but wondering if there's a more computationally fast way of doing it. I basically just created 2 of the same loops (loop within a loop) to compare 1 column vs the rest, skipping the same column comparisons:
eng_features = pd.DataFrame()
for col in df.columns:
for col2 in df.columns:
# Don't compare same columns, or inversed same columns
if col == col2 or (str(col2) + '_' + str(col)) in eng_features:
continue
else:
eng_features[str(col) + '_' + str(col2)] = df[[col, col2]].mean(axis=1)
continue
df = pd.concat([df, eng_features], axis=1)
Use itertools, a python built in utility package for iterators:
from itertools import permutations
for col1, col2 in permutations(df.columns, r=2):
df[f'Mean_of_{col1}-{col2}'] = df[[col1,col2]].mean(axis=1)
and you will get what you need:
a b c Mean_of_a-b Mean_of_a-c Mean_of_b-a Mean_of_b-c Mean_of_c-a \
0 1 1 0 1.0 0.5 1.0 0.5 0.5
1 0 1 0 0.5 0.0 0.5 0.5 0.0
2 1 1 0 1.0 0.5 1.0 0.5 0.5
Mean_of_c-b
0 0.5
1 0.5
2 0.5

How to replace values in a 4-dimensional array?

Let's say I have a 4D array that is
(600, 1, 3, 3)
If you take the first 2 elements they may look like this:
0 1 0
1 1 1
0 1 0
2 2 2
3 3 3
1 1 1
etc
I have a list that contains certain weights that I want to replace specific values in the array. My intention is to use the index of the list element match the value in the array. Therefore, this list
[0.1 1.1 1.2 1.3]
when applied against my array would give this result:
0.1 1.1 0.1
1.1 1.1 1.1
0.1 1.1 0.1
1.2 1.2 1.2
1.3 1.3 1.3
1.1 1.1 1.1
etc
This method would have to run through the entire 600 elements of the array.
I can do this in a clunky way using a for loop and array[array==x] = y or np.place but I wanted to avoid a loop and perhaps use a method that at once will replace all values. Is there such an approach?
Quoting from #Divakar's solution in the comments, which solves the issue in a very efficient manner:
Simply index into the array version:
np.asarray(vals)[idx], where vals is
the list and idx is the array.
Or use np.take(vals, idx) to do the array conversion under the hood.

Find points in cells through pandas dataframes of coordinates

I have to find which points are inside a grid of square cells, given the points coordinates and the coordinates of the bounds of the cells, through two pandas dataframes.
I'm calling dfc the dataframe containing the code and the boundary coordinates of the cells (I simplify the problem, in the real analysis I have a big grid with geographical points and tons of points to check):
Code,minx,miny,maxx,maxy
01,0.0,0.0,2.0,2.0
02,2.0,2.0,3.0,3.0
and dfp the dataframe containing an Id and the coordinates of the points:
Id,x,y
0,1.5,1.5
1,1.1,1.1
2,2.2,2.2
3,1.3,1.3
4,3.4,1.4
5,2.0,1.5
Now I would like to perform a search returning in dfc dataframe a new column (called 'GridCode') of the grid in which the point is in. The cells should be perfectly squared, so I would like to perform an analysis through:
a = np.where(
(dfp['x'] > dfc['minx']) &
(dfp['x'] < dfc['maxx']) &
(dfp['y'] > dfc['miny']) &
(dfp['y'] < dfc['maxy']),
r2['Code'],
'na')
avoiding several loops on the dataframes. The lenghts of the dataframes are not the same. The resulting dataframe should be as follows:
Id x y GridCode
0 0 1.5 1.5 01
1 1 1.1 1.1 01
2 2 2.2 2.2 02
3 3 1.3 1.3 01
4 4 3.4 1.4 na
5 5 2.0 1.5 na
Thanks in advance for your help!
Probably a better way, but since this has been sitting out there for awhile..
Using Pandas boolean indexing to filter the dfc data frame instead of np.where()
def findGrid(dfp):
c = dfc[(dfp['x'] > dfc['minx']) &
(dfp['x'] < dfc['maxx']) &
(dfp['y'] > dfc['miny']) &
(dfp['y'] < dfc['maxy'])].Code
if len(c) == 0:
return None
else:
return c.iat[0]
Then use the pandas apply() function
dfp['GridCode'] = dfp.apply(findGrid,axis=1)
Will yield this
Id x y GridCode
0 0 1.5 1.5 1
1 1 1.1 1.1 1
2 2 2.2 2.2 2
3 3 1.3 1.3 1
4 4 3.4 1.4 NaN
5 5 2.0 1.5 NaN

Categories

Resources