Only want to consider a dataframe up to the present point - python

I have a dataframe and I am trying to do something along the lines of
df['foo'] = np.where(myfunc(df) == 1, 10, 20)
but I only want to consider the dataframe up to the present, for example if my dataframe looked like
A B C
1 0.3 0.3 1.6
2 0.6 0.6 0.4
3 0.9 0.9 1.2
4 1.2 1.2 0.8
and I was generating the value of 'foo' for the third row, I would be looking at the dataframe's first through third rows, but not the fourth row. Is it possible to accomplish this?

It is certainly possible. The dataframe up to the present is given by
df.iloc[:present],
and you can do whatever you want with it, in particular, use where, as described here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.where.html

Related

Pandas Slice Columns and select subsets based on between condition

I have a dataframe as follows:
100 105 110
timestamp
2020-11-0112:00:00 0.2 0.5 0.1
2020-11-0112:01:00 0.3 0.8 0.2
2020-11-0112:02:00 0.8 0.9 0.4
2020-11-0112:03:00 1 0 0.4
2020-11-0112:04:00 0 1 0.5
2020-11-0112:05:00 0.5 1 0.2
I want to select columns with dataframe where the values would be greater than equal 0.5 and less than equal to 1, and I want the index/timestamp in which these occurrences happened. Each column could have multiple such occurrences. So, 100, can be between 0.5 and 1 from 12:00 to 12:03 and then again from 12:20 to 12:30. It needs to reset when it hits 0. The column names are variable.
I also want the time difference in which the column value was between 0.5 and 1, so from the above it was 3 minutes, and 10 minutes.
The expected output would be with a dict for ranges the indexes appeared in:
100 105 110
timestamp
2020-11-0112:00:00 NaN 0.5 NaN
2020-11-0112:01:00 NaN 0.8 NaN
2020-11-0112:02:00 0.8 0.9 NaN
2020-11-0112:03:00 1 NaN NaN
2020-11-0112:04:00 NaN 1 0.5
2020-11-0112:05:00 0.5 1 NaN
and probably a way to calculate the minutes which could be in a dict/list of dicts:
["105":
[{"from": "2020-11-0112:00:00", "to":"2020-11-0112:02:00"},
{"from": "2020-11-0112:04:00", "to":"2020-11-0112:05:00"}]
...
]
Essentially I want a the dicts at the end to evaluate.
Basically, it would be best if you got the ordered sequence of timestamps; then, you can manipulate it to get the differences. If the question is only about Pandas slicing and not about timestamp operations, then you need to do the following operation:
df[df["100"] >= 0.5][df["100"] <= 1]["timestamp"].values
Pandas data frames comparaision operations
For Pandas, data frames, normal comparison operations are overridden. If you do dataframe_instance >= 0.5, the result is a sequence of boolean values. An individual value in the sequence results from comparing an individual data frame value to 0.5.
Pandas data frame slicing
This sequence could be used to filter a subsequence from your data frame. It is possible because Pandas slicing is overridden and implemented as a reach filtering algorithm.

Multiply each column by their relative factor using prefix column name

I have a matrix like
id |v1_m1 v2_m1 v3_m1 f_m1 v1_m2 v2_m2 v3_m2 f_m2|
1 | 0 .5 .5 4 0.1 0.3 0.6 4 |
2 | 0.3 .3 .4 8 0.2 0.4 0.4 7 |
What I want is to mulply each v's in m1 by the f_m1 column, and all the v's columns with the suffix "_m2" by ghe f_m2 column.
The output that I expect is something like this:
id |v1_m1 v2_m1 v3_m1 v1_m2 v2_m2 v3_m2 |
1 | 0 2 2 0.4 1.2 2.4 |
2 | 2.4 2.4 3.2 1.4 2.8 2.8 |
for m in range (1,maxm):
for i in range (1,maxv):
df["v{}_m{}".format(i,m)] = df["v{}_m{}".format(i,m)]*df["f_m{}".format(m)]
for m in range (1,maxm):
df.drop(columns=["f_m{}".format(m)])
You could do this with some fancy dataframe reshaping:
df.columns = pd.MultiIndex.from_arrays(zip(*df.columns.str.split('_')))
df=df.stack()
df_mul = df.filter(like='v').mul(df.filter(like='f').squeeze(), axis=0)
df_mul = df_mul.unstack().sort_index(level=1, axis=1)
df_mul.columns = [f'{i}_{j}' for i, j in df_mul.columns]
df_mul
Output:
v1_m1 v2_m1 v3_m1 v1_m2 v2_m2 v3_m2
id
1 0.0 2.0 2.0 0.4 1.2 2.4
2 2.4 2.4 3.2 1.4 2.8 2.8
Details:
Create MultiIndex column headers split on '_'
Reshape dataframe stacking the m# to rows, leaving four columns f and
three v's
Using filter, we can select the v columns and multiply by the f
series created by selecting the single column and using squeeze to
create a pd.Series from a single column dataframe
unstack the m# level back to columns
Flatten the MultiIndex column header back to single level using
f-string with list comprehension.
Assuming that your matrix is a pandas dataframe called df, I would like to give my nomination for a list comprehension approach if you enjoy them.
import itertools
items = [(i[0][0],i[0][1].multiply(i[1][1]))
for i in itertools.product(df.items(),repeat=2)
if (i[0][0][-2:]==i[1][0][-2:])
and i[1][0][:1]=='f'
and i[0][0][:1]!='f']
df_mul = pd.DataFrame.from_dict({i[0]:i[1] for i in items})
It should be superfast on larger versions of this problem.
Explanation -
Creates a generator for cross-product between each column as (c1,c2) tuples
Keeps only the columns where last 2 alphabets are same for both c1,c2 AND c2 starts with 'f', AND c1 doesn't start with 'f' (leaving you with the columns you wanna operate on as individual tuples). Something like this - [('v1_m1', 'f_m1'), ('v2_m1', 'f_m1'), ('v1_m2', 'f_m2')]
Multiplies the columns, attaches a column name and saves them as items (similar structure to df.items())
Turns the items into a dataframe

How to find the minimum distance .. when two points belongs to same distance

I have a dataframe like this:
A B
1 0.1
1 0.2
1 0.3
2 0.2
2 0.5
2 0.3
3 0.8
3 0.6
3 0.1
How can I find the minimum value belonging to each point 1,2,3 and there should be no conflict which means point 1 and 2 should not belong to same point 0.3..
If I understand correctly, you want to do two things:
- find the minimum B per distinct A, and
- make sure that they don't collide. You didn't specify what to do in case of collision, so I assume you just want to know if there is one.
The first can be achieved with Rarblack's answer (though you should use min and not max in your case).
For the second, you can use the .nunique() method - see how many unique B values are there (should be same as number of unique A valuse)
#setup dataframe
df = pd.DataFrame.from_dict({
'A': [1,1,1,2,2,2,3,3,3],
'B': [0.1,0.2,0.3,0.2,0.5,0.3,0.8,0.6,0.1]
})
# find minimum
x = df.groupby('A')['B'].min()
# assert that there are no collisions:
if not (x.nunique() == len(x)):
print ("Conflicting values")
You can use groupby and max function.
df.groupby('A').B.max()

Print Tab Separated rows of Python Dataframe

I want to print the following dataframe as tab delimited string
sku ids output
1 a 0.1
2 b 0.2
3 d 0.4
Output:
1 a 0.1
2 b 0.2
3 d 0.4
It should be a iterate process and print all the rows.
I have tried str.join() but it is not giving me the output that i am looking for. Any help would be appreciated. Thanks.
Apply lambda on each row
def applytab(row):
print('\t'.join(map(str,row.values)))
#print('\t'.join(map(str,df.columns))) # to print the column names if required
df.apply(applytab,axis=1)
Output
a 0.1 1
b 0.2 2
d 0.4 3
I am very new to Pandas/Dataframes and my answer can certainly be improved, but one way to achieve your required result is the following:
def printDataFrame(df):
for i in range(len(df.index)):
row = list(df.iloc[i])
print("\t".join(map(str, row)))
printDataFrame(df)
This functions loops through all the rows, then for each row inserts a tap after every element in the row and then prints the row as a string.

How to replace values in a 4-dimensional array?

Let's say I have a 4D array that is
(600, 1, 3, 3)
If you take the first 2 elements they may look like this:
0 1 0
1 1 1
0 1 0
2 2 2
3 3 3
1 1 1
etc
I have a list that contains certain weights that I want to replace specific values in the array. My intention is to use the index of the list element match the value in the array. Therefore, this list
[0.1 1.1 1.2 1.3]
when applied against my array would give this result:
0.1 1.1 0.1
1.1 1.1 1.1
0.1 1.1 0.1
1.2 1.2 1.2
1.3 1.3 1.3
1.1 1.1 1.1
etc
This method would have to run through the entire 600 elements of the array.
I can do this in a clunky way using a for loop and array[array==x] = y or np.place but I wanted to avoid a loop and perhaps use a method that at once will replace all values. Is there such an approach?
Quoting from #Divakar's solution in the comments, which solves the issue in a very efficient manner:
Simply index into the array version:
np.asarray(vals)[idx], where vals is
the list and idx is the array.
Or use np.take(vals, idx) to do the array conversion under the hood.

Categories

Resources