major minor col
0 0 5
1 6
2 4
0 0 8
1 5
2 6
1 0 3
1 6
2 9
1 0 5
1 1
2 7
First I'd like to get
major minor col
0 0 5
1 6
2 4
0 0 8
1 5
2 6
and then select over both major '0's, that is, choose the first major 0 or the second:
major minor col
0 0 5
1 6
2 4
or
major minor col
0 0 8
1 5
2 6
Unfortunately df.xs(0,level=0,drop_level=False) doesn't exactly fit the job, since it maintains major '1's in the index, although empty. Any ideas?
I still do not understand your data structure. I'm right now working with
val
major minor col
0 0 5 1
1 6 1
2 4 1
0 8 1
1 5 1
2 6 1
I still don't understand how in your case you have two major zeros, since I get only one with the same structure. Therefore, I can't tell you exactly how you could pick any of the *major*s.
Using traditional slicing, you can get where df.major == 0 using
df[df.major == 0]
In order to select any single of the subgroups now, it depends on how they're different. Do they have another unique feature? Then you could do
df[(df.major == 0) && (df.someColumn == someValue)]
(notice the brackets). Otherwise, if you know there are 3 rows per group, df[df.major == 0].iloc[:3] (or 3:) will give you the records.
Also, have a look at the (currently experimental)df.query() (documentation).
Generally, you can do stuff such as
df[df.major == 0] to get all the values where the major is zero. If it's a (labeled) index or a normal column does not matter. You can also stack these to do
`df[(df.major == 0)
I start with
In[264]: df
Out[262]:
val
major minor col
0 0 5 1
1 6 1
2 4 1
0 8 1
1 5 1
2 6 1
1 1 3 1
6 1
2 9 1
1 5 1
1 1
2 7 1
and then I do
In[263]: df.query('major == 0')
Out[261]:
val
major minor col
0 0 5 1
1 6 1
2 4 1
0 8 1
1 5 1
2 6 1
Related
I want to get all the users from a dataframe where a specific column goes from 1 to 0.
For example, with the following dataframe I want to keep only user 1 and 2 as their values go from 1 to 0.
Relevant rows
Row 6 to 7 for user 1
Row 9 to 10 for user 2
user value
0 0 0
1 0 0
2 0 1
3 0 1
4 1 0
5 1 1
6 1 1
7 1 0
8 2 1
9 2 1
10 2 0
11 2 0
Desired Result
user value
4 1 0
5 1 1
6 1 1
7 1 0
8 2 1
9 2 1
10 2 0
11 2 0
I have tried window functions and conditions but for some reason I cannot get the desired result.
Let us try cummax
df.loc[df.user.isin(df.loc[df.value != df.groupby('user')['value'].cummax(),'user'])]
Out[769]:
user value
4 1 0
5 1 1
6 1 1
7 1 0
8 2 1
9 2 1
10 2 0
11 2 0
You can use shift to check if the next value is 1 (df.value.shift(1).eq(1)), and combine that with a mask checking if the current value is 0 (df.value.eq(0)). Then, group by 'user' and transform('any') to create the appropriate mask:
filtered = df[(df.value.eq(0) & df.value.shift(1).eq(1)).groupby(df.user).transform('any')]
Output:
>>> filtered
user value
4 1 0
5 1 1
6 1 1
7 1 0
8 2 1
9 2 1
10 2 0
11 2 0
You can use GroupBy.filter. If any diff (difference of successive values) is equal to -1 (0-1), keep the group.
df.groupby('user').filter(lambda g: g['value'].diff().eq(-1).any())
NB. this assumes you only have 0 and 1s, if you can have other numbers you also need to use two conditions: (g['value'].eq(1)&g['value'].shift(-1).eq(0)).any()
output:
user value
4 1 0
5 1 1
6 1 1
7 1 0
8 2 1
9 2 1
10 2 0
11 2 0
I have dataframe with many lines and columns, looking like this :
index
col1
col2
1
0
1
2
5
1
3
5
4
4
5
4
5
3
4
6
2
4
7
2
1
8
2
2
I would like to keep only the values that are different from the previous index and replace the others by 0. On the example dataframe, it would be :
index
col1
col2
1
0
1
2
5
0
3
0
4
4
0
0
5
3
0
6
2
0
7
0
1
8
0
2
What is a solution that works for any number of row/columns ?
So you'd like to keep the values where the difference to previous row is not equal to 0 (i.e., they're not the same), and put 0 to other places:
>>> df.where(df.diff().ne(0), other=0)
col1 col2
index
1 0 1
2 5 0
3 0 4
4 0 0
5 3 0
6 2 0
7 0 1
8 0 2
I want to add a DataFrame a (containing a loadprofile) to some of the columns of another DataFrame b (also containing one load profile per column). So some columns (load profiles) of b should be overlaid withe the load profile of a.
So lets say my DataFrames look like:
a:
P[kW]
0 0
1 0
2 0
3 8
4 8
5 0
b:
P1[kW] P2[kW] ... Pn[kW]
0 2 2 2
1 3 3 3
2 3 3 3
3 4 4 4
4 2 2 2
5 2 2 2
Now I want to overlay some colums of b:
b.iloc[:, [1]] += a.iloc[:, 0]
I would expect this:
b:
P1[kW] P2[kW] ... Pn[kW]
0 2 2 2
1 3 3 3
2 3 3 3
3 4 12 4
4 2 10 2
5 2 2 2
but what I actually get:
b:
P1[kW] P2[kW] ... Pn[kW]
0 2 nan 2
1 3 nan 3
2 3 nan 3
3 4 nan 4
4 2 nan 2
5 2 nan 2
That's not exactly what my code and data look like, but the principle is the same as in this abstract example.
Any guesses, what could be the problem?
Many thanks for any help in advance!
EDIT:
I actually have to overlay more than one column.Another example:
load = [0,0,0,0,0,0,0]
data = pd.DataFrame(load)
for i in range(1, 10):
data[i] = data[0]
data
overlay = pd.DataFrame([0,0,0,0,6,6,0])
overlay
data.iloc[:, [1,2,4,5,7,8]] += overlay.iloc[:, 0]
data
WHAT??! The result is completely crazy. Columns 1 and 2 aren't changed at all. Columns 4 and 5 are changed, but in every row. Columns 7 and 8 are nans. What am I missing?
That is what I would expect the result to look like:
Please do not pass the column index '1' of dataframe 'b' as a list but as an element.
Code
b.iloc[:, 1] += a.iloc[:, 0]
b
Output
P1[kW] P2[kW] Pn[kW]
0 2 2 2
1 3 3 3
2 3 3 3
3 4 12 4
4 2 10 2
5 2 2 2
Edit
Seems like this what we are looking for i.e to sum certain columns of data df with overlay df
Two Options
Option 1
cols=[1,2,4,5,7,8]
data[cols] = data[cols] + overlay.values
data
Option 2, if we want to use iloc
cols=[1,2,4,5,7,8]
data[cols] = data.iloc[:,cols] + overlay.iloc[:].values
data
Output
0 1 2 3 4 5 6 7 8 9
0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0
4 0 6 6 0 6 6 0 6 6 0
5 0 6 6 0 6 6 0 6 6 0
6 0 0 0 0 0 0 0 0 0 0
I have a dataframe that you can see how it is in the following. The column named target is my desired column:
group value target
1 1 0
1 2 0
1 3 2
1 4 0
1 5 1
2 1 0
2 2 0
2 3 0
2 4 1
2 5 3
Now I want to find the first non-zero value in the target column for each group and remove rows before that row in each group. So the output should be like this:
group value target
1 3 2
1 4 0
1 5 1
2 4 1
2 5 3
I have seen this post, but I don't how to change the code to get my desired result.
How can I do this?
In the groupby, set sort to False, get the cumsum, then filter for rows not equal to 0 :
df.loc[df.groupby(["group"], sort=False).target.cumsum() != 0]
group value target
2 1 3 2
3 1 4 0
4 1 5 1
8 2 4 1
9 2 5 3
This shoul do. I'm sure you can do it with less reset_index(), but this shouldn't affect too much the speed if your dataframe isn't too big:
idx = dff[dff.target.ne(0)].reset_index().groupby('group').index.first()
mask = (dff.reset_index().set_index('group')['index'].ge(idx.to_frame()['index'])).values
df_final = dff[mask]
Output:
0 group value target
3 1 3 2
4 1 4 0
5 1 5 1
9 2 4 1
10 2 5 3
I have the following data set:
PID,RUN_START_DATE,PUSHUP_START_DATE,SITUP_START_DATE,PULLUP_START_DATE
1,2013-01-24,2013-01-02,,2013-02-03
2,2013-01-30,2013-01-21,2013-01-13,2013-01-06
3,2013-01-29,2013-01-28,2013-01-01,2013-01-29
4,2013-02-16,2013-02-12,2013-01-04,2013-02-11
5,2013-01-06,2013-02-07,2013-02-25,2013-02-12
6,2013-01-26,2013-01-28,2013-02-12,2013-01-10
7,2013-01-26,,2013-01-12,2013-01-30
8,2013-01-03,2013-01-24,2013-01-19,2013-01-02
9,2013-01-22,2013-01-13,2013-02-03,
10,2013-02-06,2013-01-16,2013-02-07,2013-01-11
I know I can use numpy.argsort to return the sorted indexes of the values:
SQ_AL_INDX = numpy.argsort(df_sequence[['RUN_START_DATE', 'PUSHUP_START_DATE', 'SITUP_START_DATE', 'PULLUP_START_DATE']], axis=1)
...returns...
RUN_START_DATE PUSHUP_START_DATE SITUP_START_DATE PULLUP_START_DATE
0 2 1 0 3
1 3 2 1 0
2 2 1 0 3
3 2 3 1 0
4 0 1 3 2
5 3 0 1 2
6 1 2 0 3
7 3 0 2 1
8 3 1 0 2
9 3 1 0 2
But, it seems to put pandas.NaT values into the first position. So in this example where PID == 1 the sort order returns 2 1 0 3. But, the second index position is a pandas.Nat value.
How can I get the sorted indexes while skipping the pandas.NaT values (e.g., the return index values would be 2 1 np.NaN 3 or 2 1 pandas.NaT 3 or better yet 1 0 2 for PID 1 instead of 2 1 0 3)?
Pass numpy.argsort to the apply method instead of using it directly. This way, NaNs/NaTs persist. For your example:
In [2]: df_sequence[['RUN_START_DATE', 'PUSHUP_START_DATE', 'SITUP_START_DATE', 'PULLUP_START_DATE']].apply(numpy.argsort, axis=1)
Out[2]:
RUN_START_DATE PUSHUP_START_DATE SITUP_START_DATE PULLUP_START_DATE
0 1 0 NaN 2
(etc.)