I'm stuck and need some help. I have the following dataframe:
+-----+---+---+--+--+
| | A | B | | |
+-----+---+---+--+--+
| 288 | 1 | 4 | | |
+-----+---+---+--+--+
| 245 | 2 | 3 | | |
+-----+---+---+--+--+
| 543 | 3 | 6 | | |
+-----+---+---+--+--+
| 867 | 1 | 9 | | |
+-----+---+---+--+--+
| 345 | 2 | 7 | | |
+-----+---+---+--+--+
| 122 | 3 | 8 | | |
+-----+---+---+--+--+
| 233 | 1 | 1 | | |
+-----+---+---+--+--+
| 346 | 2 | 6 | | |
+-----+---+---+--+--+
| 765 | 3 | 3 | | |
+-----+---+---+--+--+
Column A has repeating values as shown. What I want to do is every time I see the repeating value in Column A I want to append a new colum with the corresponding values from column B as column C as shown below:
+-----+---+---+-----+
| | A | B | C |
+-----+---+---+-----+
| 288 | 1 | 4 | 9 |
+-----+---+---+-----+
| 245 | 2 | 3 | 7 |
+-----+---+---+-----+
| 543 | 3 | 6 | 8 |
+-----+---+---+-----+
| 867 | 1 | 9 | 1 |
+-----+---+---+-----+
| 345 | 2 | 7 | 6 |
+-----+---+---+-----+
| 122 | 3 | 8 | 3 |
+-----+---+---+-----+
| 233 | 1 | 1 | NaN |
+-----+---+---+-----+
| 346 | 2 | 6 | NaN |
+-----+---+---+-----+
| 765 | 3 | 3 | NaN |
+-----+---+---+-----+
Thanks.
Assuming that val is one of the repeated values,
slice = df.loc[df.A == val, 'B'].shift(-1)
will create a one-column data frame with the values re-indexed to their new positions.
Since none of the re-assigned index values should be redundant, you can use pandas.concat to stitch the different slices together without fear of losing data. Then just attach them as a new column:
df['C'] = pd.concat([df.loc[df['A'] == x, 'B'].shift(-1) for x in [1, 2, 3]])
When the column is assigned, the index values will make everything line up:
A B C
0 1 4 9.0
1 2 3 7.0
2 3 6 8.0
3 1 9 1.0
4 2 7 6.0
5 3 8 3.0
6 1 1 NaN
7 2 6 NaN
8 3 3 NaN
Reverse the dataframe order, groupby transform it against shift function, and reverse it back:
df = df[::-1]
df['C'] = df.groupby(df.columns[0]).transform('shift')
df = df[::-1]
df
A B C
0 1 4 9.0
1 2 3 7.0
2 3 6 8.0
3 1 9 1.0
4 2 7 6.0
5 3 8 3.0
6 1 1 NaN
7 2 6 NaN
8 3 3 NaN
Related
More visually, I would like to move from this dataframe :
| A\B\C\D | Unnamed:1 | Unnamed:2 | Unnamed:3 | Unnamed:4 |
| --------| ----------|
0 | 1\2\3\4 | NaN | NaN | NaN | NaN |
1 | 1\2\3\4 | NaN | NaN | NaN | NaN |
2 | a\2\7\C | NaN | NaN | NaN | NaN |
3 | d\2\u\4 | NaN | NaN | NaN | NaN |
to this one:
| A | B | C | D |
| --------| ----------|
0 | 1 | 2 | 3 | 4 |
1 | 1 | 2 | 3 | 4 |
2 | a | 2 | 7 | C |
3 | d | 2 | u | 4 |
Thanks !
Try splitting the values first and then split the column name:
df2 = df.iloc[:,0].str.split('\\', expand = True)
df2.columns = df.columns[0].split('\\')
df2
result:
A B C D
0 1 2 3 4
1 1 2 3 4
2 a 2 7 C
3 d 2 u 4
You can use DataFrame constructor:
out = pd.DataFrame(df.iloc[:, 0].str.split('\\').tolist(),
columns=df.columns[0].split('\\'))
print(out)
# Output
A B C D
0 1 2 3 4
1 1 2 3 4
2 a 2 7 C
3 d 2 u 4
The question is: why do you have a such input? Do you read your data from csv file and you don't use the right separator?
The challenge that I have, and don't know how to approach is to have averaged five, ten, or whatever amount of rows above the target value plus the target row.
Dataset
target | A | B |
----------------------
nan | 6 | 4 |
nan | 2 | 7 |
nan | 4 | 9 |
nan | 7 | 3 |
nan | 3 | 7 |
nan | 6 | 8 |
nan | 7 | 6 |
53 | 4 | 5 |
nan | 6 | 4 |
nan | 2 | 7 |
nan | 3 | 3 |
nan | 4 | 9 |
nan | 7 | 3 |
nan | 3 | 7 |
51 | 1 | 3 |
Desired format:
target | A | B |
----------------------
53 | 5.16|6.33 |
51 |3.33 |5.33 |
Try this, [::-1] reversing element to order the dataframe bottom to top, so we can group the values "above" valid targets:
df.groupby(df['target'].notna()[::-1].cumsum()[::-1]).apply(lambda x: x.tail(6).mean())
Output:
target A B
target
1 51.0 3.333333 5.333333
2 53.0 5.166667 6.333333
I have the following pandas dataframe where the first column is the datetime index. I am trying to achieve the desired_output column which increments every time the flag changes from 0 to 1 or 1 to 0. I have been able to achieve this type of thing in SQL however after finding that pandasql sqldf for some strange reason changes the values of the field undergoing the partition I am now trying to achieve this using regular python syntax.
Any help would be much appreciated.
+-------------+------+----------------+
| date(index) | flag | desired_output |
+-------------+------+----------------+
| 1/01/2020 | 0 | 1 |
| 2/01/2020 | 0 | 1 |
| 3/01/2020 | 0 | 1 |
| 4/01/2020 | 1 | 2 |
| 5/01/2020 | 1 | 2 |
| 6/01/2020 | 0 | 3 |
| 7/01/2020 | 1 | 4 |
| 8/01/2020 | 1 | 4 |
| 9/01/2020 | 1 | 4 |
| 10/01/2020 | 1 | 4 |
| 11/01/2020 | 1 | 4 |
| 12/01/2020 | 1 | 4 |
| 13/01/2020 | 0 | 5 |
| 14/01/2020 | 0 | 5 |
| 15/01/2020 | 0 | 5 |
| 16/01/2020 | 0 | 5 |
| 17/01/2020 | 1 | 6 |
| 18/01/2020 | 0 | 7 |
| 19/01/2020 | 0 | 7 |
| 20/01/2020 | 0 | 7 |
| 21/01/2020 | 0 | 7 |
| 22/01/2020 | 1 | 8 |
| 23/01/2020 | 1 | 8 |
+-------------+------+----------------+
Use diff and cumsum:
print (df["flag"].diff().ne(0).cumsum())
0 1
1 1
2 1
3 2
4 2
5 3
6 4
7 4
8 4
9 4
10 4
11 4
12 5
13 5
14 5
15 5
16 6
17 7
18 7
19 7
20 7
21 8
22 8
How can I achieve the following:
I have a table like so
|----------------------|
| Date | A | B | C | D |
|------+---+---+---+---|
| 2000 | 1 | 2 | 5 | 4 |
|------+---+---+---+---|
| 2001 | 2 | 2 | 7 | 4 |
|------+---+---+---+---|
| 2002 | 3 | 1 | 7 | 7 |
|------+---+---+---+---|
| 2003 | 4 | 1 | 5 | 7 |
|----------------------|
and turn it into a multi-index type dataframe:
|------------------------------------|
| Column Name | Date | Value | C | D |
|-------------+------+-------+---+---|
| A | 2000 | 1 | 5 | 4 |
| |------+-------+---+---|
| | 2001 | 2 | 7 | 4 |
| |------+-------+---+---|
| | 2002 | 3 | 7 | 7 |
| |------+-------+---+---|
| | 2003 | 4 | 5 | 7 |
|-------------+------+-------+---+---|
| B | 2000 | 2 | 5 | 4 |
| |------+-------+---+---|
| | 2001 | 2 | 7 | 4 |
| |------+-------+---+---|
| | 2002 | 1 | 7 | 7 |
| |------+-------+---+---|
| | 2003 | 1 | 5 | 7 |
|------------------------------------|
I have tried using the Melt function on a dataframe but could not figure out how to achieve this desired look. I think I would also then have to apply a groupby function to the melted dataframe.
You can use melt with set_index. By adding C and D as id_vars, the columns will keep the same structure, then you can just set the columns of interest as index to get a MultiIndex dataframe:
df.melt(id_vars=['Date', 'C', 'D']).set_index(['variable', 'Date'])
C D value
variable Date
A 2000 5 4 1
2001 7 4 2
2002 7 7 3
2003 5 7 4
B 2000 5 4 2
2001 7 4 2
2002 7 7 1
2003 5 7 1
I have a DataFrame that looks something like this:
| event_type | object_id
------ | ------ | ------
0 | A | 1
1 | D | 1
2 | A | 1
3 | D | 1
4 | A | 2
5 | A | 2
6 | D | 2
7 | A | 3
8 | D | 3
9 | A | 3
What I want to do is get the index of the next row where the event_type is A and the object_id is still the same, so as an additional column this would look like this:
| event_type | object_id | next_A
------ | ------ | ------ | ------
0 | A | 1 | 2
1 | D | 1 | 2
2 | A | 1 | NaN
3 | D | 1 | NaN
4 | A | 2 | 5
5 | A | 2 | NaN
6 | D | 2 | NaN
7 | A | 3 | 9
8 | D | 3 | 9
9 | A | 3 | NaN
and so on.
I want to avoid using .apply() because my DataFrame is quite large, is there a vectorized way to do this?
EDIT: for multiple A/D pairs for the same object_id, I'd like it to always use the next index of A, like this:
| event_type | object_id | next_A
------ | ------ | ------ | ------
0 | A | 1 | 2
1 | D | 1 | 2
2 | A | 1 | 4
3 | D | 1 | 4
4 | A | 1 | NaN
You can do it with groupby like:
def populate_next_a(object_df):
object_df['a_index'] = pd.Series(object_df.index, index=object_df.index)[object_df.event_type == 'A']
object_df['a_index'].fillna(method='bfill', inplace=True)
object_df['next_A'] = object_df['a_index'].where(object_df.event_type != 'A', object_df['a_index'].shift(-1))
object_df.drop('a_index', axis=1)
return object_df
result = df.groupby(['object_id']).apply(populate_next_a)
print(result)
event_type object_id next_A
0 A 1 2.0
1 D 1 2.0
2 A 1 NaN
3 D 1 NaN
4 A 2 5.0
5 A 2 NaN
6 D 2 NaN
7 A 3 9.0
8 D 3 9.0
9 A 3 NaN
GroupBy.apply will not have as much overhead as a simple apply.
Note you cannot (yet) store integer with NaN: http://pandas.pydata.org/pandas-docs/stable/gotchas.html#support-for-integer-na so they end up as float values