I have a dataframe that looks like this:
Step
Text
Parameter
15
print
1
16
control
2
17
printout
3
18
print2
1
19
Nan
2
20
Nan
3
21
Nan
4
22
Nan
1
23
Nan
2
24
Nan
1
And I want my dataframe to look like this:
Step
Text
Parameter
15
print
1
15
print
2
15
print
3
16
control
1
16
control
2
17
control
3
17
control
4
18
printout
1
18
printout
2
19
print2
1
So basically when I have "1" in Parameter column, I need the next value from Step and Text.
Any ideas?:)
You can use repeat on a custom group:
# ensure NaN
df['Text'] = df['Text'].replace('Nan', pd.NA)
# get the number of rows per group starting with 1
n = df.groupby(df['Parameter'].eq(1).cumsum()).size()
# repeat the index of the non NaN values as many times
idx = df['Text'].dropna().index.repeat(n)
# replace the values ignoring the index
# (using the underlying numpy array)
df[['Step', 'Text']] = df.loc[idx, ['Step', 'Text']].to_numpy()
output:
Step Text Parameter
0 15 print 1
1 15 print 2
2 15 print 3
3 16 control 1
4 16 control 2
5 16 control 3
6 16 control 4
7 17 printout 1
8 17 printout 2
9 18 print2 1
Related
My dataframe with Quarter and Week as MultiIndex:
Quarter Week X Y Z
Q1 Q1-W01 1 1 1
Q1-W02 2 2 2
Q1-W03 3 3 3
Q1-W04 4 4 4
Q2 Q2-W15 15 15 15
Q2-W16 16 16 16
Q2-W17 17 17 17
Q2-W18 18 18 18
I am trying to add the last row in Q1 (Q1-W04) to all the rows in Q2 (Q2-W15 through Q2-W18). This is what I would like the dataframe to look like:
Quarter Week X Y Z
Q1 Q1-W01 1 1 1
Q1-W02 2 2 2
Q1-W03 3 3 3
Q1-W04 4 4 4
Q2 Q2-W15 19 19 19
Q2-W16 20 20 20
Q2-W17 21 21 21
Q2-W18 22 22 22
When I try to only specify the level 0 index and sumthe specific row, all Q2 values go to NaN.
df.loc['Q2'] += df.loc['Q1','Q1-W04']
Quarter Week X Y Z
Q1 Q1-W01 1 1 1
Q1-W02 2 2 2
Q1-W03 3 3 3
Q1-W04 4 4 4
Q2 Q2-W15 NaN NaN NaN
Q2-W16 NaN NaN NaN
Q2-W17 NaN NaN NaN
Q2-W18 NaN NaN NaN
I have figured out that if I specify both the level 0 and level 1 index, there is no problem.
df.loc['Q2','Q2-W15'] += df.loc['Q1','Q1-W04']
Quarter Week X Y Z
Q1 Q1-W01 1 1 1
Q1-W02 2 2 2
Q1-W03 3 3 3
Q1-W04 4 4 4
Q2 Q2-W15 19 19 19
Q2-W16 16 16 16
Q2-W17 17 17 17
Q2-W18 18 18 18
Is there a way to sum the specific row to all the rows within the Q2 Level 0 index without having to call out each row individually by its level 1 index?
Any insight/guidance would be greatly appreciated.
Thank you.
try this
df.loc['Q2'] = (df.loc['Q2'] + df.loc['Q1', 'Q1-W04']).values.tolist()
df.loc returns a DataFrame, to set the value it looks for the list or array. Hence the above.
In your case we should remove the impact of index
df.loc['Q2','Q2-W15'] += df.loc['Q1','Q1-W04'].values
I have some dataframe like the one shown above. The goal of this program is to replace some specific value by the previous one.
import pandas as pd
test = pd.DataFrame([2,2,3,1,1,2,4,6,43,23,4,1,3,3,1,1,1,4,5], columns = ['A'])
obtaining:
If one want to replace all 1 by the previous values, a possible solution is:
for li in test[test['A'] == 1].index:
test['A'].iloc[li] = test['A'].iloc[li-1]
However, it is very inefficient. Can you suggest a more efficient solution?
IIUC, replace to np.nan then ffill
test.replace(1,np.nan).ffill().astype(int)
Out[881]:
A
0 2
1 2
2 3
3 3
4 3
5 2
6 4
7 6
8 43
9 23
10 4
11 4
12 3
13 3
14 3
15 3
16 3
17 4
18 5
I have a dataset with a number of values like below.
>>> a.head()
value freq
3 9 1
2 11 1
0 12 4
1 15 2
I need to fill in the values between the integers in the value column. For example, I need to insert one new row between 9 & 11 filled with zeroes, then another two between 12-15. The end result should be the dataset with 9-15 with 'missing' rows as zeroes across the board.
Is there anyway to insert a new row at an specific location without replacing data? The only methods I've found involve slicing the dataframe at a location then appending a new row and concatenating the remainder.
UPDATE: The index is completely irrelevant so don't worry about that.
You didn't say what should happen to your Index, so I'm assuming it's unimportant.
In [12]: df.index = df['value']
In [15]: df.reindex(np.arange(df.value.min(), df.value.max() + 1)).fillna(0)
Out[15]:
value freq
value
9 9 1
10 0 0
11 11 1
12 12 4
13 0 0
14 0 0
15 15 2
Another option is to create a second dataframe with values from min to max, and outer join this to your dataframe:
import pandas as pd
a = pd.DataFrame({'value':[9,11,12,15], 'freq':[1,1,4,2]})
# value freq
#0 9 1
#1 11 1
#2 12 4
#3 15 2
b = pd.DataFrame({'value':[x for x in range(a.value.min(), a.value.max()+1)]})
value
0 9
1 10
2 11
3 12
4 13
5 14
6 15
a = pd.merge(left=a, right=b, on='value', how='outer').fillna(0).sort_values(by='value')
# value freq
#0 9 1.0
#4 10 0.0
#1 11 1.0
#2 12 4.0
#5 13 0.0
#6 14 0.0
#3 15 2.0
I want to make a new column of the 5 day return for a stock, let's say. I am using pandas dataframe. I computed a moving average using the rolling_mean function, but I'm not sure how to reference lines like i would in a spreadsheet (B6-B1) for example. Does anyone know how I can do this index reference and subtraction?
sample data frame:
day price 5-day-return
1 10 -
2 11 -
3 15 -
4 14 -
5 12 -
6 18 i want to find this ((day 5 price) -(day 1 price) )
7 20 then continue this down the list
8 19
9 21
10 22
Are you wanting this:
In [10]:
df['5-day-return'] = (df['price'] - df['price'].shift(5)).fillna(0)
df
Out[10]:
day price 5-day-return
0 1 10 0
1 2 11 0
2 3 15 0
3 4 14 0
4 5 12 0
5 6 18 8
6 7 20 9
7 8 19 4
8 9 21 7
9 10 22 10
shift returns the row at a specific offset, we use this to subtract this from the current row. fillna fills the NaN values which will occur prior to the first valid calculation.
I have a pandas DataFrame in the following format:
a b c
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
4 12 13 14
5 15 16 17
I want to append a calculated row that performs some math based on a given items index value, e.g. adding a row that sums the values of all items with an index value < 2, with the new row having an index label of 'Red'. Ultimately, I am trying to add three rows that group the index values into categories:
A row with the sum of item values where index value are < 2, labeled as 'Red'
A row with the sum of item values where index values are 1 < x < 4, labeled as 'Blue'
A row with the sum of item values where index values are > 3, labeled as 'Green'
Ideal output would look like this:
a b c
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
4 12 13 14
5 15 16 17
Red 3 5 7
Blue 15 17 19
Green 27 29 31
My current solution involves transposing the DataFrame, applying a map function for each calculated column and then re-transposing, but I would imagine pandas has a more efficient way of doing this, likely using .append().
EDIT:
My in-elegant pre-set list solution (originally used .transpose() but I improved it using .groupby() and .append()):
df = pd.DataFrame(np.arange(18).reshape((6,3)),columns=['a', 'b', 'c'])
df['x'] = ['Red', 'Red', 'Blue', 'Blue', 'Green', 'Green']
df2 = df.groupby('x').sum()
df = df.append(df2)
del df['x']
I much prefer the flexibility of BrenBarn's answer (see below).
Here is one way:
def group(ix):
if ix < 2:
return "Red"
elif 2 <= ix < 4:
return "Blue"
else:
return "Green"
>>> print d
a b c
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
4 12 13 14
5 15 16 17
>>> print d.append(d.groupby(d.index.to_series().map(group)).sum())
a b c
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
4 12 13 14
5 15 16 17
Blue 15 17 19
Green 27 29 31
Red 3 5 7
For the general case, you need to define a function (or dict) to handle the mapping to different groups. Then you can just use groupby and its usual abilities.
For your particular case, it can be done more simply by directly slicing on the index value as Dan Allan showed, but that will break down if you have a more complex case where the groups you want are not simply definable in terms of contiguous blocks of rows. The method above will also easily extend to situations where the groups you want to create are not based on the index but on some other column (i.e., group together all rows whose value in column X is within range 0-10, or whatever).
The role of "transpose," which you say you used in your unshown solution, might be played more naturally by the orient keyword argument, which is available when you construct a DataFrame from a dictionary.
In [23]: df
Out[23]:
a b c
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
4 12 13 14
5 15 16 17
In [24]: dict = {'Red': df.loc[:1].sum(),
'Blue': df.loc[2:3].sum(),
'Green': df.loc[4:].sum()}
In [25]: DataFrame.from_dict(dict, orient='index')
Out[25]:
a b c
Blue 15 17 19
Green 27 29 31
Red 3 5 7
In [26]: df.append(_)
Out[26]:
a b c
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
4 12 13 14
5 15 16 17
Blue 15 17 19
Green 27 29 31
Red 3 5 7
Based the numbers in your example, I assume that by "> 4" you actually meant ">= 4".