I'm trying to replace every value above 1000 in my dataframe by its difference to the previous row value.
This is the way I tried with pandas:
data_df.replace(data_df.where(data_df["value"] >= 1000), data_df["value"].diff(), inplace=True)
This does not result in an error, but nothing in the dataframe changes. What am I missing?
import numpy as np
import pandas as pd
d = {'value': [1000, 200002,50004,600005], }
data_df = pd.DataFrame(data=d)
data_df["diff"] = data_df["value"].diff()
data_df["value"] = np.where((data_df["value"]>10000) ,data_df["diff"],data_df["value"])
data_df.drop(columns='diff', inplace=True)
I introduce one column "diff" to get the difference of pervious row.
np.where allow u implement the if else statement.
Hope it helps u thanks!
You can set the freq to 1000 or whatever interval you want. I have it at 10 to make the sample easier to see. Basically shifting the row, and for every row where the index is evenly divisible by the frequency, use the shifted value, otherwise leave as is.
import pandas as pd
import numpy as np
freq = 10
df = pd.DataFrame({'data':[x for x in range(30)]})
df['previous'] = df['data'].shift(1)
df['data'] = np.where((df.index % freq==0) & (df.index>0), df['data'] -df['previous'], df['data'])
df.drop(columns='previous', inplace=True)
Output
data
0 0.0
1 1.0
2 2.0
3 3.0
4 4.0
5 5.0
6 6.0
7 7.0
8 8.0
9 9.0
10 1.0
11 11.0
12 12.0
13 13.0
14 14.0
15 15.0
16 16.0
17 17.0
18 18.0
19 19.0
20 1.0
21 21.0
22 22.0
23 23.0
24 24.0
25 25.0
26 26.0
27 27.0
28 28.0
29 29.0
Related
I have this DataFrame
df = pd.DataFrame({'store':[1,1,1,2],'upc':[11,22,33,11],'sales':[14,16,11,29]})
which gives this output
store upc sales
0 1 11 14
1 1 22 16
2 1 33 11
3 2 11 29
I want something like this
store upc_11 upc_22 upc_33
1 14.0 16.0 11.0
2 29.0 NaN NaN
I tried this
newdf = df.pivot(index='store', columns='upc')
newdf.columns = newdf.columns.droplevel(0)
and the output looks like this with multiple headers
upc 11 22 33
store
1 14.0 16.0 11.0
2 29.0 NaN NaN
I also tried
newdf = df.pivot(index='store', columns='upc').reset_index()
This also gives multiple headers
store sales
upc 11 22 33
0 1 14.0 16.0 11.0
1 2 29.0 NaN NaN
try via fstring+columns attribute and list comprehension:
newdf = df.pivot(index='store', columns='upc')
newdf.columns=[f"upc_{y}" for x,y in newdf.columns]
newdf=newdf.reset_index()
OR
In 2 steps:
newdf = df.pivot(index='store', columns='upc').reset_index()
newdf.columns=[f"upc_{y}" if y!='' else f"{x}" for x,y in newdf.columns]
Another option, which is longer than #Anurag's:
(df.pivot(index='store', columns='upc')
.droplevel(axis=1, level=0)
.rename(columns = lambda df: f"upc_{df}")
.rename_axis(index=None, columns=None)
)
upc_11 upc_22 upc_33
1 14.0 16.0 11.0
2 29.0 NaN NaN
I am fairly new to Python and Pandas; been searching for a solution for couple days with no luck... here's the problem:
I have a data set like the below and I need to cull the first few values of some rows so the highest value in each row is in column A. In the below example, rows 0 & 3 would drop the values in column A and row 4 drop the values in column A and B and then shift all remaining values to left.
A B C D
0 11 23 21 14
1 24 18 17 15
2 22 18 15 13
3 10 13 12 10
4 5 7 14 11
Desired
A B C D
0 23 21 14 NaN
1 24 18 17 15
2 22 18 15 13
3 13 12 10 NaN
4 14 11 NaN NaN
I've looked at the df.shift(), but don't see how I can get that function to work on a unique row by row basis. Should I instead be using an array and a loop function?
Any help is greatly appreciated.
You need to turn all left values of the max to np.nan and use the solution in this question. I use the one from #cs95
df_final = df[df.eq(df.max(1), axis=0).cummax(1)].apply(lambda x: sorted(x, key=pd.isnull), 1)
A B C D
0 23.0 21.0 14.0 NaN
1 24.0 18.0 17.0 15.0
2 22.0 18.0 15.0 13.0
3 13.0 12.0 10.0 NaN
4 14.0 11.0 NaN NaN
You can loop over the unique shifts (fewer of these than rows) with a groupby and join the results back:
import pandas as pd
shifts = df.to_numpy().argmax(1)
pd.concat([gp.shift(-i, axis=1) for i, gp in df.groupby(shifts)]).sort_index()
A B C D
0 23.0 21.0 14.0 NaN
1 24.0 18.0 17.0 15.0
2 22.0 18.0 15.0 13.0
3 13.0 12.0 10.0 NaN
4 14.0 11.0 NaN NaN
One approach is to convert each row of the data frame to a list (excluding the index) and append NaN values. Then keep N elements, starting with the max value.
ncols = len(df.columns)
nans = [np.nan] * ncols
new_rows = list()
for row in df.itertuples():
# convert each row of the data frame to a list
# start at 1 to exclude the index;
# and append list of NaNs
new_list = list(row[1:]) + nans
# find index of max value (exluding NaNs we appended)
k = np.argmax(new_list[:ncols])
# collect `new row`, starting at max element
new_rows.append(new_list[k : k+ncols])
# create new data frame
df_new = pd.DataFrame(new_rows, columns=df.columns)
df_new
for i in range(df.shape[0]):
arr = list(df.iloc[i,:])
c = 0
while True:
if arr[0] != max(arr):
arr.remove(arr[0])
c += 1
else:
break
nan = ["NaN"]*c
arr.extend(nan)
df.iloc[i,:] = arr
print(df)
I have looped over every row and found out max value and remove values before the max and padding "NaN" values at the end to match the number of columns for every row.
I am looking to perform forward fill on some dataframe columns.
the ffill method replaces missing values or NaN with the previous filled value.
In my case, I would like to perform a forward fill, with the difference that I don't want to do that on Nan but for a specific value (say "*").
Here's an example
import pandas as pd
import numpy as np
d = [{"a":1, "b":10},
{"a":2, "b":"*"},
{"a":3, "b":"*"},
{"a":4, "b":"*"},
{"a":np.nan, "b":50},
{"a":6, "b":60},
{"a":7, "b":70}]
df = pd.DataFrame(d)
with df being
a b
0 1.0 10
1 2.0 *
2 3.0 *
3 4.0 *
4 NaN 50
5 6.0 60
6 7.0 70
The expected result should be
a b
0 1.0 10
1 2.0 10
2 3.0 10
3 4.0 10
4 NaN 50
5 6.0 60
6 7.0 70
If replacing "*" by np.nan then ffill, that would cause to apply ffill to column a.
Since my data has hundreds of columns, I was wondering if there is a more efficient way than looping over all columns, check if it countains "*", then replace and ffill.
You can use df.mask with df.isin with df.replace
df.mask(df.isin(['*']),df.replace('*',np.nan).ffill())
a b
0 1.0 10
1 2.0 10
2 3.0 10
3 4.0 10
4 NaN 50
5 6.0 60
6 7.0 70
I think you're going in the right direction, but here's a complete solution. What I'm doing is 'marking' the original NaN values, then replacing "*" with NaN, using ffill, and then putting the original NaN values back.
df = df.replace(np.NaN, "<special>").replace("*", np.NaN).ffill().replace("<special>", np.NaN)
output:
a b
0 1.0 10.0
1 2.0 10.0
2 3.0 10.0
3 4.0 10.0
4 NaN 50.0
5 6.0 60.0
6 7.0 70.0
And here's an alternative solution that does the same thing, without the 'special' marking:
original_nan = df.isna()
df = df.replace("*", np.NaN).ffill()
df[original_nan] = np.NaN
I have a pandas column with the name 'values' containing respective values 10 15 36 95 99. I want to subtract the each value from the next value so that I get the following format: 10 5 21 59 4
I've tried to solve this using a for loop that loops over all the data-frame but this method was time consuming.
for i in range(1,length_colulmn):
df['value'].iloc[i] = df['value'].iloc[i]-df['value'].iloc[i-1]
Is there a straightforward method the dataframe functions to solve this problem quickly?
The output we desire is the following:
['input']
11
15
22
27
36
69
77
['output']
11
4
7
5
9
33
8
Use pandas.Series.diff with fillna:
import pandas as pd
s = pd.Series([11,15,22,27,36,69,77])
s.diff().fillna(s)
Output:
0 11.0
1 4.0
2 7.0
3 5.0
4 9.0
5 33.0
6 8.0
dtype: float64
You can use the pythonic shift function. see how I did it. Let me know if it works.
Code here:
import pandas as pd
df = pd.DataFrame({ 'input': [11, 15, 22, 27, 36, 69, 77]})
df['output']=df['input'] -df['input'].shift(1)
df
#df['output'].dropna()
Explanation:
create dataframe
create a column output such that the next row minus the current row
print dataframe
Result:
input output
0 11 NaN
1 15 4.0
2 22 7.0
3 27 5.0
4 36 9.0
5 69 33.0
6 77 8.0
you can remove NaN with dropna().
I have a dataframe populated with different zeros and values different than zero. For each row I want to apply the following condition:
If the value in the given cell is different than zero AND the value in the cell to the right is zero, then put the same value in the cell to the right.
The example would be the following:
This is the one of the rows in the dataframe now:
[0,0,0,20,0,0,0,33,3,0,5,0,0,0,0,0]
The function would convert it to the following:
[0,0,0,20,20,20,20,33,3,3,5,5,5,5,5,5]
I want to apply this to the whole dataframe.
Your help would be much appreciated!
Thank you.
Since you imply you are using Pandas, I would leverage a bit of the build-in muscle in the library.
import pandas as pd
import numpy as np
s = pd.Series([0,0,0,20,0,0,0,33,3,0,5,0,0,0,0,0])
s.replace(0, np.NaN, inplace=True)
s = s.ffill()
Output:
0 NaN
1 NaN
2 NaN
3 20.0
4 20.0
5 20.0
6 20.0
7 33.0
8 3.0
9 3.0
10 5.0
11 5.0
12 5.0
13 5.0
14 5.0
15 5.0
dtype: float64