replicating data in same dataFrame

replicating data in same dataFrame - python

I want to replicate the data from the same dataframe when a certain condition is fulfilled.
Dataframe:
Hour,Wage
1,15
2,17
4,20
10,25
15,26
16,30
17,40
19,15
I want to replicate the dataframe when going through a loop and there is a difference greater than 4 in row.hour.
Expected Output:
Hour,Wage
1,15
2,17
4,20
10,25
15,26
16,30
17,40
19,15
2,17
4,20
i want to replicate the rows when the iterating through all the row and there is a difference greater than 4 in row.hour
row.hour[0] = 1
row.hour[1] = 2.here the difference between is 1 but in (row.hour[2]=4 and row,hour[3]=10).here the difference is 6 which is greater than 4.I want to replicate the data above of the index where this condition(greater than 4) is fulfilled
I can replicate the data with **df = pd.concat([df]*2, ignore_index=False)**.but it does not replicate when i run it with if statement
I tried the code below but nothing is happening.
**for i in range(0,len(df)-1):
if (df.iloc[i,0] - df.iloc[i+1,0]) > 4 :
df = pd.concat([df]*2, ignore_index=False)**

My understanding is: you want to compare 'Hour' values for two successive rows.
If the difference is > 4 you want to add the previous row to the DF.
If that is what you want try this:
Create a DF:
j = pd.DataFrame({'Hour':[1, 2, 4,10,15,16,17,19],
'Wage':[15,17,20,25,26,30,40,15]})
Define a function:
def f1(d):
dn = d.copy()
for x in range(len(d)-2):
if (abs(d.iloc[x+1].Hour - d.iloc[x+2].Hour) > 4):
idx = x + 0.5
dn.loc[idx] = d.iloc[x]['Hour'], d.iloc[x]['Wage']
dn = dn.sort_index().reset_index(drop=True)
return dn
Call the function passing your DF:
nd = f1(j)
Hour Wage
0 1 15
1 2 17
2 2 17
3 4 20
4 4 20
5 10 25
6 15 26
7 16 30
8 17 40
9 19 15

In line
if df.iloc[i,0] - df.iloc[i+1,0] > 4
you calculate 4-10 instead of 10-4 so you check -6 > 4 instead of 6 > 4
You have to replace items
if df.iloc[i+1,0] - df.iloc[i,0] > 4
or use abs() if you want to replicate in both situations - > 4 and < -4
if abs(df.iloc[i+1,0] - df.iloc[i,0]) > 4
If you would use print( df.iloc[i,0] - df.iloc[i+1,0]) (or debuger) the you would see it.

Related

Is there a way to reference a previous value in Pandas column efficiently?

I want to do some complex calculations in pandas while referencing previous values (basically I'm calculating row by row). However the loops take forever and I wanted to know if there was a faster way. Everybody keeps mentioning using shift but I don't understand how that would even work.
df = pd.DataFrame(index=range(500)
df["A"]= 2
df["B"]= 5
df["A"][0]= 1
for i in range(len(df):
if i != 0: df['A'][i] = (df['A'][i-1] / 3) - df['B'][i-1] + 25

numpy_ext can be used for expanding calculations
pandas-rolling-apply-using-multiple-columns for reference
I have also included a simpler calc to demonstrate behaviour in simpler way
df = pd.DataFrame(index=range(5000))
df["A"]= 2
df["B"]= 5
df["A"][0]= 1
import numpy_ext as npe
# for i in range(len(df):
# if i != 0: df['A'][i] = (df['A'][i-1] / 3) - df['B'][i-1] + 25
# SO example - function of previous values in A and B
def f(A,B):
r = np.sum(A[:-1]/3) - np.sum(B[:-1] + 25) if len(A)>1 else A[0]
return r
# much simpler example, sum of previous values
def g(A):
return np.sum(A[:-1])
df["AB_combo"] = npe.expanding_apply(f, 1, df["A"].values, df["B"].values)
df["A_running"] = npe.expanding_apply(g, 1, df["A"].values)
print(df.head(10).to_markdown())
sample output
A
B
AB_combo
A_running
0
1
5
1
0
1
2
5
-29.6667
1
2
2
5
-59
3
3
2
5
-88.3333
5
4
2
5
-117.667
7
5
2
5
-147
9
6
2
5
-176.333
11
7
2
5
-205.667
13
8
2
5
-235
15
9
2
5
-264.333
17

Index and save last N points from a list that meets conditions from dataframe Python

I have a DataFrame that contains gas concentrations and the corresponding valve number. This data was taken continuously where we switched the valves back and forth (valves=1 or 2) for a certain amount of time to get 10 cycles for each valve value (20 cycles total). A snippet of the data looks like this (I have 2,000+ points and each valve stayed on for about 90 seconds each cycle):
gas1 valveW time
246.9438 2 1
247.5367 2 2
246.7167 2 3
246.6770 2 4
245.9197 1 5
245.9518 1 6
246.9207 1 7
246.1517 1 8
246.9015 1 9
246.3712 2 10
247.0826 2 11
... ... ...
My goal is to save the last N points of each valve's cycle. For example, the first cycle where valve=1, I want to index and save the last N points from the end before the valve switches to 2. I would then save the last N points and average them to find one value to represent that first cycle. Then I want to repeat this step for the second cycle when valve=1 again.
I am currently converting from Matlab to Python so here is the Matlab code that I am trying to translate:
% NOAA high
n2o_noaaHigh = [];
co2_noaaHigh = [];
co_noaaHigh = [];
h2o_noaaHigh = [];
ind_noaaHigh_end = zeros(1,length(t_c));
numPoints = 40;
for i = 1:length(valveW_c)-1
if (valveW_c(i) == 1 && valveW_c(i+1) ~= 1)
test = (i-numPoints):i;
ind_noaaHigh_end(test) = 1;
n2o_noaaHigh = [n2o_noaaHigh mean(n2o_c(test))];
co2_noaaHigh = [co2_noaaHigh mean(co2_c(test))];
co_noaaHigh = [co_noaaHigh mean(co_c(test))];
h2o_noaaHigh = [h2o_noaaHigh mean(h2o_c(test))];
end
end
ind_noaaHigh_end = logical(ind_noaaHigh_end);
This is what I have so far for Python:
# NOAA high
n2o_noaaHigh = [];
co2_noaaHigh = [];
co_noaaHigh = [];
h2o_noaaHigh = [];
t_c_High = []; # time
for i in range(len(valveW_c)):
# NOAA HIGH
if (valveW_c[i] == 1):
t_c_High.append(t_c[i])
n2o_noaaHigh.append(n2o_c[i])
co2_noaaHigh.append(co2_c[i])
co_noaaHigh.append(co_c[i])
h2o_noaaHigh.append(h2o_c[i])
Thanks in advance!

I'm not sure if I understood correctly, but I guess this is what you are looking for:
# First we create a column to show cycles:
df['cycle'] = (df.valveW.diff() != 0).cumsum()
print(df)
gas1 valveW time cycle
0 246.9438 2 1 1
1 247.5367 2 2 1
2 246.7167 2 3 1
3 246.677 2 4 1
4 245.9197 1 5 2
5 245.9518 1 6 2
6 246.9207 1 7 2
7 246.1517 1 8 2
8 246.9015 1 9 2
9 246.3712 2 10 3
10 247.0826 2 11 3
Now you can use groupby method to get the average for the last n points of each cycle:
n = 3 #we assume this is n
df.groupby('cycle').apply(lambda x: x.iloc[-n:, 0].mean())
Output:
cycle 0
1 246.9768
2 246.6579
3 246.7269

Let's call your DataFrame df; then you could do:
results = {}
for k, v in df.groupby((df['valveW'].shift() != df['valveW']).cumsum()):
results[k] = v
print(f'[group {k}]')
print(v)
Shift(), as it suggests, shifts the column of the valve cycle allows to detect changes in number sequences. Then, cumsum() helps to give a unique number to each of the group with the same number sequence. Then we can do a groupby() on this column (which was not possible before because groups were either of ones or twos!).
which gives e.g. for your code snippet (saved in results):
[group 1]
gas1 valveW time
0 246.9438 2 1
1 247.5367 2 2
2 246.7167 2 3
3 246.6770 2 4
[group 2]
gas1 valveW time
4 245.9197 1 5
5 245.9518 1 6
6 246.9207 1 7
7 246.1517 1 8
8 246.9015 1 9
[group 3]
gas1 valveW time
9 246.3712 2 10
10 247.0826 2 11
Then to get the mean for each cycle; you could e.g. do:
df.groupby((df['valveW'].shift() != df['valveW']).cumsum()).mean()
which gives (again for your code snippet):
gas1 valveW time
valveW
1 246.96855 2.0 2.5
2 246.36908 1.0 7.0
3 246.72690 2.0 10.5
where you wouldn't care much about the time mean but the gas1 one!
Then, based on results you could e.g. do:
n = 3
mean_n_last = []
for k, v in results.items():
if len(v) < n:
mean_n_last.append(np.nan)
else:
mean_n_last.append(np.nanmean(v.iloc[len(v) - n:, 0]))
which gives [246.9768, 246.65796666666665, nan] for n = 3 !

If your dataframe is sorted by time you could get the last N records for each valve like this.
N=2
valve1 = df[df['valveW']==1].iloc[-N:,:]
valve2 = df[df['valveW']==2].iloc[-N:,:]
If it isn't currently sorted you could easily sort it like this.
df.sort_values(by=['time'])

Using df.apply on a function with multiple inputs to generate multiple outputs

I have a dataframe that looks like this
initial year0 year1
0 0 12
1 1 13
2 2 14
3 3 15
Note that the number of year columns year0, year1... (year_count) is completely variable but will be constant throughout this code
I first wanted to apply a function to each of the 'year' columns to generate 'mod' columns like so
def mod(year, scalar):
return (year * scalar)
s = 5
year_count = 2
# Generate new columns
df[[f"mod{y}" for y in range (year_count)]] = df[[f"year{y}" for y in range(year_count)]].apply(mod, scalar=s)
initial year0 year1 mod0 mod1
0 0 12 0 60
1 1 13 5 65
2 2 14 10 70
3 3 15 15 75
All good so far. The problem is that I now want to apply another function to both the year column and its corresponding mod column to generate another set of val columns, so something like
def sum_and_scale(year_col, mod_col, scale):
return (year_col + mod_col) * scale
Then I apply this to each of the columns (year0, mod0), (year1, mod1) etc to generate the next tranche of columns.
With scale = 10 I should end up with
initial year0 year1 mod0 mod1 val0 val1
0 0 12 0 60 0 720
1 1 13 5 65 60 780
2 2 14 10 70 120 840
3 3 15 15 75 180 900
This is where I'm stuck - I don't know how to put two existing df columns together in a function with the same structure as in the first example, and if I do something like
df[['val0', 'val1']] = df['col1', 'col2'].apply(lambda x: sum_and_scale('mod0', 'mod1', scale=10))
I don't know how to generalise this to have arbitrary inputs and outputs and also apply the constant scale parameter. (I know the last piece of won't work but it's the other avenue to a solution I've seen)
The reason I'm asking is because I believe the loop that I currently have working is creating performance issues with the number of columns and the length of each column.
Thanks

IMHO, it's better with a simple for loop:
for i in range(2):
df[f'val{i}'] = sum_and_scale(df[f'year{i}'], df[f'mod{i}'], scale=10)

Iterating over groups into a dataframe

A B
0 2002-01-16 0
1 2002-01-16 4
2 2002-01-16 -2
3 2002-01-16 11
4 2002-01-16 12
5 2002-01-17 0
6 2002-01-17 -18
7 2002-01-17 16
8 2002-01-18 0
9 2002-01-18 -1
10 2002-01-18 4
results = {}
grouped = df.groupby("A")
for name, group in grouped:
if (df["B"] >= 10).any():
results[name] = df.loc[df["B"] >= 10].head(1)
print(results[name])
elif (df["B"] <= -10).any():
results[name] = df.loc[df["B"] <= -10].head(1)
print(results[name])
else:
results[name] = df.loc[df["B"] > -10, :].tail(1)
print(results[name])
Output:
A B
3 2002-01-16 11
A B
3 2002-01-16 11
A B
3 2002-01-16 11
I want to iterate and get one result per each A group, with the next conditions:
If any B column value is >= 10 or <= -10, add just the first to "results" and skip to the next group to continue iterating.
If there is not any B column value >= 10 or <= -10, add the last value to "results" and skip to the next group to continue iterating.
The desired output would be:
A B
3 2002-01-16 11
A B
6 2002-01-17 -18
A B
10 2002-01-18 4

Your code contains two errors that prevent the correct output. The first, and most obvoius, is that you are not using your group in the for loop. Instead you operate on the full df frame. That is why you get the same result for every entry.
When that is fixed, you will get almost the expected result. Not exactly, though, due to your second mistake. According to your description, you want to equate >= 10 and <= -10. Your code, however, executes first the greater than check and if that one is successful, it will generate your output. Thus, the result for group 2002-01-17 will be 16, rather than -18.
The fix for the second problem is to ensure that you test for both the conditions in the same ifclause, generally using an or. However, in your current situation, it is possible to collapse these two tests to one, using absolute values (the abs() operator). This is somewhat of a special case (albeit pretty common), though. It is good to understand both this and the more general way, using or.
This will reduce the number of cases to two, removing the elif line. In addition, it is possible to do some minor modifications to increase readablity. That taken together will leave you with someting similar to:
results = {}
grouped = df.groupby("A")
for name, group in grouped:
if (abs(group["B"]) >= 10).any():
results[name] = group[abs(group["B"]) >= 10].head(1)
else:
results[name] = group.tail(1)
print(results[name])
which generates the wanted output:
A B
3 2002-01-16 11
A B
6 2002-01-17 -18
A B
10 2002-01-18 4

Here's another way, following your approach:
# we'll use this function to get output
def get_values(df):
# check the condition
if any(df.loc[(df["B"] >= 10) |(df["B"] <= -10),'B'].values > 0):
# spit correct value
val = df.loc[(df["B"] >= 10) |(df["B"] <= -10),'B'].head(1)
else:
val = df['B'].tail(1)
return val
df.groupby('A').apply(get_values)
A
2002-01-16 3 11
2002-01-17 6 -18
2002-01-18 10 4
Name: B, dtype: int64

If you don't want to use loop, try this:
df["C"] = df["B"].apply(lambda x: abs(x)>=10)
df.groupby("A", as_index=False).apply(lambda x: x[x["C"]].head(1) if not x[x["C"]].empty else x.tail(1))[["A", "B"]]
result
Out[315]:
A B
0 3 2002-01-16 11
1 6 2002-01-17 -18
2 10 2002-01-18 4

Pandas sequentially apply function using output of previous value

I want to compute the "carryover" of a series. This computes a value for each row and then adds it to the previously computed value (for the previous row).
How do I do this in pandas?
decay = 0.5
test = pd.DataFrame(np.random.randint(1,10,12),columns = ['val'])
test
val
0 4
1 5
2 7
3 9
4 1
5 1
6 8
7 7
8 3
9 9
10 7
11 2
decayed = []
for i, v in test.iterrows():
if i ==0:
decayed.append(v.val)
continue
d = decayed[i-1] + v.val*decay
decayed.append(d)
test['loop_decay'] = decayed
test.head()
val loop_decay
0 4 4.0
1 5 6.5
2 7 10.0
3 9 14.5
4 1 15.0

Consider a vectorized version with cumsum() where you cumulatively sum (val * decay) with the very first val.
However, you then need to subtract the very first (val * decay) since cumsum() includes it:
test['loop_decay'] = (test.ix[0,'val']) + (test['val']*decay).cumsum() - (test.ix[0,'val']*decay)

You can utilize pd.Series.shift() to create a dataframe with val[i] and val[i-1] and then apply your function across a single axis (1 in this case):
# Create a series that shifts the rows by 1
test['val2'] = test.val.shift()
# Set the first row on the shifted series to 0
test['val2'].ix[0] = 0
# Apply the decay formula:
test['loop_decay'] = test.apply(lambda x: x['val'] + x['val2'] * 0.5, axis=1)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

replicating data in same dataFrame - python

Related

Is there a way to reference a previous value in Pandas column efficiently?

Index and save last N points from a list that meets conditions from dataframe Python

Using df.apply on a function with multiple inputs to generate multiple outputs

Iterating over groups into a dataframe

Pandas sequentially apply function using output of previous value

Categories

Resources