I would like to add a new column to existing dataframe. The new column needs to start with a constant value in the first row (-17.3 in the example below) and then consecutively add 0.15 to it for all 26000 rows. The new column should have following values in the end:
-17.3
-17.15
-17
-16.85
…
…
…
26000 rows
Is there any way to get this done without looping over all the rows?
Thanks,
Yoshiro
You can construct the range like this:
# 26,000 numbers
# step of 0.15
# starting at -17.3
np.arange(26000) * 0.15 - 17.3
Let's say your dataframe is named df, you can do it in the following way:
start_value = -17.3
increment_value = 0.15
new_column = [start_value + increment_value * i for i in range(df.shape[0])]
df['new_column'] = new_column
Either use pandas.Series constructor with pandas.Series.cumsum :
N, S, F = len(df), -17.3, 0.15 # <- len(df) = 26000 in your case
df["approach_1"] = pd.Series([S] + [np.NaN]*(N-1)).fillna(F).cumsum()
Or simply go for numpy.arange as per #tdy :
df["approach_2"] = np.arange(S, S + N*F, F)
Output :
print(df)
approach_1 approach_2
0 -17.30 -17.30
1 -17.15 -17.15
2 -17.00 -17.00
3 -16.85 -16.85
4 -16.70 -16.70
... ... ...
25995 3881.95 3881.95
25996 3882.10 3882.10
25997 3882.25 3882.25
25998 3882.40 3882.40
25999 3882.55 3882.55
[26000 rows x 2 columns]
Related
I have a pandas dataframe df which looks like this
betasub0 betasub1 betasub2 betasub3 betasub4 betasub5 betasub6 betasub7 betasub8 betasub9 betasub10
0 0.009396 0.056667 0.104636 0.067066 0.009678 0.019402 0.029316 0.187884 0.202597 0.230275 0.083083
1 0.009829 0.058956 0.108205 0.068956 0.009888 0.019737 0.029628 0.187611 0.197627 0.225660 0.083903
2 0.009801 0.058849 0.108092 0.068927 0.009886 0.019756 0.029690 0.188627 0.200235 0.224703 0.081434
3 0.009938 0.059595 0.109310 0.069609 0.009970 0.019896 0.029854 0.189187 0.199424 0.221968 0.081249
4 0.009899 0.059373 0.108936 0.069395 0.009943 0.019852 0.029801 0.188979 0.199893 0.222922 0.081009
Then I have a vector dk that looks like this:
[0.18,0.35,0.71,1.41,2.83,5.66,11.31,22.63,45.25,90.51,181.02]
What I need to do is:
calculate a new vector which is
psik = [np.log2(dki/1e3) for dki in dk]
calculate the sum of each row multiplied with the psik vector (just as the SUMPRODUCT function of excel)
calculate the log2 of each psik value
expected output should be:
betasub0 betasub1 betasub2 betasub3 betasub4 betasub5 betasub6 betasub7 betasub8 betasub9 betasub10 psig dg
0 0.009396 0.056667 0.104636 0.067066 0.009678 0.019402 0.029316 0.187884 0.202597 0.230275 0.083083 -5.848002631 0.017361042
1 0.009829 0.058956 0.108205 0.068956 0.009888 0.019737 0.029628 0.187611 0.197627 0.22566 0.083903 -5.903532822 0.016705502
2 0.009801 0.058849 0.108092 0.068927 0.009886 0.019756 0.02969 0.188627 0.200235 0.224703 0.081434 -5.908820802 0.016644383
3 0.009938 0.059595 0.10931 0.069609 0.00997 0.019896 0.029854 0.189187 0.199424 0.221968 0.081249 -5.930608559 0.016394906
4 0.009899 0.059373 0.108936 0.069395 0.009943 0.019852 0.029801 0.188979 0.199893 0.222922 0.081009 -5.924408689 0.016465513
I would do that with a for loop cycling over the rows like this
for r in rows:
psig_i = sum([d[i]*ri for i,ri in enumerate(r)])
psig.append(sum([d[i]*ri for i,ri in enumerate(r)]))
dg.append(np.log2(psig_i))
df['psig'] = psig
df['dg'] = dg
Is there any other way to update the df without iterating through its rows?
EDIT: I found the solution and I am ashamed for how simple it is
df['psig']=df.mul(psik).sum(axis=1)
df['dg'] = df[psig].apply(lambda x: np.log2(x))
EDIT2: now, my df has more entries, so I have to filter it with a regex to find only the columns with a name starting with "basesub".
I have my array psik and a new column ``psigin thedf. I would like to calculate for each row (i.e. each value of psig```):
sum(((psik-psig)**2)*betasub[0...n])
I did it like this, but maybe there's a better way?
PsimPsig2 = [[(psik_i-psig_i)**2 for psik_i in psik] for psig_i in list(df['psig'])]
psikmpsigname = ['psikmpsig'+str(i) for i in range(len(psik))]
dfPsimPsig2 = pd.DataFrame(data=PsimPsig2,columns=psikmpsigname)
siggAL = np.power(2,(np.power(pd.DataFrame(df.filter(regex=r'^betasub[0-9]',axis=1).values*dfPsimPsig2.values).sum(axis=1),0.5)))
df['siggAL'] = siggAL
I am trying to create a new column(B) in a Pyspark / Python table.
New column(B) is sum of : current value of column(A) + previous value of column(B)
desired output example image
`Id a b
1 977 977
2 3665 4642
3 1746 6388
4 2843 9231
5 200 9431`
current Col B = current Col A + previous Col B ;
example Row 4 : 9231 (col B) = 2843 (col A) + 6388 (previous Col B value)
(for 1st row since there is no previous value for B so it is 0)
Please help me with the Python / PySpark query code
Without the context I may be wrong, but it seems your trying to do a cumulative sum of column A :
from pyspark.sql.window import Window
import pyspark.sql.functions as sf
df = df.withColumn('B', sf.sum(df.A).over(Window.partitionBy().orderBy().rowsBetween(
Window.unboundedPreceding, 0)))
EDIT:
If you need to iteratively add new rows based on the last value of B and assuming the value of B in the dataframe doesn't change in the meantime, I think you'd better memorize B in a standard python variable and build the following row with that.
previous_B = 0
# your code to get new A
previous_B += new_A
new_row = spark.createDataFrame([(new_A, previous_B)])
df = df.union(new_row)
Summary
Suppose that you apply a function to a groupby object, so that every g.apply for every g in the df.groupby(...) gives you a series/dataframe. How do I combine these results into a single dataframe, but with the group names as columns?
Details
I have a dataframe event_df that looks like this:
index event note time
0 on C 0.5
1 on D 0.75
2 off C 1.0
...
I want to create a sampling of the event for every note, and the sampling is done at times as given by t_df:
index t
0 0
1 0.5
2 1.0
...
So that I'd get something like this.
t C D
0 off off
0.5 on off
1.0 off on
...
What I've done so far:
def get_t_note_series(notedata_row, t_arr):
"""Return the time index in the sampling that corresponds to the event."""
t_idx = np.argwhere(t_arr >= notedata_row['time']).flatten()[0]
return t_idx
def get_t_for_gb(group, **kwargs):
t_idxs = group.apply(get_t_note_series, args=(t_arr,), axis=1)
t_idxs.rename('t_arr_idx', inplace=True)
group_with_t = pd.concat([group, t_idxs], axis=1).set_index('t_arr_idx')
print(group_with_t)
return group_with_t
t_arr = np.arange(0,10,0.5)
t_df = pd.DataFrame({'t': t_arr}).rename_axis('t_arr_idx')
gb = event_df.groupby('note')
gb.apply(get_t_for_gb, **kwargs)
So what I get is a number of dataframes for each note, all of the same size (same as t_df):
t event
0 on
0.5 off
...
t event
0 off
0.5 on
...
How do I go from here to my desired dataframe, with each group corresponding to a column in a new dataframe, and the index being t?
EDIT: sorry, I didn't take into account below, that you rescale your time column and can't present a whole solution now because I have to leave, but I think, you could do the rescaling by using pandas.merge_asof with your two dataframes to get the nearest "rescaled" time and from the merged dataframe you could try the code below. I hope this is, what you wanted.
import pandas as pd
import io
sio= io.StringIO("""index event note time
0 on C 0.5
1 on D 0.75
2 off C 1.0""")
df= pd.read_csv(sio, sep='\s+', index_col=0)
df.groupby(['time', 'note']).agg({'event': 'first'}).unstack(-1).fillna('off')
Take the first row in each time-note group by agg({'event': 'first'}), then use the note-index column and transpose it, so the note values become columns. Then at the end fill all cells, for which no datapoints could be found with 'off' by fillna.
This outputs:
Out[28]:
event
note C D
time
0.50 on off
0.75 off on
1.00 off off
You might also want to try min or max in case on/off is not unambiguous for a combination of time/note (if there are more rows for the same time/note where some have on and some have off) and you prefer one of these values (say if there is one on, then no matter how many offs are there, you want an on etc.). If you want something like a mayority-vote, I would suggest to add a mayority vote column in the aggregated dataframe (before the unstack()).
Oh so I found it! All I had to do was to unstack the groupby results. Going back to generating the groupby result:
def get_t_note_series(notedata_row, t_arr):
"""Return the time index in the sampling that corresponds to the event."""
t_idx = np.argwhere(t_arr >= notedata_row['time']).flatten()[0]
return t_idx
def get_t_for_gb(group, **kwargs):
t_idxs = group.apply(get_t_note_series, args=(t_arr,), axis=1)
t_idxs.rename('t_arr_idx', inplace=True)
group_with_t = pd.concat([group, t_idxs], axis=1).set_index('t_arr_idx')
## print(group_with_t) ## unnecessary!
return group_with_t
t_arr = np.arange(0,10,0.5)
t_df = pd.DataFrame({'t': t_arr}).rename_axis('t_arr_idx')
gb = event_df.groupby('note')
result = gb.apply(get_t_for_gb, **kwargs)
At this point, result is a dataframe with note as an index:
>> print(result)
event
note t
C 0 off
0.5 on
1.0 off
....
D 0 off
0.5 off
1.0 on
....
Doing result = result.unstack('note') does the trick:
>> result = result.unstack('note')
>> print(result)
event
note C D
t
0 off off
0.5 on on
1.0 off off
....
D 0 off
0.5 off
1.0 on
....
I have a large pandas dataframe of time-series data.
I currently manipulate this dataframe to create a new, smaller dataframe that is rolling average of every 10 rows. i.e. a rolling window technique. Like this:
def create_new_df(df):
features = []
x = df['X'].astype(float)
i = x.index.values
time_sequence = [i] * 10
idx = np.array(time_sequence).T.flatten()[:len(x)]
x = x.groupby(idx).mean()
x.name = 'X'
features.append(x)
new_df = pd.concat(features, axis=1)
return new_df
Code to test:
columns = ['X']
df_ = pd.DataFrame(columns=columns)
df_ = df_.fillna(0) # with 0s rather than NaNs
data = np.array([np.arange(20)]*1).T
df = pd.DataFrame(data, columns=columns)
test = create_new_df(df)
print test
Output:
X
0 4.5
1 14.5
However, I want the function to make the new dataframe using a sliding window with a 50% overlap
So the output would look like this:
X
0 4.5
1 9.5
2 14.5
How can I do this?
Here's what I've tried:
from itertools import tee, izip
def window(iterable, size):
iters = tee(iterable, size)
for i in xrange(1, size):
for each in iters[i:]:
next(each, None)
return izip(*iters)
for each in window(df, 20):
print list(each) # doesn't have the desired sliding window effect
Some might also suggest using the pandas rolling_mean() methods, but if so, I can't see how to use this function with window overlap.
Any help would be much appreciated.
I think pandas rolling techniques are fine here. Note that starting with version 0.18.0 of pandas, you would use rolling().mean() instead of rolling_mean().
>>> df=pd.DataFrame({ 'x':range(30) })
>>> df = df.rolling(10).mean() # version 0.18.0 syntax
>>> df[4::5] # take every 5th row
x
4 NaN
9 4.5
14 9.5
19 14.5
24 19.5
29 24.5
I am still very new to Pandas and hence this might be very silly.I have Pandas data frame as follows:
>>> data_frame
median quarter status change
0 240 2015-1 BV NaN
1 300 2015-2 BV 0.25
2 300 2015-1 CORR 0.00
3 240 2015-2 CORR -0.20
Now i need only the quarter 2015-2,so i perform the query
>>> data_frame.query('quarter == "2015-2"')
median quarter status change
1 300 2015-2 BV 0.25
2 240 2015-2 CORR -0.20
That works fine.However if I need to search via a variable name,it does not work.
>>> completed_quarter = '2015-2'
>>> data_frame.query('quarter == "completed_quarter"')
Empty DataFrame
Columns: [median, quarter, status, change]
Index: []
I tried a few other combinations with single quotes, no quotes etc but nothing works.What am I doing wrong ? Is there any other way in Pandas through which I can accomplish the same thing ?
Trying using this:
>>> completed_quarter = '2015-2'
>>> data_frame.query('quarter == "{}"'.format(completed_quarter))
At the moment you are searching for a quarter that equals "completed_quarter" rather than the value of the completed_quarter variable. Using string format method will replace the value in braces with the variable value.
You can access the value of the variable like this
completed_quarter = '2015-2'
data_frame.query('quarter == #completed_quarter')
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.query.html