Combine Rows in Pandas - python

I have a pandas DataFrame like this
100 200 300
283.1 0.01 0.02 0.40
284.1 0.02 0.03 0.42
285.1 0.05 0.01 0.8
286.1 0.06 0.02 0.9
I need to combine a certain number of consecutive rows and calculate the average value for each column and a new index as the average of the indices I used, in order to obtain something like this:
100 200 300
283.6 0.015 0.025 0.41
285.6 0.055 0.015 0.85
Is there a way to to this with pandas?

Yes -- you could do this in Pandas. Here's one way to do it.
Let's say out, our initial dataframe df is like
index 100 200 300
0 283.1 0.01 0.02 0.40
1 284.1 0.02 0.03 0.42
2 285.1 0.05 0.01 0.80
3 286.1 0.06 0.02 0.90
Now, calculate the length of dataframe
N = len(df.index)
N
4
We create a grp column -- to be used for aggregation,
where for 2 rows aggregation use [x ]*2 and for n-consecutive rows use [x]*n
df['grp'] = list(itertools.chain.from_iterable([x]*2 for x in range(0, N/2)))
df
index 100 200 300 grp
0 283.1 0.01 0.02 0.40 0
1 284.1 0.02 0.03 0.42 0
2 285.1 0.05 0.01 0.80 1
3 286.1 0.06 0.02 0.90 1
Now, get the means by grouping the grp column like this --
df.groupby('grp').mean()
index 100 200 300
grp
0 283.6 0.015 0.025 0.41
1 285.6 0.055 0.015 0.85

A simple way:
>>> print df
index 100 200 300
0 283.1 0.01 0.02 0.40
1 284.1 0.02 0.03 0.42
2 285.1 0.05 0.01 0.80
3 286.1 0.06 0.02 0.90
break the DataFrame up into the portions that you want and find the mean of the relevant columns:
>>> pieces = [df[:2].mean(), df[2:].mean()]
then put the pieces back together using concat:
>>> avgdf = pd.concat(pieces, axis=1).transpose()
index 100 200 300
0 283.6 0.015 0.025 0.41
1 285.6 0.055 0.015 0.85
Alternatively, you can recombine the data with a list comprehension [i for i in pieces] or a generator expression:
>>> z = (i for i in pieces)
and use this to create your new DataFrame:
>>> avgdf = pd.DataFrame(z)
Finally, to set the index:
>>> avgdf.set_index('index', inplace=True)
>>> print avgdf
100 200 300
index
283.6 0.015 0.025 0.41
285.6 0.055 0.015 0.85

Related

python Get cumulative sum until condition is met in another column, then reset

I have 3 columns. I want to get the cumulative return given there is no trade. Once there is a trade, then reset the starting point of the cumulative return.
Return
Price
Trade
0.00
400
0
0.08
432.00
0
0.04
419.28
-30
0.02
427.6656
0
0.06
513.325536
60
0.10
564.65809
0
I am trying to do a cumulative return by each row using iterrows(), but no luck. Would somebody know how to get this output?
Trade values can be divided into groups using condition ~df.Trade.eq(0) (Trade not equal to 0) as a split point; to further calculate cumulative sums:
df['Ret_cumsum'] = df.groupby((~df.Trade.eq(0)).cumsum())['Return'].cumsum()
Return Price Trade Ret_cumsum
0 0.00 400.000000 0 0.00
1 0.08 432.000000 0 0.08
2 0.04 419.280000 -30 0.04
3 0.02 427.665600 0 0.06
4 0.06 513.325536 60 0.06
5 0.10 564.658090 0 0.16

Pandas: assigning columns with multiple conditions and date thresholds

Edited:
I have a financial portfolio in a pandas dataframe df, where the index is the date and I have multiple financial stocks per date.
Eg dataframe:
Date Stock Weight Percentile Final weight
1/1/2000 Apple 0.010 0.75 0.010
1/1/2000 IBM 0.011 0.4 0
1/1/2000 Google 0.012 0.45 0
1/1/2000 Nokia 0.022 0.81 0.022
2/1/2000 Apple 0.014 0.56 0
2/1/2000 Google 0.015 0.45 0
2/1/2000 Nokia 0.016 0.55 0
3/1/2000 Apple 0.020 0.52 0
3/1/2000 Google 0.030 0.51 0
3/1/2000 Nokia 0.040 0.47 0
I created Final_weight by doing assigning values of Weight whenever Percentile is greater than 0.7
Now I want this to be a bit more sophisticated, I still want Weight to be assigned to Final_weight when Percentile is > 0.7, however after this date (at any point in the future), rather than become 0 when a stocks Percentile is not >0.7, we would still get a weight as long as the Stocks Percentile is above 0.5 (ie holding the position for longer than just one day).
Then if the stock goes below 0.5 (in the near future) then Final_weight would become 0.
Eg modified dataframe from above:
Date Stock Weight Percentile Final weight
1/1/2000 Apple 0.010 0.75 0.010
1/1/2000 IBM 0.011 0.4 0
1/1/2000 Google 0.012 0.45 0
1/1/2000 Nokia 0.022 0.81 0.022
2/1/2000 Apple 0.014 0.56 0.014
2/1/2000 Google 0.015 0.45 0
2/1/2000 Nokia 0.016 0.55 0.016
3/1/2000 Apple 0.020 0.52 0.020
3/1/2000 Google 0.030 0.51 0
3/1/2000 Nokia 0.040 0.47 0
Everyday the portfolios are different not always have the same stock from the day before.
This solution is more explicit and less pandas-esque, but it involves only a single pass through all rows without creating tons of temporary columns, and is therefore possibly faster. It needs an additional state variable, which I wrapped it into a closure for not having to make a class.
def closure():
cur_weight = {}
def func(x):
if x["Percentile"] > 0.7:
next_weight = x["Weight"]
elif x["Percentile"] < 0.5 :
next_weight = 0
else:
next_weight = x["Weight"] if cur_weight.get(x["Stock"], 0) > 0 else 0
cur_weight[x["Stock"]] = next_weight
return next_weight
return func
df["FinalWeight"] = df.apply(closure(), axis=1)
I'd first put 'Stock' into the index
Then unstack to put them into the columns
I'd then split w for weights and p for percentiles
Then manipulate with a series of where
d1 = df.set_index('Stock', append=True)
d2 = d1.unstack()
w, p = d2.Weight, d2.Percentile
d1.join(w.where(p > .7, w.where((p.shift() > .7) & (p > .5), 0)).stack().rename('Final Weight'))
Weight Percentile Final Weight
Date Stock
2000-01-01 Apple 0.010 0.75 0.010
IBM 0.011 0.40 0.000
Google 0.012 0.45 0.000
Nokia 0.022 0.81 0.022
2000-02-01 Apple 0.014 0.56 0.014
Google 0.015 0.45 0.000
Nokia 0.016 0.55 0.016
Setup
Dataframe:
Stock Weight Percentile Finalweight
Date
2000-01-01 Apple 0.010 0.75 0
2000-01-01 IBM 0.011 0.40 0
2000-01-01 Google 0.012 0.45 0
2000-01-01 Nokia 0.022 0.81 0
2000-02-01 Apple 0.014 0.56 0
2000-02-01 Google 0.015 0.45 0
2000-02-01 Nokia 0.016 0.55 0
2000-03-01 Apple 0.020 0.52 0
2000-03-01 Google 0.030 0.51 0
2000-03-01 Nokia 0.040 0.57 0
Solution
df = df.reset_index()
#find historical max percentile for a Stock
df['max_percentile'] = df.apply(lambda x: df[df.Stock==x.Stock].iloc[:x.name].Percentile.max() if x.name>0 else x.Percentile, axis=1)
#set weight according to max_percentile and the current percentile
df['Finalweight'] = df.apply(lambda x: x.Weight if (x.Percentile>0.7) or (x.Percentile>0.5 and x.max_percentile>0.7) else 0, axis=1)
Out[1041]:
Date Stock Weight Percentile Finalweight max_percentile
0 2000-01-01 Apple 0.010 0.75 0.010 0.75
1 2000-01-01 IBM 0.011 0.40 0.000 0.40
2 2000-01-01 Google 0.012 0.45 0.000 0.45
3 2000-01-01 Nokia 0.022 0.81 0.022 0.81
4 2000-02-01 Apple 0.014 0.56 0.014 0.75
5 2000-02-01 Google 0.015 0.45 0.000 0.51
6 2000-02-01 Nokia 0.016 0.55 0.016 0.81
7 2000-03-01 Apple 0.020 0.52 0.020 0.75
8 2000-03-01 Google 0.030 0.51 0.000 0.51
9 2000-03-01 Nokia 0.040 0.57 0.040 0.81
Note
In the last row of your example data, Nokia's Percentile is 0.57 while in your results it becomes 0.47. In this example, I used 0.57 so the output is a bit different than yours for the last row.
One method, avoiding loops and limited lookback periods.
Using your example:
import pandas as pd
import numpy as np
>>>df = pd.DataFrame([['1/1/2000', 'Apple', 0.010, 0.75],
['1/1/2000', 'IBM', 0.011, 0.4],
['1/1/2000', 'Google', 0.012, 0.45],
['1/1/2000', 'Nokia', 0.022, 0.81],
['2/1/2000', 'Apple', 0.014, 0.56],
['2/1/2000', 'Google', 0.015, 0.45],
['2/1/2000', 'Nokia', 0.016, 0.55],
['3/1/2000', 'Apple', 0.020, 0.52],
['3/1/2000', 'Google', 0.030, 0.51],
['3/1/2000', 'Nokia', 0.040, 0.47]],
columns=['Date', 'Stock', 'Weight', 'Percentile'])
First, identify when stocks would start or stop being tracked in final weight:
>>>df['bought'] = np.where(df['Percentile'] >= 0.7, 1, np.nan)
>>>df['bought or sold'] = np.where(df['Percentile'] < 0.5, 0, df['bought'])
With '1' indicating a stock to buy, and '0' one to sell, if owned.
From this, you can identify whether the stock is owned. Note that this requires the dataframe already be sorted chronologically, if at any point you use it on a dataframe without a date index:
>>>df['own'] = df.groupby('Stock')['bought or sold'].fillna(method='ffill').fillna(0)
'ffill' is forward fill, propagating ownership status forward from buy and sell dates. .fillna(0) catches any stocks that have remained between .5 and .7 for the entirety of the dataframe.
Then, calculate Final Weight
>>>df['Final Weight'] = df['own']*df['Weight']
Multiplication, with df['own'] being the identity or zero, is a little faster than another np.where and gives the same result.
Edit:
Since speed is a concern, doing everything in one column, as suggested by #cronos, does provide a speed boost, coming in around a 37% improvement at 20 rows in my tests, or 18% at 2,000,000. I could imagine the latter larger if storing the intermediate columns were to cross some sort of memory-usage threshold or there were something else involving system specifics I didn't experience.
This would look like:
>>>df['Final Weight'] = np.where(df['Percentile'] >= 0.7, 1, np.nan)
>>>df['Final Weight'] = np.where(df['Percentile'] < 0.5, 0, df['Final Weight'])
>>>df['Final Weight'] = df.groupby('Stock')['Final Weight'].fillna(method='ffill').fillna(0)
>>>df['Final Weight'] = df['Final Weight']*df['Weight']
Either using this method or deleting the intermediate fields would give result:
>>>df
Date Stock Weight Percentile Final Weight
0 1/1/2000 Apple 0.010 0.75 0.010
1 1/1/2000 IBM 0.011 0.40 0.000
2 1/1/2000 Google 0.012 0.45 0.000
3 1/1/2000 Nokia 0.022 0.81 0.022
4 2/1/2000 Apple 0.014 0.56 0.014
5 2/1/2000 Google 0.015 0.45 0.000
6 2/1/2000 Nokia 0.016 0.55 0.016
7 3/1/2000 Apple 0.020 0.52 0.020
8 3/1/2000 Google 0.030 0.51 0.000
9 3/1/2000 Nokia 0.040 0.47 0.000
For further improvement, I'd look at adding a way to set an initial condition that has stocks being owned, followed by breaking the dataframe down to look at smaller timeframes. This could be done by adding an initial condition for the time period covered by one of these smaller dataframes, then changing
>>>df['Final Weight'] = np.where(df['Percentile'] >= 0.7, 1, np.nan)
to something like
>>>df['Final Weight'] = np.where((df['Percentile'] >= 0.7) | (df['Final Weight'] != 0), 1, np.nan)
to allow that to be recognized and propagate.
I think you may want to use the pandas.Series rolling window method.
Perhaps something like this:
import pandas as pd
grouped = df.groupby('Stock')
df['MaxPercentileToDate'] = np.NaN
df.index = df['Date']
for name, group in grouped:
df.loc[df.Stock==name, 'MaxPercentileToDate'] = group['Percentile'].rolling(min_periods=0, window=4).max()
# Mask selects rows that have ever been greater than 0.75 (including current row in max)
# and are currently greater than 0.5
mask = ((df['MaxPercentileToDate'] > 0.75) & (df['Percentile'] > 0.5))
df.loc[mask, 'Finalweight'] = df.loc[mask, 'Weight']
I believe this assumes values are sorted by date (which your initial dataset seems to have), and you would also have to adjust the min_periods parameter to be the max number of entries per stock.

python pandas change dataframe to pivoted columns

I have a dataframe that looks as following:
Type Month Value
A 1 0.29
A 2 0.90
A 3 0.44
A 4 0.43
B 1 0.29
B 2 0.50
B 3 0.14
B 4 0.07
I want to change the dataframe to following format:
Type A B
1 0.29 0.29
2 0.90 0.50
3 0.44 0.14
4 0.43 0.07
Is this possible ?
Use set_index + unstack
df.set_index(['Month', 'Type']).Value.unstack()
Type A B
Month
1 0.29 0.29
2 0.90 0.50
3 0.44 0.14
4 0.43 0.07
To match your exact output
df.set_index(['Month', 'Type']).Value.unstack().rename_axis(None)
Type A B
1 0.29 0.29
2 0.90 0.50
3 0.44 0.14
4 0.43 0.07
Pivot solution:
In [70]: df.pivot(index='Month', columns='Type', values='Value')
Out[70]:
Type A B
Month
1 0.29 0.29
2 0.90 0.50
3 0.44 0.14
4 0.43 0.07
In [71]: df.pivot(index='Month', columns='Type', values='Value').rename_axis(None)
Out[71]:
Type A B
1 0.29 0.29
2 0.90 0.50
3 0.44 0.14
4 0.43 0.07
You're having a case of long format table which you want to transform to a wide format.
This is natively handled in pandas:
df.pivot(index='Month', columns='Type', values='Value')

DataFrame of means of top N most correlated columns

I have a dataframe df1 where each column represents a time series of returns. I want to create a new dataframe df2 with columns that corresponds to each of the columns in df1 where the column in df2 is defined to be the average of the top 5 most correlated columns in df1.
import pandas as pd
import numpy as np
from string import ascii_letters
np.random.seed([3,1415])
df1 = pd.DataFrame(np.random.randn(100, 10).round(2),
columns=list(ascii_letters[26:36]))
print df1.head()
A B C D E F G H I J
0 -2.13 -1.27 -1.97 -2.26 -0.35 -0.03 0.32 0.35 0.72 0.77
1 -0.61 0.35 -0.35 -0.42 -0.91 -0.14 0.75 -1.50 0.61 0.40
2 -0.96 1.49 -0.35 -1.47 1.06 1.06 0.59 0.30 -0.77 0.83
3 1.49 0.26 -0.90 0.38 -0.52 0.05 0.95 -1.03 0.95 0.73
4 1.24 0.16 -1.34 0.16 1.26 0.78 1.34 -1.64 -0.20 0.13
I expect the head of the resulting dataframe rounded to 2 places to look like:
A B C D E F G H I J
0 -0.78 -0.70 -0.53 -0.45 -0.99 -0.10 -0.47 -0.86 -0.31 -0.64
1 -0.49 -0.11 -0.45 -0.03 -0.04 0.10 -0.26 0.11 -0.06 -0.10
2 0.03 0.13 0.54 0.33 -0.13 0.27 0.22 0.32 0.41 0.27
3 -0.22 0.13 0.19 0.58 0.63 0.24 0.34 0.51 0.32 0.22
4 -0.04 0.31 0.23 0.52 0.43 0.24 0.07 0.31 0.73 0.43
For each column in the correlation matrix, take the six largest and ignore the first one (i.e. 100% correlated with itself). Use a dictionary comprehension to do this for each column.
Use another dictionary comprehension to located this columns in df1 and take their mean. Create a dataframe from the result, and reorder the columns to match those of df1 by appending [df1.columns].
corr = df1.corr()
most_correlated_cols = {col: corr[col].nlargest(6)[1:].index
for col in corr}
df2 = pd.DataFrame({col: df1.loc[:, most_correlated_cols[col]].mean(axis=1)
for col in df1})[df1.columns]
>>> df2.head()
A B C D E F G H I J
0 -0.782 -0.698 -0.526 -0.452 -0.994 -0.102 -0.472 -0.856 -0.310 -0.638
1 -0.486 -0.106 -0.454 -0.032 -0.042 0.100 -0.258 0.108 -0.064 -0.102
2 0.026 0.132 0.544 0.330 -0.130 0.272 0.224 0.320 0.414 0.274
3 -0.224 0.128 0.186 0.582 0.626 0.242 0.344 0.506 0.318 0.224
4 -0.044 0.310 0.230 0.518 0.428 0.238 0.068 0.306 0.734 0.432
%%timeit
corr = df1.corr()
most_correlated_cols = {
col: corr[col].nlargest(6)[1:].index
for col in corr}
df2 = pd.DataFrame({col: df1.loc[:, most_correlated_cols[col]].mean(axis=1)
for col in df1})[df1.columns]
100 loops, best of 3: 10 ms per loop
%%timeit
corr = df1.corr()
df2 = corr.apply(argsort).head(5).apply(lambda x: avg_of(x, df1))
100 loops, best of 3: 16 ms per loop
Setup
import pandas as pd
import numpy as np
from string import ascii_letters
np.random.seed([3,1415])
df1 = pd.DataFrame(np.random.randn(100, 10).round(2),
columns=list(ascii_letters[26:36]))
Solution
corr = df.corr()
# I don't want a securities correlation with itself to be included.
# Because `corr` is symmetrical, I can assume that a series' name will be in its index.
def remove_self(x):
return x.loc[x.index != x.name]
# This builds utilizes `remove_self` then sorts by correlation
# and returns the index.
def argsort(x):
return pd.Series(remove_self(x).sort_values(ascending=False).index)
# This reaches into `df` and gets all columns identified in x
# then takes the mean.
def avg_of(x, df):
return df.loc[:, x].mean(axis=1)
# Putting it all together.
df2 = corr.apply(argsort).head(5).apply(lambda x: avg_of(x, df))
print df2.round(2).head()
A B C D E F G H I J
0 -0.78 -0.70 -0.53 -0.45 -0.99 -0.10 -0.47 -0.86 -0.31 -0.64
1 -0.49 -0.11 -0.45 -0.03 -0.04 0.10 -0.26 0.11 -0.06 -0.10
2 0.03 0.13 0.54 0.33 -0.13 0.27 0.22 0.32 0.41 0.27
3 -0.22 0.13 0.19 0.58 0.63 0.24 0.34 0.51 0.32 0.22
4 -0.04 0.31 0.23 0.52 0.43 0.24 0.07 0.31 0.73 0.43

Creating Multi-hierarchy pivot table in Pandas

1. Background
The .xls files I have now contain some parameters of multi-pollutant in many aspects for different sites.
I created an simplified dataframe below as an illustration:
Some declaration:
Column Site contain the monitoring sites properties. In this case, Sites S1, S2 are the only two locatio here.
Column Time contain the monitoring period for different sites.
Species A & B represents two chemical pollutants had been detected.
Conc is one key parameter for each species(A & B) represents the concentration. Notice that, the concentration of Species A should be measured twice as parallel.
P and Q are two different analysis experiments. Since species A has two samples, it has P1, P2, P3 & Q1, Q2 as the analysis results respectively. Species B has only be analyzed by P. So, P1, P2, P3 are the only parameters.
After read some post on manipulating the pivot_table using Pandas, I want to have a try.
2. My target
I presented my target file construction manually in Excel showing like this:
3. My work
df = pd.ExcelFile("./test_file.xls")
df = df.parse("Sheet1")
pd.pivot_table(df,index = ["Site","Time","Species"])
This is the result:
Update
What I'm trying to figure out is to creat two columns P & Q and sub_columns below them.
I have re-upload my test file here. Anyone interested in can download it.
The P and Q tests are for each sample of species A respectively.
The Conc test are for them both.
Any advice would be appreciate!
IIUC
You want the same dataframe, but with a better column index.
To create the first level:
level0 = df.columns.str.extract(r'([^\d]*)', expand=False)
then assign a multiindex to the columns attribute.
df.columns = pd.MultiIndex.from_arrays([level0, df.columns])
Looks like:
print df
Conc P Q
Conc P1 P2 P3 Q1 Q2
Site Time Species
S1 20141222 A 0.79 0.02 0.62 1.05 0.01 1.73
20141228 A 0.13 0.01 0.79 0.44 0.01 1.72
20150103 B 0.48 0.03 1.39 0.84 NaN NaN
20150104 A 0.36 0.02 1.13 0.31 0.01 0.94
20150109 A 0.14 0.01 0.64 0.35 0.00 1.00
20150114 B 0.47 0.08 1.16 1.40 NaN NaN
20150115 A 0.62 0.02 0.90 0.95 0.01 2.63
20150116 A 0.71 0.03 1.72 1.71 0.01 2.53
20150121 B 0.61 0.03 0.67 0.87 NaN NaN
S2 20141222 A 0.23 0.01 0.66 0.44 0.01 1.49
20141228 A 0.42 0.06 0.99 1.56 0.00 2.18
20150103 B 0.09 0.01 0.56 0.12 NaN NaN
20150104 A 0.18 0.01 0.56 0.36 0.00 0.67
20150109 A 0.50 0.03 0.74 0.71 0.00 1.11
20150114 B 0.64 0.06 1.76 0.92 NaN NaN
20150115 A 0.58 0.05 0.77 0.95 0.01 1.54
20150116 A 0.93 0.04 1.33 0.69 0.00 0.82
20150121 B 0.33 0.09 1.33 0.76 NaN NaN

Categories

Resources