filling and renaming a dataset at the same time - python

i would like to filling a dataset and making the log returns at the same time:
These are the returns
ret_names =['FTSEMIB_Index_ret', 'FCA_IM_Equity_ret', 'UCG_IM_Equity_ret', 'ISP_IM_Equity_ret',
'ENI_IM_Equity_ret',
'LUX_IM_Equity_ret']
and this is the Dataframe
'FTSEMIB_Index', 'FCA_IM_Equity', 'UCG_IM_Equity', 'ISP_IM_Equity','ENI_IM_Equity', 'LUX_IM_Equity'
0 22793.69 14.840 16.430 2.8860 14.040 49.24
1 22991.99 15.150 16.460 2.8780 14.220 48.98
2 23046.05 15.290 16.760 2.8660 14.300 48.70
3 23014.13 15.660 16.390 2.8500 14.380 48.72
4 23002.85 15.590 16.300 2.8420 14.500 49.13
so my idea is to use enumerate in a for loop.
for index,name in enumerate(ret_names):
df[name] = np.diff(np.log(df.iloc[:,index]))
but i cannot match the lenght because having the returns i'm going to erase 1 value (the first one i suppose).
Any idea?

maybe i found a solution, but i can't figure out why the previous one doesn't work
for index,name in enumerate(ret_names):
df[name] = np.log(df.iloc[:,index])/np.log(df.iloc[:,index]).shift(1)
with this you can fill and assign name, and the first value will increase.

Related

Incremental counter if the value is the same before the point

I have the following STRING column on a pandas DataFrame.
HOURCENTSEG(string-column)
070026.16169
070026.16169
070026.16169
070026.16169
070052.85555
070052.85555
070109.43620
070202.56430
070202.56431
070202.56434
070202.56434
As you can see we have many elements where the time overlaps before the point, in all the fields to avoid date overlaps I must add an incremental counter as I show you in the following output example.
HOURCENTSEG (string-column)
070026.00001
070026.00002
070026.00003
070026.00004
070052.00001
070052.00002
070109.00001 (if there is only one value it's just 00001)
070202.00001
070202.00002
070202.00003
070202.00004
It is a poorly designed application in the past and I have no other option to solve this.
Summary: Add an incremental counter after point. With a maximum size of 5, and padded with 0 from the left, When the number to the left of the point is equal.
Use GroupBy.cumcount with splitted values by . and selected first sublist, last add zeros by Series.str.zfill:
s = df['HOURCENTSEG'].str.split('.').str[0]
df['HOURCENTSEG'] = s + '.' + s.groupby(s).cumcount().add(1).astype(str).str.zfill(5)
print (df)
HOURCENTSEG
0 070026.00001
1 070026.00002
2 070026.00003
3 070026.00004
4 070052.00001
5 070052.00002
6 070109.00001
7 070202.00001
8 070202.00002
9 070202.00003
10 070202.00004

Python Lookup - Mapping Dynamic Ranges (fast)

This is an extension to a question I posted earlier: Python Sum lookup dynamic array table with df column
I'm currently investigating a way to efficiently map a decision variable to a dataframe. The main DF and the lookup table will be dynamic in length (+15,000 lines and +20 lines, respectively). Thus was hoping not to do this with a loop, but happy to hear suggestions.
The DF (DF1) will mostly look like the following, where I would like to lookup/search for the decision.
Where the decision value is found on a separate DF (DF0).
For Example: the first DF1["ValuesWhereXYcomefrom"] value is 6.915 which is between 3.8>=(value)>7.4 on the key table and thus the corresponding value DF0["Decision"] is -1. The process then repeats until every line is mapped to a decision.
I was thinking to use the python bisect library, but have not prevailed to any working solution and also working with a loop. Now I'm wondering if I am looking at the problem incorrectly as mapping and looping 15k lines is time consuming.
Example Main Data (DF1):
time
Value0
Value1
Value2
ValuesWhereXYcomefrom
Value_toSum
Decision Map
1
41.43
6.579482077
0.00531021
2
41.650002
6.756817908
46.72466411
6.915187703
0.001200456
-1
3
41.700001
6.221966706
11.64727001
1.871959552
0.000959257
-1
4
41.740002
6.230847055
46.92753343
7.531485368
0.006228989
1
5
42
6.637399856
8.031374656
1.210018204
0.010238095
-1
6
42.43
7.484894608
16.24547568
2.170434793
-0.007777563
-1
7
42.099998
7.595291765
38.73871244
5.100358702
0.003562993
-1
8
42.25
7.567457423
37.07538953
4.899319211
0.01088755
-1
9
42.709999
8.234795546
64.27986403
7.805884636
0.005151042
1
10
42.93
8.369526407
24.72700129
2.954408659
-0.003028209
-1
11
42.799999
8.146653099
61.52243361
7.55186613
0
1
Example KeyTable (DF0):
ValueX
ValueY
SUM
Decision
0.203627201
3.803627201
0.040294925
-1
3.803627201
7.403627201
0.031630668
-1
7.403627201
11.0036272
0.011841521
1
Here's how I would go about this, assuming your first DataFrame is called df and your second is decision:
def map_func(x):
for i in range(len(decision)):
try:
if x < decision["ValueY"].iloc[i]:
return decision["Decision"].iloc[i]
except Exception:
return np.nan
df["decision"] = df["ValuesWhereXYcomefrom"].apply(lambda x: map_func(x))
This will create a new row in your DataFrame called "decision" that contains the looked up value. You can then just query it:
df.decision.iloc[row]

How can I count the element in specific intervals in a dataframe?

I've got a dataframe like below where columns in c01 represent the start time and c04 the end for time intervals:
c01 c04
1742 8.444991 14.022029
3786 29.91143 31.422439
3951 29.91143 31.145099
5402 37.81136 42.689595
8230 63.12394 65.34602
also a list like this (it's actually way longer):
8.522494
8.54471
8.578426
8.611193
8.644996
8.678053
8.710918
8.744901
8.777851
8.811053
8.844867
8.878389
8.912099
8.944729
8.977601
9.011232
9.04492
9.078157
9.111946
9.144788
9.177663
9.211054
9.245265
9.27805
9.311766
9.344647
9.377612
9.411709
I'd like to count how many elements in the list falls in the intervals shown by the dataframe, where I coded like this:
count = 0
for index, row in speech.iterrows():
count += gtls.count(lambda i : i in [row['c01'], row['c04']])
the file works as a whole but all 'count' turns out to be 0, would you please tell me where did I mess up?
I took the liberty of converting your list into a numpy array() (I called it arr). Then you can use the apply function to create your count column. Let's assume your dataframe is called df.
def get_count(row): #the logic for your summation is here
return np.sum([(row['c01'] < arr) & (row['c04'] >= arr)])
df['C_sum'] = df.apply(get_count, axis = 1)
print(df)
Output:
c01 c04 C_sum
0 8.444991 14.022029 28
1 29.911430 31.422439 0
2 29.911430 31.145099 0
3 37.811360 42.689595 0
4 63.123940 65.346020 0
You can also do the whole thing in one line using lambda:
df['C_sum'] = df.apply(lambda row: np.sum([(row['c01'] < arr) & (row['c04'] >= arr)]), axis = 1)
Welcome to Stack Overflow! The i in [row['c01'], row['c04']] doesn't do what you seem to think; it stands for checking whether element i can be found from the two-element list instead of checking the range between row['c01'] and row['c04']. For checking if a floating point number is within a range, use row['c01'] < i < row['c04'].

python pass string to pandas dataframe in a specific format

I am not entirely sure if this is possible but I thought I would go ahead and ask. I currently have a string that looks like the following:
myString =
"{"Close":175.30,"DownTicks":122973,"DownVolume":18639140,"High":177.47,"Low":173.66,"Open":177.32,"Status":29,"TimeStamp":"\/Date(1521489600000)\/","TotalTicks":245246,"TotalVolume":33446771,"UnchangedTicks":0,"UnchangedVolume":0,"UpTicks":122273,"UpVolume":14807630,"OpenInterest":0}
{"Close":175.24,"DownTicks":69071,"DownVolume":10806836,"High":176.80,"Low":174.94,"Open":175.24,"Status":536870941,"TimeStamp":"\/Date(1521576000000)\/","TotalTicks":135239,"TotalVolume":19649350,"UnchangedTicks":0,"UnchangedVolume":0,"UpTicks":66168,"UpVolume":8842514,"OpenInterest":0}"
The datasets can be varying lengths (this example has 2 datasets but it could have more), however the parameters will always be the same, (close, downticks, downvolume, etc).
Is there a way to create a dataframe from this string that takes the parameters as the index, and the numbers as the values in the column? So the dataframe would look something like this:
df =
0 1
index
Close 175.30 175.24
DownTicks 122973 69071
DownVolume 18639140 10806836
High 177.47 176.80
Low 173.66 174.94
Open 177.32 175.24
(etc)...
It looks like there are some issues with your input. As mentioned by #lmiguelvargasf, there's a missing comma at the end of the first dictionary. Additionally, there's a \n which you can simply use a str.replace to fix.
Once those issues have been solved, the process it pretty simple.
myString = '''{"Close":175.30,"DownTicks":122973,"DownVolume":18639140,"High":177.47,"Low":173.66,"Open":177.32,"Status":29,"TimeStamp":"\/Date(1521489600000)\/","TotalTicks":245246,"TotalVolume":33446771,"UnchangedTicks":0,"UnchangedVolume":0,"UpTicks":122273,"UpVolume":14807630,"OpenInterest":0}
{"Close":175.24,"DownTicks":69071,"DownVolume":10806836,"High":176.80,"Low":174.94,"Open":175.24,"Status":536870941,"TimeStamp":"\/Date(1521576000000)\/","TotalTicks":135239,"TotalVolume":19649350,"UnchangedTicks":0,"UnchangedVolume":0,"UpTicks":66168,"UpVolume":8842514,"OpenInterest":0}'''
myString = myString.replace('\n', ',')
import ast
list_of_dicts = list(ast.literal_eval(myString))
df = pd.DataFrame.from_dict(list_of_dicts).T
df
0 1
Close 175.3 175.24
DownTicks 122973 69071
DownVolume 18639140 10806836
High 177.47 176.8
Low 173.66 174.94
Open 177.32 175.24
OpenInterest 0 0
Status 29 536870941
TimeStamp \/Date(1521489600000)\/ \/Date(1521576000000)\/
TotalTicks 245246 135239
TotalVolume 33446771 19649350
UnchangedTicks 0 0
UnchangedVolume 0 0
UpTicks 122273 66168
UpVolume 14807630 8842514

Difficult adding up elements in a pandas DataFrame

I'm currently having trouble adding up the rows for the following DataFrame which I have constructed for the returns for six companies' stocks:
def importdata(data):
returns=pd.read_excel(data) # Imports the data from Excel
returns_with_dates=returns.set_index('Dates') # Sets the Dates as the df index
return returns_with_dates
which outputs:
Out[345]:
Company 1 Company 2 Company 3 Company 4 Company 5 Company 6
Dates
1997-01-02 31.087620 3.094705 24.058686 31.694404 37.162890 13.462241
1997-01-03 31.896592 3.109631 22.423629 32.064378 37.537013 13.511706
1997-01-06 31.723241 3.184358 18.803148 32.681000 37.038183 13.684925
1997-01-07 31.781024 3.199380 19.503886 33.544272 37.038183 13.660193
1997-01-08 31.607673 3.169431 19.387096 32.927650 37.537013 13.585995
1997-01-09 31.492106 3.199380 19.737465 33.420948 37.038183 13.759214
1997-01-10 32.589996 3.184358 19.270307 34.284219 37.661721 13.858235
1997-01-13 32.416645 3.199380 19.153517 35.147491 38.035844 13.660193
1997-01-14 32.301077 3.184358 19.503886 35.517465 39.407629 13.783946
1997-01-15 32.127726 3.199380 19.387096 35.887438 38.409967 13.759214
1997-01-16 32.532212 3.229232 19.737465 36.257412 39.282921 13.635460
1997-01-17 33.167833 3.259180 20.087835 37.490657 39.033505 13.858235
1997-01-20 33.456751 3.229232 20.438204 35.640789 39.657044 14.377892
1997-01-21 33.225616 3.244158 20.671783 36.010763 40.779413 14.179940
1997-01-22 33.110049 3.289033 21.489312 36.010763 40.654705 14.254138
1997-01-23 32.705563 3.199380 20.905363 35.394140 40.904121 14.229405
1997-01-24 32.127726 3.139579 20.204624 35.764114 40.405290 13.957165
1997-01-27 32.127726 3.094705 20.204624 35.270816 40.779413 13.882968
1997-01-28 31.781024 3.079778 20.788573 34.407544 41.153536 13.684925
1997-01-29 32.185510 3.094705 21.138942 34.654193 41.278244 13.858235
1997-01-30 32.647779 3.094705 21.022153 34.407544 41.652367 13.981898
1997-01-31 32.532212 3.064757 20.204624 34.037570 42.275905 13.858235
For countless hours I have tried summing them up in such a way that I add up the rows from 1997-01-02 to 1997-01-08, 1997-01-09 to 1997-01-15 etc., thus adding up the first five rows, and then the following five rows. Furthermore, I seek to keep the date as an index for the 5th element, so in the case of adding up the elements from 1997-01-02 to 1997-01-08 I seek to keep 1997-01-08 as the index corresponding to the summed up element. It is worth mentioning that I have been using the five row addition as an example, but ideally I seek to add up every n rows, and then the following n rows, whilst maintaining the date in the same way said previously. I have figured out a way - shown in the code below - of doing it in array form, but I don't get to keep the dates in this situation.
returns=pd.read_excel(data) # Imports the data from Excel
returns_with_dates=returns.set_index('Dates') # Sets the Dates as the df index
returns_mat=returns_with_dates.as_matrix()
ndays=int(len(returns_mat)/n) # Number of "ndays" in our time-period
nday_returns=np.empty((ndays,min(np.shape(returns_mat)))) # Creates an empty array to fill
# and accommodate the n-day log-returns
for i in range(1,asset_number+1):
for j in range(1,ndays+1):
nday_returns[j-1,i-1]=np.sum(returns_mat[(n*j)-n:n*j,i-1])
return nday_returns
Is there any way of doing this but in a DataFrame context whilst maintaining the dates in the way I said before? I've been trying to do this for sooo long without any kind of success and it's really stressing me out! For some reason everyone finds Pandas extremely useful and easy to use, but I happen to find it the opposite. Any kind of help would be very much appreciated. Thanks in advance.
groupby
df.groupby(np.arange(len(df)) // 5).sum()
To include the date index as requested
g = np.arange(len(df)) // 5
i = df.index.to_series().groupby(g).last()
df.groupby(g).sum().set_index(i)
If you have the same number of missing dates you can resample it by the number of days you desire. Using resample keeps the dates in the index. You can also use the loffset parameter to shift the dates.
df.resample('7D', loffset='6D').sum()
Company 1 Company 2 Company 3 Company 4 Company 5 \
Dates
1997-01-08 158.096150 15.757505 104.176445 162.911704 186.313282
1997-01-15 160.927550 15.966856 97.052271 174.257561 190.553344
1997-01-22 165.492461 16.250835 102.424599 181.410384 199.407588
1997-01-29 160.927549 15.608147 103.242126 175.490807 204.520604
1997-02-05 65.179991 6.159462 41.226777 68.445114 83.928272
Company 6
Dates
1997-01-08 67.905060
1997-01-15 68.820802
1997-01-22 70.305665
1997-01-29 69.612698
1997-02-05 27.840133

Categories

Resources