Python Lookup - Mapping Dynamic Ranges (fast) - python

This is an extension to a question I posted earlier: Python Sum lookup dynamic array table with df column
I'm currently investigating a way to efficiently map a decision variable to a dataframe. The main DF and the lookup table will be dynamic in length (+15,000 lines and +20 lines, respectively). Thus was hoping not to do this with a loop, but happy to hear suggestions.
The DF (DF1) will mostly look like the following, where I would like to lookup/search for the decision.
Where the decision value is found on a separate DF (DF0).
For Example: the first DF1["ValuesWhereXYcomefrom"] value is 6.915 which is between 3.8>=(value)>7.4 on the key table and thus the corresponding value DF0["Decision"] is -1. The process then repeats until every line is mapped to a decision.
I was thinking to use the python bisect library, but have not prevailed to any working solution and also working with a loop. Now I'm wondering if I am looking at the problem incorrectly as mapping and looping 15k lines is time consuming.
Example Main Data (DF1):
time
Value0
Value1
Value2
ValuesWhereXYcomefrom
Value_toSum
Decision Map
1
41.43
6.579482077
0.00531021
2
41.650002
6.756817908
46.72466411
6.915187703
0.001200456
-1
3
41.700001
6.221966706
11.64727001
1.871959552
0.000959257
-1
4
41.740002
6.230847055
46.92753343
7.531485368
0.006228989
1
5
42
6.637399856
8.031374656
1.210018204
0.010238095
-1
6
42.43
7.484894608
16.24547568
2.170434793
-0.007777563
-1
7
42.099998
7.595291765
38.73871244
5.100358702
0.003562993
-1
8
42.25
7.567457423
37.07538953
4.899319211
0.01088755
-1
9
42.709999
8.234795546
64.27986403
7.805884636
0.005151042
1
10
42.93
8.369526407
24.72700129
2.954408659
-0.003028209
-1
11
42.799999
8.146653099
61.52243361
7.55186613
0
1
Example KeyTable (DF0):
ValueX
ValueY
SUM
Decision
0.203627201
3.803627201
0.040294925
-1
3.803627201
7.403627201
0.031630668
-1
7.403627201
11.0036272
0.011841521
1

Here's how I would go about this, assuming your first DataFrame is called df and your second is decision:
def map_func(x):
for i in range(len(decision)):
try:
if x < decision["ValueY"].iloc[i]:
return decision["Decision"].iloc[i]
except Exception:
return np.nan
df["decision"] = df["ValuesWhereXYcomefrom"].apply(lambda x: map_func(x))
This will create a new row in your DataFrame called "decision" that contains the looked up value. You can then just query it:
df.decision.iloc[row]

Related

Incremental counter if the value is the same before the point

I have the following STRING column on a pandas DataFrame.
HOURCENTSEG(string-column)
070026.16169
070026.16169
070026.16169
070026.16169
070052.85555
070052.85555
070109.43620
070202.56430
070202.56431
070202.56434
070202.56434
As you can see we have many elements where the time overlaps before the point, in all the fields to avoid date overlaps I must add an incremental counter as I show you in the following output example.
HOURCENTSEG (string-column)
070026.00001
070026.00002
070026.00003
070026.00004
070052.00001
070052.00002
070109.00001 (if there is only one value it's just 00001)
070202.00001
070202.00002
070202.00003
070202.00004
It is a poorly designed application in the past and I have no other option to solve this.
Summary: Add an incremental counter after point. With a maximum size of 5, and padded with 0 from the left, When the number to the left of the point is equal.
Use GroupBy.cumcount with splitted values by . and selected first sublist, last add zeros by Series.str.zfill:
s = df['HOURCENTSEG'].str.split('.').str[0]
df['HOURCENTSEG'] = s + '.' + s.groupby(s).cumcount().add(1).astype(str).str.zfill(5)
print (df)
HOURCENTSEG
0 070026.00001
1 070026.00002
2 070026.00003
3 070026.00004
4 070052.00001
5 070052.00002
6 070109.00001
7 070202.00001
8 070202.00002
9 070202.00003
10 070202.00004

How can I count the element in specific intervals in a dataframe?

I've got a dataframe like below where columns in c01 represent the start time and c04 the end for time intervals:
c01 c04
1742 8.444991 14.022029
3786 29.91143 31.422439
3951 29.91143 31.145099
5402 37.81136 42.689595
8230 63.12394 65.34602
also a list like this (it's actually way longer):
8.522494
8.54471
8.578426
8.611193
8.644996
8.678053
8.710918
8.744901
8.777851
8.811053
8.844867
8.878389
8.912099
8.944729
8.977601
9.011232
9.04492
9.078157
9.111946
9.144788
9.177663
9.211054
9.245265
9.27805
9.311766
9.344647
9.377612
9.411709
I'd like to count how many elements in the list falls in the intervals shown by the dataframe, where I coded like this:
count = 0
for index, row in speech.iterrows():
count += gtls.count(lambda i : i in [row['c01'], row['c04']])
the file works as a whole but all 'count' turns out to be 0, would you please tell me where did I mess up?
I took the liberty of converting your list into a numpy array() (I called it arr). Then you can use the apply function to create your count column. Let's assume your dataframe is called df.
def get_count(row): #the logic for your summation is here
return np.sum([(row['c01'] < arr) & (row['c04'] >= arr)])
df['C_sum'] = df.apply(get_count, axis = 1)
print(df)
Output:
c01 c04 C_sum
0 8.444991 14.022029 28
1 29.911430 31.422439 0
2 29.911430 31.145099 0
3 37.811360 42.689595 0
4 63.123940 65.346020 0
You can also do the whole thing in one line using lambda:
df['C_sum'] = df.apply(lambda row: np.sum([(row['c01'] < arr) & (row['c04'] >= arr)]), axis = 1)
Welcome to Stack Overflow! The i in [row['c01'], row['c04']] doesn't do what you seem to think; it stands for checking whether element i can be found from the two-element list instead of checking the range between row['c01'] and row['c04']. For checking if a floating point number is within a range, use row['c01'] < i < row['c04'].

filling and renaming a dataset at the same time

i would like to filling a dataset and making the log returns at the same time:
These are the returns
ret_names =['FTSEMIB_Index_ret', 'FCA_IM_Equity_ret', 'UCG_IM_Equity_ret', 'ISP_IM_Equity_ret',
'ENI_IM_Equity_ret',
'LUX_IM_Equity_ret']
and this is the Dataframe
'FTSEMIB_Index', 'FCA_IM_Equity', 'UCG_IM_Equity', 'ISP_IM_Equity','ENI_IM_Equity', 'LUX_IM_Equity'
0 22793.69 14.840 16.430 2.8860 14.040 49.24
1 22991.99 15.150 16.460 2.8780 14.220 48.98
2 23046.05 15.290 16.760 2.8660 14.300 48.70
3 23014.13 15.660 16.390 2.8500 14.380 48.72
4 23002.85 15.590 16.300 2.8420 14.500 49.13
so my idea is to use enumerate in a for loop.
for index,name in enumerate(ret_names):
df[name] = np.diff(np.log(df.iloc[:,index]))
but i cannot match the lenght because having the returns i'm going to erase 1 value (the first one i suppose).
Any idea?
maybe i found a solution, but i can't figure out why the previous one doesn't work
for index,name in enumerate(ret_names):
df[name] = np.log(df.iloc[:,index])/np.log(df.iloc[:,index]).shift(1)
with this you can fill and assign name, and the first value will increase.

Reading values from datafram.iloc is too slow and problem in dataframe.values

I use python and I have data of 35 000 rows I need to change values by loop but it takes too much time
ps: I have columns named by succes_1, succes_2, succes_5, succes_7....suces_120 so I get the name of the column by the other loop the values depend on the other column
exemple:
SK_1 Sk_2 Sk_5 .... SK_120 Succes_1 Succes_2 ... Succes_120
1 0 1 0 1 0 0
1 1 0 1 2 1 1
for i in range(len(data_jeux)):
for d in range (len(succ_len)):
ids = succ_len[d]
if data_jeux['SK_%s' % ids][i] == 1:
data_jeux.iloc[i]['Succes_%s' % ids]= 1+i
I ask if there is a way for executing this problem with the faster way I try :
data_jeux.values[i, ('Succes_%s' % ids)] = 1+i
but it returns me the following error maybe it doesn't accept string index
You can define columns and then use loc to increment. It's not clear whether your columns are naturally ordered; if they aren't you can use sorted with a custom function. String-based sorting will cause '20' to come before '100'.
def splitter(x):
return int(x.rsplit('_', maxsplit=1)[-1])
cols = df.columns
sk_cols = sorted(cols[cols.str.startswith('SK')], key=splitter)
succ_cols = sorted(cols[cols.str.startswith('Succes')], key=splitter)
df.loc[df[sk_cols] == 1, succ_cols] += 1

Calculating cumulative returns

I have calculated the daily returns from some market data and am trying to add each of those returns in an accumulation column (cum_rets). Here's a sample of the data:
timestamp dash_cap litecoin_cap dogecoin_cap nxt_cap Total_MC Dal_rets
0 2014-02-14 702537 410011811 80883283 61942277 553539908 NaN
1 2014-02-15 1054625 413848776 73684156 59127182 547714739 -1.052349
2 2014-02-17 1882753 407014106 70512004 59753481 539162344 -1.561469
3 2014-02-18 3766068 398556278 69890414 60219880 532432640 -1.248178
4 2014-02-19 3038521 404855950 71588924 59870181 539353576 1.299871
I don't understand why, when I use this code:
merged['cum_rets'] = 0
merged['cum_rets'] = merged['cum_rets'].shift(1) + merged.Dal_rets
merged
it returns
timestamp dash_cap litecoin_cap dogecoin_cap nxt_cap Total_MC Dal_rets cum_rets
0 2014-02-14 702537 410011811 80883283 61942277 553539908 NaN NaN
1 2014-02-15 1054625 413848776 73684156 59127182 547714739 -1.052349 -1.052349
2 2014-02-17 1882753 407014106 70512004 59753481 539162344 -1.561469 -1.561469
3 2014-02-18 3766068 398556278 69890414 60219880 532432640 -1.248178 -1.248178
merged['cum_rets'].shift(1) should retrieve the previous value for cum_rets and then add it to the current value of Dal_ret thus incrementally sumating all values in Dal_rets. I know I could use the .cumprod() but i'd like to learn why my method doesn't work
merged['cum_rets'].shift(1) should retrieve the previous value for cum_rets and then add it to the current value of Dal_ret
is correct. However, the conclusion
thus incrementally sumating all values in Dal_rets
is not.
The expression
a = b + c
treats a, b, and c as distinct arrays in the sense that it first adds b and c, then assigns the results to a. It doesn't matter if a and b refer to the same thing.
Incidentally, you probably want here cumsum, not cumprod.

Categories

Resources