Numpy repeat to odd length index - python

I have a dataframe:
proj_length = 6
nb_date = pd.Period("2017-10-13")
rb_date = pd.Period("2017-11-13")
rev_length = proj_length * 30
end_date = rb_date + (rev_length * 2)
df_index = pd.PeriodIndex(start=nb_date, end=end_date)
df = pd.DataFrame(data=[[0,0]] * len(df_index),
index=df_index,
columns=["in", "out"])
len(df) == 392
And i'm trying to groupby 30 days at a time, so my initial thought was to just create some new column:
groups = (end_date - nb_date) // 30
gb_key = np.repeat(np.arange(groups), 30)
len(gb_key) == 390
This is good so far, but I cannot figure out a pythonic way to get the overflow (392-390) to be set to 13?
Non-numpy/pandas way:
arr = np.zeros(len(df))
for i, idx in enumerate(range(0, len(df), 30)):
arr[idx:] = i

Related

Search through a dataframe using Regex in a for loop to pull out a value associated with the Regex

I have a subset dataframe from a much larger dataframe. I need to be able to create a for loop that searches through a dataframe and pull out the data corresponding to the correct name.
import pandas as pd
import numpy as np
import re
data = {'Name': ['CH_1', 'CH_2', 'CH_3', 'FV_1', 'FV_2', 'FV_3'],
'Value': [1, 2, 3, 4, 5, 6]
}
df = pd.DataFrame(data)
FL = [17.7, 60.0]
CH = [20, 81.4]
tol = 8
time1 = FL[0] + tol
time2 = FL[1] + tol
time3 = CH[0] + tol
time4 = CH[1] + tol
FH_mon = df['Values'] *5
workpercent = [.7, .92, .94]
mhpy = [2087, 2503, 3128.75]
list1 = list()
list2 = list()
for x in df['Name']:
if x == [(re.search('FV_', s)) for s in df['Name'].values]:
y = np.select([FH_mon < time1 , (FH_mon >= time1) and (FH_mon < time2), FH_mon > time2], [workpercent[0],workpercent[1],workpercent[2]])
z = np.select([FH_mon < time1 , (FH_mon >= time1) and (FH_mon < time2), FH_mon > time2], [mhpy[0],mhpy[1],mhpy[2]])
if x == [(re.search('CH_', s)) for s in df['Name'].values]:
y = np.select([FH_mon < time3, (FH_mon >= time3) and (FH_mon < time4)], [workpercent[0],workpercent[1]])
z = np.select([FH_mon < time3, (FH_mon >= time3) and (FH_mon < time4)], [mhpy[0],mhpy[1]])
list1.append(y)
list2.append(z)
I had a simple version earlier where I was just added a couple numbers, and I was getting really helpful answers to how I asked my question, but here is the more complex version. I need to search through and any time there is a FV in the name column, the if loop runs and uses data from the Name column with FV. Same for CH. I have the lists to keep track of each value as the loop loops through the Name column. If there is a simpler way I would really appreciate seeing it, but right now this seems like the cleanest way but I am receiving errors or the loop will not function properly.
This should be what you want:
for index, row in df.iterrows():
if re.search("FV_", row["Name"]):
df.loc[index, "Value"] += 2
elif re.search("CH_", row["Name"]):
df.loc[index, "Value"] += 4
If the "Name" column only has values starting with "FV_" or "CH_", use where:
df["Value"] = df["Value"].add(2).where(df["Name"].str.startswith("FV_"), df["Value"].add(4))
If you might have other values in "Name", use numpy.select:
import numpy as np
df["Value"] = np.select([df["Name"].str.startswith("FV_"), df["Name"].str.startswith("CH_")], [df["Value"].add(2), df["Value"].add(4)])
Output:
>>> df
Name Value
0 CH_1 5
1 CH_2 6
2 CH_3 7
3 FV_1 6
4 FV_2 7
5 FV_3 8

Visualize Results of each iteration of While Loop into a Time Series Chart

initial_month = datetime.strptime('01-2018', '%m-%Y')
final_month = datetime.strptime('12-2020', '%m-%Y')
ppl_initial_comms = ppc_initial_comms = 6174
initial_leadpcpm = 4
ppl_price = 40
ppc_price = 400
ppl_new_comms = ppc_new_comms = 0
growth_years = ['2018','2019','2020']
leadpcpm_rate_dict = {}
ppl_rate_dict = {}
ppc_rate_dict = {}
ppl_cumul_rev_dict = {}
ppc_cumul_rev_dict = {}
#For Loop to calculate yearly MoM rates
for year in growth_years:
if year == '2018':
leadpcpm_rate_dict[year] = 5
ppl_rate_dict[year] = 6
ppc_rate_dict[year] = 0.9 * ppl_rate_dict[year]
elif year == '2019':
leadpcpm_rate_dict[year] = 4
ppl_rate_dict[year] = 4
ppc_rate_dict[year] = 0.9 * ppl_rate_dict[year]
elif year == '2020':
leadpcpm_rate_dict[year] = 1
ppl_rate_dict[year] = 2
ppc_rate_dict[year] = 0.9 * ppl_rate_dict[year]
#While loop to calculate MoM revenue growth over 3 years
while(initial_month != final_month+relativedelta(months=1)):
initial_year = str(initial_month.year)
if initial_year in growth_years:
ppl_new_leadpcpm = initial_leadpcpm + ((initial_leadpcpm*leadpcpm_rate_dict[initial_year]) / 100)
initial_leadpcpm = ppl_new_leadpcpm
ppl_new_comms = ppl_initial_comms + ((ppl_initial_comms*ppl_rate_dict[initial_year]) / 100)
ppl_initial_comms = ppl_new_comms
ppl_cumul_rev = ppl_new_comms * ppl_new_leadpcpm * ppl_price
ppl_cumul_rev_dict[initial_month] = ppl_cumul_rev
ppc_new_comms = ppc_initial_comms + ((ppc_initial_comms*ppc_rate_dict[initial_year]) / 100)
ppc_initial_comms = ppc_new_comms
ppc_cumul_rev = ppc_new_comms * ppc_price
ppc_cumul_rev_dict[initial_month] = ppc_cumul_rev
initial_month += relativedelta(months=1)
I am trying to visualize the running sum of revenue for both ppl and ppc for 36 months in a single line chart (Time series) using MatPlotLib. But I am not sure how to parse the results of ppl_cumul_rev_dict and ppc_cumul_rev_dict into a Dataframe like this:
Year PPLRevenue PPCRevenue
0 Jan 2018 1234 5678
1 Feb 2018 9112 10019
.. .. ..
35 Dec 2020 1000000 1500000
I've tried creating 2 dictionaries of ppl and ppc revenues but I don't know how to combine them into a single Dataframe to feed into plt.plot
It would probably be better to rewrite the way you construct your dictionaries, but, given the code that you gave us and the content of ppl_cumul_rev_dict and ppc_cumul_rev_dict:
df1 = pd.DataFrame(np.array([[k,v] for k,v in ppc_cumul_rev_dict.items()]), columns=['Date','PPC']).set_index('Date')
df2 = pd.DataFrame(np.array([[k,v] for k,v in ppl_cumul_rev_dict.items()]), columns=['Date','PPL']).set_index('Date')
df = pd.concat([df1,df2], axis=1)
df.plot()
You can simply transform the dict into pandas Series and then create a DataFrame.
import pandas as pd
ppc_series = pd.Series(ppc_cumul_rev_dict)
ppl_series = pd.Series(ppl_cumul_rev_dict)
df = pd.DataFrame(data={'ppc': ppc_series, 'ppl': ppl_series})

ValueError: Comparing labels between series and dataframe in pandas (identically labeled error)

I have a dataframe which contains an ID and a Value, and I', cycling through timestamps of that and I need to use that ID to retrieve a value corresponding to the same ID from another dataset. Here's some code:
patient_weight = float(weigth[(weigth['PatientID'] == otherDataframe['PatientID'])]['Value'])
I even tried to to the following:
PatientID=valuesDataset['PatientID']
patient_weight = float(weigth[(weigth['PatientID'] == PatientID)]['Value'])
Now, everytime I run the code I got the
ValueError: Can only compare identically-labeled Series objects
errore right in this part. That is maybe because I'm evaluating a dataframe and a Series, but I don't know how to solve this. Is there a solution?
Full function code:
Dates = dataset['Date']
for Date in Dates:
for index in range(1, len(valuesDataset)):
if valuesDataset.iloc[index]['Date'] >= Date and valuesDataset.iloc[index - 1]['Date'] <= Date:
found = 1
# print(valuesDataset.iloc[index]['Date'],valuesDataset.iloc[index-1]['Date'])
delta_minutes = float(((valuesDataset.iloc[index]['Date'] - valuesDataset.iloc[index - 1]['Date']).seconds) / 60)
patient_weight = float(weigth[(weigth['PatientID'] == PatientID)]['Value'])
break
if found: vis = operations...
found = 0
I fixed it by doing the merge operation like this:
found = 0
Dates = generalDataset['Date']
for Date in Dates:
for index in range(1, len(valuesDataset)):
if valuesDataset.iloc[index]['Date'] >= Date and adrData.iloc[index - 1]['Date'] <= Date:
found = 1
delta_minutes = float(((valuesDataset.iloc[index]['Date'] - valuesDataset.iloc[index - 1]['Date']).seconds) / 60)
dose_mcg = (valuesDataset.iloc[index]['Value'] * 1000)
res_weigth = pd.merge(generalDataset, valuesDataset, how='outer', left_on=['PatientID'], right_on=['PatientID'])
# valuex = weigth, valuey = value
patient_weight = (res_weigth[(res_weigth['PatientID'] == PatientID)]['Value_x'])
dose_mcg_kg_min = (dose_mcg / (patient_weight * delta_minutes))
break
if found: vis = vis + dose_mcg_kg_min * 100
for index in range(i, len(valuesDataset2)...
same as before
if found: vis = vis + dose_mcg_kg_min * 100
for index in range(i, len(valuesDataset3)...
same as before
if found: vis = vis + dose_mcg_kg_min * 100
for index in range(i, len(valuesDataset4)...
same as before
if found: vis = vis + dose_mcg_kg_min * 100
print(vis)
Now, this seems correct, but my values are extremely high and I would like to order them by PatientID and the values as floats between 0 and 7.0. My results are:
0 15733.202257
1 15733.202257
2 15733.202257
3 15733.202257
4 15733.202257
5 15733.202257
6 16362.530347
I can't understand why. Can you help me?

writing function in pandas/python

I have just started to learn python and don't have much of dev background. Here is the code I have written while learning.
I now want to make a function which exactly does what my "for" loop is doing but it needs to calculate different exp(exp,exp1 etc) based on different num(num, num1 etc)
how can I do this?
import pandas as pd
index = [0,1]
s = pd.Series(['a','b'],index= index)
t = pd.Series([1,2],index= index)
t1 = pd.Series([3,4],index= index)
df = pd.DataFrame(s,columns = ["str"])
df["num"] =t
df['num1']=t1
print (df)
exp=[]
for index, row in df.iterrows():
if(row['str'] == 'a'):
row['mul'] = -1 * row['num']
exp.append(row['mul'])
else:
row['mul'] = 1 * row['num']
exp.append(row['mul'])
df['exp'] = exp
print (df)
This is what i was trying to do which gives wrong results
import pandas as pd
index = [0,1]
s = pd.Series(['a','b'],index= index)
t = pd.Series([1,2],index= index)
t1 = pd.Series([3,4],index= index)
df = pd.DataFrame(s,columns = ["str"])
df["num"] =t
df['num1']=t1
def f(x):
exp=[]
for index, row in df.iterrows():
if(row['str'] == 'a'):
row['mul'] = -1 * x
exp.append(row['mul'])
else:
row['mul'] = 1 * x
exp.append(row['mul'])
return exp
df['exp'] = df['num'].apply(f)
df['exp1'] = df['num1'].apply(f)
df
Per suggestion below, I would do:
df['exp']=np.where(df.str=='a',df['num']*-1,df['num']*1)
df['exp1']=np.where(df.str=='a',df['num1']*-1,df['num1']*1)
I think you are looking for np.where
df['exp']=np.where(df.str=='a',df['num']*-1,df['num']*1)
df
Out[281]:
str num num1 exp
0 a 1 3 -1
1 b 2 4 2
Normal dataframe operation:
df["exp"] = df.apply(lambda x: x["num"] * (1 if x["str"]=="a" else -1), axis=1)
Mathematical dataframe operation:
df["exp"] = ((df["str"] == 'a')-0.5) * 2 * df["num"]

Looping over lists Python, indexing (basic bootstrap)

Given the following two lists:
dates = [1,2,3,4,5]
rates = [0.0154, 0.0169, 0.0179, 0.0187, 0.0194]
I would like to generate a list
df = []
of same lengths as dates and rates (0 to 4 = 5 elements) in 'pure' Python (without Numpy) as an exercise.
df[i] would be equal to:
df[0] = (1 / (1 + rates[0])
df[1] = (1 - df[0] * rates[1]) / (1 + rates[1])
...
df[4] = (1 - (df[0] + df[1]..+df[3])*rates[4]) / (1 + rates[4])
I was trying:
df = []
df.append(1 + rates[0]) #create df[0]
for date in enumerate(dates, start = 1):
running_sum_vec = 0
for i in enumerate(rates, start = 1):
running_sum_vec += df[i] * rates[i]
df[i] = (1 - running_sum_vec) / (1+ rates[i])
return df
but am getting as TypeError: list indices must be integers. Thank you.
So, the enumerate method return two values: index and value
>>> x = ['a', 'b', 'a']
>>> for y_count, y in enumerate(x):
... print('index: {}, value: {}'.format(y_count, y))
...
index: 0, value: a
index: 1, value: b
index: 2, value: a
It's because of for i in enumerate(rates, start = 1):. enumerate generates tuples of the index and the object in the list. You should do something like
for i, rate in enumerate(rates, start=1):
running_sum_vec += df[i] * rate
You'll need to fix the other loop (for date in enumerate...) as well.
You also need to move df[i] = (1 - running_sum_vec) / (1+ rates[i]) back into the loop (currently it will only set the last value) (and change it to append since currently it will try to set at an index out of bounds).
Not sure if this is what you want:
df = []
sum = 0
for ind, val in enumerate(dates):
df.append( (1 - (sum * rates[ind])) / (1 + rates[ind]) )
sum += df[ind]
Enumerate returns both index and entry.
So assuming the lists contain ints, your code can be:
df = []
df.append(1 + rates[0]) #create df[0]
for date in dates:
running_sum_vec = 0
for i, rate in enumerate(rates[1:], start = 1):
running_sum_vec += df[i] * rate
df[i] = (1 - running_sum_vec) / (1+ rate)
return df
Although I'm almost positive there's a way with list comprehension. I'll have to think about it for a bit.

Categories

Resources