Automated way to get binary variables from a database - python

I have a problem in relation to a database related to dengue. I have in this database some variables, among them the variable "Cases", which indicates the number of dengue cases in a given period. But I want to apply the logistic regression model to these data, so the idea is to make this variable with integers, to become a binary variable, that is, for places that did not have dengue cases in that period, I want to put 0 in place of the quantity that I already have, and for places that have had cases, put 1. As there are 35628 lines, I want to do it in an automated way, to avoid doing it, manually. Would anyone have any idea how to proceed in order to put this idea into practice? I'm new to programming and I'm trying to implement it in the R language. If they know of a package that does this, it helps a lot. Each neighborhood is conditioned to a number.
I appreciate any help and thank you very much.
neighborhood
Dates
Cases
precipitation
Temperature
0
Jan/14
10
149,6
33,25
1
Fev/14
0
254
30,1
2
Mar/14
6
150
25,4
3
Apr/14
0
244,1
32,5
4
May/14
3
44,3
33,2
I appreciate any help and thank you very much.

R
Pick from among
dat$CasesBin1 <- (dat$Cases > 0)
dat$CasesBin2 <- +(dat$Cases > 0)
dat
# neighborhood Dates Cases precipitation Temperature CasesBin1 CasesBin2
# 1 0 Jan/14 10 149.6 33.25 TRUE 1
# 2 1 Fev/14 0 254.0 30.10 FALSE 0
# 3 2 Mar/14 6 150.0 25.40 TRUE 1
# 4 3 Apr/14 0 244.1 32.50 FALSE 0
# 5 4 May/14 3 44.3 33.20 TRUE 1
In R at least, most logistic regression tools I've used work fine with either integer (0/1) or logical, but you may need to verify with the tools you will use.
Data:
dat <- structure(list(neighborhood = 0:4, Dates = c("Jan/14", "Fev/14", "Mar/14", "Apr/14", "May/14"), Cases = c(10L, 0L, 6L, 0L, 3L), precipitation = c(149.6, 254, 150, 244.1, 44.3), Temperature = c(33.25, 30.1, 25.4, 32.5, 33.2)), class = "data.frame", row.names = c(NA, -5L))
python
In [13]: dat
Out[13]:
neighborhood Dates Cases precipitation Temperature
0 0 Jan/14 10 149.6 33.25
1 1 Fev/14 0 254.0 30.10
2 2 Mar/14 6 150.0 25.40
3 3 Apr/14 0 244.1 32.50
4 4 May/14 3 44.3 33.20
In [17]: dat['CasesBin1'] = dat['Cases'].apply(lambda x: (x > 0))
In [18]: dat['CasesBin2'] = dat['Cases'].apply(lambda x: int(x > 0))
In [19]: dat
Out[19]:
neighborhood Dates Cases ... Temperature CasesBin1 CasesBin2
0 0 Jan/14 10 ... 33.25 True 1
1 1 Fev/14 0 ... 30.10 False 0
2 2 Mar/14 6 ... 25.40 True 1
3 3 Apr/14 0 ... 32.50 False 0
4 4 May/14 3 ... 33.20 True 1
[5 rows x 7 columns]
Data:
In [11]: js
Out[11]: '[{"neighborhood":0,"Dates":"Jan/14","Cases":10,"precipitation":149.6,"Temperature":33.25},{"neighborhood":1,"Dates":"Fev/14","Cases":0,"precipitation":254,"Temperature":30.1},{"neighborhood":2,"Dates":"Mar/14","Cases":6,"precipitation":150,"Temperature":25.4},{"neighborhood":3,"Dates":"Apr/14","Cases":0,"precipitation":244.1,"Temperature":32.5},{"neighborhood":4,"Dates":"May/14","Cases":3,"precipitation":44.3,"Temperature":33.2}]'
In [12]: dat = pd.read_json(js)

Sorry I didn't see you would like to implement it in the R Language. Below is suggested code in Python...
Assuming that the table is in a DataFrame df, you could create a new column 'dengue_cases' with 0 when there are no cases, and 1 when there are cases
df['Cases'] = df['Cases'].astype('int') #to ensure the correct data type in column
df['dengue_cases'] = df['Cases'].apply(lambda x: 0 if x==0 else 1)
The above lines will create a new column. If you are replacing the original column use below line:
df['Cases'] = df['Cases'].apply(lambda x: 0 if x==0 else 1)

Related

cumsum based on taking first value from another column the creating a new calculation

I was hoping to get some help with a calculation I'm struggling a bit with. I'm working with some data (copied below) and I need create a calculation that takes the first value > 0 from another column and computes a new series based on that value, and then aggregates the numbers giving a cumulative sum. My raw data looks like this:
d = {'Final Account': ['A', 'A', 'A' ,'A' , 'A', 'A', 'A','A' ,'A' ,'A', 'A', 'A', 'A'],
'Date': ['Jun-21','Jul-21','Aug-21','Sep-21','Oct-21','Nov-21','Dec-21','Jan-22','Feb-22','Mar-22','Apr-22','May-22','Jun-22'],
'Units':[0, 0, 0, 0, 10, 0, 20, 0, 0, 7, 12, 35, 0]}
df = pd.DataFrame(data=d)
Account Date Units
A Jun-21 0
A Jul-21 0
A Aug-21 0
A Sep-21 0
A Oct-21 10
A Nov-21 0
A Dec-21 20
A Jan-22 0
A Feb-22 0
A Mar-22 7
A Apr-22 12
A May-22 35
A Jun-22 0
To the table I do an initial conversion for my data which is:
df['Conv'] = df['Units'].apply(x/5)
This adds a new column to my table like this:
Account Date Units Conv
A Jun-21 0 0
A Jul-21 0 0
A Aug-21 0 0
A Sep-21 0 0
A Oct-21 10 2
A Nov-21 0 0
A Dec-21 20 4
A Jan-22 0 0
A Feb-22 0 0
A Mar-22 7 1
A Apr-22 12 2
A May-22 35 7
A Jun-22 0 0
The steps after this I begin to run into issues. I need to calculate new field which takes the first value of the conv field > 0, at the same index position and begins a new calculation based on the previous rows cumsum and then adds it back into the cumsum following the calculation. Outside of python this is done by creating two columns. One to calculate new units by:
(Units - (previous row cumsum of existing units * 2))/5
Then existing units which is just the cumsum of the values that have been figured out to be new units. The desired output should look something like this:
Account Date Units Conv New Units Existing Units (cumsum of new units)
A Jun-21 0 0 0 0
A Jul-21 0 0 0 0
A Aug-21 0 0 0 0
A Sep-21 0 0 0 0
A Oct-21 10 2 2 2
A Nov-21 0 0 0 2
A Dec-21 20 4 3 5
A Jan-22 0 0 0 5
A Feb-22 0 0 0 5
A Mar-22 7 1 0 5
A Apr-22 12 2 0 5
A May-22 35 7 5 10
A Jun-22 0 0 0 10
The main issue I'm struggling with is grabbing the first value >0 from the "Conv" column and being able to create a new cumsum based on that initial value that can be applied to the "New Units" calculation. Any guidance is much appreciated, and despite reading a lot around I've hit a bit of a brick wall! If you need me to explain better please do ask! :)
Much appreciated in advance!
I'm not sure that I completely understand what you are trying to achieve. Nevertheless, here's an attempt to reproduce your expected results. For your example frame this
groups = (df['Units'].eq(0) & df['Units'].shift().ne(0)).cumsum()
df['New Units'] = 0
last = 0
for _, group in df['Units'].groupby(groups):
i, unit = group.index[-1], group.iloc[-1]
if unit != 0:
new_unit = (unit - last * 2) // 5
last = df.at[i, 'New Units'] = new_unit
does result in
Final Account Date Units New Units
0 A Jun-21 0 0
1 A Jul-21 0 0
2 A Aug-21 0 0
3 A Sep-21 0 0
4 A Oct-21 10 2
5 A Nov-21 0 0
6 A Dec-21 20 3
7 A Jan-22 0 0
8 A Feb-22 0 0
9 A Mar-22 7 0
10 A Apr-22 12 0
11 A May-22 35 5
12 A Jun-22 0 0
The first step identifies the blocks in column Units whose last item is relevant for building the new units: Successive zeros, followed by non-zeros, until the first zero. This
groups = (df['Units'].eq(0) & df['Units'].shift().ne(0)).cumsum()
results in
0 1
1 1
2 1
3 1
4 1
5 2
6 2
7 3
8 3
9 3
10 3
11 3
12 4
Then group column Units along these blocks, grab the last item if each block if it is non-zero (zero can only happen in the last block), build the new unit (according to the given formula) and store it in the new column New Units.
(If you actually need the column Existing Units then just use .cumsum() on the column New Units.)
If there are multiple accounts (indicated in the comments), then one way to apply the procedure to each account separately would be to pack it into a function (here new_units), .groupby() over the Final Account column, and .apply() the function to the groups:
def new_units(sdf):
groups = (sdf['Units'].eq(0) & sdf['Units'].shift().ne(0)).cumsum()
last = 0
for _, group in sdf['Units'].groupby(groups):
i, unit = group.index[-1], group.iloc[-1]
if unit != 0:
new_unit = (unit - last * 2) // 5
last = sdf.at[i, 'New Units'] = new_unit
return sdf
df['New Units'] = 0
df = df.groupby('Final Account').apply(new_units)
Try using a for loop that performs the sample calculation you provided
# initialize new rows to zero
df['new u'] = 0
df['ext u'] = 0
# set first row cumsum
df['ext u'][0] = df['units'][0]//5
# loop through the data frame to perform the calculations
for i in range(1, len(df)):
# calculate new units
df['new u'][i] = (df['units'][i]-2*df['ext u'][i-1])//5
# calculate existing units
df['ext u'][i] = df['ext u'][i-1] + df['new u']
I'm not certain that those are the exact expressions you are looking for, but hopefully this gets you on your way to a solution. Worth noting that this does not take care of the whole "first value > 0" thing because (feel free to correct me but) it seems like before that you will just be adding up zeros, which won't affect anything. Hope this helps!

How do I round values in a column based on integers in another column

I need to round prices in a column to different number of decimals in Python. I am using this code to create the dataframe, df_prices:
df_prices = pd.DataFrame({'InstrumentID':['001','002','003','004','005','006'], 'Price':[12.44,6.5673,23.999,56.88,4333.22,27.8901],'RequiredDecimals':[2,0,1,2,0,3]})
The data looks like this:
InstrumentID Price RequiredDecimals
1 12.444 2
2 6.5673 0
3 23.999 1
4 56.88 2
5 4333.22 0
6 27.8901 3
I often get this issue returned:
TypeError: cannot convert the series to
Neither of these statements worked:
df_prices['PriceRnd'] = np.round(df_prices['Price'] , df_prices['RequiredDecimals'])
df_prices['PriceRnd'] = df_prices['Price'].round(decimals = df_prices['RequiredDecimals'] )
This is what the final output should look like:
Instrument# Price RequiredDecimals PriceRnd
1 12.444 2 12.44
2 6.5673 0 7
3 23.999 1 24.0
4 56.88 2 56.88
5 4333.22 0 4333
6 27.8901 3 27.890
Couldn't find a better solution, but this one seems to work
df['Rnd'] = [np.around(x,y) for x,y in zip(df['Price'],df['RequiredDecimals'])]
Although not elegant, you can try this.
import pandas as pd
df_prices = pd.DataFrame({'InstrumentID':['001','002','003','004','005','006'], 'Price':[12.44,6.5673,23.999,56.88,4333.22,27.8901],'RequiredDecimals':[2,0,1,2,0,3]})
print(df_prices)
list1 = []
for i in df_prices.values:
list1.append('{:.{}f}' .format(i[1], i[2]))
print(list1)
df_prices["Rounded Price"] =list1
print(df_prices)
InstrumentID Price RequiredDecimals Rounded Price
0 001 12.4400 2 12.44
1 002 6.5673 0 7
2 003 23.9990 1 24.0
3 004 56.8800 2 56.88
4 005 4333.2200 0 4333
5 006 27.8901 3 27.890
or a 1-liner code
df_prices['Rnd'] = ['{:.{}f}' .format(x, y) for x,y inzip(df_prices['Price'],df_prices['RequiredDecimals'])]
An alternative way would be to adjust the number that you are trying to round with an appropriate factor and then use the fact that the .round()-function always rounds to the nearest integer.
df_prices['factor'] = 10**df_prices['RequiredDecimals']
df_prices['rounded'] = (df_prices['Price'] * df_prices['factor']).round() / df_prices['factor']
After rounding, the number is divided again by the factor.

Conditional length of a binary data series in Pandas

Having a DataFrame with the following column:
df['A'] = [1,1,1,0,1,1,1,1,0,1]
What would be the best vectorized way to control the length of "1"-series by some limiting value? Let's say the limit is 2, then the resulting column 'B' must look like:
A B
0 1 1
1 1 1
2 1 0
3 0 0
4 1 1
5 1 1
6 1 0
7 1 0
8 0 0
9 1 1
One fully-vectorized solution is to use the shift-groupby-cumsum-cumcount combination1 to indicate where consecutive runs are shorter than 2 (or whatever limiting value you like). Then, & this new boolean Series with the original column:
df['B'] = ((df.groupby((df.A != df.A.shift()).cumsum()).cumcount() <= 1) & df.A)\
.astype(int) # cast the boolean Series back to integers
This produces the new column in the DataFrame:
A B
0 1 1
1 1 1
2 1 0
3 0 0
4 1 1
5 1 1
6 1 0
7 1 0
8 0 0
9 1 1
1 See the pandas cookbook; the section on grouping, "Grouping like Python’s itertools.groupby"
Another way (checking if previous two are 1):
In [443]: df = pd.DataFrame({'A': [1,1,1,0,1,1,1,1,0,1]})
In [444]: limit = 2
In [445]: df['B'] = map(lambda x: df['A'][x] if x < limit else int(not all(y == 1 for y in df['A'][x - limit:x])), range(len(df)))
In [446]: df
Out[446]:
A B
0 1 1
1 1 1
2 1 0
3 0 0
4 1 1
5 1 1
6 1 0
7 1 0
8 0 0
9 1 1
If you know that the values in the series will all be either 0 or 1, I think you can use a little trick involving convolution. Make a copy of your column (which need not be a Pandas object, it can just be a normal Numpy array)
a = df['A'].as_matrix()
and convolve it with a sequence of 1's that is one longer than the cutoff you want, then chop off the last cutoff elements. E.g. for a cutoff of 2, you would do
long_run_count = numpy.convolve(a, [1, 1, 1])[:-2]
The resulting array, in this case, gives the number of 1's that occur in the 3 elements prior to and including that element. If that number is 3, then you are in a run that has exceeded length 2. So just set those elements to zero.
a[long_run_count > 2] = 0
You can now assign the resulting array to a new column in your DataFrame.
df['B'] = a
To turn this into a more general method:
def trim_runs(array, cutoff):
a = numpy.asarray(array)
a[numpy.convolve(a, numpy.ones(cutoff + 1))[:-cutoff] > cutoff] = 0
return a

create a summary of movements between prices by date in a pandas dataframe

I have a dataframe which shows; 1) dates, prices and 3) the difference between two prices by row.
dates | data | result | change
24-09 24 0 none
25-09 26 2 pos
26-09 27 1 pos
27-09 28 1 pos
28-09 26 -2 neg
I want to create a summary of the above data in a new dataframe. The summary would have 4 columns: 1) start date, 2) end date 3) number of days 4) run
For example using the above there was a positive run of +4 from 25-09 and 27-09, so I would want this in a row of a dataframe like so:
In the new dataframe there would be one new row for every change in the value of result from positive to negative. Where run = 0 this indicates no change from the previous days price and would also need its own row in the dataframe.
start date | end date | num days | run
25-09 27-09 3 4
27-09 28-09 1 -2
23-09 24-09 1 0
The first step I think would be to create a new column "change" based on the value of run which then shows either of: "positive","negative" or "no change". Then maybe I could groupby this column.
A couple of useful functions for this style of problem are diff() and cumsum().
I added some extra datapoints to your sample data to flesh out the functionality.
The ability to pick and choose different (and more than one) aggregation functions assigned to different columns is a super feature of pandas.
df = pd.DataFrame({'dates': ['24-09', '25-09', '26-09', '27-09', '28-09', '29-09', '30-09','01-10','02-10','03-10','04-10'],
'data': [24, 26, 27, 28, 26,25,30,30,30,28,25],
'result': [0,2,1,1,-2,0,5,0,0,-2,-3]})
def cat(x):
return 1 if x > 0 else -1 if x < 0 else 0
df['cat'] = df['result'].map(lambda x : cat(x)) # probably there is a better way to do this
df['change'] = df['cat'].diff()
df['change_flag'] = df['change'].map(lambda x: 1 if x != 0 else x)
df['change_cum_sum'] = df['change_flag'].cumsum() # which gives us our groupings
foo = df.groupby(['change_cum_sum']).agg({'result' : np.sum,'dates' : [np.min,np.max,'count'] })
foo.reset_index(inplace=True)
foo.columns = ['id','start date','end date','num days','run' ]
print foo
which yields:
id start date end date num days run
0 1 24-09 24-09 1 0
1 2 25-09 27-09 3 4
2 3 28-09 28-09 1 -2
3 4 29-09 29-09 1 0
4 5 30-09 30-09 1 5
5 6 01-10 02-10 2 0
6 7 03-10 04-10 2 -5

pandas dataframe - increase values of a subset of a timeframe on a multi-index dataframe

The following code worked for me on pandas 12.0, but on pandas 13 no longer works (processing time #1 min per record, previously 200k records were processed in an hour or so).
I suspect there's a more elegant way of achieving the same result. Would be nice if someone could point me in the right direction.
I create the dataframe like so:
pubs = ['pub1','pub2','pub3','pub4','pub5']
panel = pd.Panel(np.random.randn(2,2200,5), items=['variableA','variableB'], major_axis=pd.date_range('20110101', periods=2200), minor_axis=pubs)
df_sub = panel.to_frame()
df_sub.ix[:] = 0
I increment values like this:
startDate = time.ctime(time.mktime(time.strptime(meh,"%d/%m/%Y %H:%M:%S")))
TempRng = pd.date_range(startDate, periods=75)
for eachDay in TempRng:
df_sub.ix[eachDay,pubID]['variableA'] +=1
df_sub.ix[eachDay,pubID]['variableB'] += 5
^^It's this last part which used to work fine a month ago, but now grinds to a halt. On a different machine which still has the older version of pandas, the processing speed is acceptable.
What is the correct way of making this increment?
Reverse what you are doing and iterate over the smaller number of pubs. This will be order of magnitudes faster. Ix/loc is very fast when setting big ranges / slices. Using it for a small number of changed many times is inefficient.
In [57]: df = df_sub.reset_index()
In [58]: mask = df.minor == 'pub1'
In [59]: df.loc[mask,'variableA'] = 1
In [60]: df.loc[mask,'variableB'] = 5
In [61]: df.loc[mask,'variableA'] = df.loc[mask,'variableA'].cumsum()
In [62]: df.loc[mask,'variableB'] = df.loc[mask,'variableB'].cumsum()
In [64]: df.set_index(['major','minor']).head(20)
Out[64]:
variableA variableB
major minor
2011-01-01 pub1 1 5
pub2 0 0
pub3 0 0
pub4 0 0
pub5 0 0
2011-01-02 pub1 2 10
pub2 0 0
pub3 0 0
pub4 0 0
pub5 0 0
2011-01-03 pub1 3 15
pub2 0 0
pub3 0 0
pub4 0 0
pub5 0 0
2011-01-04 pub1 4 20
pub2 0 0
pub3 0 0
pub4 0 0
pub5 0 0
[20 rows x 2 columns]
In 0.14 you will be able to do this to directly index (and set) the 2nd level
idx = pd.IndexSlice
df_sub.loc[idx[:,'pub1'],:] = 1

Categories

Resources