How to compare two columns, using python? - python

I want to compare two continuous home price Sale, and create new column that stores binary variables.
This is my process so far:
dataset['High'] = dataset['November'].map(lambda x: 1 if x>50000 else 0)
This allows me to work on only one column, but I want to compare both November and December home price columns and create new column that contains binary variables.
I want this output
November - December - NewCol
-------------------------------
651200 - 626600 - 0
420900 - 423600 - 1
82300 - 83100 - 1
177000 - 169600 - 0
285500 - 206300 - 0
633900 - 640000 - 1
218900 - 222400 - 1
461700 - 403800 - 0
419100 - 421300 - 1
127600 - 128300 - 1
553400 - 547800 - 0
November and December is a continuous variable, and so I wanted by converting it to a binary variable. I want to use the ifelse() function to create a variable, called "NewCol", which takes on a value of "1" if the ['November'] column is greater than ['December'], and takes on a value of "0" otherwise.

Similar to #3novak but with casting. One uses pandas for greater efficiency but when you use something like map that needs values expressed as (more expensive) python variables, you may as well just use python lists. Try to use pandas operations that apply to entire series and dataframes instead.
>>> import pandas as pd
>>> df = pd.read_csv('test.csv')
>>> df
November December
0 651200 626600
1 420900 423600
2 82300 83100
3 177000 169600
4 285500 206300
5 633900 640000
6 218900 222400
7 461700 403800
8 419100 421300
9 127600 128300
10 553400 547800
>>> df['Higher'] = df['December'].gt(df['November']).astype(int)
>>> df
November December Higher
0 651200 626600 0
1 420900 423600 1
2 82300 83100 1
3 177000 169600 0
4 285500 206300 0
5 633900 640000 1
6 218900 222400 1
7 461700 403800 0
8 419100 421300 1
9 127600 128300 1
10 553400 547800 0

Answer: This would do the trick.
dataset['deff'] = np.where(dataset['2016-11'] >= dataset['2016-12'], 0,1)

If I understand correctly, you can use the following to create a boolean column. We don't even need to use an ifelse statement. Instead we can use the vectorized nature of pandas data frames.
data['NewCol'] = data['November'] > data['December']
This returns a column of True and False values instead of 1 and 0, but they are functionally equivalent. You can sum, take means, etc. treating True as 1 and False as 0.

Related

Python: using rolling + apply with a function that requires 2 columns as arguments in pandas

I have a dataframe (df) with 2 columns:
Out[2]:
0 1
0 1 2
1 4 5
2 3 6
3 10 12
4 1 2
5 4 5
6 3 6
7 10 12
I would like to use calculate for all the elements of df[0] a function of itself and df[1] column:
def custom_fct_2(x,y):
res=stats.percentileofscore(y.values,x.iloc[-1])
return res
I get the following error: TypeError:
("'numpy.float64' object is not callable", u'occurred at index 0')
Here is the full code:
from __future__ import division
import pandas as pd
import sys
from scipy import stats
def custom_fct_2(x,y):
res=stats.percentileofscore(y.values,x.iloc[-1])
return res
df= pd.DataFrame([[1,2],[4,5],[3,6],[10,12],[1,2],[4,5],[3,6],[10,12]])
df['perc']=df.rolling(3).apply(custom_fct_2(df[0],df[1]))
Can someone help me on that? ( I am new in Python)
Out[2]:
0 1
...
5 4 5
6 3 6
7 10 12
I want the percentile ranking of [10] in [12,6,5]
I want the percentile ranking of [3] in [6,5,2]
I want the percentile ranking of [4] in [5,2,12]
...
The problem here is that rolling().apply() function cannot give you a segment of 3 rows across all the columns. Instead, it gives you series for the column 0 first, then the column 1.
Maybe there are better solutions, but I would show my one which at least works.
df= pd.DataFrame([[1,2],[4,5],[3,6],[10,12],[1,2],[4,5],[3,6],[10,12]])
def custom_fct_2(s):
score = df[0][s.index.values[1]] # you may use .values[-1] if you want the last element
a = s.values
return stats.percentileofscore(a, score)
I'm using the same data you provided. But I modified your custom_fct_2() function. Here we get the s which is a series of 3 rolling values from the column 1. Fortunately, we have indexes in this series, so we can get the score from the column 0 via the "middle" index of the series. BTW, in Python [-1] means the last element of a collection, but from your explanation, I believe you actually want the middle one.
Then, apply the function.
# remove the shift() function if you want the value align to the last value of the rolling scores
df['prec'] = df[1].rolling(3).apply(custom_fct_2).shift(periods=-1)
The shift function is optional. It depends on your requirements whether your prec need to be aligned with column 0 (the middle score is using) or the rolling scores of column 1. I would assume you need it.

Pandas Vectorization with Function on Parts of Column

So I have a dataframe that looks something like this:
df1 = pd.DataFrame([[1,2, 3], [5,7,8], [2,5,4]])
0 1 2
0 1 2 3
1 5 7 8
2 2 5 4
I then have a function that adds 5 to a number called add5. I'm trying to create a new column in df1 that adds 5 to all the numbers in column 2 that are greater than 3. I want to use vectorization not apply as this concept is going to be expanded to a dataset with hundreds of thousands of entries and speed will be important. I can do it without the greater than 3 constraint like this:
df1['3'] = add5(df1[2])
But my goal is to do something like this:
df1['3'] = add5(df1[2]) if df1[2] > 3
Hoping someone can point me in the right direction on this. Thanks!
With Pandas, a function applied explicitly to each row typically cannot be vectorised. Even implicit loops such as pd.Series.apply will likely be inefficient. Instead, you should use true vectorised operations, which lean heavily on NumPy in both functionality and syntax.
In this case, you can use numpy.where:
df1[3] = np.where(df1[2] > 3, df1[2] + 5, df1[2])
Alternatively, you can use pd.DataFrame.loc in a couple of steps:
df1[3] = df1[2]
df1.loc[df1[2] > 3, 3] = df1[2] + 5
In each case, the term df1[2] > 3 creates a Boolean series, which is then used to mask another series.
Result:
print(df1)
0 1 2 3
0 1 2 3 3
1 5 7 8 13
2 2 5 4 9

How do you set a specific column with a specific value to a new value in a Pandas DF?

I imported a CSV file that has two columns ID and Bee_type. The bee_type has two types in it - bumblebee and honey bee. I'm trying to convert them to numbers instead of names; i.e. instead of bumblebee it says 1.
However, my code is setting everything to 1. How can I keep the ID column its original value and only change the bee_type column?
# load the labels using pandas
labels = pd.read_csv("bees/train_labels.csv")
#Set bumble_bee to one
for index in range(len(labels)):
labels[labels['bee_type'] == 'bumble_bee'] = 1
I believe you need map by dictionary if only 2 possible values exist:
labels['bee_type'] = labels['bee_type'].map({'bumble_bee': 1, 'honey_bee': 2})
Another solution is to use numpy.where - set values by condition:
labels['bee_type'] = np.where(labels['bee_type'] == 'bumble_bee', 1, 2)
Your code works, but for improved performance, modify it a bit - remove loops and add loc:
labels.loc[labels['bee_type'] == 'bumble_bee'] = 1
print (labels)
ID bee_type
0 1 1
1 1 honey_bee
2 1 1
3 3 honey_bee
4 1 1
Sample:
labels = pd.DataFrame({
'bee_type': ['bumble_bee','honey_bee','bumble_bee','honey_bee','bumble_bee'],
'ID': list(range(5))
})
print (labels)
ID bee_type
0 0 bumble_bee
1 1 honey_bee
2 2 bumble_bee
3 3 honey_bee
4 4 bumble_bee
labels['bee_type'] = labels['bee_type'].map({'bumble_bee': 1, 'honey_bee': 2})
print (labels)
ID bee_type
0 0 1
1 1 2
2 2 1
3 3 2
4 4 1
As far as I can understand, you want to convert names to numbers. If that's the scenario please try LabelEncoder. Detailed documentation can be found sklearn LabelEncoder

Python dataframe check if a value in a column dataframe is within a range of values reported in another dataframe

Apology if the problemis trivial but as a python newby I wasn't able to find the right solution.
I have two dataframes and I need to add a column to the first dataframe that is true if a certain value of the first dataframe is between two values of the second dataframe otherwise false.
for example:
first_df = pd.DataFrame({'code1':[1,1,2,2,3,1,1],'code2':[10,22,15,15,7,130,2]})
second_df = pd.DataFrame({'code1':[1,1,2,2,3,1,1],'code2_start':[5,20,11,11,5,110,220],'code2_end':[15,25,20,20,10,120,230]})
first_df
code1 code2
0 1 10
1 1 22
2 2 15
3 2 15
4 3 7
5 1 130
6 1 2
second_df
code1 code2_end code2_start
0 1 15 5
1 1 25 20
2 2 20 11
3 2 20 11
4 3 10 5
5 1 120 110
6 1 230 220
For each row in the first dataframe I should check if the value reported in the code2 columne is between one of the possible range identified by the row of the second dataframe second_df for example:
in row 1 of first_df code1=1 and code2=22
checking second_df I have 4 rows with code1=1, rows 0,1,5 and 6, the value code2=22 is in the interval identified by code2_start=20 and code2_end=25 so the function should return True.
Considering an example where the function should return False,
in row 5 of first_df code1=1 and code2=130
but there is no interval containing 130 where code1=1
I have tried to use this function
def check(first_df,second_df):
for i in range(len(first_df):
return ((second_df.code2_start <= first_df.code2[i]) & (second_df.code2_end <= first_df.code2[i]) & (second_df.code1 == first_df.code1[i])).any()
and to vectorize it
first_df['output'] = np.vectorize(check)(first_df, second_df)
but obviously with no success.
I would be happy for any input you could provide.
thx.
A.
As a practical example:
first_df.code1[0] = 1
therefore I need to search on second_df all the istances where
second_df.code1 == first_df.code1[0]
0 True
1 True
2 False
3 False
4 False
5 True
6 True
for the instances 0,1,5,6 where the status is True I need to check if the value
first_df.code2[0]
10
is between one of the range identified by
second_df[second_df.code1 == first_df.code1[0]][['code2_start','code2_end']]
code2_start code2_end
0 5 15
1 20 25
5 110 120
6 220 230
since the value of first_df.code2[0] is 10 it is between 5 and 15 so the range identified by row 0 therefore my function should return True. In case of first_df.code1[6] the value vould still be 1 therefore the range table would be still the same above but first_df.code2[6] is 2 in this case and there is no interval containing 2 therefore the resut should be False.
first_df['output'] = (second_df.code2_start <= first_df.code2) & (second_df.code2_end <= first_df.code2)
This works because when you do something like: second_df.code2_start <= first_df.code2
You get a boolean Series. If you then perform a logical AND on two of these boolean series, you get a Series which has value True where both Series were True and False otherwise.
Here's an example:
>>> import pandas as pd
>>> a = pd.DataFrame([{1:2,2:4,3:6},{1:3,2:6,3:9},{1:4,2:8,3:10}])
>>> a['output'] = (a[2] <= a[3]) & (a[2] >= a[1])
>>> a
1 2 3 output
0 2 4 6 True
1 3 6 9 True
2 4 8 10 True
EDIT:
So based on your updated question and my new interpretation of your problem, I would do something like this:
import pandas as pd
# Define some data to work with
df_1 = pd.DataFrame([{'c1':1,'c2':5},{'c1':1,'c2':10},{'c1':1,'c2':20},{'c1':2,'c2':8}])
df_2 = pd.DataFrame([{'c1':1,'start':3,'end':6},{'c1':1,'start':7,'end':15},{'c1':2,'start':5,'end':15}])
# Function checks if c2 value is within any range matching c1 value
def checkRange(x, code_range):
idx = code_range.c1 == x.c1
code_range = code_range.loc[idx]
check = (code_range.start <= x.c2) & (code_range.end >= x.c2)
return check.any()
# Apply the checkRange function to each row of the DataFrame
df_1['output'] = df_1.apply(lambda x: checkRange(x, df_2), axis=1)
What I do here is define a function called checkRange which takes as input x, a single row of df_1 and code_range, the entire df_2 DataFrame. It first finds the rows of code_range which have the same c1 value as the given row, x.c1. Then the non matching rows are discarded. This is done in the first 2 lines:
idx = code_range.c1 == x.c1
code_range = code_range.loc[idx]
Next, we get a boolean Series which tells us if x.c2 falls within any of the ranges given in the reduced code_range DataFrame:
check = (code_range.start <= x.c2) & (code_range.end >= x.c2)
Finally, since we only care that the x.c2 falls within one of the ranges, we return the value of check.any(). When we call any() on a boolean Series, it will return True if any of the values in the Series are True.
To call the checkRange function on each row of df_1, we can use apply(). I define a lambda expression in order to send the checkRange function the row as well as df_2. axis=1 means that the function will be called on each row (instead of each column) for the DataFrame.

is there any quick function to do looking-back calculating in pandas dataframe?

I wanna implement a calculate method like a simple scenario:
value computed as the sum of daily data during the previous N days (set N = 3 in the following example)
Dataframe df: (df.index is 'date')
date value
20140718 1
20140721 2
20140722 3
20140723 4
20140724 5
20140725 6
20140728 7
......
to do calculating like:
date value new
20140718 1 0
20140721 2 0
20140722 3 0
20140723 4 6 (3+2+1)
20140724 5 9 (4+3+2)
20140725 6 12 (5+4+3)
20140728 7 15 (6+5+4)
......
Now I have done this using for cycle like:
df['value']=[0]*len(df)
for idx in df.index
loc=df.index.get_loc(idx)
if((loc-N)>=0):
tmp=df.ix[df.index[loc-3]:df.index[loc-1]]
sum=tmp['value'].sum()
else:
sum=0
df['new'].ix(idx)=sum
But, when the length of dataframe or the value of N is very long / big, these calculating will be very slow....How I can implement this faster using a function or by other ways?
Besides, if the scenario is more complex? how ? Thanks.
Since you want the sum of the previous three excluding the current one, you can use rolling_apply over the a window of four and sum up all but the last value.
new = rolling_apply(df, 4, lambda x:sum(x[:-1]), min_periods=4)
This is the same as shifting afterwards with a window of three:
new = rolling_apply(df, 3, sum, min_periods=3).shift()
Then
df["new"] = new["value"].fillna(0)

Categories

Resources