If statement to add column to pandas dataframe gives the same values - python

I want to add a new column called I have a pandas dataframe called week5_233C. My Python version is 3.19.13.
I wrote an if-statement to add a new column to my data set: Spike. If the value in Value [pV] is not equal to 0, I want to add a 1 to that row. If Value [pV] is 0, then I want to add in the spike column that it is 0.
The data looks like this:
TimeStamp [µs] Value [pV]
0 1906200 0
1 1906300 0
2 1906400 0
3 1906500 -149012
4 1906600 -149012
And I want it to look like this:
TimeStamp [µs] Value [pV] Spike
0 1906200 0 0
1 1906300 0 0
2 1906400 0 0
3 1906500 -149012 1
4 1906600 -149012 1
I tried:
week5_233C.loc[week5_233C[' Value [pV]'] != 0, 'Spike'] = 1
week5_233C.loc[week5_233C[' Value [pV]'] == 0, 'Spike'] = 0
but all rows in column Spike get the same value.
I also tried:
week5_233C['Spike'] = week5_233C[' Value [pV]'].apply(lambda x: 0 if x == 0 else 1)
Again, it just adds only 0s or only 1s, but does not work with if and else. See example data:
TimeStamp [µs] Value [pV] Spike
0 1906200 0 1
1 1906300 0 1
2 1906400 0 1
3 1906500 -149012 1
4 1906600 -149012 1
Doing it like this:
for i in week5_233C[' Value [pV]']:
if i != 0:
week5_233C['Spike'] = 1
elif i == 0:
week5_233C['Spike'] = 0
does not do anything: does not add a column, does not give an error, and makes Python crash.
However, when I run this if-statement with just a print as such:
for i in week5_233C[' Value [pV]']:
if i != 0:
print(1)
elif i == 0:
print(0)
then it does print the exact values I want. I cannot figure out how to save these values in a new column.
This:
for i in week5_233C[' Value [pV]']:
if i != 0:
week5_233C.concat([1, df.iloc['Spike']])
elif i == 0:
week5_233C.concat([0, df.iloc['Spike']])
gives me an error: AttributeError: 'DataFrame' object has no attribute 'concat'
How can I make a new column Spike and add the values 0 and 1 based on the value in column Value [pV]?

I think you should check the dtype of Value [pV] column. You probably have string that's why you have the same value. Try print(df['Value [pV]'].dtype). If object try to convert with astype(float) or pd.to_numeric(df['Value [pV]']).
You can also try:
df['spike'] = np.where(df['Value [pV]'] == '0', 0, 1)
Update
To show bad rows and debug your datafame, use the following code:
df.loc[pd.to_numeric(df['Value [pV]'], errors='coerce').isna(), 'Value [pV]']

import pandas as pd
df = pd.DataFrame({'TimeStamp [µs]':[1906200, 1906300, 1906400, 1906500, 1906600],
'Value [pV] ':[0, 0, 0, -149012, -149012],
})
df['Spike'] = df.agg({'Value [pV] ': lambda v: int(bool(v))})
print(df)
TimeStamp [µs] Value [pV] Spike
0 1906200 0 0
1 1906300 0 0
2 1906400 0 0
3 1906500 -149012 1
4 1906600 -149012 1

Related

Better Way to do this in Pandas?

I'm just seeking some guidance on how to do this better. I was just doing some basic research to compare Monday's opening and low. The code code returns two lists, one with the returns (Monday's close - open/Monday's open) and a list that's just 1's and 0's to reflect if the return was positive or negate.
Please take a look as I'm sure there's a better way to do it in pandas but I just don't know how.
#Monday only
m_list = [] #results list
h_list = [] #hit list (close-low > 0)
n=0 #counter variable
for t in history.index:
if datetime.datetime.weekday(t[1]) == 1: #t[1] is the timestamp in multi index (if timestemp is a Monday)
x = history.ix[n]['open']-history.ix[n]['low']
m_list.append((history.ix[n]['open']-history.ix[n]['low'])/history.ix[n]['open'])
if x > 0:
h_list.append(1)
else:
h_list.append(0)
n += 1 #add to index counter
else:
n += 1 #add to index counter
print("Mean: ", mean(m_list), "Max: ", max(m_list),"Min: ",
min(m_list), "Hit Rate: ", sum(h_list)/len(h_list))
You can do that by straight forward :
(history['open']-history['low'])>0
This will give you true for rows where open is greater and flase where low is greater.
And if you want 1,0, you can multiply the above statement with 1.
((history['open']-history['low'])>0)*1
Example
import numpy as np
import pandas as pd
df = pd.DataFrame({'a':np.random.random(10),
'b':np.random.random(10)})
Printing the data frame:
print(df)
a b
0 0.675916 0.796333
1 0.044582 0.352145
2 0.053654 0.784185
3 0.189674 0.036730
4 0.329166 0.021920
5 0.163660 0.331089
6 0.042633 0.517015
7 0.544534 0.770192
8 0.542793 0.379054
9 0.712132 0.712552
To make a new column compare where it is 1 if a is greater and 9 if b is greater :
df['compare'] = (df['a']-df['b']>0)*1
this will add new column compare:
a b compare
0 0.675916 0.796333 0
1 0.044582 0.352145 0
2 0.053654 0.784185 0
3 0.189674 0.036730 1
4 0.329166 0.021920 1
5 0.163660 0.331089 0
6 0.042633 0.517015 0
7 0.544534 0.770192 0
8 0.542793 0.379054 1
9 0.712132 0.712552 0

Cannot compare/transform list to float

I have a column "Employees" that contains the following data:
122.12 (Mark/Jen)
32.11 (John/Albert)
29.1 (Jo/Lian)
I need to count how many values match a specific condition (like x>31).
base = list()
count = 0
count2 = 0
for element in data['Employees']:
base.append(element.split(' ')[0])
if base > 31:
count= count +1
else
count2 = count2 +1
print(count)
print(count2)
The output should tell me that count value is 2, and count2 value is 1. The problem is that I cannot compare float to list. How can I make that if work ?
You have a df with a Employees column that you need to split into number and text, keep the number and convert it into a float, then filter it based on a value:
import pandas as pd
df = pd.DataFrame({'Employees': ["122.12 (Mark/Jen)", "32.11(John/Albert)",
"29.1(Jo/Lian)"]})
print(df)
# split at (
df["value"] = df["Employees"].str.split("(")
# convert to float
df["value"] = pd.to_numeric(df["value"].str[0])
print(df)
# filter it into 2 series
smaller = df["value"] < 31
remainder = df["value"] > 30
print(smaller)
print(remainder)
# counts
smaller31 = sum(smaller) # True == 1 -> sum([True,False,False]) == 1
bigger30 = sum(remainder)
print(f"Smaller: {smaller31} bigger30: {bigger30}")
Output:
# df
Employees
0 122.12 (Mark/Jen)
1 32.11(John/Albert)
2 29.1(Jo/Lian)
# after split/to_numeric
Employees value
0 122.12 (Mark/Jen) 122.12
1 32.11(John/Albert) 32.11
2 29.1(Jo/Lian) 29.10
# smaller
0 False
1 False
2 True
Name: value, dtype: bool
# remainder
0 True
1 True
2 False
Name: value, dtype: bool
# counted
Smaller: 1 bigger30: 2

Python Pandas - Get the First Value that Meets Criteria

We have this function:
def GetPricePerCustomAmt(CustomAmt):
data = [{"Price":281.48,"USDamt":104.84},{"Price":281.44,"USDamt":5140.77},{"Price":281.42,"USDamt":10072.24},{"Price":281.39,"USDamt":15773.83},{"Price":281.33,"USDamt":19314.54},{"Price":281.27,"USDamt":22255.55},{"Price":281.2,"USDamt":23427.64},{"Price":281.13,"USDamt":23708.77},{"Price":281.1,"USDamt":23738.77},{"Price":281.08,"USDamt":24019.88},{"Price":281.01,"USDamt":25986.95},{"Price":281.0,"USDamt":26127.45}]
df = pd.DataFrame(data)
df["getfirst"] = np.where(df["USDamt"] > CustomAmt, 1, 0)
wantedprice = "??"
print(df)
print()
print("Wanted Price:",wantedprice)
return wantedprice
Calling it using a custom USDamt like this:
GetPricePerCustomAmt(500)
gets this result:
Price USDamt getfirst
0 281.48 104.84 0
1 281.44 5140.77 1
2 281.42 10072.24 1
3 281.39 15773.83 1
4 281.33 19314.54 1
5 281.27 22255.55 1
6 281.20 23427.64 1
7 281.13 23708.77 1
8 281.10 23738.77 1
9 281.08 24019.88 1
10 281.01 25986.95 1
11 281.00 26127.45 1
Wanted Price: ??
We want to return the Price row of the first 1 appearing in the "getfirst" column.
Examples:
GetPricePerCustomAmt(500)
Wanted Price: 281.44
GetPricePerCustomAmt(15000)
Wanted Price: 281.39
GetPricePerCustomAmt(24000)
Wanted Price: 281.08
How do we do it?
(If you know a more efficient way to get the wanted price please do tell too)
Use next with iter for return default value if no value matched and is returneded empty Series, for filtering use boolean indexing:
def GetPricePerCustomAmt(CustomAmt):
data = [{"Price":281.48,"USDamt":104.84},{"Price":281.44,"USDamt":5140.77},{"Price":281.42,"USDamt":10072.24},{"Price":281.39,"USDamt":15773.83},{"Price":281.33,"USDamt":19314.54},{"Price":281.27,"USDamt":22255.55},{"Price":281.2,"USDamt":23427.64},{"Price":281.13,"USDamt":23708.77},{"Price":281.1,"USDamt":23738.77},{"Price":281.08,"USDamt":24019.88},{"Price":281.01,"USDamt":25986.95},{"Price":281.0,"USDamt":26127.45}]
df = pd.DataFrame(data)
return next(iter(df.loc[df["USDamt"] > CustomAmt, 'Price']), 'no matched')
print(GetPricePerCustomAmt(500))
281.44
print(GetPricePerCustomAmt(15000))
281.39
print(GetPricePerCustomAmt(24000))
281.08
print(GetPricePerCustomAmt(100000))
no matched

How do you set a specific column with a specific value to a new value in a Pandas DF?

I imported a CSV file that has two columns ID and Bee_type. The bee_type has two types in it - bumblebee and honey bee. I'm trying to convert them to numbers instead of names; i.e. instead of bumblebee it says 1.
However, my code is setting everything to 1. How can I keep the ID column its original value and only change the bee_type column?
# load the labels using pandas
labels = pd.read_csv("bees/train_labels.csv")
#Set bumble_bee to one
for index in range(len(labels)):
labels[labels['bee_type'] == 'bumble_bee'] = 1
I believe you need map by dictionary if only 2 possible values exist:
labels['bee_type'] = labels['bee_type'].map({'bumble_bee': 1, 'honey_bee': 2})
Another solution is to use numpy.where - set values by condition:
labels['bee_type'] = np.where(labels['bee_type'] == 'bumble_bee', 1, 2)
Your code works, but for improved performance, modify it a bit - remove loops and add loc:
labels.loc[labels['bee_type'] == 'bumble_bee'] = 1
print (labels)
ID bee_type
0 1 1
1 1 honey_bee
2 1 1
3 3 honey_bee
4 1 1
Sample:
labels = pd.DataFrame({
'bee_type': ['bumble_bee','honey_bee','bumble_bee','honey_bee','bumble_bee'],
'ID': list(range(5))
})
print (labels)
ID bee_type
0 0 bumble_bee
1 1 honey_bee
2 2 bumble_bee
3 3 honey_bee
4 4 bumble_bee
labels['bee_type'] = labels['bee_type'].map({'bumble_bee': 1, 'honey_bee': 2})
print (labels)
ID bee_type
0 0 1
1 1 2
2 2 1
3 3 2
4 4 1
As far as I can understand, you want to convert names to numbers. If that's the scenario please try LabelEncoder. Detailed documentation can be found sklearn LabelEncoder

Python dataframe check if a value in a column dataframe is within a range of values reported in another dataframe

Apology if the problemis trivial but as a python newby I wasn't able to find the right solution.
I have two dataframes and I need to add a column to the first dataframe that is true if a certain value of the first dataframe is between two values of the second dataframe otherwise false.
for example:
first_df = pd.DataFrame({'code1':[1,1,2,2,3,1,1],'code2':[10,22,15,15,7,130,2]})
second_df = pd.DataFrame({'code1':[1,1,2,2,3,1,1],'code2_start':[5,20,11,11,5,110,220],'code2_end':[15,25,20,20,10,120,230]})
first_df
code1 code2
0 1 10
1 1 22
2 2 15
3 2 15
4 3 7
5 1 130
6 1 2
second_df
code1 code2_end code2_start
0 1 15 5
1 1 25 20
2 2 20 11
3 2 20 11
4 3 10 5
5 1 120 110
6 1 230 220
For each row in the first dataframe I should check if the value reported in the code2 columne is between one of the possible range identified by the row of the second dataframe second_df for example:
in row 1 of first_df code1=1 and code2=22
checking second_df I have 4 rows with code1=1, rows 0,1,5 and 6, the value code2=22 is in the interval identified by code2_start=20 and code2_end=25 so the function should return True.
Considering an example where the function should return False,
in row 5 of first_df code1=1 and code2=130
but there is no interval containing 130 where code1=1
I have tried to use this function
def check(first_df,second_df):
for i in range(len(first_df):
return ((second_df.code2_start <= first_df.code2[i]) & (second_df.code2_end <= first_df.code2[i]) & (second_df.code1 == first_df.code1[i])).any()
and to vectorize it
first_df['output'] = np.vectorize(check)(first_df, second_df)
but obviously with no success.
I would be happy for any input you could provide.
thx.
A.
As a practical example:
first_df.code1[0] = 1
therefore I need to search on second_df all the istances where
second_df.code1 == first_df.code1[0]
0 True
1 True
2 False
3 False
4 False
5 True
6 True
for the instances 0,1,5,6 where the status is True I need to check if the value
first_df.code2[0]
10
is between one of the range identified by
second_df[second_df.code1 == first_df.code1[0]][['code2_start','code2_end']]
code2_start code2_end
0 5 15
1 20 25
5 110 120
6 220 230
since the value of first_df.code2[0] is 10 it is between 5 and 15 so the range identified by row 0 therefore my function should return True. In case of first_df.code1[6] the value vould still be 1 therefore the range table would be still the same above but first_df.code2[6] is 2 in this case and there is no interval containing 2 therefore the resut should be False.
first_df['output'] = (second_df.code2_start <= first_df.code2) & (second_df.code2_end <= first_df.code2)
This works because when you do something like: second_df.code2_start <= first_df.code2
You get a boolean Series. If you then perform a logical AND on two of these boolean series, you get a Series which has value True where both Series were True and False otherwise.
Here's an example:
>>> import pandas as pd
>>> a = pd.DataFrame([{1:2,2:4,3:6},{1:3,2:6,3:9},{1:4,2:8,3:10}])
>>> a['output'] = (a[2] <= a[3]) & (a[2] >= a[1])
>>> a
1 2 3 output
0 2 4 6 True
1 3 6 9 True
2 4 8 10 True
EDIT:
So based on your updated question and my new interpretation of your problem, I would do something like this:
import pandas as pd
# Define some data to work with
df_1 = pd.DataFrame([{'c1':1,'c2':5},{'c1':1,'c2':10},{'c1':1,'c2':20},{'c1':2,'c2':8}])
df_2 = pd.DataFrame([{'c1':1,'start':3,'end':6},{'c1':1,'start':7,'end':15},{'c1':2,'start':5,'end':15}])
# Function checks if c2 value is within any range matching c1 value
def checkRange(x, code_range):
idx = code_range.c1 == x.c1
code_range = code_range.loc[idx]
check = (code_range.start <= x.c2) & (code_range.end >= x.c2)
return check.any()
# Apply the checkRange function to each row of the DataFrame
df_1['output'] = df_1.apply(lambda x: checkRange(x, df_2), axis=1)
What I do here is define a function called checkRange which takes as input x, a single row of df_1 and code_range, the entire df_2 DataFrame. It first finds the rows of code_range which have the same c1 value as the given row, x.c1. Then the non matching rows are discarded. This is done in the first 2 lines:
idx = code_range.c1 == x.c1
code_range = code_range.loc[idx]
Next, we get a boolean Series which tells us if x.c2 falls within any of the ranges given in the reduced code_range DataFrame:
check = (code_range.start <= x.c2) & (code_range.end >= x.c2)
Finally, since we only care that the x.c2 falls within one of the ranges, we return the value of check.any(). When we call any() on a boolean Series, it will return True if any of the values in the Series are True.
To call the checkRange function on each row of df_1, we can use apply(). I define a lambda expression in order to send the checkRange function the row as well as df_2. axis=1 means that the function will be called on each row (instead of each column) for the DataFrame.

Categories

Resources