Cannot compare/transform list to float - python

I have a column "Employees" that contains the following data:
122.12 (Mark/Jen)
32.11 (John/Albert)
29.1 (Jo/Lian)
I need to count how many values match a specific condition (like x>31).
base = list()
count = 0
count2 = 0
for element in data['Employees']:
base.append(element.split(' ')[0])
if base > 31:
count= count +1
else
count2 = count2 +1
print(count)
print(count2)
The output should tell me that count value is 2, and count2 value is 1. The problem is that I cannot compare float to list. How can I make that if work ?

You have a df with a Employees column that you need to split into number and text, keep the number and convert it into a float, then filter it based on a value:
import pandas as pd
df = pd.DataFrame({'Employees': ["122.12 (Mark/Jen)", "32.11(John/Albert)",
"29.1(Jo/Lian)"]})
print(df)
# split at (
df["value"] = df["Employees"].str.split("(")
# convert to float
df["value"] = pd.to_numeric(df["value"].str[0])
print(df)
# filter it into 2 series
smaller = df["value"] < 31
remainder = df["value"] > 30
print(smaller)
print(remainder)
# counts
smaller31 = sum(smaller) # True == 1 -> sum([True,False,False]) == 1
bigger30 = sum(remainder)
print(f"Smaller: {smaller31} bigger30: {bigger30}")
Output:
# df
Employees
0 122.12 (Mark/Jen)
1 32.11(John/Albert)
2 29.1(Jo/Lian)
# after split/to_numeric
Employees value
0 122.12 (Mark/Jen) 122.12
1 32.11(John/Albert) 32.11
2 29.1(Jo/Lian) 29.10
# smaller
0 False
1 False
2 True
Name: value, dtype: bool
# remainder
0 True
1 True
2 False
Name: value, dtype: bool
# counted
Smaller: 1 bigger30: 2

Related

If statement to add column to pandas dataframe gives the same values

I want to add a new column called I have a pandas dataframe called week5_233C. My Python version is 3.19.13.
I wrote an if-statement to add a new column to my data set: Spike. If the value in Value [pV] is not equal to 0, I want to add a 1 to that row. If Value [pV] is 0, then I want to add in the spike column that it is 0.
The data looks like this:
TimeStamp [µs] Value [pV]
0 1906200 0
1 1906300 0
2 1906400 0
3 1906500 -149012
4 1906600 -149012
And I want it to look like this:
TimeStamp [µs] Value [pV] Spike
0 1906200 0 0
1 1906300 0 0
2 1906400 0 0
3 1906500 -149012 1
4 1906600 -149012 1
I tried:
week5_233C.loc[week5_233C[' Value [pV]'] != 0, 'Spike'] = 1
week5_233C.loc[week5_233C[' Value [pV]'] == 0, 'Spike'] = 0
but all rows in column Spike get the same value.
I also tried:
week5_233C['Spike'] = week5_233C[' Value [pV]'].apply(lambda x: 0 if x == 0 else 1)
Again, it just adds only 0s or only 1s, but does not work with if and else. See example data:
TimeStamp [µs] Value [pV] Spike
0 1906200 0 1
1 1906300 0 1
2 1906400 0 1
3 1906500 -149012 1
4 1906600 -149012 1
Doing it like this:
for i in week5_233C[' Value [pV]']:
if i != 0:
week5_233C['Spike'] = 1
elif i == 0:
week5_233C['Spike'] = 0
does not do anything: does not add a column, does not give an error, and makes Python crash.
However, when I run this if-statement with just a print as such:
for i in week5_233C[' Value [pV]']:
if i != 0:
print(1)
elif i == 0:
print(0)
then it does print the exact values I want. I cannot figure out how to save these values in a new column.
This:
for i in week5_233C[' Value [pV]']:
if i != 0:
week5_233C.concat([1, df.iloc['Spike']])
elif i == 0:
week5_233C.concat([0, df.iloc['Spike']])
gives me an error: AttributeError: 'DataFrame' object has no attribute 'concat'
How can I make a new column Spike and add the values 0 and 1 based on the value in column Value [pV]?
I think you should check the dtype of Value [pV] column. You probably have string that's why you have the same value. Try print(df['Value [pV]'].dtype). If object try to convert with astype(float) or pd.to_numeric(df['Value [pV]']).
You can also try:
df['spike'] = np.where(df['Value [pV]'] == '0', 0, 1)
Update
To show bad rows and debug your datafame, use the following code:
df.loc[pd.to_numeric(df['Value [pV]'], errors='coerce').isna(), 'Value [pV]']
import pandas as pd
df = pd.DataFrame({'TimeStamp [µs]':[1906200, 1906300, 1906400, 1906500, 1906600],
'Value [pV] ':[0, 0, 0, -149012, -149012],
})
df['Spike'] = df.agg({'Value [pV] ': lambda v: int(bool(v))})
print(df)
TimeStamp [µs] Value [pV] Spike
0 1906200 0 0
1 1906300 0 0
2 1906400 0 0
3 1906500 -149012 1
4 1906600 -149012 1

how to convert a dataframe containing 1's and 0's and add a new column to the same dataframe that represents the hex value of entire row in python

I have a dataframe of 51 rows and 464 columns , the columns contain 1's and 0's. I want to have a encoded value of the hex as you see in the attached picture.
I was trying to use numpy to make the hex conversion but it would fail
df = pd.DataFrame(np.random.randint(0,2,size=(51, 464)))
#converting into numpy for easier shifting
a = df.values
b = a.dot(2**np.arange(a.size)[::-1])
I want to have every 4 columns grouped to produce the hexadecimal value and then if there are odd columns for ex: 463 instead of 464 then the trailing hexadecimal will be padded with zero or zeroes based on how many ever needed to make the full hexadecimal value
This code only works for 64 bits length and then fails.
I was following this example
binary0|1 to hex string
any suggestions on how to do this?
Doesn't this do what you want?
df.apply(lambda row: hex(int(''.join(map(str, row)), base=2)), axis=1)
Convert to string every number in a row
Join them to create one big number in string
Convert it to integer with base 2 (since a row is in binary format)
Convert it to hex
Edit: To convert every 4 piece with the same manner:
def hexize(row):
hexes = '0x'
row = ''.join(map(str, row))
for i in range(0, len(row), 4):
value = row[i:i+4]
value = value.ljust(4, '0') # right fill with 0
value = hex(int(value, base=2))
hexes += value[2:]
return hexes
df.apply(hexize, axis=1)
hexize('011101100') # returns '0x760'
Given input data:
ECID,T1,T2,T3,T4,T5,T6,T7,T8,T9,T10,T11,T12,T13,T14,T15,T16,T17,T18,T19,T20,T21,T22,T23,T24,T25,T26,T27,T28,T29,T30,T31,T32,T33,T34,T35,T36,T37,T38,T39,T40,T41,T42,T43,T44,T45,T46,T47,T48,T49,T50,T51
ABC123,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
XYZ345,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
DEF789,1,0,1,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
434thECID,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
This adds an "Encoded" column similar to what was asked. The first row example in the original question seems to have the wrong number of Fs:
import pandas as pd
def encode(row):
s = ''.join(str(x) for x in row[1:]) # Create binary string
s += '0' * (4 - len(row[1:]) % 4) # Make length a multiple of 4 by adding zeros
i = int(s,2) # convert to integer base 2
h = hex(i).rstrip('0') # strip trailing zeros
return h if h != '0x' else '0x0' # Handle special case of '0x0' stripping to '0x'
df = pd.read_csv('input.csv')
df['Encoded'] = df.apply(encode,axis=1)
print(df)
Output:
ECID T1 T2 T3 T4 T5 ... T47 T48 T49 T50 T51 Encoded
0 ABC123 1 1 1 1 1 ... 1 1 1 1 1 0xffffffffffffe
1 XYZ345 1 0 0 0 0 ... 0 0 0 0 0 0x8
2 DEF789 1 0 1 0 1 ... 0 0 0 0 0 0xaa
3 434thECID 0 0 0 0 0 ... 0 0 0 0 0 0x0
[4 rows x 53 columns]

Auto Increment Index against Unique Column Values in a DataFrame

I have unique values in a column, but they all have strange codes, and I want to instead have a numeric counter to identify these values. Is there a better way to do this?
class umm:
inc = 0
last_val = ''
#classmethod
def create_new_index(cls, new_val):
if new_val != cls.last_val:
cls.inc += 1
cls.last_val = new_val
return cls.inc
df['Doc_ID_index'] = df['Doc_ID'].apply(lambda x: umm.create_new_index(x))
Here is the dataframe:
Doc_ID Sent_ID Doc_ID_index
0 PMC2774701 S1.1 1
1 PMC2774701 S1.2 1
2 PMC2774701 S1.3 1
3 PMC2774701 S1.4 1
4 PMC2774701 S1.5 1
... ... ... ...
46019 3469-0 3469-51 6279
46020 3528-0 3528-10 6280
46021 3942-0 3942-39 6281
46022 4384-0 4384-25 6282
46023 4622-0 4622-45 6283
Method 1
#take the unique Doc ID's in the column
new_df=pd.DataFrame({'Doc_ID':df['Doc_ID'].unique()})
#assign a unique id
new_df['Doc_ID_index'] = new_df.index +1
#combine with original df to get the whole df
pd.merge(df,new_df,on='Doc_ID')
Method 2
df['Doc_ID_index'] = df.groupby(['Doc_ID']).ngroup()
I hope this helps!

Better Way to do this in Pandas?

I'm just seeking some guidance on how to do this better. I was just doing some basic research to compare Monday's opening and low. The code code returns two lists, one with the returns (Monday's close - open/Monday's open) and a list that's just 1's and 0's to reflect if the return was positive or negate.
Please take a look as I'm sure there's a better way to do it in pandas but I just don't know how.
#Monday only
m_list = [] #results list
h_list = [] #hit list (close-low > 0)
n=0 #counter variable
for t in history.index:
if datetime.datetime.weekday(t[1]) == 1: #t[1] is the timestamp in multi index (if timestemp is a Monday)
x = history.ix[n]['open']-history.ix[n]['low']
m_list.append((history.ix[n]['open']-history.ix[n]['low'])/history.ix[n]['open'])
if x > 0:
h_list.append(1)
else:
h_list.append(0)
n += 1 #add to index counter
else:
n += 1 #add to index counter
print("Mean: ", mean(m_list), "Max: ", max(m_list),"Min: ",
min(m_list), "Hit Rate: ", sum(h_list)/len(h_list))
You can do that by straight forward :
(history['open']-history['low'])>0
This will give you true for rows where open is greater and flase where low is greater.
And if you want 1,0, you can multiply the above statement with 1.
((history['open']-history['low'])>0)*1
Example
import numpy as np
import pandas as pd
df = pd.DataFrame({'a':np.random.random(10),
'b':np.random.random(10)})
Printing the data frame:
print(df)
a b
0 0.675916 0.796333
1 0.044582 0.352145
2 0.053654 0.784185
3 0.189674 0.036730
4 0.329166 0.021920
5 0.163660 0.331089
6 0.042633 0.517015
7 0.544534 0.770192
8 0.542793 0.379054
9 0.712132 0.712552
To make a new column compare where it is 1 if a is greater and 9 if b is greater :
df['compare'] = (df['a']-df['b']>0)*1
this will add new column compare:
a b compare
0 0.675916 0.796333 0
1 0.044582 0.352145 0
2 0.053654 0.784185 0
3 0.189674 0.036730 1
4 0.329166 0.021920 1
5 0.163660 0.331089 0
6 0.042633 0.517015 0
7 0.544534 0.770192 0
8 0.542793 0.379054 1
9 0.712132 0.712552 0

Python pandas resampling

I have the following dataframe:
Timestamp S_time1 S_time2 End_Time_1 End_time_2 Sign_1 Sign_2
0 2413044 0 0 0 0 x x
1 2422476 0 0 0 0 x x
2 2431908 0 0 0 0 x x
3 2441341 0 0 0 0 x x
4 2541232 2526631 2528631 2520631 2530631 10 80
5 2560273 2544946 2546496 2546496 2548496 40 80
6 2577224 2564010 2566010 2566010 2568010 null null
7 2592905 2580959 2582959 2582959 2584959 null null
The table goes on and on like that. The first column is a timestamp which is in milliseconds. S_time1 and End_time_1 are the duration where a particular sign (number) appear. For example, if we take the 5th row, S_time1 is 2526631, End_time_1 is 2520631, and the corresponding sign_1 is 10, which means from 2526631 to 2520631 the sign 10 will be displayed. And the same thing goes to S_time2 and End_time_2. The corresponding values in sign_2 will appear in the duration from S_time2 to End_time_2.
I want to resample the index column (Timestamp) in 100-millisecond bin time and check in which bin times the signs belong. For instance, between each start time and end time there is 2000 milliseconds difference. So the corresponding sign number will appear repeatedly in around 20 consecutive bin times because each bin time is 100 millisecond. So I need to have two columns only: one with the bin times and the second with the signs. Looks like the following table: (I am just making up the bin time just for example)
Bin_time signs
...100 0
...200 0
...300 10
...400 10
...500 10
...600 10
The sign 10 will be for the duration of the corresponding S_time1 to End_time_1. Then the next sign which is 80 continues for the duration of S_time2 to End_time_2. I am not sure if this can be done in pandas or not. But I really need help either in pandas or other methods.
Thanks for your help and suggestion in advance.
Input:
print df
Timestamp S_time1 S_time2 End_Time_1 End_time_2 Sign_1 Sign_2
0 2413044 0 0 0 0 x x
1 2422476 0 0 0 0 x x
2 2431908 0 0 0 0 x x
3 2441341 0 0 0 0 x x
4 2541232 2526631 2528631 2520631 2530631 10 80
5 2560273 2544946 2546496 2546496 2548496 40 80
6 2577224 2564010 2566010 2566010 2568010 null null
7 2592905 2580959 2582959 2582959 2584959 null null
2 approaches:
In [231]: %timeit s(df)
1 loops, best of 3: 2.78 s per loop
In [232]: %timeit m(df)
1 loops, best of 3: 690 ms per loop
def m(df):
#resample column Timestamp by 100ms, convert bak to integers
df['Timestamp'] = df['Timestamp'].astype('timedelta64[ms]')
df['i'] = 1
df = df.set_index('Timestamp')
df1 = df[[]].resample('100ms', how='first').reset_index()
df1['Timestamp'] = (df1['Timestamp'] / np.timedelta64(1, 'ms')).astype(int)
#felper column i for merging
df1['i'] = 1
#print df1
out = df1.merge(df,on='i', how='left')
out1 = out[['Timestamp', 'Sign_1']][(out.Timestamp >= out.S_time1) & (out.Timestamp <= out.End_Time_1)]
out2 = out[['Timestamp', 'Sign_2']][(out.Timestamp >= out.S_time2) & (out.Timestamp <= out.End_time_2)]
out1 = out1.rename(columns={'Sign_1':'Bin_time'})
out2 = out2.rename(columns={'Sign_2':'Bin_time'})
df = pd.concat([out1, out2], ignore_index=True).drop_duplicates(subset='Timestamp')
df1 = df1.set_index('Timestamp')
df = df.set_index('Timestamp')
df = df.reindex(df1.index).reset_index()
#print df.head(10)
def s(df):
#resample column Timestamp by 100ms, convert bak to integers
df['Timestamp'] = df['Timestamp'].astype('timedelta64[ms]')
df = df.set_index('Timestamp')
out = df[[]].resample('100ms', how='first')
out = out.reset_index()
out['Timestamp'] = (out['Timestamp'] / np.timedelta64(1, 'ms')).astype(int)
#print out.head(10)
#search start end
def search(x):
mask1 = (df.S_time1<=x['Timestamp']) & (df.End_Time_1>=x['Timestamp'])
#if at least one True return first value of series
if mask1.any():
return df.loc[mask1].Sign_1[0]
#check second start and end time
else:
mask2 = (df.S_time2<=x['Timestamp']) & (df.End_time_2>=x['Timestamp'])
if mask2.any():
#if at least one True return first value
return df.loc[mask2].Sign_2[0]
else:
#if all False return NaN
return np.nan
out['Bin_time'] = out.apply(search, axis=1)
#print out.head(10)

Categories

Resources