use pandas to check first digit of a column - python

Problem
I need to test the first digit of each number in a column for conditions.
Conditions
is the first digit of checkVar greater than 5
or
is the first digit of checkVar less than 2
then set newVar=1
Solution
One thought that I had was to convert to it a string, left strip the space, and then take [0], but i can't figure out the code.
perhaps something like,
df.ix[df.checkVar.str[0:1].str.contains('1'),'newVar']=1
It isn't what I want, and for some reason i get this error
invalid index to scalar variable.
testing my original variable i get values that should meet the condition
df.checkVar.value_counts()
301 62
1 15
2 5
999 3
dtype: int64
ideally it would look something like this:
checkVar newVar
NaN 1 nan
2 nan
3 nan
4 nan
5 301.0
6 301.0
7 301.0
8 301.0
9 301.0
10 301.0
11 301.0
12 301.0
13 301.0
14 1.0 1
15 1.0 1
UPDATE
My final solution, since actual problem was more complex
w = df.EligibilityStatusSP3.dropna().astype(str).str[0].astype(int)
v = df.EligibilityStatusSP2.dropna().astype(str).str[0].astype(int)
u = df.EligibilityStatusSP1.dropna().astype(str).str[0].astype(int)
t = df.EligibilityStatus.dropna().astype(str).str[0].astype(int) #get a series of the first digits of non-nan numbers
df['MCelig'] = ((t < 5)|(t == 9)|(u < 5)|(v < 5)|(w < 5)).astype(int)
df.MCelig = df.MCelig.fillna(0)

t = df.checkVar.dropna().astype(str).str[0].astype(int) #get a series of the first digits of non-nan numbers
df['newVar'] = ((t > 5) | (t < 2)).astype(int)
df.newVar = df.newVar.fillna(0)
this might be slightly better, unsure, but another, very similar way to approach it.
t = df.checkVar.dropna().astype(str).str[0].astype(int)
df['newVar'] = 0
df.newVar.update(((t > 5) | (t < 2)).astype(int))

It helpful to break up the steps a bit when you are uncertain how to proceed.
def checkvar(x):
s = str(x)
first_d = int(s[0])
if first_d < 2 or first_d > 5:
return 1
else:
return 0
Change the "else: return" value to whatever you want (e.g., "else: pass"). Also, if you want to create a new column:
*Update - I didn't notice the NaNs before. I see that you are still having problems even with the dropna(). Does the following work for you, like it does for me?
df = pd.DataFrame({'old_col': [None, None, None, 13, 75, 22, 51, 61, 31]})
df['new_col'] = df['old_col'].dropna().apply(checkvar)
df
If so Maybe the issue in your data is with the dtype of 'old_col'? Have you tried converting it to float first?
df['old_col'] = df['old_col'].astype('float')

Related

pandas round to neareast custom list

My question is very similar to here, except I would like to round to closest, instead of always round up, so cut() doesn't seem to work.
import pandas as pd
import numpy as np
df = pd.Series([11,16,21, 125])
rounding_logic = pd.Series([15, 20, 100])
labels = rounding_logic.tolist()
rounding_logic = pd.Series([-np.inf]).append(rounding_logic) # add infinity as leftmost edge
pd.cut(df, rounding_logic, labels=labels).fillna(rounding_logic.iloc[-1])
The result is [15,20,100,100], but I'd like [15,15,20,100], since 16 is closest to 15 and 21 closest to 20.
You can try pandas.merge_asof with direction=nearest
out = pd.merge_asof(df.rename('1'), rounding_logic.rename('2'),
left_on='1',
right_on='2',
direction='nearest')
print(out)
1 2
0 11 15
1 16 15
2 21 20
3 125 100
Get the absolute difference for the values, and take minimum value from rounding_logic:
>>> rounding_logic.reset_index(drop=True, inplace=True)
>>> df.apply(lambda x: rounding_logic[rounding_logic.sub(x).abs().idxmin()])
0 15.0
1 15.0
2 20.0
3 100.0
dtype: float64
PS: You need to reset the index in rounding_logic cause you have duplicate index after adding -inf to the start of the series.

Performance issue: Flag items in a Pandas Series depending on future values

I have a Series of decimal values and a DateTime index, and I would like to flag each item by following this simple rule:
0 if value - 1 is reached before value + 1 in the future
1 for the other way around
Note that the two offsets can vary: it could be -1 and +2 for example.
Here is an example:
2018-01-04 12:00:00 3550.1
2018-01-04 12:01:00 3551.2
2018-01-04 12:02:00 3550.7
2018-01-04 12:03:00 3551.3
2018-01-04 12:04:00 3550.2
2018-01-04 12:05:00 3549.0
2018-01-04 12:06:00 3549.3
2018-01-04 12:07:00 3548.7
2018-01-04 12:08:00 3549.8
2018-01-04 12:09:00 3545.4
Freq: T, Name: close_1T, dtype: float64
For the first 3 rows, that would give:
1 : 3551.2 is reached on the next row
0 : 3550.2 is reached at 12:04:00
0 : 3550.2 is reached at 12:04:00
I tried this:
se_flag = se.apply(lambda x: 0 if len(se[se > x + 1]) == 0
else 1 if len(se[se < x - 1]) == 0
else 1 if se[se > x + 1].index[0] > se[se < x - 1].index[0]
else 0)
The first two members of the lambda are for handling cases when the value is at or near the highest/lowest in the Series.
It seems to do the trick but scales awfully on my real case 1M items Series.
Can you give me some insights on how to make it more performant? Cast the Series to a list? Use a def function rather than a lambda?
Thanks a lot for your help.
So if I understand you correctly, you're looking for the direction of the first deviation that is larger than a in one direction or b in the other (you said they could be variable).
Since the values are 1 and 0, which are essentially the same as TRUE and FALSE, you're saying you want a series where TRUE means the value increases by a before it decreases by b, FALSE otherwise.
Note that when iterating over a series like this, this is almost always faster to keep it as a series than it is to cast it to a list. And lambdas and def functions are equivalent, apart from syntax. Your real speed ups come either from vectorization, or from being smarter about what you're doing.
In this case, what you're doing is much more than you'd need to. You're calculating the minimum and maximum value multiple times, even though they'll probably stay the same. You're also not going it by using the optimized methods for it, but with a custom function. Just use se.min(), se.max() and compare x directly.
In addition, you're checking the index of the first element where something is true, but you're looking at the entire series instead of quitting as soon as you find the first match. You're essentially doing a lot of unnecessary extra work!
Finally, you're looking at the array from the start, instead of from x onwards, is that on purpose? I assume it isn't, because of the example you gave.
In any case, this answer pointed me towards np.argmax, which returns the index of the first element where something is true if the passed numpy array contains only booleans. The only problem is that it returns 0 if no such value exists.
If we know the index, you could do two things: calculate the index for the lower and upper bound and see which occurs first, or calculate the index in an OR loop and then access the array to see which one was right. Performance-wise, I think they should be similar, but it depends how far you want to go!
import numpy as np
import pandas as pd
lower = 1; upper = 1
se = pd.Series([3550.1, 3551.2, 3550.7, 3551.3, 3550.2, 3549.0, 3549.3, 3548.7, 3549.8, 3545.4])
# I put this in a function to prevent polluting the global scope with `counter`
def calculate(se):
counter = 0
se_values = se.values
se_len = len(se) # otherwise you get an error if `np.argmax` is applied to an empty series
def goes_up_before_down(x):
nonlocal counter
counter += 1
if counter == se_len: return False
next_values = se_values[counter:]
idx_first_larger_value = np.argmax(next_values >= x + upper)
idx_first_smaller_value = np.argmax(next_values <= x - lower)
if idx_first_larger_value == 0 and x + upper >= next_values.max():
# There is no larger number in the rest of the series
# we can only go down from here
return False
if idx_first_smaller_value == 0 and x - lower <= next_values.min():
# There is no smaller number in the rest of the series
# we can only go down from here
return True
return idx_first_larger_value < idx_first_smaller_value
return se.apply(goes_up_before_down)
and this yields:
Out[2]:
0 True
1 False
2 False
3 False
4 False
5 False
6 False
7 True
8 False
9 False
dtype: bool
which can be cast into
calculate(se).astype(np.int64)
Out[3]:
0 1
1 0
2 0
3 0
4 0
5 0
6 0
7 1
8 0
9 0
dtype: int64

How many data points are plotted on my matplotlib graph?

So I want to count the number of data points plotted on my graph to keep a total track of graphed data. The problem is, my data table messes it up to where there are some NaN values in a different row in comparison to another column where it may or may not have a NaN value. For example:
# I use num1 as my y-coordinate and num1-num2 for my x-coordinate.
num1 num2 num3
1 NaN 25
NaN 7 45
3 8 63
NaN NaN 23
5 10 42
NaN 4 44
#So in this case, there should be only 2 data point on the graph between num1 and num2. For num1 and num3, there should be 3. There should be 4 data points between num2 and num3.
I believe Matplotlib doesn't graph the rows of the column that contain NaN values since its null (please correct me if I'm wrong, I can only tell this due to no dots being on the 0 coordinate of the x and y axes). In the beginning, I thought I could get away with using .count() and find the smaller of the two columns and use that as my tracker, but realistically that won't work as shown in my example above because it can be even LESS than that since one may have the NaN value and the other will have an actual value. Some examples of code I did:
# both x and y are columns within the DataFrame and are used to "count" how many data points are # being graphed.
def findAmountOfDataPoints(colA, colB):
if colA.count() < colB.count():
print(colA.count()) # Since its a smaller value, print the number of values in colA.
else:
print(colB.count()) # Since its a smaller value, print the number of values in colB.
Also, I thought about using .value_count() but I'm not sure if thats the exact function I'm looking for to complete what I want. Any suggestions?
Edit 1: Changed Data Frame names to make example clearer hopefully.
If I understood correctly your problem, assuming that your table is a pandas dataframe df, the following code should work:
sum((~np.isnan(df['num1']) & (~np.isnan(df['num2']))))
How it works:
np.isnan returns True if a cell is Nan. ~np.isnan is the inverse, hence it returns True when it's not Nan.
The code checks where both the column "num1" AND the column "num2" contain a non-Nan value, in other words it returns True for those rows where both the values exist.
Finally, those good rows are counted with sum, which takes into account only True values.
The way I understood it is that the number of combiniations of points that are not NaN is needed. Using a function I found I came up with this:
import pandas as pd
import numpy as np
def choose(n, k):
"""
A fast way to calculate binomial coefficients by Andrew Dalke (contrib).
https://stackoverflow.com/questions/3025162/statistics-combinations-in-python
"""
if 0 <= k <= n:
ntok = 1
ktok = 1
for t in range(1, min(k, n - k) + 1):
ntok *= n
ktok *= t
n -= 1
return ntok // ktok
else:
return 0
data = {'num1': [1, np.nan,3,np.nan,5,np.nan],
'num2': [np.nan,7,8,np.nan,10,4],
'num3': [25,45,63,23,42,44]
}
df = pd.DataFrame(data)
df['notnulls'] = df.notnull().sum(axis=1)
df['plotted'] = df.apply(lambda row: choose(int(row.notnulls), 2), axis=1)
print(df)
print("Total data points: ", df['plotted'].sum())
With this result:
num1 num2 num3 notnulls plotted
0 1.0 NaN 25 2 1
1 NaN 7.0 45 2 1
2 3.0 8.0 63 3 3
3 NaN NaN 23 1 0
4 5.0 10.0 42 3 3
5 NaN 4.0 44 2 1
Total data points: 9

Why does Python function return 1.0 (float) when `return 1` is specified?

I have a lot of strings, some of which consist of 1 sentence and some consisting of multiple sentences. My goal is to determine which one-sentence strings end with an exclamation mark '!'.
My code gives a strange result. Instead of returning '1' if found, it returns 1.0. I have tried: return int(1) but that does not help. I am fairly new to coding and do not understand, why is this and how can I get 1 as an integer?
'Sentences'
0 [This is a string., And a great one!]
1 [It's a wonderful sentence!]
2 [This is yet another string!]
3 [Strange strings have been written.]
4 etc. etc.
e = df['Sentences']
def Single(s):
if len(s) == 1: # Select the items with only one sentence
count = 0
for k in s: # loop over every sentence
if (k[-1]=='!'): # check if sentence ends with '!'
count = count+1
if count == 1:
return 1
else:
return ''
df['Single'] = e.apply(Single)
This returns the the correct result, except that there should be '1' instead of '1.0'.
'Single'
0 NaN
1 1.0
2 1.0
3
4 etc. etc.
Why does this happen?
The reason is np.nan is considered float. This makes the series of type float. You cannot avoid this unless you want your column to be of type Object [i.e. anything]. This is inefficient and inadvisable, and I refuse to show you how to do this.
If there is an alternative value you can use instead of np.nan, e.g. 0, then there is a workaround. You can replace NaN values with 0 and then convert to int:
s = pd.Series([1, np.nan, 2, 3])
print(s)
# 0 1.0
# 1 NaN
# 2 2.0
# 3 3.0
# dtype: float64
s = s.fillna(0).astype(int)
print(s)
# 0 1
# 1 0
# 2 2
# 3 3
# dtype: int32
Use astype(int)
Ex:
df['Single'] = e.apply(Single).astype(int)

Insert Values into Pandas Dataframe backwards (High Index to low)

Found a solution using .fillna
As you can guess, my title is already confusing, and so am I!
I have a dataframe like this
Index Values
0 NaN
1 NaN
...................
230 350.21
231 350.71
...................
1605 922.24
Between 230 and 1605 I have values, but not for the first 229 entries. So I calculated the slope to approximate the missing data and stored it in 'slope'.
Y1 = df['Values'].min()
X1ID = df['Values'].idxmin()
Y2 = df['Values'].max()
X2ID = df['Values'].idxmax()
slope = (Y2 - Y1)/(X2ID - X1ID)
In essence I want to get the .min from Values, subtract the slope and insert the new value in the index before the previous .min. However, I am completely lost, I tried something like this:
df['Values2'] = df['Values'].min().apply(lambda x: x.min() - slope)
But that is obviously rubbish. I would greatly appreciate some advise
EDIT
So after trying multiple ways I found a crude solution that at least works for me.
loopcounter = 0
missingValue = []
missingindex = []
missingindex.append(loopcounter)
missingValue.append(Y1)
for minValue in missingValue:
minValue = minValue-slopeave
missingValue.append(minwavelength)
loopcounter +=1
missingindex.append(loopcounter)
if loopcounter == 230:
break
del missingValue[0]
missingValue.reverse()
del missingindex[-1]
First I created two lists, one is for the missing values and the other for the index.
Afterwards I added my minimum Value (Y1) to the list and started my loop.
I wanted the loop to stop after 230 times (the amount of missing Values)
Each loop would subtract the slope from the items in the list, starting with the minimum value while also adding the counter to the missingindex list.
Deleting the first value and reversing the order transformed the list into the correct order.
missValue = dict(zip(missingindex,missingValue))
I then combined the two lists into a dictionary
df['Values'] = df['Values'].fillna(missValue)
Afterwards I used the .fillna function to fill up my dataframe with the dictionary.
This worked for me, I know its probably not the most elegant solution...
I would like to thank everyone that invested their time in trying to help me, thanks a lot.
Check this. However, I feel you would have to put this is a loop, as the insertion and min calculation has to do the re-calculation
import pandas as pd
import numpy as np
df = pd.DataFrame(columns=('Values',),data=
[
np.nan,
np.nan,
350.21,
350.71,
922.24
])
Y1 = df['Values'].min()
X1ID = df['Values'].idxmin()
Y2 = df['Values'].max()
X2ID = df['Values'].idxmax()
slope = (Y2 - Y1)/(X2ID - X1ID)
line = pd.DataFrame(data=[Y1-slope], columns=('Values',), index=[X1ID])
df2 = pd.concat([df.ix[:X1ID-1], line, df.ix[X1ID:]]).reset_index(drop=True)
print df2
The insert logic is provided here Is it possible to insert a row at an arbitrary position in a dataframe using pandas?
I think you can use loc with interpolate:
print df
Values
Index
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
229 NaN
230 350.21
231 350.71
1605 922.24
#add value 0 to index = 0
df.at[0, 'Values'] = 0
#add value Y1 - slope (349.793978) to max NaN value
df.at[X1ID-1, 'Values'] = Y1 - slope
print df
Values
Index
0 0.000000
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
229 349.793978
230 350.210000
231 350.710000
1605 922.240000
print df.loc[0:X1ID-1, 'Values']
Index
0 0.000000
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
229 349.793978
Name: Values, dtype: float64
#filter values by indexes and interpolate
df.loc[0:X1ID-1, 'Values'] = df.loc[0:X1ID-1, 'Values'].interpolate(method='linear')
print df
Values
Index
0 0.000000
1 49.970568
2 99.941137
3 149.911705
4 199.882273
5 249.852842
6 299.823410
229 349.793978
230 350.210000
231 350.710000
1605 922.240000
I will revise this a little bit:
df['Values2'] = df['Values']
df.ix[df.Values2.isnull(), 'Values2'] = (Y1 - slope)
EDIT
Or try to put this in a loop like below. This will recursively fill in all values until it reaches the end of the series:
def fix_rec(series):
Y1 = series.min()
X1ID = series.idxmin() ##; print(X1ID)
Y2 = series.max()
X2ID = series.idxmax()
slope = (Y2 - Y1) / (X2ID - X1ID);
if X1ID == 0: ## termination condition
return series
series.loc[X1ID-1] = Y1 - slope
return fix_rec(series)
call it like this:
df['values2'] = df['values']
fix_rec(df.values2)
I hope that helps!

Categories

Resources