1
2
3
4
Combined
Series
0.5
5
10
Nan
0.5, 5, 10
Increaseing
1
2
3
4
1, 2, 3, 4
Increasing
8
5
3
-1
8, 5, 3, -1
Decreasing
4
8
5
3
4, 8, 5, 3
neither
I have a table with the above column [1,2,3,4,Combined]
How can i try to automate the series determination of the combined column in python?
def test(combine):
return "Increasing." if all(combine[i] < combine[i + 1] for i in range(len(combine) - 1))
else
"Decreasing." if all(combine[i + 1] < combine[i] for i in range(len(combine) - 1))
else
"neither!"
But this give me error with outcome '0'
Your function works fine, provided that:
It is written with proper indentations (and line-continuations),
The column Combined contains lists of numbers (not e.g. strings such as '[0.5, 5, 10]', or lists of strings, etc).
First, let's make sure the column contains lists of floats and/or ints:
assert df['Combined'].apply(lambda x: isinstance(x, list) and all(isinstance(xi, (int, float)) for xi in x)).all()
If that is not the case, then correct it:
from pandas.api.types import is_numeric_dtype
# test that the 4 first columns are numeric
assert df.iloc[:, :4].apply(is_numeric_dtype).all()
df['Combined'] = df.iloc[:, :4].apply(lambda s: s.dropna().tolist(), axis=1)
Then, make sure the syntax of your function is correct (indentation and line-continuation):
def test(combine):
return "Increasing." if all(combine[i] < combine[i + 1] for i in range(len(combine) - 1)) \
else "Decreasing." if all(combine[i + 1] < combine[i] for i in range(len(combine) - 1)) \
else "neither!"
Then:
>>> df['Combined'].apply(test)
0 Increasing.
1 Increasing.
2 Decreasing.
3 neither!
You could make the function a bit more concise:
def test(x):
return "Increasing." if all(a < b for a, b in zip(x, x[1:])) \
else "Decreasing." if all(a > b for a, b in zip(x, x[1:])) \
else "neither!"
Or you could use Pandas Series built-in monotonic properties:
def trend_comment(a):
s = pd.Series(a)
return 'increasing' if s.is_monotonic_increasing \
else 'decreasing' if s.is_monotonic_decreasing \
else 'neither'
>>> df['Combined'].apply(trend_comment)
0 increasing
1 increasing
2 decreasing
3 neither
(Note that the definition is slightly different than your tests: Pandas' is_monotonic_increasing really means "monotonic non-decreasing", in which consecutive values are either equal or increasing).
Related
I have a data frame consisting of lists as elements. I want to find the closest matching values within a percentage of a given value.
My code:
df = pd.DataFrame({'A':[[1,2],[4,5,6]]})
df
A
0 [1, 2]
1 [3, 5, 7]
# in each row, lets find a the values and their index that match 5 with 20% tolerance
val = 5
tol = 0.2 # find values matching 5 or 20% within 5 (4 or 6)
df['Matching_index'] = (df['A'].map(np.array)-val).map(abs).map(np.argmin)
Present solution:
df
A Matching_index
0 [1, 2] 1 # 2 matches closely with 5 but this is wrong
1 [4, 5, 6] 1 # 5 matches with 5, correct.
Expected solution:
df
A Matching_index
0 [1, 2] NaN # No matching value, hence NaN
1 [4, 5, 6] 1 # 5 matches with 5, correct.
Idea is get difference with val and then replace to missing values if not match tolerance, last get np.nanargmin which raise error if all missing values, so added next condition with np.any:
def f(x):
a = np.abs(np.array(x)-val)
m = a <= val * tol
return np.nanargmin(np.where(m, a, np.nan)) if m.any() else np.nan
df['Matching_index'] = df['A'].map(f)
print (df)
A Matching_index
0 [1, 2] NaN
1 [4, 5, 6] 1.0
Pandas solution:
df1 = pd.DataFrame(df['A'].tolist(), index=df.index).sub(val).abs()
df['Matching_index'] = df1.where(df1 <= val * tol).dropna(how='all').idxmin(axis=1)
I'm not sure it you want all indexes or just a counter.
Try this:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':[[1,2],[4,5,6,7,8]]})
val = 5
tol = 0.3
def closest(arr,val,tol):
idxs = [ idx for idx,el in enumerate(arr) if (np.abs(el - val) < val*tol)]
result = len(idxs) if len(idxs) != 0 else np.nan
return result
df['Matching_index'] = df['A'].apply(closest, args=(val,tol,))
df
If you want all the indexes, just return idxs instead of len(idxs).
Consider these series:
>>> a = pd.Series('abc a abc c'.split())
>>> b = pd.Series('a abc abc a'.split())
>>> pd.concat((a, b), axis=1)
0 1
0 abc a
1 a abc
2 abc abc
3 c a
>>> unknown_operation(a, b)
0 False
1 True
2 True
3 False
The desired logic is to determine if the string in the left column is a substring of the string in the right column. pd.Series.str.contains does not accept another Series, and pd.Series.isin checks if the value exists in the other series (not in the same row specifically). I'm interested to know if there's a vectorized solution (not using .apply or a loop), but it may be that there isn't one.
Let us try with numpy defchararray which is vectorized
from numpy.core.defchararray import find
find(df['1'].values.astype(str),df['0'].values.astype(str))!=-1
Out[740]: array([False, True, True, False])
IIUC,
df[1].str.split('', expand=True).eq(df[0], axis=0).any(axis=1) | df[1].eq(df[0])
Output:
0 False
1 True
2 True
3 False
dtype: bool
I tested various functions with a randomly generated Dataframe of 1,000,000 5 letter entries.
Running on my machine, the averages of 3 tests showed:
zip > v_find > to_list > any > apply
0.21s > 0.79s > 1s > 3.55s > 8.6s
Hence, i would recommend using zip:
[x[0] in x[1] for x in zip(df['A'], df['B'])]
or vectorized find (as proposed by BENY)
np.char.find(df['B'].values.astype(str), df['A'].values.astype(str)) != -1
My test-setup:
def generate_string(length):
return ''.join(random.choices(string.ascii_uppercase + string.digits, k=length))
A = [generate_string(5) for x in range(n)]
B = [generate_string(5) for y in range(n)]
df = pd.DataFrame({"A": A, "B": B})
to_list = pd.Series([a in b for a, b in df[['A', 'B']].values.tolist()])
apply = df.apply(lambda s: s["A"] in s["B"], axis=1)
v_find = np.char.find(df['B'].values.astype(str), df['A'].values.astype(str)) != -1
any = df["B"].str.split('', expand=True).eq(df["A"], axis=0).any(axis=1) | df["B"].eq(df["A"])
zip = [x[0] in x[1] for x in zip(df['A'], df['B'])]
I have the following example and I cannot understand why it doesn't work.
import pandas as pd
d = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=d)
def balh(a, b):
z = a + b
if z.any() > 1:
return z + 1
else:
return z
df['col3'] = balh(df.col1, df.col2)
Output:
My expected output would be see 5 and 7 not 4 and 6 in col3, since 4 and 6 are grater than 1 and my intention is to add 1 if a + b are grater than 1
The any method will evaluate if any element of the pandas.Series or pandas.DataFrame is True. A non-null integer is evaluated as True. So essentially by if z.any() > 1 you are comparing the True returned by the method with the 1 integer.
You need to condition directly the pandas.Series which will return a boolean pandas.Series where you can safely apply the any method.
This will be the same for the all method.
def balh(a, b):
z = a + b
if (z > 1).any():
return z + 1
else:
return z
As #arhr clearly explained the issue was the incorrect call to z.any(), which returns True when there is at least one non-zero element in z. It resulted in a True > 1 which is a False expression.
A one line alternative to avoid the if statement and the custom function call would be the following:
df['col3'] = df.iloc[:, :2].sum(1).transform(lambda x: x + int(x > 1))
This gets the first two columns in the dataframe then sums the elements along each row and transforms the new column according to the lambda function.
The iloc can also be omitted because the dataframe is instantiated with only two columns col1 and col2, thus the line can be refactored to:
df['col3'] = df.sum(1).transform(lambda x: x + int(x > 1))
Example output:
col1 col2 col3
0 1 3 5
1 2 4 7
Basically if a column of my pandas dataframe looks like this:
[1 1 1 2 2 2 3 3 3 1 1]
I'd like it to be turned into the following:
[1 2 3 1]
You can write a simple function that loops through the elements of your series only storing the first element in a run.
As far as I know, there is no tool built in to pandas to do this. But it is not a lot of code to do it yourself.
import pandas
example_series = pandas.Series([1, 1, 1, 2, 2, 3])
def collapse(series):
last = ""
seen = []
for element in series:
if element != last:
last = element
seen.append(element)
return seen
collapse(example_series)
In the code above, you will iterate through each element of a series and check if it is the same as the last element seen. If it is not, store it. If it is, ignore the value.
If you need to handle the return value as a series you can change the last line of the function to:
return pandas.Series(seen)
You could write a function that does the following:
x = pandas.Series([1 1 1 2 2 2 3 3 3 1 1])
y = x-x.shift(1)
y[0] = 1
result = x[y!=0]
You can use DataFrame's diff and indexing:
>>> df = pd.DataFrame([1,1,2,2,2,2,3,3,3,3,1])
>>> df[df[0].diff()!=0]
0
0 1
2 2
6 3
10 1
>>> df[df[0].diff()!=0].values.ravel() # If you need an array
array([1, 2, 3, 1])
Same works for Series:
>>> df = pd.Series([1,1,2,2,2,2,3,3,3,3,1])
>>> df[df.diff()!=0].values
array([1, 2, 3, 1])
You can use shift to create a boolean mask to compare the row against the previous row:
In [67]:
s = pd.Series([1,1,2,2,2,2,3,3,3,3,4,4,5])
s[s!=s.shift()]
Out[67]:
0 1
2 2
6 3
10 4
12 5
dtype: int64
I have a dataframe df like
A B
1 2
3 4
I then want to create 2 new series
t = pd.Series()
r = pd.Series()
I was able to assign values to t using the condition cond as below
t = "1+" + df.A.astype(str) + '+' + df.B.astype(str)
cond = df['A']<df['B']
team[cond] = "1+" + df.loc[cond,'B'].astype(str) + '+' + df.loc[cond,'A'].astype(str)
But I'm having problems with r. I just want r to contain values of 2 when con is satisfied and 1 otherwise
If I just try
r = 1
r[cond] = 2
Then I get TypeError: 'int' object does not support item assignment
I figure I could just run a for loop through df and check the cases in cond through each row of df, but I was wondering if Pandas offers a more efficient way instead?
You will laugh at how easy this is:
r = cond + 1
The reason is that cond is a boolean (True and False) which evaluate to 1 and 0. If you add one to it, it coerces the boolean to an int, which will mean True maps to 2 and False maps to one.
df = pd.DataFrame({'A': [1, 3, 4],
'B': [2, 4, 3]})
cond = df['A'] < df['B']
>>> cond + 1
0 2
1 2
2 1
dtype: int64
When you assign 1 to r as in
r = 1
r now references the integer 1. So when you call r[cond] you're treating an integer like a series.
You want to first create a series of ones for r the size of cond. Something like
r = pd.Series(np.ones(cond.shape))