I have dataframe:
A B C D
1 0 0 2
0 1 0 0
0 0 0 0
I need to select all values which are greater then 0 and put them in a list.
if row doesnt contain any positive value 0 should be written to list.
So, the output for given dataframe should look like this:
[1,2,1,0]
How this can be resolved?
Here is a simple loop you could use (looping through df.values gives us rows as arrays):
output = []
for ar in df.values:
nonzeros = ar[ar > 0]
# If nonzeros is not empty proceed and extend the output
if nonzeros.size:
output.extend(nonzeros)
# If not add 0
else:
output.append(0)
print(output)
returns:
[1, 2, 1, 0]
We can make extensive use of pandas + numpy here:
Mask all values which are greater than 0
m = df.gt(0)
A B C D
0 True False False True
1 False True False False
2 False False False False
Mask rows which dont contain any values above 0:
s1 = m.any(axis=1).astype(int).values
Get all the values greater than 0 in an array:
s2 = df.values[m]
Finally concat both arrays with each other:
np.concatenate([s2, s1[s1==0]]).tolist()
Output
[1, 2, 1, 0]
In your case , first stack with your df, then we apply your condition , if the row contain the none 0 we select , if all 0 , then we keep it as zero
df.stack().groupby(level=0).apply(lambda x : x.head(1) if all(x==0) else x[x!=0]).tolist()
[1, 2, 1, 0]
Or without apply
np.concatenate(df.mask(df==0).stack().groupby(level=0).apply(list).reindex(df.index,fill_value=[0]).values)
array([1., 2., 1., 0.])
Shorten the process
np.concatenate(list(map(lambda x : [x[0]] if all(x==0) else x[x!=0],df.values)))
array([1, 2, 1, 0])
You could apply a custom function which will process each row of the DataFrame and return a list. Then to sum returned lists.
In [1]: import pandas as pd
In [2]: df = pd.read_clipboard()
In [3]: df
Out[3]:
A B C D
0 1 0 0 2
1 0 1 0 0
2 0 0 0 0
In [4]: def get_positive_values(row):
...: # If all elements in a row are zeros
...: # then return a list with a single zero
...: if row.eq(0).all():
...: return [0]
...: # Else return a list with positive values only.
...: return row[row.gt(0)].tolist()
...:
...:
In [5]: df.apply(get_positive_values, axis=1).sum()
Out[5]: [1, 2, 1, 0]
Related
I have a data frame consisting of lists as elements. I want to find the closest matching values within a percentage of a given value.
My code:
df = pd.DataFrame({'A':[[1,2],[4,5,6]]})
df
A
0 [1, 2]
1 [3, 5, 7]
# in each row, lets find a the values and their index that match 5 with 20% tolerance
val = 5
tol = 0.2 # find values matching 5 or 20% within 5 (4 or 6)
df['Matching_index'] = (df['A'].map(np.array)-val).map(abs).map(np.argmin)
Present solution:
df
A Matching_index
0 [1, 2] 1 # 2 matches closely with 5 but this is wrong
1 [4, 5, 6] 1 # 5 matches with 5, correct.
Expected solution:
df
A Matching_index
0 [1, 2] NaN # No matching value, hence NaN
1 [4, 5, 6] 1 # 5 matches with 5, correct.
Idea is get difference with val and then replace to missing values if not match tolerance, last get np.nanargmin which raise error if all missing values, so added next condition with np.any:
def f(x):
a = np.abs(np.array(x)-val)
m = a <= val * tol
return np.nanargmin(np.where(m, a, np.nan)) if m.any() else np.nan
df['Matching_index'] = df['A'].map(f)
print (df)
A Matching_index
0 [1, 2] NaN
1 [4, 5, 6] 1.0
Pandas solution:
df1 = pd.DataFrame(df['A'].tolist(), index=df.index).sub(val).abs()
df['Matching_index'] = df1.where(df1 <= val * tol).dropna(how='all').idxmin(axis=1)
I'm not sure it you want all indexes or just a counter.
Try this:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':[[1,2],[4,5,6,7,8]]})
val = 5
tol = 0.3
def closest(arr,val,tol):
idxs = [ idx for idx,el in enumerate(arr) if (np.abs(el - val) < val*tol)]
result = len(idxs) if len(idxs) != 0 else np.nan
return result
df['Matching_index'] = df['A'].apply(closest, args=(val,tol,))
df
If you want all the indexes, just return idxs instead of len(idxs).
I am currently trying to shuffle an array and am running into some problems.
What I have:
my_array=array([nan, 1, 1, nan, nan, 2, nan, ..., nan, nan, nan])
What I want to do:
I want to shuffle the dataset while keeping the numbers (e.g. the 1,1 in the array) together.
What I did is first converting every naninto an unique negative number.
my_array=array([-1, 1, 1, -2, -3, 2, -4, ..., -2158, -2159, -2160])
Afterward I split everything up with pandas:
df = pd.DataFrame(my_array)
df.rename(columns={0: 'sampleID'}, inplace=True)
groups = [df.iloc[:, 0] for _, df in df.groupby('sampleID')]
If I know shuffle my dataset I will have an equal probability for every group to appear at a given place, but this would neglect the number of elements in each group. If I have a group of several elements like [9,9,9,9,9,9] it should have a higher chance at appearing earlier than some random nan. Correct me on this one if I'm wrong.
One way to get around this problem is numpys choice method.
For this I have to create a probability array
probability_array = np.zeros(len(groups))
for index, item in enumerate(groups):
probability_array[index] = len(item) / len(groups)
All of this to finally call:
groups=np.array(groups,dtype=object)
rng = np.random.default_rng()
shuffled_indices = rng.choice(len(groups), len(groups), replace=False, p=probability_array)
shuffled_array = np.concatenate(groups[shuffled_indices]).ravel()
shuffled_array[shuffled_array < 1] = np.NaN
All of this is quite cumbersome and not very fast. Besides the fact that you can certainly code it better, I feel like I am missing some very simple solution to my problem.
Can somebody point me in the right direction?
One approach:
import numpy as np
from itertools import groupby
# toy data
my_array = np.array([np.nan, 1, 1, np.nan, np.nan, 2, 2, 2, np.nan, 3, 3, 3, np.nan, 4, 4, np.nan, np.nan])
# find groups
groups = np.array([[key, sum(1 for _ in group)] for key, group in groupby(my_array)])
# permute
keys, repetitions = zip(*np.random.permutation(groups))
# recreate new array
res = np.repeat(keys, repetitions)
print(res)
Output (single run)
[ 3. 3. 3. nan nan nan nan 2. 2. 2. 1. 1. nan nan nan 4. 4.]
I have solved your problem under some restrictions
Instead of NaN, I have used zeros as separators
I assumed that an array of yours ALWAYS starts with a sequence of non-zero integers and ends with another sequence of non-zero integers.
With these provisions, I have essentially shuffled a representation of the sequences of integers, and later I have stitched everything in place again.
In [102]: import numpy as np
...: from itertools import groupby
...: a = np.array([int(_) for _ in '1110022220003044440005500000600777'])
...: print(a)
...: n, z = [], []
...: for i,g in groupby(a):
...: if i:
...: n.append((i, sum(1 for _ in g)))
...: else:
...: z.append(sum(1 for _ in g))
...: np.random.shuffle(n)
...: nn = n[0]
...: b = [*[nn[0]]*nn[1]]
...: for zz, nn in zip(z, n[1:]):
...: b += [*[0]*zz, *[nn[0]]*nn[1]]
...: print(np.array(b))
[1 1 1 0 0 2 2 2 2 0 0 0 3 0 4 4 4 4 0 0 0 5 5 0 0 0 0 0 6 0 0 7 7 7]
[7 7 7 0 0 1 1 1 0 0 0 4 4 4 4 0 6 0 0 0 5 5 0 0 0 0 0 2 2 2 2 0 0 3]
Note
The lengths of the runs of separators in the shuffled array is exactly the same as in the original array, but shuffling also the separators is easy. A more difficult problem would be to change arbitrarily the lengths, keepin' the array length unchanged.
I have 2 questions:
I have a dataset that contains some duplicate IDs, but some of them have different actions so they can't be removed. I want for each ID to do some math and store the final value to work with later. I already have duplicate indices, but in this code, it doesn't work properly and gives NaN.
How can I write nested loop using pandas? Cause it takes too much time to run. I've already used iterrows(), but didn't work.
l_list = []
for i in range(len(idx)):
for j in range(len(idx[i])):
if df.at[j,'action'] == 0:
a = df.rank[idx[i]]*50
b = df.study_list[idx[i]].str.strip('[]').str.split(',').str.len()
l_list.append(a + b)
Based on my understanding of what you've provided, see if this works:
In [15]: df
Out[15]:
ID rank action study_list
0 aaa 24 0 [a, b]
1 bbb 6 1 [1, 2, 3]
2 aaa 14 0 [1, 2, 3, 4]
In [16]: def do_thing(row):
...: if row['ID'] == 'aaa' and row['action'] == 0:
...: return row['rank'] * 50 + len(row['study_list'])
...: else:
...: return 100 * row['rank']
...:
In [17]: df['new_value'] = df.apply(do_thing, axis=1)
In [18]: df
Out[18]:
ID rank action study_list new_value
0 aaa 24 0 [a, b] 1202
1 bbb 6 1 [1, 2, 3] 600
2 aaa 14 0 [1, 2, 3, 4] 704
NOTE:
I have made many simplifications as your post doesn't enable a reproducible case. Read this thread to see how to best ask questions about Pandas.
I also can't guarantee speed as you have not provided the details regarding the size of the dataset.
i dont know what does the variable idx or anything. i think your code is wrong,
you have to try this code
l_list = []
for i in range(len(idx)):
for j in range(len(idx[i])):
if df.at[j,'action'] == 0:
a = df.rank[idx[i]]*50
b = df.study_list[idx[i]].str.strip('[]').str.split(',').str.len()
l_list.append(a + b)
Suppose we have a toy example like below.
np.random.seed(seed=1)
df = pd.DataFrame(np.random.randint(low=0,
high=2,
size=(5, 2)))
df
0 1
0 1 1
1 0 0
2 1 1
3 1 1
4 1 0
We want to return the indices of all rows like a certain row. Suppose I want the indices of all rows like row 0, which has a 1 in both column 0 and column 1.
I would want a data structure that has: (0, 2, 3).
I think you can do it like this
df.index[df.eq(df.iloc[0]).all(1)].tolist()
[0, 2, 3]
One way may be to use lambda:
df.index[df.apply(lambda row: all(row == df.iloc[0]), axis=1)].tolist()
Other way may be to use mask :
df.index[df[df == df.iloc[0].values].notnull().all(axis=1)].tolist()
Result:
[0, 2, 3]
Basically if a column of my pandas dataframe looks like this:
[1 1 1 2 2 2 3 3 3 1 1]
I'd like it to be turned into the following:
[1 2 3 1]
You can write a simple function that loops through the elements of your series only storing the first element in a run.
As far as I know, there is no tool built in to pandas to do this. But it is not a lot of code to do it yourself.
import pandas
example_series = pandas.Series([1, 1, 1, 2, 2, 3])
def collapse(series):
last = ""
seen = []
for element in series:
if element != last:
last = element
seen.append(element)
return seen
collapse(example_series)
In the code above, you will iterate through each element of a series and check if it is the same as the last element seen. If it is not, store it. If it is, ignore the value.
If you need to handle the return value as a series you can change the last line of the function to:
return pandas.Series(seen)
You could write a function that does the following:
x = pandas.Series([1 1 1 2 2 2 3 3 3 1 1])
y = x-x.shift(1)
y[0] = 1
result = x[y!=0]
You can use DataFrame's diff and indexing:
>>> df = pd.DataFrame([1,1,2,2,2,2,3,3,3,3,1])
>>> df[df[0].diff()!=0]
0
0 1
2 2
6 3
10 1
>>> df[df[0].diff()!=0].values.ravel() # If you need an array
array([1, 2, 3, 1])
Same works for Series:
>>> df = pd.Series([1,1,2,2,2,2,3,3,3,3,1])
>>> df[df.diff()!=0].values
array([1, 2, 3, 1])
You can use shift to create a boolean mask to compare the row against the previous row:
In [67]:
s = pd.Series([1,1,2,2,2,2,3,3,3,3,4,4,5])
s[s!=s.shift()]
Out[67]:
0 1
2 2
6 3
10 4
12 5
dtype: int64