I have 2 questions:
I have a dataset that contains some duplicate IDs, but some of them have different actions so they can't be removed. I want for each ID to do some math and store the final value to work with later. I already have duplicate indices, but in this code, it doesn't work properly and gives NaN.
How can I write nested loop using pandas? Cause it takes too much time to run. I've already used iterrows(), but didn't work.
l_list = []
for i in range(len(idx)):
for j in range(len(idx[i])):
if df.at[j,'action'] == 0:
a = df.rank[idx[i]]*50
b = df.study_list[idx[i]].str.strip('[]').str.split(',').str.len()
l_list.append(a + b)
Based on my understanding of what you've provided, see if this works:
In [15]: df
Out[15]:
ID rank action study_list
0 aaa 24 0 [a, b]
1 bbb 6 1 [1, 2, 3]
2 aaa 14 0 [1, 2, 3, 4]
In [16]: def do_thing(row):
...: if row['ID'] == 'aaa' and row['action'] == 0:
...: return row['rank'] * 50 + len(row['study_list'])
...: else:
...: return 100 * row['rank']
...:
In [17]: df['new_value'] = df.apply(do_thing, axis=1)
In [18]: df
Out[18]:
ID rank action study_list new_value
0 aaa 24 0 [a, b] 1202
1 bbb 6 1 [1, 2, 3] 600
2 aaa 14 0 [1, 2, 3, 4] 704
NOTE:
I have made many simplifications as your post doesn't enable a reproducible case. Read this thread to see how to best ask questions about Pandas.
I also can't guarantee speed as you have not provided the details regarding the size of the dataset.
i dont know what does the variable idx or anything. i think your code is wrong,
you have to try this code
l_list = []
for i in range(len(idx)):
for j in range(len(idx[i])):
if df.at[j,'action'] == 0:
a = df.rank[idx[i]]*50
b = df.study_list[idx[i]].str.strip('[]').str.split(',').str.len()
l_list.append(a + b)
Related
I have defined a function to create a dataframe, but I get two lists in each column, how could I get each element of the list as a separate row in the dataframe as shown below.
a = [1, 2, 3, 4]
def function():
result = []
for i in range(0, len(a)):
number = [i for i in a]
operation = [8*i for i in a]
result.append({'number': number, 'operation': operation})
df = pd.DataFrame(result, columns=['number','operation'])
return df
function()
Result:
number operation
0 [1, 2, 3, 4] [8, 16, 24, 32]
What I really want to:
number operation
0 1 8
1 2 16
2 3 24
3 4 34
Can anyone help me please? :)
Your problems are twofold, firstly you are pushing the entire list of values (instead of the "current" value) into the result array on each pass through your for loop, and secondly you are overwriting the dataframe each time as well. It would be simpler to use a list comprehension to generate the values for the dataframe:
import pandas as pd
a = [1, 2, 3, 4]
def function():
result = [{'number' : i, 'operation' : 8*i} for i in a]
df = pd.DataFrame(result)
return df
print(function())
Output:
number operation
0 1 8
1 2 16
2 3 24
3 4 32
import numpy as np
a = [1, 2, 3, 4]
def function():
for i in range(0, len(a)):
number = [i for i in a]
operation = [8*i for i in a]
v=np.rot90(np.array((number,operation)))
result=np.flipud(v)
df = pd.DataFrame(result, columns=['number','operation'])
return df
print (function())
number operation
0 1 8
1 2 16
2 3 24
3 4 32
You are almost there. Just replace number = [i for i in a] with number = a[i] and operation = [8*i for i in a] with operation = 8 * a[i]
(FYI: No need to create pandas dataframe inside loop. You can get same output with pandas dataframe creation outside loop)
Refer to the below code:
a = [1, 2, 3, 4]
def function():
result = []
for i in range(0, len(a)):
number = a[i]
operation = 8*a[i]
result.append({'number': number, 'operation': operation})
df = pd.DataFrame(res, columns=['number','operation'])
return df
function()
number operation
0 1 8
1 2 16
2 3 24
3 4 32
I have dataframe:
A B C D
1 0 0 2
0 1 0 0
0 0 0 0
I need to select all values which are greater then 0 and put them in a list.
if row doesnt contain any positive value 0 should be written to list.
So, the output for given dataframe should look like this:
[1,2,1,0]
How this can be resolved?
Here is a simple loop you could use (looping through df.values gives us rows as arrays):
output = []
for ar in df.values:
nonzeros = ar[ar > 0]
# If nonzeros is not empty proceed and extend the output
if nonzeros.size:
output.extend(nonzeros)
# If not add 0
else:
output.append(0)
print(output)
returns:
[1, 2, 1, 0]
We can make extensive use of pandas + numpy here:
Mask all values which are greater than 0
m = df.gt(0)
A B C D
0 True False False True
1 False True False False
2 False False False False
Mask rows which dont contain any values above 0:
s1 = m.any(axis=1).astype(int).values
Get all the values greater than 0 in an array:
s2 = df.values[m]
Finally concat both arrays with each other:
np.concatenate([s2, s1[s1==0]]).tolist()
Output
[1, 2, 1, 0]
In your case , first stack with your df, then we apply your condition , if the row contain the none 0 we select , if all 0 , then we keep it as zero
df.stack().groupby(level=0).apply(lambda x : x.head(1) if all(x==0) else x[x!=0]).tolist()
[1, 2, 1, 0]
Or without apply
np.concatenate(df.mask(df==0).stack().groupby(level=0).apply(list).reindex(df.index,fill_value=[0]).values)
array([1., 2., 1., 0.])
Shorten the process
np.concatenate(list(map(lambda x : [x[0]] if all(x==0) else x[x!=0],df.values)))
array([1, 2, 1, 0])
You could apply a custom function which will process each row of the DataFrame and return a list. Then to sum returned lists.
In [1]: import pandas as pd
In [2]: df = pd.read_clipboard()
In [3]: df
Out[3]:
A B C D
0 1 0 0 2
1 0 1 0 0
2 0 0 0 0
In [4]: def get_positive_values(row):
...: # If all elements in a row are zeros
...: # then return a list with a single zero
...: if row.eq(0).all():
...: return [0]
...: # Else return a list with positive values only.
...: return row[row.gt(0)].tolist()
...:
...:
In [5]: df.apply(get_positive_values, axis=1).sum()
Out[5]: [1, 2, 1, 0]
I have a time-series A holding several values. I need to obtain a series B that is defined algebraically as follows:
B[t] = a * A[t] + b * B[t-1]
where we can assume B[0] = 0, and a and b are real numbers.
Is there any way to do this type of recursive computation in Pandas? Or do I have no choice but to loop in Python as suggested in this answer?
As an example of input:
> A = pd.Series(np.random.randn(10,))
0 -0.310354
1 -0.739515
2 -0.065390
3 0.214966
4 -0.605490
5 1.293448
6 -3.068725
7 -0.208818
8 0.930881
9 1.669210
As I noted in a comment, you can use scipy.signal.lfilter. In this case (assuming A is a one-dimensional numpy array), all you need is:
B = lfilter([a], [1.0, -b], A)
Here's a complete script:
import numpy as np
from scipy.signal import lfilter
np.random.seed(123)
A = np.random.randn(10)
a = 2.0
b = 3.0
# Compute the recursion using lfilter.
# [a] and [1, -b] are the coefficients of the numerator and
# denominator, resp., of the filter's transfer function.
B = lfilter([a], [1, -b], A)
print B
# Compare to a simple loop.
B2 = np.empty(len(A))
for k in range(0, len(B2)):
if k == 0:
B2[k] = a*A[k]
else:
B2[k] = a*A[k] + b*B2[k-1]
print B2
print "max difference:", np.max(np.abs(B2 - B))
The output of the script is:
[ -2.17126121e+00 -4.51909273e+00 -1.29913212e+01 -4.19865530e+01
-1.27116859e+02 -3.78047705e+02 -1.13899647e+03 -3.41784725e+03
-1.02510099e+04 -3.07547631e+04]
[ -2.17126121e+00 -4.51909273e+00 -1.29913212e+01 -4.19865530e+01
-1.27116859e+02 -3.78047705e+02 -1.13899647e+03 -3.41784725e+03
-1.02510099e+04 -3.07547631e+04]
max difference: 0.0
Another example, in IPython, using a pandas DataFrame instead of a numpy array:
If you have
In [12]: df = pd.DataFrame([1, 7, 9, 5], columns=['A'])
In [13]: df
Out[13]:
A
0 1
1 7
2 9
3 5
and you want to create a new column, B, such that B[k] = A[k] + 2*B[k-1] (with B[k] == 0 for k < 0), you can write
In [14]: df['B'] = lfilter([1], [1, -2], df['A'].astype(float))
In [15]: df
Out[15]:
A B
0 1 1
1 7 9
2 9 27
3 5 59
I need to apply a function to a subset of columns in a dataframe. consider the following toy example:
pdf = pd.DataFrame({'a' : [1, 2, 3], 'b' : [2, 3, 4], 'c' : [5, 6, 7]})
arb_cols = ['a', 'b']
what I want to do is this:
[df[c] = df[c].apply(lambda x : 99 if x == 2 else x) for c in arb_cols]
But this is bad syntax. Is it possible to accomplish such a task without a for loop?
With mask
pdf.mask(pdf.loc[:,arb_cols]==2,99).assign(c=pdf.c)
Out[1190]:
a b c
0 1 99 5
1 99 3 6
2 3 4 7
Or with assign
pdf.assign(**pdf.loc[:,arb_cols].mask(pdf.loc[:,arb_cols]==2,99))
Out[1193]:
a b c
0 1 99 5
1 99 3 6
2 3 4 7
Do not use pd.Series.apply when you can use vectorised functions.
For example, the below should be efficient for larger dataframes even though there is an outer loop:
for col in arb_cols:
pdf.loc[pdf[col] == 2, col] = 99
Another option it to use pd.DataFrame.replace:
pdf[arb_cols] = pdf[arb_cols].replace(2, 99)
Yet another option is to use numpy.where:
import numpy as np
pdf[arb_cols] = np.where(pdf[arb_cols] == 2, 99, pdf[arb_cols])
For this case it would probably be better to use applymap if you need to apply a custom function
pdf[arb_cols] = pdf[arb_cols].applymap(lambda x : 99 if x == 2 else x)
There's an operation that is a little counter intuitive when using pandas apply() method. It took me a couple of hours of reading to solve, so here it is.
So here is what I was trying to accomplish.
I have a pandas dataframe like so:
test = pd.DataFrame({'one': [[2],['test']], 'two': [[5],[10]]})
one two
0 [2] [5]
1 [test] [10]
and I want to add the columns per row to create a resulting list of length = to the DataFrame's original length like so:
def combine(row):
result = row['one'] + row['two']
return(result)
When running it through the dataframe using the apply() method:
test.apply(lambda x: combine(x), axis=1)
one two
0 2 5
1 test 10
Which isn't quite what we wanted. What we want is:
result
0 [2, 5]
1 [test, 10]
EDIT
I know there are simpler solutions to this example. But this is an abstraction from a much more complex operation.Here's an example of a more complex one:
df_one:
org_id date status id
0 2 2015/02/01 True 3
1 10 2015/05/01 True 27
2 10 2015/06/01 True 18
3 10 2015/04/01 False 27
4 10 2015/03/01 True 40
df_two:
org_id date
0 12 2015/04/01
1 10 2015/02/01
2 2 2015/08/01
3 10 2015/08/01
Here's a more complex operation:
def operation(row, df_one):
sel = (df_one.date < pd.Timestamp(row['date'])) & \
(df_one['org_id'] == row['org_id'])
last_changes = df_one[sel].groupby(['org_id', 'id']).last()
id_list = last_changes[last_changes.status].reset_index().id.tolist()
return (id_list)
then finally run:
df_one.sort_values('date', inplace=True)
df_two['id_list'] = df_two.apply(
operation,
axis=1,
args=(df_one,)
)
This would be impossible with simpler solutions. Hence my proposed one below would be to re write operation to:
def operation(row, df_one):
sel = (df_one.date < pd.Timestamp(row['date'])) & \
(df_one['org_id'] == row['org_id'])
last_changes = df_one[sel].groupby(['org_id', 'id']).last()
id_list = last_changes[last_changes.status].reset_index().id.tolist()
return pd.Series({'id_list': id_list})
We'd expect the following result:
id_list
0 []
1 []
2 [3]
3 [27,18,40]
IIUC we can simply sum two columns:
In [93]: test.sum(axis=1).to_frame('result')
Out[93]:
result
0 [2, 5]
1 [test, 10]
because when we sum lists:
In [94]: [2] + [5]
Out[94]: [2, 5]
they are getting concatenated...
So the answer to this problem lies in how pandas.apply() method works.
When defining
def combine(row):
result = row['one'] + row['two']
return(result)
the function will be returning a list for each row that gets passed in. This is a problem if we use the function with the .apply() method because it will interpret the resulting lists as a Series where each element is a column of that same row.
To solve this we need to create a Series where we specify a new column name like so:
def combine(row):
result = row['one'] + row['two']
return pd.Series({'result': result})
And if we run this again:
test.apply(lambda x: combine(x), axis=1)
result
0 [2, 5]
1 [test, 10]
We'll get what we originally wanted! Again, this is because we are forcing pandas to interpret the entire result as a column.