I have a dataframe that looks like this:
data = [[1, 10,100], [1.5, 15, 25], [7, 14, 70], [33,44,55]]
df = pd.DataFrame(data, columns = ['A', 'B','C'])
And has a visual expression like this
A B C
1 10 100
1.5 15 25
7 14 70
33 44 55
I have other data, that is a random subset of rows from the dataframe, so something like this
set_of_rows = [[1,10,100], [33,44,55]]
I want to get the indeces indicating the location of each row in set_of_rows inside df. So I need a function that does something like this:
indeces = func(subset=set_of_rows, dataframe=df)
In [1]: print(indeces)
Out[1]: [0, 3]
What function can do this? Tnx
Try the following:
[i for i in df.index if df.loc[i].to_list() in set_of_rows]
#[0, 3]
If you want it as a function:
def func(set_of_rows, df):
return [i for i in df.index if df.loc[i].to_list() in set_of_rows]
You can check this thread out;
Python Pandas: Get index of rows which column matches certain value
As far as I know, there is no intrinsic Panda function for your task so iteration is the only way to go about it. If you are concerned about dealing with the errors, you can add conditions in your loop that will take care of that.
for i in df.index:
lst = df.loc[i].to_list()
if lst in set_of_rows:
return i
else:
return None
Related
I am learning Pandas and I am moving my python code to Pandas. I want to compare every value with the next values using a sub. So the first with the second etc.. The second with the third but not with the first because I already did that. In python I use two nested loops over a list:
sub match_values (a, b):
#do some stuff...
l = ['a', 'b', 'c']
length = len(l)
for i in range (1, length):
for j in range (i, length): # starts from i, not from the start!
if match_values(l[i], l[j]):
#do some stuff...
How do I do a similar technique in Pandas when my list is a column in a dataframe? Do I simply reference every value like before or is there a clever "vector-style" way to do this fast and efficient?
Thanks in advance,
Jo
Can you please check this ? It provides an output in the form of a list for each row after comparing the values.
>>> import pandas as pd
>>> import numpy as np
>>> val = [16,19,15,19,15]
>>> df = pd.DataFrame({'val': val})
>>> df
val
0 16
1 19
2 15
3 19
4 15
>>>
>>>
>>> df['match'] = df.apply(lambda x: [ (1 if (x['val'] == df.loc[idx, 'val']) else 0) for idx in range(x.name+1, len(df)) ], axis=1)
>>> df
val match
0 16 [0, 0, 0, 0]
1 19 [0, 1, 0]
2 15 [0, 1]
3 19 [0]
4 15 []
Yes, vector comparison as pandas is built on Numpy:
df['columnname'] > 5
This will result in a Boolean array. If you also want to return the actually part of the dataframe:
df[df['columnname'] > 5]
The title may come across as confusing (honestly, not quite sure how to summarize it in a sentence), so here is a much better explanation:
I'm currently handling a dataFrame A regarding different attributes, and I used a .groupby[].count() function on a data column age to create a list of occurrences:
A_sub = A.groupby(['age'])['age'].count()
A_sub returns a Series similar to the following (the values are randomly modified):
age
1 316
2 249
3 221
4 219
5 262
...
59 1
61 2
65 1
70 1
80 1
Name: age, dtype: int64
I would like to plot a list of values from element-wise division. The division I would like to perform is an element value divided by the sum of all the elements that has the index greater than or equal to that element. In other words, for example, for age of 3, it should return
221/(221+219+262+...+1+2+1+1+1)
The same calculation should apply to all the elements. Ideally, the outcome should be in the similar type/format so that it can be plotted.
Here is a quick example using numpy. A similar approach can be used with pandas. The for loop can most likely be replaced by something smarter and more efficient to compute the coefficients.
import numpy as np
ages = np.asarray([316, 249, 221, 219, 262])
coefficients = np.zeros(ages.shape)
for k, a in enumerate(ages):
coefficients[k] = sum(ages[k:])
output = ages / coefficients
Output:
array([0.24940805, 0.26182965, 0.31481481, 0.45530146, 1. ])
EDIT: The coefficients initizaliation at 0 and the for loop can be replaced with:
coefficients = np.flip(np.cumsum(np.flip(ages)))
You can use the function cumsum() in pandas to get accumulated sums:
A_sub = A['age'].value_counts().sort_index(ascending=False)
(A_sub / A_sub.cumsum()).iloc[::-1]
No reason to use numpy, pandas already includes everything we need.
A_sub seems to return a Series where age is the index. That's not ideal, but it should be fine. The code below therefore operates on a series, but can easily be modified to work DataFrames.
import pandas as pd
s = pd.Series(data=np.random.randint(low=1, high=10, size=10), index=[0, 1, 3, 4, 5, 8, 9, 10, 11, 13], name="age")
print(s)
res = s / s[::-1].cumsum()[::-1]
res = res.rename("cumsum div")
I saw your comment about missing ages in the index. Here is how you would add the missing indexes in the range from min to max index, and then perform the division.
import pandas as pd
s = pd.Series(data=np.random.randint(low=1, high=10, size=10), index=[0, 1, 3, 4, 5, 8, 9, 10, 11, 13], name="age")
s_all_idx = s.reindex(index=range(s.index.min(), s.index.max() + 1), fill_value=0)
print(s_all_idx)
res = s_all_idx / s_all_idx[::-1].cumsum()[::-1]
res = res.rename("all idx cumsum div")
Usually, I use a list I simplify the timestamp by finding the difference between two successive values like this:
x=[ 1552154111, 1552154115, 1552154117, 1552154120, 1552154125
,1552154127, 1552154134, 1552154137]
List_time = []
for i in x:
List_time.append((i + 1) - x[0])
print(List_time)
[1, 5, 7, 10, 15, 17, 24, 27]
I need to have the same result by using the dataframe, which looks like this:
print(df['Timestamp'])
0 1552154111
1 1552154115
2 1552154117
3 1552154120
4 1552154125
5 1552154127
6 1552154134
7 1552154137
I need to replace the currect timestamp column with the expected difference. I don't know how to do that. It is the first I use a dataframe.
How could I do that please?
A potential solution that does not involve a df.apply(lambda) loop:
df['Timestamp'] = df['Timestamp'] - df['Timestamp'].iloc[0] + 1
You can achieve it with :
first_value = df.loc[0]
new_row = df.apply(lambda x: x + 1 - first_value)
first_value represents x[0].
Usually, you can achieve element-wise operations on pandas Series with apply
I'm using python pandas to organize some measurements values in a DataFrame.
One of the columns is a value which I want to convert in a 2D-vector so let's say the column contains such values:
col1
25
12
14
21
I want to have the values of this column changed one by one (in a for loop):
for value in values:
df.['col1'][value] = convert2Vector(df.['col1'][value])
So that the column col1 becomes:
col1
[-1. 21.]
[-1. -2.]
[-15. 54.]
[11. 2.]
The values are only examples and the function convert2Vector() converts the angle to a 2D-vector.
With the for-loop that I wrote it doesn't work .. I get the error:
ValueError: setting an array element with a sequence.
Which I can understand.
So the question is: How to do it?
That exception comes from the fact that you want to insert a list or array in a column (array) that stores ints. And arrays in Pandas and NumPy can't have a "ragged shape" so you can't have 2 elements in one row and 1 element in all the others (except maybe with masking).
To make it work you need to store "general" objects. For example:
import pandas as pd
df = pd.DataFrame({'col1' : [25, 12, 14, 21]})
df.col1[0] = [1, 2]
# ValueError: setting an array element with a sequence.
But this works:
>>> df.col1 = df.col1.astype(object)
>>> df.col1[0] = [1, 2]
>>> df
col1
0 [1, 2]
1 12
2 14
3 21
Note: I wouldn't recommend doing that because object columns are much slower than specifically typed columns. But since you're iterating over the Column with a for loop it seems you don't need the performance so you can also use an object array.
What you should be doing if you want it fast is vectorize the convert2vector function and assign the result to two columns:
import pandas as pd
import numpy as np
def convert2Vector(angle):
"""I don't know what your function does so this is just something that
calculates the sin and cos of the input..."""
ret = np.zeros((angle.size, 2), dtype=float)
ret[:, 0] = np.sin(angle)
ret[:, 1] = np.cos(angle)
return ret
>>> df = pd.DataFrame({'col1' : [25, 12, 14, 21]})
>>> df['col2'] = [0]*len(df)
>>> df[['col1', 'col2']] = convert2Vector(df.col1)
>>> df
col1 col2
0 -0.132352 0.991203
1 -0.536573 0.843854
2 0.990607 0.136737
3 0.836656 -0.547729
You should call a first order function like df.apply or df.transform which creates a new column which you then assign back:
In [1022]: df.col1.apply(lambda x: [x, x // 2])
Out[1022]:
0 [25, 12]
1 [12, 6]
2 [14, 7]
3 [21, 10]
Name: col1, dtype: object
In your case, you would do:
df['col1'] = df.col1.apply(convert2vector)
There's an operation that is a little counter intuitive when using pandas apply() method. It took me a couple of hours of reading to solve, so here it is.
So here is what I was trying to accomplish.
I have a pandas dataframe like so:
test = pd.DataFrame({'one': [[2],['test']], 'two': [[5],[10]]})
one two
0 [2] [5]
1 [test] [10]
and I want to add the columns per row to create a resulting list of length = to the DataFrame's original length like so:
def combine(row):
result = row['one'] + row['two']
return(result)
When running it through the dataframe using the apply() method:
test.apply(lambda x: combine(x), axis=1)
one two
0 2 5
1 test 10
Which isn't quite what we wanted. What we want is:
result
0 [2, 5]
1 [test, 10]
EDIT
I know there are simpler solutions to this example. But this is an abstraction from a much more complex operation.Here's an example of a more complex one:
df_one:
org_id date status id
0 2 2015/02/01 True 3
1 10 2015/05/01 True 27
2 10 2015/06/01 True 18
3 10 2015/04/01 False 27
4 10 2015/03/01 True 40
df_two:
org_id date
0 12 2015/04/01
1 10 2015/02/01
2 2 2015/08/01
3 10 2015/08/01
Here's a more complex operation:
def operation(row, df_one):
sel = (df_one.date < pd.Timestamp(row['date'])) & \
(df_one['org_id'] == row['org_id'])
last_changes = df_one[sel].groupby(['org_id', 'id']).last()
id_list = last_changes[last_changes.status].reset_index().id.tolist()
return (id_list)
then finally run:
df_one.sort_values('date', inplace=True)
df_two['id_list'] = df_two.apply(
operation,
axis=1,
args=(df_one,)
)
This would be impossible with simpler solutions. Hence my proposed one below would be to re write operation to:
def operation(row, df_one):
sel = (df_one.date < pd.Timestamp(row['date'])) & \
(df_one['org_id'] == row['org_id'])
last_changes = df_one[sel].groupby(['org_id', 'id']).last()
id_list = last_changes[last_changes.status].reset_index().id.tolist()
return pd.Series({'id_list': id_list})
We'd expect the following result:
id_list
0 []
1 []
2 [3]
3 [27,18,40]
IIUC we can simply sum two columns:
In [93]: test.sum(axis=1).to_frame('result')
Out[93]:
result
0 [2, 5]
1 [test, 10]
because when we sum lists:
In [94]: [2] + [5]
Out[94]: [2, 5]
they are getting concatenated...
So the answer to this problem lies in how pandas.apply() method works.
When defining
def combine(row):
result = row['one'] + row['two']
return(result)
the function will be returning a list for each row that gets passed in. This is a problem if we use the function with the .apply() method because it will interpret the resulting lists as a Series where each element is a column of that same row.
To solve this we need to create a Series where we specify a new column name like so:
def combine(row):
result = row['one'] + row['two']
return pd.Series({'result': result})
And if we run this again:
test.apply(lambda x: combine(x), axis=1)
result
0 [2, 5]
1 [test, 10]
We'll get what we originally wanted! Again, this is because we are forcing pandas to interpret the entire result as a column.