Usually, I use a list I simplify the timestamp by finding the difference between two successive values like this:
x=[ 1552154111, 1552154115, 1552154117, 1552154120, 1552154125
,1552154127, 1552154134, 1552154137]
List_time = []
for i in x:
List_time.append((i + 1) - x[0])
print(List_time)
[1, 5, 7, 10, 15, 17, 24, 27]
I need to have the same result by using the dataframe, which looks like this:
print(df['Timestamp'])
0 1552154111
1 1552154115
2 1552154117
3 1552154120
4 1552154125
5 1552154127
6 1552154134
7 1552154137
I need to replace the currect timestamp column with the expected difference. I don't know how to do that. It is the first I use a dataframe.
How could I do that please?
A potential solution that does not involve a df.apply(lambda) loop:
df['Timestamp'] = df['Timestamp'] - df['Timestamp'].iloc[0] + 1
You can achieve it with :
first_value = df.loc[0]
new_row = df.apply(lambda x: x + 1 - first_value)
first_value represents x[0].
Usually, you can achieve element-wise operations on pandas Series with apply
Related
I have a simple question in pandas.
Lets say I have following data:
d = {'a': [1, 3, 5, 2, 10, 3, 5, 4, 2]}
df = pd.DataFrame(data=d)
df
How do I count the number of rows which are between minimum and maximum value in column a? So number of rows (it is 3 in this case) which are between 1 and 10 in this particular case?
Thanks
You can use this:
diff = np.abs(df['a'].idxmin() - df['a'].idxmax()) - 1
IIUC, you could get the index of the min and max, and subtract 2:
out = len(df.loc[df['a'].idxmin():df['a'].idxmax()])-2
output: 3
If the is a chance that the max is before the min:
out = max(len(df.loc[df['a'].idxmin():df['a'].idxmax()])-2, 0)
Alternative it the order of the min/max does not matter:
import numpy as np
out = np.ptp(df.reset_index()['a'].agg(['idxmin', 'idxmax']))-1
update: index of the 2nd largest/smallest:
# second largest
df['a'].nlargest(2).idxmin()
# 2
# second smallest
df['a'].nsmallest(2).idxmax()
# 3
Since they are numbers, we can dump down into numpy and compute the result:
arr = df.a.to_numpy()
np.abs(np.argmax(arr) - np.argmin(arr)) - 1
3
np.abs(df.loc[df.a==df.a.min()].index-df.loc[df.a==df.a.max()].index)-1
I have a dataframe that looks like this:
data = [[1, 10,100], [1.5, 15, 25], [7, 14, 70], [33,44,55]]
df = pd.DataFrame(data, columns = ['A', 'B','C'])
And has a visual expression like this
A B C
1 10 100
1.5 15 25
7 14 70
33 44 55
I have other data, that is a random subset of rows from the dataframe, so something like this
set_of_rows = [[1,10,100], [33,44,55]]
I want to get the indeces indicating the location of each row in set_of_rows inside df. So I need a function that does something like this:
indeces = func(subset=set_of_rows, dataframe=df)
In [1]: print(indeces)
Out[1]: [0, 3]
What function can do this? Tnx
Try the following:
[i for i in df.index if df.loc[i].to_list() in set_of_rows]
#[0, 3]
If you want it as a function:
def func(set_of_rows, df):
return [i for i in df.index if df.loc[i].to_list() in set_of_rows]
You can check this thread out;
Python Pandas: Get index of rows which column matches certain value
As far as I know, there is no intrinsic Panda function for your task so iteration is the only way to go about it. If you are concerned about dealing with the errors, you can add conditions in your loop that will take care of that.
for i in df.index:
lst = df.loc[i].to_list()
if lst in set_of_rows:
return i
else:
return None
I have a panda dataframe, it is used for a heatmap. I would like the minimal value of each column to be along the diagonal.
I've sorted the columsn using
data = data.loc[:, data.min().sort_values().index]
This works. Now I just need to sort the values such that the index of the min value in the first column is row 0, then the min value of second column is row 1, and so on.
Example
import seaborn as sns
import pandas as pd
data = [[5,1,9],
[7,8,6],
[5,3,2]]
data = pd.DataFrame(data)
#sns.heatmap(data)
data = data.loc[:, data.min().sort_values().index]
#sns.heatmap(data) # Gives result in step 1
# Step1, Columsn sorted by min value, 1, 2, 5
data = [[1,9,5],
[8,6,7],
[3,2,5]]
data = pd.DataFrame(data)
#sns.heatmap(data)
# How do i perform step two, maintinaing column order.
# Step 2, Rows sorted by min value 1,2,7
data = [[1,9,5],
[3,2,5],
[8,6,7]]
data = pd.DataFrame(data)
sns.heatmap(data)
Is this possible in panda in a clever way?
Setup
data = pd.DataFrame([[5, 1, 9], [7, 8, 6], [5, 3, 2]])
You can accomplish this by using argsort of the diagonal elements of your sorted DataFrame, then indexing the DataFrame using these values.
Step 1
Use your initial sort:
data = data.loc[:, data.min().sort_values().index]
1 2 0
0 1 9 5
1 8 6 7
2 3 2 5
Step 2
Use np.argsort with np.diag:
data.iloc[np.argsort(np.diag(data))]
1 2 0
0 1 9 5
2 3 2 5
1 8 6 7
I'm not quite sure, but you've already done the following to sort column
data = data.loc[:, data.min().sort_values().index]
the same trick could also be applied to sort row
data = data.loc[data.min(axis=1).sort_values().index, :]
To move some values around so that the min value within each column is placed along the diagonal you could try something like this:
for i in range(len(data)):
min_index = data.iloc[:, i].idxmin()
if data.iloc[i,i] != data.iloc[min_index, i]:
data.iloc[i,i], data.iloc[min_index,i] = data.iloc[min_index, i], data.iloc[i,i]
Basically just swap the min with the diagonal.
I'm using python pandas to organize some measurements values in a DataFrame.
One of the columns is a value which I want to convert in a 2D-vector so let's say the column contains such values:
col1
25
12
14
21
I want to have the values of this column changed one by one (in a for loop):
for value in values:
df.['col1'][value] = convert2Vector(df.['col1'][value])
So that the column col1 becomes:
col1
[-1. 21.]
[-1. -2.]
[-15. 54.]
[11. 2.]
The values are only examples and the function convert2Vector() converts the angle to a 2D-vector.
With the for-loop that I wrote it doesn't work .. I get the error:
ValueError: setting an array element with a sequence.
Which I can understand.
So the question is: How to do it?
That exception comes from the fact that you want to insert a list or array in a column (array) that stores ints. And arrays in Pandas and NumPy can't have a "ragged shape" so you can't have 2 elements in one row and 1 element in all the others (except maybe with masking).
To make it work you need to store "general" objects. For example:
import pandas as pd
df = pd.DataFrame({'col1' : [25, 12, 14, 21]})
df.col1[0] = [1, 2]
# ValueError: setting an array element with a sequence.
But this works:
>>> df.col1 = df.col1.astype(object)
>>> df.col1[0] = [1, 2]
>>> df
col1
0 [1, 2]
1 12
2 14
3 21
Note: I wouldn't recommend doing that because object columns are much slower than specifically typed columns. But since you're iterating over the Column with a for loop it seems you don't need the performance so you can also use an object array.
What you should be doing if you want it fast is vectorize the convert2vector function and assign the result to two columns:
import pandas as pd
import numpy as np
def convert2Vector(angle):
"""I don't know what your function does so this is just something that
calculates the sin and cos of the input..."""
ret = np.zeros((angle.size, 2), dtype=float)
ret[:, 0] = np.sin(angle)
ret[:, 1] = np.cos(angle)
return ret
>>> df = pd.DataFrame({'col1' : [25, 12, 14, 21]})
>>> df['col2'] = [0]*len(df)
>>> df[['col1', 'col2']] = convert2Vector(df.col1)
>>> df
col1 col2
0 -0.132352 0.991203
1 -0.536573 0.843854
2 0.990607 0.136737
3 0.836656 -0.547729
You should call a first order function like df.apply or df.transform which creates a new column which you then assign back:
In [1022]: df.col1.apply(lambda x: [x, x // 2])
Out[1022]:
0 [25, 12]
1 [12, 6]
2 [14, 7]
3 [21, 10]
Name: col1, dtype: object
In your case, you would do:
df['col1'] = df.col1.apply(convert2vector)
There's an operation that is a little counter intuitive when using pandas apply() method. It took me a couple of hours of reading to solve, so here it is.
So here is what I was trying to accomplish.
I have a pandas dataframe like so:
test = pd.DataFrame({'one': [[2],['test']], 'two': [[5],[10]]})
one two
0 [2] [5]
1 [test] [10]
and I want to add the columns per row to create a resulting list of length = to the DataFrame's original length like so:
def combine(row):
result = row['one'] + row['two']
return(result)
When running it through the dataframe using the apply() method:
test.apply(lambda x: combine(x), axis=1)
one two
0 2 5
1 test 10
Which isn't quite what we wanted. What we want is:
result
0 [2, 5]
1 [test, 10]
EDIT
I know there are simpler solutions to this example. But this is an abstraction from a much more complex operation.Here's an example of a more complex one:
df_one:
org_id date status id
0 2 2015/02/01 True 3
1 10 2015/05/01 True 27
2 10 2015/06/01 True 18
3 10 2015/04/01 False 27
4 10 2015/03/01 True 40
df_two:
org_id date
0 12 2015/04/01
1 10 2015/02/01
2 2 2015/08/01
3 10 2015/08/01
Here's a more complex operation:
def operation(row, df_one):
sel = (df_one.date < pd.Timestamp(row['date'])) & \
(df_one['org_id'] == row['org_id'])
last_changes = df_one[sel].groupby(['org_id', 'id']).last()
id_list = last_changes[last_changes.status].reset_index().id.tolist()
return (id_list)
then finally run:
df_one.sort_values('date', inplace=True)
df_two['id_list'] = df_two.apply(
operation,
axis=1,
args=(df_one,)
)
This would be impossible with simpler solutions. Hence my proposed one below would be to re write operation to:
def operation(row, df_one):
sel = (df_one.date < pd.Timestamp(row['date'])) & \
(df_one['org_id'] == row['org_id'])
last_changes = df_one[sel].groupby(['org_id', 'id']).last()
id_list = last_changes[last_changes.status].reset_index().id.tolist()
return pd.Series({'id_list': id_list})
We'd expect the following result:
id_list
0 []
1 []
2 [3]
3 [27,18,40]
IIUC we can simply sum two columns:
In [93]: test.sum(axis=1).to_frame('result')
Out[93]:
result
0 [2, 5]
1 [test, 10]
because when we sum lists:
In [94]: [2] + [5]
Out[94]: [2, 5]
they are getting concatenated...
So the answer to this problem lies in how pandas.apply() method works.
When defining
def combine(row):
result = row['one'] + row['two']
return(result)
the function will be returning a list for each row that gets passed in. This is a problem if we use the function with the .apply() method because it will interpret the resulting lists as a Series where each element is a column of that same row.
To solve this we need to create a Series where we specify a new column name like so:
def combine(row):
result = row['one'] + row['two']
return pd.Series({'result': result})
And if we run this again:
test.apply(lambda x: combine(x), axis=1)
result
0 [2, 5]
1 [test, 10]
We'll get what we originally wanted! Again, this is because we are forcing pandas to interpret the entire result as a column.