Operations on Pandas DataFrame Index

Operations on Pandas DataFrame Index - python

How can I easily perform an operation on a Pandas DataFrame Index? Lets say I create a DataFrame like so:
df = DataFrame(rand(5,3), index=[0, 1, 2, 4, 5])
and I want to find the mean sampling rate. The way I do this now doesn't seem quite right.
fs = 1./np.mean(np.diff(df.index.values.astype(np.float)))
I feel like there must be a better way to do this, but I can't figure it out.
Thanks for any help.

#BrenBarn is correct, better to make a column in a frame, but you can do this
In [2]: df = DataFrame(np.random.rand(5,3), index=[0, 1, 2, 4, 5])
In [3]: df.index.to_series()
Out[3]:
0 0
1 1
2 2
4 4
5 5
dtype: int64
In [4]: s = df.index.to_series()
In [5]: 1./s.diff().mean()
Out[5]: 0.80000000000000004

Related

How to get rows from only some columns based on columns value in Pandas?

I use Pandas to get datas from Excel. From those tables, I often need to find one or some values in only one row, based on value in a column.
I've read a lot about Pandas (doc and SO), and almost everytime, the question is like « how to SELECT * FROM df WHERE value = smthing ».
But what I'd like to do is more like :
SELECT Col1, Col2
FROM df
WHERE Col3.value = smthing
And I can't find any answer.
For example :
>>> dataFrame
foo bar sm_else
0 0 3 6
1 1 4 7
2 2 5 8
I want to get foo value and sm_else value when bar == 4.
So :
foo sm_else
1 7
Result can be DataFrame or can be list or dict, I don't really care.
Thanks !
How can I achieve this ?

df.loc can help you out
import pandas as pd
df = pd.DataFrame(data={'col1': [1, 2, 3],
'col2': [4, 5, 6],
'col3': [7, 8, 9]})
print(df.loc[df['col2'] == 4][['col1', 'col2']])

df.loc[df.bar == 4, ['foo', 'sm_else']]

Why does combine_first() display this behavior, when substituting values from one column into another column in the same DataFrame?

I am new to stackoverflow.
I noticed this behavior of pandas combine_first() and would simply like to understand why.
When I have the following dataframe,
df = pd.DataFrame({'A':[6,'',7,''], 'B':[1, 3, 5, 3]})
df['A'].combine_first(df['B'])
Out[1]:
0 6
1
2 7
3
Name: A, dtype: object
Whereas initiating with np.nan instead of ' ' gives the expected behavior of combine_first()
df = pd.DataFrame({'A':[6,np.nan,7,np.nan], 'B':[1, 3, 5, 3]})
df['A'].combine_first(df['B'])
Out[2]:
0 6.0
1 3.0
2 7.0
3 3.0
Name: A, dtype: float64
And also replacing the ' ' with np.nan and then applying combine_first() doesn't seem to work either.
df = pd.DataFrame({'A':[6,'',7,''], 'B':[1, 3, 5, 3]})
df.replace('', np.nan)
df['A'].combine_first(df['B'])
Out[3]:
0 6
1
2 7
3
Name: A, dtype: object
I would like to understand why this happens before using an alternate method for this purpose.

This seemed to have been pretty obvious for people here. But thank-you for posting the comments!
My mistake in the 3rd dataframe I posted, pointed out by #W-B
df = pd.DataFrame({'A':[6,'',7,''], 'B':[1, 3, 5, 3]})
df = df.replace('', np.nan)
df['A'].combine_first(df['B'])
Also as #ALollz pointed out, df['A'] has empty strings ' ' are not null values. It does sound simple in hind-sight. But I couldn't figure it out earlier!
Thank-you!

Multiple filters Python Data.frame

I'm pretty new to python. I'm trying to filter rows in a data.frame as I do in R.
sub_df = df[df[main_id]==3]
works, but
df[df[main_id] in [3,7]]
gives me error
"The truth value of a Series is ambiguous"
Can you please suggest me a correct syntax to write similar selections?

You can use pandas isin function. This would look like this:
import pandas as pd
df = pd.DataFrame({'A': [1, 2, 3], 'B': ['a', 'b', 'f']})
df[df['A'].isin([2, 3])]
giving:
A B
1 2 b
2 3 f

df[df[main_id].apply(lambda x: x in [3, 7])]

yet another solution:
In [60]: df = pd.DataFrame({'main_id': [0,1, 2, 3], 'x': list('ABCD')})
In [61]: df
Out[61]:
main_id x
0 0 A
1 1 B
2 2 C
3 3 D
In [62]: df.query("main_id in [0,3]")
Out[62]:
main_id x
0 0 A
3 3 D

Pandas: Difference between largest and smallest value within group

Given a data frame that looks like this
GROUP VALUE
1 5
2 2
1 10
2 20
1 7
I would like to compute the difference between the largest and smallest value within each group. That is, the result should be
GROUP DIFF
1 5
2 18
What is an easy way to do this in Pandas?
What is a fast way to do this in Pandas for a data frame with about 2 million rows and 1 million groups?

Using #unutbu 's df
per timing
unutbu's solution is best over large data sets
import pandas as pd
import numpy as np
df = pd.DataFrame({'GROUP': [1, 2, 1, 2, 1], 'VALUE': [5, 2, 10, 20, 7]})
df.groupby('GROUP')['VALUE'].agg(np.ptp)
GROUP
1 5
2 18
Name: VALUE, dtype: int64
np.ptp docs returns the range of an array
timing
small df
large df
df = pd.DataFrame(dict(GROUP=np.arange(1000000) % 100, VALUE=np.random.rand(1000000)))
large df
many groups
df = pd.DataFrame(dict(GROUP=np.arange(1000000) % 10000, VALUE=np.random.rand(1000000)))

groupby/agg generally performs best when you take advantage of the built-in aggregators such as 'max' and 'min'. So to obtain the difference, first compute the max and min and then subtract:
import pandas as pd
df = pd.DataFrame({'GROUP': [1, 2, 1, 2, 1], 'VALUE': [5, 2, 10, 20, 7]})
result = df.groupby('GROUP')['VALUE'].agg(['max','min'])
result['diff'] = result['max']-result['min']
print(result[['diff']])
yields
diff
GROUP
1 5
2 18

Note: this will get the job done, but #piRSquared's answer has faster methods.
You can use groupby(), min(), and max():
df.groupby('GROUP')['VALUE'].apply(lambda g: g.max() - g.min())

how to add columns label on a Pandas DataFrame

I can't understand how can I add column names on a pandas dataframe, an easy example will clarify my issue:
dic = {'a': [4, 1, 3, 1], 'b': [4, 2, 1, 4], 'c': [5, 7, 9, 1]}
df = pd.DataFrame(dic)
now if I type df than I get
a b c
0 4 4 5
1 1 2 7
2 3 1 9
3 1 4 1
say now that I generate another dataframe just by summing up the columns on the previous one
a = df.sum()
if I type 'a' than I get
a 9
b 11
c 22
That looks like a dataframe without with index and without names on the only column. So I wrote
a.columns = ['column']
or
a.columns = ['index', 'column']
and in both cases Python was happy because he didn't provide me any message of errors. But still if I type 'a' I can't see the columns name anywhere. What's wrong here?

The method DataFrame.sum() does an aggregation and therefore returns a Series, not a DataFrame. And a Series has no columns, only an index. If you want to create a DataFrame out of your sum you can change a = df.sum() by:
a = pandas.DataFrame(df.sum(), columns = ['whatever_name_you_want'])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Operations on Pandas DataFrame Index - python

Related

How to get rows from only some columns based on columns value in Pandas?

Why does combine_first() display this behavior, when substituting values from one column into another column in the same DataFrame?

Multiple filters Python Data.frame

Pandas: Difference between largest and smallest value within group

how to add columns label on a Pandas DataFrame

Categories

Resources