I have a function that takes as one of its inputs a dataframe, which is indexed by date. How can I run the function only on a subset of the dataframe (say, from 2005-2010)? I don't think I can just drop the rest of the rows from the dataframe because part of the function keeps track of a rolling average, and thus the first few rows would depend on dates I am not considering.
You can subset the dataframe with the rows that you need:
df.iloc[0:5] ## Gives only 5 rows
Then, you can run the function on these rows, like this:
df.iloc[0:5].apply(my_function)
Related
I would like to ask how can I join the dataframe as shown in (exiting dataframe) to group values based on date&time and take the means of the values. what I meant is that if col B have 2 values in the same minute , it will take average of that value and do same for rest of the columns. What I want to achieve is to have one value each minutes as shown in (preprocessed dataframe)
Thank you
If your dataframe is called df, you can do as following :
df.groupby(['DataTime']).mean()
I am confused why A Pandas Groupby function can be written both of the ways below and yield the same result. The specific code is not really the question, both give the same result. I would like someone to breakdown the syntax of both.
df.groupby(['gender'])['age'].mean()
df.groupby(['gender']).mean()['age']
In the first instance, It reads as if you are calling the .mean() function on the age column specifically. The second appears like you are calling .mean() on the whole groupby object and selecting the age column after? Are there runtime considerations.
It reads as if you are calling the .mean() function on the age column specifically. The second appears like you are calling .mean() on the whole groupby object and selecting the age column after?
This is exactly what's happening. df.groupby() returns a dataframe. The .mean() method is applied column-wise by default, so the mean of each column is calculated independent of the other columns and the results are returned as a Series (which can be indexed) if run on the full dataframe.
Reversing the order produces a single column as a Series and then calculates the mean. If you know you only want the mean for a single column, it will be faster to isolate that first, rather than calculate the mean for every column (especially if you have a very large dataframe).
Think of groupby as a rows-separation function. It groups all rows having the same attributes (as specified in by parameter) into separate data frames.
After the groupby, you need an aggregate function to summarize data in each subframe. You can do that in a number of ways:
# In each subframe, take the `age` column and summarize it
# using the `mean function from the result
df.groupby(['gender'])['age'].mean()
# In each subframe, apply the `mean` function to all numeric
# columns then extract the `age` column
df.groupby(['gender']).mean()['age']
The first method is more efficient since you are applying the aggregate function (mean) on a single column.
I have a dataframe with several rows of values. I need to filter these rows based on the value of a column (in this case the index column), perform a series of calculations and then return the calculated values to a new table. At the end I need a consolidated table with all the calculated values.
Example:
I have the following dataframe:
First I need to filter all the rows with 1 in the column index
Perform some calculation with only these values
Store the calculated values into a new table
Repeat the process for the rows with 2 in the column index.
Any idea how I can do this?
I can only guess without the actual data and code, but it looks like you need groupby+apply. You can try:
df.groupby('Index')['Values'].apply(lambda s: ExponentialSmoothing(s,trend='mul',seasonal='mul',seasonal_periods=12).fit().forecast(steps=15))
Scenario. Assume a
pd.DataFrame, loaded from an external source
where one row is a line from a sensor. The index is a DateTimeIndex
with some rows having df.index.duplicated()==True. This actually means, there are lines with the same timestamp from different sensors.
Now applying some logic, like df.loc[df.A>0, 'my_col'] = 1, I ran into ValueError: cannot reindex from a duplicate axis. This can be solved by simply removing the duplicated rows using
df[~df.index.duplicated()]
But I wonder, if it would be possible, to actually apply a column based function during the Index de-duplication process? E.g.: Calculating the mean/max/min of column A/B/C for the duplicated rows.
Is this possible? Its something like a groupby.aggregate on df.index.duplicated() rows.
Check with describe
df.groupby(level=0).describe()
data example I have a large data frame with over 20000 observations, I have a variable called “station” and I need to remove all rows that only have numbers as the s station name.
The only code that has worked so far is :
Df[‘station’][~df[‘station’].str.isnumeric()
However this only creates a data frame with one variable
You can use an extre column with .str.isnumeric() to be used later on as a filter:
df['filter'] = df['station'].str.isnumeric()
df_filtered = df[df['filter'] != False]#.drop(columns=['filter']
This should return all rows that are not only numbers for the column station. After that, you can remove the hash if you wish to drop the filter column to mantain you original structure.
You can do it like so,
df_filtered = df[df['station'].str.isnumeric()==False]
You wouldn't have to do set operations on your dataframe if you use this.
The internal statements are ultimately a Boolean logic filter that is being applied on the dataframe.