I have a Pandas DataFrame. How do I create a new column that is like a count of the Pandas DataFrame because I already made my index a Datatime.
For example, the following code is reproducible on your local PC:
import datetime
import numpy
dates = [
datetime.date(2019, 1, 13),
datetime.date(2020, 5, 11),
datetime.date(2018, 7, 24),
datetime.date(2019, 3, 23),
datetime.date(2020, 2, 16)
]
data = {
"a": [13.3,12.3,np.nan,10.3,np.nan],
"b": [1,0,0,1,1],
"c": ["no","yes","no","","yes"]
}
pd.DataFrame(index=dates,data=data)
Right now, I would like to add a new column as a count. Something like 1,2,3,4,5 until the end of the data
df['count'] = range(1, len(df) + 1)
len(df) returns the number of rows in the DataFrame, so you can call the builtin range function to create a range from 1 to the number of rows in the DataFrame, and then assign it to a new column. When assigning a range to a column, it is automatically converted to a pandas Series.
You can build a Series using df.index and apply some processing to it before assigning it to a column of the dataframe.
Here, we could use:
df['count'] = pd.Series(1, index=df.index()).cumsum()
Here it would be far less efficient (more than 1 magnitude order) than df['count'] = np.arange(1, 1 + len(df)) that directly builds a numpy array with the expected values, but it can be useful in more complex uses cases.
Related
I have following list has two values one is datetime.datetime(2018-06-18) and another NaN. both are extracted from the datetime column unique values . I just want list only contain the date.
# extracting date from datetime column
main_df['date'] = main_df.DateTime.dt.date
# getting only unique values from date column
agg_hos = main_df['date'].unique()
# output is
array([datetime.date(2018, 6, 18), NaT], dtype=object)
Want to remove the nan from the list
tried different ans from site
remove nan values from np array
# desired output
array([datetime.date(2018, 6, 18)], dtype=object)
How to do it?
you can use pd.isnull check instead (borrowing from this answer):
import datetime
import numpy as np
import pandas as pd
# np.isfinite(pd.NaT), np.isnan(pd.NaT)
# -> TypeError !
arr = np.array([datetime.date(2018, 6, 18), pd.NaT])
arr = arr[~pd.isnull(arr)]
# array([datetime.date(2018, 6, 18)], dtype=object)
Have the following dataset. This is a small sample while the actual dataset is much larger.
What is the fastest way to:
iterate through days = (1,2,3,4,5,6)
calculate [...rolling(day, min_periods=day).mean()]
add it as column name df[f'sma_{day}']
Method I have is casting it to dict of {ticker:price_df} and looping through shown below..
Have thought of methods like groupby, stack/unstack got stuck and need help with appending the columns because they are multi-index.
Am favouring the method with the fastest %%timeit.
import yfinance as yf
df = yf.download(['MSFT','AAPL','AMZN'], start="2022-09-13").loc[:,['Close']].stack().swaplevel().sort_index()
df.index.set_names(['Ticker','Date'], inplace=True)
df
Here is a sample dictionary method I have..
df = df.reset_index()
df = dict(tuple(df.groupby(['Ticker'])))
## Iterate through days and keys
days = (1, 2, 3, 4, 5, 6)
for key in df.keys():
for day in days:
df[key][f'sma_{day}'] = df[key].Close.sort_index(ascending=True).rolling(day, min_periods=day).mean()
## Flatten dictionary
pd.concat(df.values()).set_index(['Ticker','Date']).sort_index()
Noob question, but I have column in Python DataFrame that I want to aggregate into a new column.
Also I'm trying to create a column that take n*(average value), and also a column for the difference.
How...?
Added link to picture of dataset to illustrate. VERY new to Python/Jupyter Notebook!
Thanks in advance!:)
n = pd.Series([1,2,3,4,5])
a = pd.Series([1, 2, 4, 6, 11])
cumsum = a.cumsum()
average_n = n* a.mean()
diff = -cumsum+average_n
df = pd.concat([n,a,cumsum,average_n,diff],axis=1)
df.columns = ["n","data","cumsum","average_n","diff"]
df
I want to concatenate the multidimensional output of a NumPy computation matching in dimensions the shape of the input (with regards to rows and respective selected columns).
But it fails with: NotImplementedError: Can only union MultiIndex with MultiIndex or Index of tuples, try mi.to_flat_index().union(other) instead.
I do not want to flatten the indices first - so is there another way to get it to work?
import pandas as pd
from pandas import Timestamp
df = pd.DataFrame({('metrik_0', Timestamp('2020-01-01 00:00:00')): {(1, 1): 2.5393693602911447, (1, 5): 4.316896324314225, (1, 6): 4.271001191238499, (1, 9): 2.8712588011247377, (1, 11): 4.0458495954752545}, ('metrik_0', Timestamp('2020-01-01 01:00:00')): {(1, 1): 4.02779063729038, (1, 5): 3.3849606155101224, (1, 6): 4.284114856052976, (1, 9): 3.980919941298365, (1, 11): 5.042488191587525}, ('metrik_0', Timestamp('2020-01-01 02:00:00')): {(1, 1): 2.374592085569529, (1, 5): 3.3405503781564487, (1, 6): 3.4049690284720366, (1, 9): 3.892686173978996, (1, 11): 2.1876998087043127}})
def compute_return_columns_to_df(df, colums_to_process,axis=0):
method = 'compute_result'
renamed_base_levels = map(lambda x: f'{x}_{method}', colums_to_process.get_level_values(0).unique())
renamed_columns = colums_to_process.set_levels(renamed_base_levels, level=0)
#####
# perform calculation in numpy here
# for the sake of simplicity (and as the actual computation is irrelevant - it is omitted in this minimal example)
result = df[colums_to_process].values
#####
result = pd.DataFrame(result, columns=renamed_columns)
display(result)
return pd.concat([df, result], axis=1) # fails with: NotImplementedError: Can only union MultiIndex with MultiIndex or Index of tuples, try mi.to_flat_index().union(other) instead.
# I do not want to flatten the indices first - so is there another way to get it to work?
compute_return_columns_to_df(df[df.columns[0:3]].head(), df.columns[0:2])
The reason why your code failed is in:
result = df[colums_to_process].values
result = pd.DataFrame(result, columns=renamed_columns)
Note that result has:
column names with the top index level renamed to
metrik_0_compute_result (so far OK),
but the row index is the default single level index,
composed of consecutive numbers.
Then, when you concatenate df and result, Pandas attempts to
align both source DataFrames on the row index, but they are incompatible
(df has a MultiIndex, whereas result has an "ordinary" index).
Change this part of your code to:
result = df[colums_to_process]
result.columns = renamed_columns
This way result keeps the original index and concat raises no
exception.
Another remark: Your function contains axis parameter, which is
never used. Consider removing it.
Another possible approach
Since result has a default (single level) index, you can leave the
previous part of code as is, but reset the index in df before joining:
return pd.concat([df.reset_index(drop=True), result], axis=1)
This way both DataFrames have the same indices and you can concatenate
them as well.
I have a column in pandas dataframe in timestamp format and want to extract unique dates (no time) into a list. I tried following ways doesn't really work,
1. dates = datetime.datetime(df['EventTime'].tolist()).date()
2. dates = pd.to_datetime(df['EventTime']).date().tolist()
3. dates = pd.to_datetime(df['EventTime']).tolist().date()
can anyone help?
You can use dt to access the date time object in a Series, try this:
pd.to_datetime(df['EventTime']).dt.date.unique().tolist()
# [datetime.date(2014, 1, 1), datetime.date(2014, 1, 2)]
df = pd.DataFrame({"EventTime": ["2014-01-01", "2014-01-01", "2014-01-02 10:12:00", "2014-01-02 09:12:00"]})