How would I get the values from a previously assigned index - python

I have two data frames, the first column is formed by getting the index values from the other data frame. This is tested and successfully returns 5 entries.
The second line executes but assigns NaN to all rows in "StartPrice" column
df = pd.DataFrame()
df["StartBar"] = df_rs["HighTrendStart"].dropna().index # Works
df["StartPrice"] = df_rs["HighTrendStart"].loc[df["StartBar"]] # Assigns Nan's to all rows
As pointed out by #YOBEN_S, the indexes do not match.
Date
2020-05-01 00:00:00 NaN
2020-05-01 00:15:00 NaN
2020-05-01 00:30:00 NaN
2020-05-01 00:45:00 NaN
2020-05-01 01:00:00 NaN
Freq: 15T, Name: HighTrendStart, dtype: float64
0 2020-05-01 02:30:00
1 2020-05-01 06:30:00
2 2020-05-01 13:45:00
3 2020-05-01 16:15:00
4 2020-05-01 20:00:00
Name: StartBar, dtype: datetime64[ns]

You should make sure the index did not match when you assign the value from different dataframe
df["StartPrice"] = df_rs["HighTrendStart"].loc[df["StartBar"]].to_numpy()
For example
df=pd.DataFrame({'a':[1,2,3,4,5,6]})
s=pd.Series([1,2,3,4,5,6],index=list('abcdef'))
df
Out[190]:
a
0 1
1 2
2 3
3 4
4 5
5 6
s
Out[191]:
a 1
b 2
c 3
d 4
e 5
f 6
dtype: int64
df['New']=s
df
Out[193]:
a New
0 1 NaN
1 2 NaN
2 3 NaN
3 4 NaN
4 5 NaN
5 6 NaN

Related

Subtract one column by itself based on a condition set by another column

I have the following data frame, where time_stamp is already sorted in the ascending order:
time_stamp indicator
0 2021-01-01 00:00:00 1
1 2021-01-01 00:02:00 1
2 2021-01-01 00:03:00 NaN
3 2021-01-01 00:04:00 NaN
4 2021-01-01 00:09:00 NaN
5 2021-01-01 00:14:00 NaN
6 2021-01-01 00:19:00 NaN
7 2021-01-01 00:24:00 NaN
8 2021-01-01 00:27:00 1
9 2021-01-01 00:29:00 NaN
10 2021-01-01 00:32:00 2
11 2021-01-01 00:34:00 NaN
12 2021-01-01 00:37:00 2
13 2021-01-01 00:38:00 NaN
14 2021-01-01 00:39:00 NaN
I want to create a new column in the above data frame, that shows the time difference between each row's time_stamp value and the first time_stamp value above that row where indicator is not NaN (immediately above row, where indicator is not NaN).
Below is how the output should look like (time_diff is a timedelta value, but I'll just show subtraction by indices to better illustrate. For example, ( 2 - 1 ) = df['time_stamp'][2] - df['time_stamp'][1] ):
time_stamp indicator time_diff
0 2021-01-01 00:00:00 1 NaT # (or undefined)
1 2021-01-01 00:02:00 1 1 - 0
2 2021-01-01 00:03:00 NaN 2 - 1
3 2021-01-01 00:04:00 NaN 3 - 1
4 2021-01-01 00:09:00 NaN 4 - 1
5 2021-01-01 00:14:00 NaN 5 - 1
6 2021-01-01 00:19:00 NaN 6 - 1
7 2021-01-01 00:24:00 NaN 7 - 1
8 2021-01-01 00:27:00 1 8 - 1
9 2021-01-01 00:29:00 NaN 9 - 8
10 2021-01-01 00:32:00 1 10 - 8
11 2021-01-01 00:34:00 NaN 11 - 10
12 2021-01-01 00:37:00 1 12 - 10
13 2021-01-01 00:38:00 NaN 13 - 12
14 2021-01-01 00:39:00 NaN 14 - 12
We can use a for loop that keeps track of the last NaN entry, but I'm looking for a solution that does not use a for loop.
I've ended up doing this:
# create an intermediate column to track the last timestamp corresponding to the non-NaN `indicator` value
df['tracking'] = np.nan
df['tracking'][~df['indicator'].isna()] = df['time_stamp'][~df['indicator'].isna()]
df['tracking'] = df['tracking'].ffill()
# use that to subtract the value from the `time_stamp`
df['time_diff'] = df['time_stamp'] - df['tracking']

Add rows in a gap dates

I need to insert rows in my dataframe:
This is my df:
I want this result, grouped by client. I mean, I have to create this for every client present in my dataframe
Try something like this:
df['month'] = pd.to_datetime(df.month, format='%d/%m/%Y',dayfirst=True ,errors='coerce')
df.set_index(['month']).groupby(['client']).resample('M').asfreq().drop('client', axis=1).reset_index()
client month col1
0 1 2017-03-31 20.0
1 1 2017-04-30 NaN
2 1 2017-05-31 90.0
3 1 2017-06-30 NaN
4 1 2017-07-31 NaN
5 1 2017-08-31 NaN
6 1 2017-09-30 NaN
7 1 2017-10-31 NaN
8 1 2017-11-30 NaN
9 1 2017-12-31 100.0
10 2 2018-09-30 NaN
11 2 2018-10-31 7.0

How can I split this excel file into two data frames?

When I try and load this excel spreadsheet into a dataframe I get a lot of NAN due to all the random white space in the file. I'd really like to split class I and class A from this excel file into two seperate pandas dataframe
In:
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
excel_file = 'EXAMPLE.xlsx'
df = pd.read_excel(excel_file, header=8)
print(df)
sys.exit()
Out:
Class I Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 Class A Unnamed: 9 Unnamed: 10 Unnamed: 11 Unnamed: 12
0 Date NaN column 1 NaN column 2 NaN NaN NaN Date NaN column 1 NaN column 2
1 2019-12-31 00:00:00 NaN 1 NaN A NaN NaN NaN 2019-12-31 00:00:00 NaN A NaN 1
2 2020-01-01 00:00:00 NaN 2 NaN B NaN NaN NaN 2020-01-01 00:00:00 NaN B NaN 2
3 2020-01-02 00:00:00 NaN 3 NaN C NaN NaN NaN 2020-01-02 00:00:00 NaN C NaN 3
4 2020-01-03 00:00:00 NaN 4 NaN D NaN NaN NaN 2020-01-03 00:00:00 NaN D NaN 4
5 2020-01-04 00:00:00 NaN 5 NaN E NaN NaN NaN 2020-01-04 00:00:00 NaN E NaN 5
6 2020-01-05 00:00:00 NaN 6 NaN F NaN NaN NaN 2020-01-05 00:00:00 NaN F NaN 6
7 2020-01-06 00:00:00 NaN 7 NaN G NaN NaN NaN 2020-01-06 00:00:00 NaN G NaN 7
8 2020-01-07 00:00:00 NaN 8 NaN H NaN NaN NaN 2020-01-07 00:00:00 NaN H NaN 8
Try to use the parameter usecols. From the documentation:
If list of int, then indicates list of column numbers to be parsed.
import pandas as pd
df1 = pd.read_excel(excel_file,usecols=[0,2,4])
df2 = pd.read_excel(excel_file,usecols=[8,10,12])
This should create two dataframes with the columns you want.

how to groupby and calculate the percentage of non missing values in each column in pandas?

I have the following datadrame
var loyal_date
1 2017-01-17
1 2017-01-03
1 2017-01-11
1 NaT
1 NaT
2 2017-01-15
2 2017-01-07
2 Nat
2 Nat
2 Nat
i need to group by var column and find the percentage of non missing value in loyal_date column for each group. Is there any way to do it using lambda function?
try this:
In [59]: df
Out[59]:
var loyal_date
0 1 2017-01-17
1 1 2017-01-03
2 1 2017-01-11
3 1 NaT
4 1 NaT
5 2 2017-01-15
6 2 2017-01-07
7 2 NaT
8 2 NaT
9 2 NaT
In [60]: df.groupby('var')['loyal_date'].apply(lambda x: x.notnull().sum()/len(x)*100)
Out[60]:
var
1 60.0
2 40.0
Name: loyal_date, dtype: float64

pandas group by date, assign value to a column

I have a DataFrame with columns = ['date','id','value'], where id represents different products. Assume that we have n products. I am looking to create a new dataframe with columns = ['date', 'valueid1' ..,'valueidn'], where the values are assigned to the corresponding date-row if they exist, a NaN is assigned as value if they don't. Many thanks
assuming you have the following DF:
In [120]: df
Out[120]:
date id value
0 2001-01-01 1 10
1 2001-01-01 2 11
2 2001-01-01 3 12
3 2001-01-02 3 20
4 2001-01-03 1 20
5 2001-01-04 2 30
you can use pivot_table() method:
In [121]: df.pivot_table(index='date', columns='id', values='value')
Out[121]:
id 1 2 3
date
2001-01-01 10.0 11.0 12.0
2001-01-02 NaN NaN 20.0
2001-01-03 20.0 NaN NaN
2001-01-04 NaN 30.0 NaN
or
In [122]: df.pivot_table(index='date', columns='id', values='value', fill_value=0)
Out[122]:
id 1 2 3
date
2001-01-01 10 11 12
2001-01-02 0 0 20
2001-01-03 20 0 0
2001-01-04 0 30 0
I think you need pivot:
df = df.pivot(index='date', columns='id', values='value')
Sample:
df = pd.DataFrame({'date':pd.date_range('2017-01-01', periods=5),
'id':[4,5,6,4,5],
'value':[7,8,9,1,2]})
print (df)
date id value
0 2017-01-01 4 7
1 2017-01-02 5 8
2 2017-01-03 6 9
3 2017-01-04 4 1
4 2017-01-05 5 2
df = df.pivot(index='date', columns='id', values='value')
#alternative solution
#df = df.set_index(['date','id'])['value'].unstack()
print (df)
id 4 5 6
date
2017-01-01 7.0 NaN NaN
2017-01-02 NaN 8.0 NaN
2017-01-03 NaN NaN 9.0
2017-01-04 1.0 NaN NaN
2017-01-05 NaN 2.0 NaN
but if get:
ValueError: Index contains duplicate entries, cannot reshape
is necessary use aggregating function like mean, sum, ... with groupby or pivot_table:
df = pd.DataFrame({'date':['2017-01-01', '2017-01-02',
'2017-01-03','2017-01-05','2017-01-05'],
'id':[4,5,6,4,4],
'value':[7,8,9,1,2]})
df.date = pd.to_datetime(df.date)
print (df)
date id value
0 2017-01-01 4 7
1 2017-01-02 5 8
2 2017-01-03 6 9
3 2017-01-05 4 1 <- duplicity 2017-01-05 4
4 2017-01-05 4 2 <- duplicity 2017-01-05 4
df = df.groupby(['date', 'id'])['value'].mean().unstack()
#alternative solution (another answer same as groupby only slowier in big df)
#df = df.pivot_table(index='date', columns='id', values='value', aggfunc='mean')
print (df)
id 4 5 6
date
2017-01-01 7.0 NaN NaN
2017-01-02 NaN 8.0 NaN
2017-01-03 NaN NaN 9.0
2017-01-05 1.5 NaN NaN <- 1.5 is mean (1 + 2)/2

Categories

Resources