Subset pandas dataframe based on first non zero occurrence - python

Here is the sample dataframe:-
Trade_signal
2007-07-31 0.0
2007-08-31 0.0
2007-09-28 0.0
2007-10-31 0.0
2007-11-30 0.0
2007-12-31 0.0
2008-01-31 0.0
2008-02-29 0.0
2008-03-31 0.0
2008-04-30 0.0
2008-05-30 0.0
2008-06-30 0.0
2008-07-31 -1.0
2008-08-29 0.0
2008-09-30 -1.0
2008-10-31 -1.0
2008-11-28 -1.0
2008-12-31 0.0
2009-01-30 -1.0
2009-02-27 -1.0
2009-03-31 0.0
2009-04-30 0.0
2009-05-29 1.0
2009-06-30 1.0
2009-07-31 1.0
2009-08-31 1.0
2009-09-30 1.0
2009-10-30 0.0
2009-11-30 1.0
2009-12-31 1.0
1 represents buy and -1 represents sell. I want to subset the dataframe so that the new dataframe starts with first 1 occurrence. Expected Output:-
2009-05-29 1.0
2009-06-30 1.0
2009-07-31 1.0
2009-08-31 1.0
2009-09-30 1.0
2009-10-30 0.0
2009-11-30 1.0
2009-12-31 1.0
Please suggest the way forward. Apologies if this is a repeated question.

Simply do. Here df[1] refers to the column containing buy/sell data.
new_df = df.iloc[df[df["Trade Signal"]==1].index[0]:,:]

Related

Conditionally Set Values Greater Than 0 To 1

I have a dataframe that looks like this, with many more date columns
AUTHOR 2022-07-01 2022-10-14 2022-10-15 .....
0 Kathrine 0.0 7.0 0.0
1 Catherine 0.0 13.0 17.0
2 Amanda Jane 0.0 0.0 0.0
3 Jaqueline 0.0 3.0 0.0
4 Christine 0.0 0.0 0.0
I would like to set values in each column after the AUTHOR to 1 when the value is greater than 0, so the resulting table would look like this:
AUTHOR 2022-07-01 2022-10-14 2022-10-15 .....
0 Kathrine 0.0 1.0 0.0
1 Catherine 0.0 1.0 1.0
2 Amanda Jane 0.0 0.0 0.0
3 Jaqueline 0.0 1.0 0.0
4 Christine 0.0 0.0 0.0
I tried the following line of code but got an error, which makes sense. As I need to figure out how to apply this code just to the date columns while also keeping the AUTHOR column in my table.
Counts[Counts != 0] = 1
TypeError: Cannot do inplace boolean setting on mixed-types with a non np.nan value
You can select the date column first then mask on these columns
cols = df.drop(columns='AUTHOR').columns
# or
cols = df.filter(regex='\d{4}-\d{2}-\d{2}').columns
# or
cols = df.select_dtypes(include='number').columns
df[cols] = df[cols].mask(df[cols] != 0, 1)
print(df)
AUTHOR 2022-07-01 2022-10-14 2022-10-15
0 Kathrine 0.0 1.0 0.0
1 Catherine 0.0 1.0 1.0
2 Amanda Jane 0.0 0.0 0.0
3 Jaqueline 0.0 1.0 0.0
4 Christine 0.0 0.0 0.0
Since you would like to only exclude the first column you could first set it as index and then create your booleans. In the end you will reset the index.
df.set_index('AUTHOR').pipe(lambda g: g.mask(g > 0, 1)).reset_index()
df
AUTHOR 2022-10-14 2022-10-15
0 Kathrine 0.0 1.0
1 Cathrine 1.0 1.0

Making columns from a range of lines in pandas

I have a csv file like the below:
A, B
1,2
3,4
5,6
C,D
7,8
9,10
11,12
E,F
13,14
15,16
As you can see, and imagine, when I import this data using pd.read_csv, pandas creates the whole thing making with two columns (A,B) and a bunch of lines. It's correct because of the shape. However, I want do create various columns (A,B,C,D...). Fortunately, there're a blank space at the end of each "column", and I think that this could be used to separete theses lines in some way. However, I don't know how to proced with this.
The data:
https://raw.githubusercontent.com/AlessandroMDO/Dinamica_de_Voo/master/data.csv
It's normal behavior of pandas.read_csv, but usually data is not stored in csv files this way.
You can try to read the csv, strip extra spaces and split it by empty lines to parts first. Then read each part using pandas.read_csv and StringIO and concatenate them together using pandas.concat.
import pandas as pd
from io import StringIO
with open('test.csv', 'r') as f:
parts = f.read().strip().split('\n\n')
df = pd.concat([pd.read_csv(StringIO(part)) for part in parts], axis=1)
I have tried this with your csv:
Alpha Cd Alpha CL Alpha ... Cnp Alpha Cnr Alpha Clr
0 -14.0 0.08941 -14.0 -0.19430 -14.0 ... 0.0 -14.0 0.0 -14.0 0.0
1 -12.0 0.07646 -12.0 -0.17150 -12.0 ... 0.0 -12.0 0.0 -12.0 0.0
2 -10.0 0.06509 -10.0 -0.14710 -10.0 ... 0.0 -10.0 0.0 -10.0 0.0
3 -8.0 0.05545 -8.0 -0.12150 -8.0 ... 0.0 -8.0 0.0 -8.0 0.0
4 -6.0 0.04766 -6.0 -0.09479 -6.0 ... 0.0 -6.0 0.0 -6.0 0.0
5 -4.0 0.04181 -4.0 -0.06722 -4.0 ... 0.0 -4.0 0.0 -4.0 0.0
6 -2.0 0.03797 -2.0 -0.03905 -2.0 ... 0.0 -2.0 0.0 -2.0 0.0
7 0.0 0.03620 0.0 -0.01054 0.0 ... 0.0 0.0 0.0 0.0 0.0
8 2.0 0.03651 2.0 0.01806 2.0 ... 0.0 2.0 0.0 2.0 0.0
9 4.0 0.03960 4.0 0.05879 4.0 ... 0.0 4.0 0.0 4.0 0.0
10 6.0 0.04814 6.0 0.12650 6.0 ... 0.0 6.0 0.0 6.0 0.0
11 8.0 0.06494 8.0 0.22050 8.0 ... 0.0 8.0 0.0 8.0 0.0
12 10.0 0.09268 10.0 0.33960 10.0 ... 0.0 10.0 0.0 10.0 0.0
13 12.0 0.13390 12.0 0.48240 12.0 ... 0.0 12.0 0.0 12.0 0.0
14 14.0 0.19110 14.0 0.64710 14.0 ... 0.0 14.0 0.0 14.0 0.0
[15 rows x 36 columns]

Counting consonants and vowels in a split string

I read in a .csv file. I have the following data frame that counts vowels and consonants in a string in the column Description. This works great, but my problem is I want to split Description into 8 columns and count the consonants and vowels for each column. The second part of my code allows for me to split Description into 8 columns. How can I count the vowels and consonants on all 8 columns the Description is split into?
import pandas as pd
import re
def anti_vowel(s):
result = re.sub(r'[AEIOU]', '', s, flags=re.IGNORECASE)
return result
data = pd.read_csv('http://core.secure.ehc.com/src/util/detail-price-list/TristarDivision_SummitMedicalCenter_CM.csv')
data.dropna(inplace = True)
data['Vowels'] = data['Description'].str.count(r'[aeiou]', flags=re.I)
data['Consonant'] = data['Description'].str.count(r'[bcdfghjklmnpqrstvwxzy]', flags=re.I)
print (data)
This is the code I'm using to split the column Description into 8 columns.
import pandas as pd
data = data["Description"].str.split(" ", n = 8, expand = True)
data = pd.read_csv('http://core.secure.ehc.com/src/util/detail-price-list/TristarDivision_SummitMedicalCenter_CM.csv')
data.dropna(inplace = True)
data = data["Description"].str.split(" ", n = 8, expand = True)
print (data)
Now how can I put it all together?
In order to read each column of the 8 and count consonants I know i can use the following replacing the 0 with 0-7:
testconsonant = data[0].str.count(r'[bcdfghjklmnpqrstvwxzy]', flags=re.I)
testvowel = data[0].str.count(r'[aeiou]', flags=re.I)
Desired output would be:
Description [0] vowel count consonant count Description [1] vowel count consonant count Description [2] vowel count consonant count Description [3] vowel count consonant count Description [4] vowel count consonant count all the way to description [7]
stack then unstack
stacked = data.stack()
pd.concat({
'Vowels': stacked.str.count('[aeiou]', flags=re.I),
'Consonant': stacked.str.count('[bcdfghjklmnpqrstvwxzy]', flags=re.I)
}, axis=1).unstack()
Consonant Vowels
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 3.0 5.0 5.0 1.0 2.0 NaN NaN NaN NaN 1.0 0.0 0.0 0.0 0.0 NaN NaN NaN NaN
1 8.0 5.0 1.0 0.0 0.0 0.0 0.0 0.0 NaN 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN
2 8.0 5.0 1.0 0.0 0.0 0.0 0.0 0.0 NaN 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN
3 8.0 5.0 1.0 0.0 0.0 0.0 0.0 0.0 NaN 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN
4 3.0 5.0 3.0 1.0 0.0 0.0 0.0 0.0 NaN 0.0 0.0 2.0 0.0 0.0 0.0 0.0 0.0 NaN
5 3.0 5.0 3.0 1.0 0.0 0.0 0.0 0.0 NaN 0.0 0.0 2.0 0.0 0.0 0.0 0.0 0.0 NaN
6 3.0 4.0 0.0 1.0 0.0 0.0 0.0 NaN NaN 3.0 1.0 0.0 0.0 0.0 0.0 0.0 NaN NaN
7 3.0 3.0 0.0 1.0 0.0 0.0 0.0 NaN NaN 3.0 1.0 0.0 1.0 0.0 0.0 0.0 NaN NaN
8 3.0 3.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 3.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0
9 3.0 3.0 0.0 1.0 0.0 0.0 0.0 NaN NaN 3.0 1.0 0.0 1.0 0.0 0.0 0.0 NaN NaN
10 3.0 3.0 0.0 1.0 0.0 0.0 0.0 0.0 NaN 3.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN
11 3.0 3.0 0.0 2.0 2.0 NaN NaN NaN NaN 3.0 0.0 0.0 0.0 0.0 NaN NaN NaN NaN
12 3.0 3.0 0.0 1.0 0.0 0.0 0.0 0.0 NaN 3.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN
13 3.0 3.0 0.0 2.0 2.0 NaN NaN NaN NaN 3.0 1.0 0.0 0.0 0.0 NaN NaN NaN NaN
14 3.0 5.0 0.0 2.0 0.0 0.0 0.0 0.0 0.0 3.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
15 3.0 3.0 0.0 3.0 1.0 NaN NaN NaN NaN 3.0 0.0 0.0 0.0 1.0 NaN NaN NaN NaN
If you want to combine this with the data dataframe, you can do:
stacked = data.stack()
pd.concat({
'Data': data,
'Vowels': stacked.str.count('[aeiou]', flags=re.I),
'Consonant': stacked.str.count('[bcdfghjklmnpqrstvwxzy]', flags=re.I)
}, axis=1).unstack()

pandas calculating mean per month

I created the following dataframe:
availability = pd.DataFrame(propertyAvailableData).set_index("createdat")
monthly_availability = availability.fillna(value=0).groupby(pd.TimeGrouper(freq='M'))
This gives the following output
2015-08-18 2015-09-09 2015-09-10 2015-09-11 2015-09-12 \
createdat
2015-08-12 1.0 1.0 1.0 1.0 1.0
2015-08-17 0.0 0.0 0.0 0.0 0.0
2015-08-18 0.0 1.0 1.0 1.0 1.0
2015-08-18 0.0 0.0 0.0 0.0 0.0
2015-08-19 0.0 1.0 1.0 1.0 1.0
2015-09-03 0.0 1.0 1.0 1.0 1.0
2015-09-03 0.0 1.0 1.0 1.0 1.0
2015-09-07 0.0 0.0 0.0 0.0 0.0
2015-09-08 0.0 0.0 0.0 0.0 0.0
2015-09-11 0.0 0.0 0.0 0.0 0.0
I'm trying to get the averages per created at month by doing:
monthly_availability_mean = monthly_availability.mean()
However, here I get the following output:
2015-08-18 2015-09-09 2015-09-10 2015-09-11 2015-09-12 \
createdat
2015-08-31 0.111111 0.444444 0.666667 0.777778 0.777778
2015-09-30 0.000000 0.222222 0.222222 0.222222 0.222222
2015-10-31 0.000000 0.000000 0.000000 0.000000 0.000000
And when I hand check august I get:
1.0 + 0 + 0 + 0 + 0 / 5 = 0.2
How do I get the correct mean per month?
availability.resample('M').mean()
I just encountered the same issue and solved it with the following code
#load data daily
df = pd.read_csv('./name.csv')
#set Date as index
df.Date = pd.to_datetime(df.Date)
df_date = df.set_index('Date', inplace=False)
#get monthly mean
df_month = df_date.resample('M').mean()
#group months
df_monthly_mean = df_month.groupby(df_daily.index.month).mean()
How that this was helpful!

Set value based on day in month in pandas timeseries

I have a timeseries
date
2009-12-23 0.0
2009-12-28 0.0
2009-12-29 0.0
2009-12-30 0.0
2009-12-31 0.0
2010-01-04 0.0
2010-01-05 0.0
2010-01-06 0.0
2010-01-07 0.0
2010-01-08 0.0
2010-01-11 0.0
2010-01-12 0.0
2010-01-13 0.0
2010-01-14 0.0
2010-01-15 0.0
2010-01-18 0.0
2010-01-19 0.0
2010-01-20 0.0
2010-01-21 0.0
2010-01-22 0.0
2010-01-25 0.0
2010-01-26 0.0
2010-01-27 0.0
2010-01-28 0.0
2010-01-29 0.0
2010-02-01 0.0
2010-02-02 0.0
I would like to set the value to 1 based on the following rule:
If the constant is set 9 this means the 9th of each month. Due to
that that 2010-01-09 doesn't exist I would like to set the next date
that exists in the series to 1 which is 2010-01-11 above.
I have tried to create two series one (series1) with day < 9 set to 1 and one (series2) with day > 9 to 1 and then series1.shift(1) * series2
It works in the middle of the month but not if day is set to 1 due to that the last date in previous month is set to 0 in series1.
Assume your timeseries is s with a datetimeindex
I want to create a groupby object of all index values whose days are greater than or equal to 9.
g = s.index.to_series().dt.day.ge(9).groupby(pd.TimeGrouper('M'))
Then I'll check that there is at least one day past >= 9 and grab the first among them. With those, I'll assign the value of 1.
s.loc[g.idxmax()[g.any()]] = 1
s
date
2009-12-23 1.0
2009-12-28 0.0
2009-12-29 0.0
2009-12-30 0.0
2009-12-31 0.0
2010-01-04 0.0
2010-01-05 0.0
2010-01-06 0.0
2010-01-07 0.0
2010-01-08 0.0
2010-01-11 1.0
2010-01-12 0.0
2010-01-13 0.0
2010-01-14 0.0
2010-01-15 0.0
2010-01-18 0.0
2010-01-19 0.0
2010-01-20 0.0
2010-01-21 0.0
2010-01-22 0.0
2010-01-25 0.0
2010-01-26 0.0
2010-01-27 0.0
2010-01-28 0.0
2010-01-29 0.0
2010-02-01 0.0
2010-02-02 0.0
Name: val, dtype: float64
Note that 2009-12-23 also was assigned a 1 as it satisfies this requirement as well.

Categories

Resources