Fetch previous rows based on if condition and Shift function - Python dataframe - python

I have data as shown below. I would like to select rows based on two conditions.
1) rows that start with digits (1,2,3 etc)
2) previous row of the records that satisfy 1st condition
Please find the how the input data looks like
Please find how I expect the output to be
I tried using the shift(-1) function but it seems to be throwing error. I am sure I messed up with the logic/syntax. Please find the code below that I tried
# i get the index of all records that start with number.
s=df1.loc[df1['VARIABLE'].str.contains('^\d')==True].index
# now I need to get the previous record of each group but this is
#incorrect
df1.loc[((df1['VARIABLE'].shift(-1).str.contains('^\d')==False) &
(df1['VARIABLE'].str.contains('^\d')==True))].index

Use:
df1 = pd.DataFrame({'VARIABLE':['studyid',np.nan,'age_interview','Gender','1.Male',
'2.Female',np.nan, 'dob', 'eth',
'Ethnicity','1.Chinese','2.Indian','3.Malay']})
#first remove missing rows by column VARIABLE
df1 = df1.dropna(subset=['VARIABLE'])
#test startinf numbers
s = (df1['VARIABLE'].str.contains('^\d'))
#chain shifted values by | for OR
mask = s | s.shift(-1)
#filtering by boolean indexing
df1 = df1[mask]
print (df1)
VARIABLE
3 Gender
4 1.Male
5 2.Female
9 Ethnicity
10 1.Chinese
11 2.Indian
12 3.Malay

Related

filtering pandas dataframe when data contains two parts

I have a pandas dataframe and want to filter down to all the rows that contain a certain criteria in the “Title” column.
The rows I want to filter down to are all rows that contain the format “(Axx)” (Where xx are 2 numbers).
The data in the “Title” column doesn’t just consist of “(Axx)” data.
The data in the “Title” column looks like so:
“some_string (Axx)”
What Ive been playing around a bit with different methods but cant seem to get it.
I think the closest ive gotten is:
df.filter(regex=r'(D\d{2})', axis=0))
but its not correct as the entries aren’t being filtered.
Use Series.str.contains with escape () and $ for end of string and filter in boolean indexing:
df = pd.DataFrame({'Title':['(D89)','aaa (D71)','(D5)','(D78) aa','D72']})
print (df)
Title
0 (D89)
1 aaa (D71)
2 (D5)
3 (D78) aa
df1 = df[df['Title'].str.contains(r'\(D\d{2}\)$')]
print (df1)
4 D72
Title
0 (D89)
1 aaa (D71)
If ned match only (Dxx) use Series.str.match:
df2 = df[df['Title'].str.match(r'\(D\d{2}\)$')]
print (df2)
Title
0 (D89)

To get subset of dataframe based on index of a label

I have a dataframe from yahoo finance
import pandas as pd
import yfinance
ticker = yfinance.Ticker("INFY.NS")
df = ticker.history(period = '1y')
print(df)
This gives me df as,
If I specify,
date = "2021-04-23"
I need a subset of df with row having indexes label "2021-04-23"
rows of 2 days before the date
row of 1 day after of date
The important thing here is, we cannot calculate before & after using date strings as df may not have some dates but rows to be printed based on indexes. (i.e. 2 rows of previous indexes and one row of next index)
For example, in df, there is no "2021-04-21" but "2021-04-20"
How can we implement this?
You can go for integer-based indexing. First find the integer location of the desired date and then take the desired subset with iloc:
def get_subset(df, date):
# get the integer index of the matching date(s)
matching_dates_inds, = np.nonzero(df.index == date)
# and take the first one (works in case of duplicates)
first_matching_date_ind = matching_dates_inds[0]
# take the 4-element subset
desired_subset = df.iloc[first_matching_date_ind - 2: first_matching_date_ind + 2]
return desired_subset
If need before and after values by positions (if always exist date in DatetimeIndex) use DataFrame.iloc with position by Index.get_loc with min and max for select rows if not exist values before 2 or after 1 like in sample data:
df = pd.DataFrame({'a':[1,2,3]},
index=pd.to_datetime(['2021-04-21','2021-04-23','2021-04-25']))
date = "2021-04-23"
pos = df.index.get_loc(date)
df = df.iloc[max(0, pos-2):min(len(df), pos+2)]
print (df)
a
2021-04-21 1
2021-04-23 2
2021-04-25 3
Notice:
min and max are added for not failed selecting if date is first (not exist 2 values before, or second - not exist second value before) or last (not exist value after)

How to add rows to a specific location in a pandas DataFrame?

enter image description here
enter image description here
I am trying to add rows where there is a gap between month_count. For example, row 0 has month_count = 0 and row 1 has month_count = 7. How can I add extra 6 rows with month counts being 1,2,3,4,5,6? Also, same situation from row 3 to row 4. I would like to add 2 extra rows with month_count 10 and 11. What is the best way to go about this?
One way to do this would be to iterate over all of the rows and re-build the DataFrame with the missing rows inserted. Pandas does not support the direct insertion of rows at an index, however you can hack together a solution using pd.concat():
def pandas_insert(df, idx, row_contents):
top = df.iloc[:idx]
bot = df.iloc[idx:]
inserted = pd.concat([top, row_contents, bot], ignore_index=True)
return inserted
Here row_contents should be a DataFrame with one (or more) rows. We use ignore_index=True to update the index of the new DataFrame to be labeled 0,1, …, n-2, n-1

Finding rows with highest means in dataframe

I am trying to find the rows, in a very large dataframe, with the highest mean.
Reason: I scan something with laser trackers and used a "higher" point as reference to where the scan starts. I am trying to find the object placed, through out my data.
I have calculated the mean of each row with:
base = df.mean(axis=1)
base.columns = ['index','Mean']
Here is an example of the mean for each row:
0 4.407498
1 4.463597
2 4.611886
3 4.710751
4 4.742491
5 4.580945
This seems to work fine, except that it adds an index column, and gives out columns with an index of type float64.
I then tried this to locate the rows with highest mean:
moy = base.loc[base.reset_index().groupby(['index'])['Mean'].idxmax()]
This gives out tis :
index Mean
0 0 4.407498
1 1 4.463597
2 2 4.611886
3 3 4.710751
4 4 4.742491
5 5 4.580945
But it only re-index (I have now 3 columns instead of two) and does nothing else. It still shows all rows.
Here is one way without using groupby
moy=base.sort_values('Mean').tail(1)
It looks as though your data is a string or single column with a space in between your two numbers. Suggest splitting the column into two and/or using something similar to below to set the index to your specific column of interest.
import pandas as pd
df = pd.read_csv('testdata.txt', names=["Index", "Mean"], delimiter="\s+")
df = df.set_index("Index")
print(df)

Pandas: aggregate column based on values in a different column

Lets say I start with a dataframe that looks like this:
Group Val date
0 home first 2017-12-01
1 home second 2017-12-02
2 away first 2018-03-07
3 away second 2018-03-01
Data types are [string, string, datetime]. I would like to get a dataframe that for each group, shows me the value that was entered most recently:
Group Most rececnt Val Most recent date
0 home second 12-02-2017
1 away first 03-07-2018
(Data types are [string, string, datetime])
My initial thought is that I should be able to do something like this by grouping by 'group' and then aggregating the dates and vals. I know I can get the most recent datetime using the 'max' agg function, but I'm stuck on what function to use to get the corresponding val:
df.groupby('Group').agg({'val':lambda x: ____????____
'date':'max'})
Thanks,
In case I understood you right, you can do this:
df.iloc[df.groupby('Group').agg({'date': 'idxmax'}).date]
Or as a whole example:
import pandas as pd
import numpy as np
np.random.seed(42)
data = [(np.random.choice(['home', 'away'], size=1)[0],
np.random.choice(['first', 'second'], size=1)[0],
pd.Timestamp(np.random.rand()*1.9989e+18)) for i in range(10)]
df = pd.DataFrame.from_records(data)
df.columns = ['Group', 'Val', 'date']
df.iloc[df.groupby('Group').agg({'date': 'idxmax'}).date]
Which selects
Group Val date
5 away first 2031-06-09 06:26:43.486610432
0 home second 2030-03-22 04:07:07.082781440
from
Group Val date
0 home second 2030-03-22 04:07:07.082781440
1 home second 2007-12-03 05:07:24.061456384
2 home second 1979-11-18 23:57:26.700035456
3 home first 2024-11-12 08:18:17.789517824
4 away second 2014-11-07 13:17:55.756515328
5 away first 2031-06-09 06:26:43.486610432
6 away second 1983-06-14 13:17:28.334806208
7 away second 1981-08-14 03:21:14.746028864
8 away second 2003-03-29 11:00:31.189680256
9 away first 1988-06-12 16:58:48.341865984
First select the indeces of the dataframe whose variable value is maximum
max_indeces = df.groupby(['Group'])['date'].idxmax()
and then select the corresponding rows in the original dataframe, maybe only indicating the actual value you are interested in:
df.iloc[max_indeces]['Val']

Categories

Resources