Extract latest data from dataframe using latest dates

Extract latest data from dataframe using latest dates - python

Date Sub Value
10/24/2020 A 1
9/18/2020 A 2
9/21/2020 A 3
9/13/2020 A 4
9/20/2020 A 5
I want to extract the data using latest date from the dataframe.
I was using the following formula, but the output is different
df = df.Date.max()
Output: 2020-10-24 00:00:00.
The output which i am looking for is
Date Sub Value
10/24/2020 A 1

To get multiple rows matching the same max value, you can do this:
In [2679]: df[df.Date == df.Date.max()]
Out[2679]:
Date Sub Value
0 2020-10-24 A 1

Use Series.idxmax with DataFrame.loc and [[]] for output one row DataFrame - get only one row by first maximal datetime:
df1 = df.loc[[df.Date.idxmax()]]
Or boolean indexing with compare max - then get multiple rows if more like 1 max values:
df1 = df[df.Date.eq(df.Date.max())]

Related

Filter rows from a DataFrame with matching pairs of strings

I need to filter rows from a dataframe that include matching pairs of strings. For example if the below instance when filtered only data for IDs 1 and 2 would be kept as 3 does not have a corresponding '3 Month' for the '0 Month' entry:
df = pd.DataFrame({'ID':[1,2,3,1,2,1], 'Period':['0 Month','0 Month','0 Month','3 Month','3 Month','6 Month']})
The OR operation can easily be used to filter for 2 strings, as below, but that does not drop the ID without the requisite pair.
df = df[(df["Period"].str.contains("0 Month")) | (df["Period"].str.contains("3 Month"))]
df
Therefore I'm attempting to use the AND operator to address this need but that is returning an empty dataframe:
df = df[(df["Period"].str.contains("0 Month")) & (df["Period"].str.contains("3 Month"))]
df

You can groupby "ID" and the condition and transform nunique method to count the number of unique "Period"s and filter the rows with more than 1 unique "Period" values:
out = df[df.groupby(['ID', (df["Period"].str.contains("0 Month") | df["Period"].str.contains("3 Month"))])['Period'].transform('nunique') > 1]
Note that, instead of | you can use isin:
out = df[df.groupby(['ID', df["Period"].isin(['0 Month', '3 Month'])])['Period'].transform('nunique') > 1]
or combine the strings to match inside str.contains:
out = df[df.groupby(['ID', df["Period"].str.contains('0|3')])['Period'].transform('nunique') > 1]
Output:
ID Period
0 1 0 Month
1 2 0 Month
3 1 3 Month
4 2 3 Month

how to extract a subset of a dataframe where the date column is larger than a date?

I have a dataframe consist of a date column and other columns.
as a sample see the below,
a = pd.DataFrame({'Date':['2021/2/21', '2021/2/20','2021/3/5','2021/5/30'],
'Number':[2,4,6,9]})
a
Date Number
0 2021/2/21 2
1 2021/2/20 4
2 2021/3/5 6
3 2021/5/30 9
a['Date'].dtypes
Object
neither of the following got me the subset
a = a[a['Date'] > '20/02/2021']
[x for x in a['Date'] if x > '20/02/2021' ]
how can I get the subset?

Use pd.to_datetime and standardize date column
a['Date'] = pd.to_datetime(a.Date)
Then compare using ge i.e. greater than
a['Date'].ge('2021/02/21')

How to group a dataframe by date to get an array of ids for each group?

Here is my dataframe:
id - title - publish_up - date
1 - Exampl- 2019-12-1 - datetime
...
I created a date column by applying
df['date'] = pd.to_datetime(df['publish_up'], format='%Y-%m-%d')
I am new in python and I am trying to learn pandas.
What I would like to do is to create groups for each day of the year.
The dataframe contains data from one year span, so in theory, there should be 365 groups.
Then, I would need to get an array of ids for each group.
example:
[{date:'2019-12-1',ids:[1,2,3,4,5,6]},{date:'2019-12-2',ids:[7,8,9,10,11,12,13,14]},...]
Thank you

If want format dates in strings in output list then convert to datetimes is not necessary, only create lists per groups by GroupBy.apply, convert it to DataFrame by DataFrame.reset_index and last create list of dicts by DataFrame.to_dict:
print (df)
id title publish_up date
0 1 Exampl 2019-12-2 datetime
1 2 Exampl 2019-12-2 datetime
2 2 Exampl 2019-12-1 datetime
#if necessary change format 2019-12-1 to 2019-12-01
#df['publish_up'] = pd.to_datetime(df['publish_up'], format='%Y-%m-%d').dt.strftime('%Y-%m-%d')
print (df.groupby('publish_up')['id'].agg(list).reset_index())
publish_up id
0 2019-12-1 [2]
1 2019-12-2 [1, 2]
a = df.groupby('publish_up')['id'].agg(list).reset_index().to_dict('r')
print (a)
[{'publish_up': '2019-12-1', 'id': [2]}, {'publish_up': '2019-12-2', 'id': [1, 2]}]

Get last column of pandas dataframes using python

How do you get the last (or "nth") column in a dataFrame?
I tried several different articles such as 1 and 2.
df = pd.read_csv(csv_file)
col=df.iloc[:,0] #returns Index([], dtype='object')
col2=df.iloc[:,-1] #returns the whole dataframe
col3=df.columns[df.columns.str.startswith('c')] #returns Index([], dtype='object')
The commented out parts after the code is what I am getting after a print. Most of the time I am getting things like "returns Index([], dtype='object')"
Here is what df prints:
date open high low close
0 0 2019-07-09 09:20:10 296.235 296.245 296...
1 1 2019-07-09 09:20:15 296.245 296.245 296...
2 2 2019-07-09 09:20:20 296.235 296.245 296...
3 3 2019-07-09 09:20:25 296.235 296.275 296...

df.iloc is able to refer to both rows and columns. If you only input one integer, it will automatically refer to a row. You can mix the indexer types for the index and columns. Use : to select the entire axis.
df.iloc[:,-1:] will print out all of the final column.

Python: Create empty pandas dataframe and add elements to its columns dynamically

I am running a for loop for each of 12 months. For each month I get bunch of dates in random order over various years in history. I also have corresponding temperature data on those dates. e.g. if I am in month January, of loop all dates and temperature I get from history are for January only.
I want to start with empty pandas dataframe with two columns namely 'Dates' and 'Temperature'. As the loop progresses I want to add the dates from another month and corresponding data to the 'Temperature' column.
After my dataframe is ready I want to finally use the 'Dates'column as index to order the 'Temperature' history available so that I have correct historical sorted dates with their temperatures.
I have thought about using numpy array and storing dates and data in two separate arrays; sort the dates and then sort the temperature using some kind of index. I believe using pandas pivot table feature it will be better implemented in pandas.

#Zanam Pls refer this syntax. I think your question is similar to this answer
df = DataFrame(columns=('lib', 'qty1', 'qty2'))
for i in range(5):
df.loc[i] = [randint(-1,1) for n in range(3)]
print(df)
lib qty1 qty2
0 0 0 -1
1 -1 -1 1
2 1 -1 1
3 0 0 0
4 1 -1 -1
[5 rows x 3 columns]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract latest data from dataframe using latest dates - python

To get multiple rows matching the same max value, you can do this: In [2679]: df[df.Date == df.Date.max()] Out[2679]: Date Sub Value 0 2020-10-24 A 1

Use Series.idxmax with DataFrame.loc and [[]] for output one row DataFrame - get only one row by first maximal datetime: df1 = df.loc[[df.Date.idxmax()]] Or boolean indexing with compare max - then get multiple rows if more like 1 max values: df1 = df[df.Date.eq(df.Date.max())]

Related

Filter rows from a DataFrame with matching pairs of strings

how to extract a subset of a dataframe where the date column is larger than a date?

How to group a dataframe by date to get an array of ids for each group?

Get last column of pandas dataframes using python

Python: Create empty pandas dataframe and add elements to its columns dynamically

Categories

Resources