Using Groupby within a for loop

Using Groupby within a for loop - python

I have the following DataFrame
If df['Time'] and df['OrderID'] are the same, and df['MessageType'] is 'D' followed by 'A', then remove the row that contains 'D' and rename the value 'A' to 'AMEND'. Here's my code:
import pandas as pd
Instrument = df['Symbol']
Date = df['Date']
Time = df['Time']
RecordType = df['MessageType']
Price = df['Price']
Volume = df['Quantity']
Qualifiers = df['ExchangeOrderType']
OrderID = df['OrderID']
MatchID = df['MatchID']
Side = df['Side']
for i in range(len(Time)-1):
if((Time[i] == Time[i+1]) & (RecordType[i] == "D") & (RecordType[i+1] == "A")):
del Instrument[i]
del Date[i]
del Time[i]
del RecordType[i]
del Price[i]
del Volume[i]
del Qualifiers[i]
del OrderID[i]
del Side[i]
RecordType[i+1] = "AMEND" # rename the message type
# creating a new dataframe with updated lists
new_df = pd.DataFrame({'Instrument':Instrument, 'Date':Date, 'Time':Time, 'RecordType':RecordType, 'Price':Price, 'Volume':Volume, 'Qualifiers':Qualifiers, 'OrderID':OrderID, 'MatchID':MatchID, 'Side':Side}).reset_index(drop=True)
new_df['RecordType']=np.where(new_df['RecordType'] =='O', 'CONTROL', new_df['RecordType'])
new_df['RecordType']=np.where(new_df['RecordType'] =='A', 'ENTER', new_df['RecordType'])
new_df['RecordType']=np.where(new_df['RecordType'] =='D', 'DELETE', new_df['RecordType'])
However, I have many different Symbol and Date and wish to use groupby in the for loop. I tried
grouped = df.groupby(['Symbol', 'Date']) and replaced df with grouped but it didn't work. Also, I realize that my code is index sensitive, i.e., it must start with index zero for the for loop to work. I'm not sure if groupby will cause index problem or not.
Please help.
Thank you.

A good solution is to use np.where() for the conditions you have mentioned and .shift(-1) to compare to the next row. You can add more conditions (e.g. a condition for the df['Symbol'] column).
import pandas as pd, numpy as np
import pandas as pd, numpy as np
df = pd.DataFrame({'Symbol': ['A2M', 'A2M', 'A2M'],
'Time' : ['14:00:02 678544300', '07:00:02 678544300', '07:00:02 678544300'],
'MessageType' : ['D', 'D', 'A'],
'OrderID' : ['72222771064878939976', '72222771064878939976', '72222771064878939976'],
'Date' : ['2020-01-02', '2020-01-02', '2020-01-02']})
df['MessageType'] = np.where((df['MessageType'] == 'D') & (df['MessageType'].shift(-1) == 'A') &
(df['Date'] == df['Date'].shift(-1)) & (df['Time'] == df['Time'].shift(-1)) &
(df['Symbol'] == df['Symbol'].shift(-1)) &
(df['OrderID'] == df['OrderID'].shift(-1)), 'AMEND', df['MessageType'])
df
Output:
Symbol Time MessageType OrderID Date
0 A2M 14:00:02 678544300 D 72222771064878939976 2020-01-02
1 A2M 07:00:02 678544300 AMEND 72222771064878939976 2020-01-02
2 A2M 07:00:02 678544300 A 72222771064878939976 2020-01-02
For all your future posts, please consider this post: How to make good reproducible pandas examples
You should not include an image. As you can see, I was forced to create a sample dataframe. You can simply copy and paste the data into your answer (and should do that), and then format it or you can do df.to_dict() and copy ans paste that into your SatackOverFlow question. See the link.

Related

Pandas include single row in df after filtering with .loc

So, in this function:
def filter_by_freq(df, frequency):
filtered_df = df.copy()
if frequency.upper() == 'DAY':
pass
else:
date_obj = filtered_df['Date'].values[0]
target_day = pd.to_datetime(date_obj).day
target_month = pd.to_datetime(date_obj).month
final_date_obj = filtered_df['Date'].values[-1]
if frequency.upper() == 'MONTH':
filtered_df = filtered_df.loc[filtered_df['Date'].dt.day.eq(target_day)]
elif frequency.upper() == 'YEAR':
filtered_df = filtered_df.loc[filtered_df['Date'].dt.day.eq(target_day)]
filtered_df = filtered_df.loc[filtered_df['Date'].dt.month.eq(target_month)]
return filtered_df
How can I also include in the .loc the very last row from the original df? Tried doing (for month frequency): filtered_df = filtered_df.loc[(filtered_df['Date'].dt.day.eq(target_day)) | (filtered_df['Date'].dt.date.eq(final_date_obj))] but didn't work.
Thanks for your time!

Here's one way you could do it. In this example I have a df and I want to filter out all rows that have c1 > 0.5, but I want to keep the last row no matter what. I create a boolean series called lte_half to keep track of the first condition, and then I create another boolean series/list/array (all interchangeable) called end_ind which is True only for the last row. The filtered table is created by taking all rows that pass either condition with the |
import pandas as pd
import numpy as np
np.random.seed(0)
df = pd.DataFrame({'c1':np.random.rand(20)})
lte_half = df['c1'].le(0.5)
end_ind = df.index == df.index[-1]
filt_df = df[lte_half | end_ind]
print(filt_df)

Applying a conditional statement on one column to achieve a result in another column

I am trying to write a script that will add 4 months or 8 months to the column titled "Date" depending on the column titled "Quarterly_Call__c". For instance, if the value in Quarterly_Call__c = 2 then add 4 months to the "Date" column and if the value is 3, add 8 months. Finally, I want the output in the column titled "New Date".
So far I am able to add the number of months I want using this piece of code:
from datetime import date
from dateutil.relativedelta import relativedelta
new_date = []
df['Date'] = df['Date'].dt.normalize()
for value in df['Date']:
new_date.append(value + relativedelta(months=+4))
df['New Date'] = new_date
However, as I mentioned, I would like this to work depending on the value in Quarterly_Call__c, so I tried writing this code:
for i in df['Quarterly_Call__c'].astype(int).to_list():
if i == 2:
for value in df['Date']:
new_date.append(value + relativedelta(months=+4))
elif i == 3:
for value in df['Date']:
new_date.append(value + relativedelta(months=+8))
Unfortunately, this does not work. Could you please recommend a solution? Thanks!

Using lambda expressions to each of the rows on your DataFrame seems to be the most convenient approach:
def date_calc(q,d):
if q == 2:
return d + relativedelta(months=+4)
else:
return d + relativedelta(months=+8)
df['New Date'] = df.apply(lambda x: date_calc(x['Quarterly_Call__c'], x['Date']), axis=1)
The date_calc function holds the same logic you posted in your question while taking the inputs as arguments, and the apply method of the DataFrame is used to calculate the 'New Date' column for each row where the variable x of the lambda expression represents a row of the DataFrame.
Keep in mind that the axis argument being specified to 1 is what makes sure that the function is applied for each row of the DataFrame rather than each column. More info about the apply method can be found here.

You could iterate through row by row to access the row data, and calculate the new date.
import pandas as pd
from dateutil.relativedelta import relativedelta
df = pd.DataFrame({
'Quarterly_Call__c': [2,3,2,3],
'Date': ['2021-02-25', '2021-03-25', '2021-04-25', '2021-05-25']
})
df['Date'] = pd.to_datetime(df['Date'])
df['New Date'] = '' #new empty column
for i in range(len(df)):
if df.loc[i, 'Quarterly_Call__c'] == 2:
df.loc[i, 'New Date'] = df.loc[i, 'Date'] + relativedelta(months=+4)
if df.loc[i, 'Quarterly_Call__c'] == 3:
df.loc[i, 'New Date'] = df.loc[i, 'Date'] + relativedelta(months=+8)
df['New Date'] = df['New Date'].dt.normalize()
Output
Quarterly_Call__c Date New Date
0 2 2021-02-25 2021-06-25
1 3 2021-03-25 2021-11-25
2 2 2021-04-25 2021-08-25
3 3 2021-05-25 2022-01-25

You can try lambda functions on your dataframe. For example:
import pandas as pd
numbers = {'set_of_numbers': [1,2,3,4,5,6,7,8,9,10]}
df = pd.DataFrame(numbers,columns=['set_of_numbers'])
df['equal_or_lower_than_4?'] = df['set_of_numbers'].apply(lambda x: 'True' if x <= 4 else 'False')
print (df)
You can check this link [1] for more information on how to apply if conditions on Pandas DataFrame.

how can I select rows in a Dataframe according to specified conditions in several columns?

I'm new in pandas and trying to get some rows which match conditions of two columns
Here is my code:
import pandas as pd
df = pd.read_csv('sp500.csv')
full_list = []
symbol = df['Symbol']
full_list.append(symbol)
name = df['Name']
full_list.append(name)
sector = df['Sector']
full_list.append(sector)
price = df['Price']
full_list.append(price)
book_value = df['Book Value']
full_list.append(book_value)
low = df['52 week low']
full_list.append(low)
high = df['52 week high']
full_list.append(high)
df = pd.DataFrame(full_list)
df = df.T
print(df.loc[df['Sector'].isin(['Financials','Energy']) and (df['52 week low'] < 80)])
I can't find the correct command in the documentation, and the problem is in the last line of code. Please help me to understand how it works

You're quite close. You need to use bit-wise operators and take care to consider the unintuitive operator precedence:
df.loc[
df['Sector'].isin(['Financials','Energy']) & # not "and"
(df['52 week low'] < 80) # these parenthesis are crucial
]
Side note, without seeing the text file that you're working from, I can't help but think you'd be better off selecting your columns directly instead of rebuilding your dataframe:
import pandas as pd
cols_to_keep = ['Symbol', 'Name', 'Sector', 'Price', 'Book Value', '52 week low', '52 week high']
rows_to_keep = lambda df: df['Sector'].isin(['Financials','Energy']) & (df['52 week low'] < 80)
df = (
pd.read_csv('sp500.csv')
.loc[row_to_keep, cols_to_keep]
)

So close!
df.loc[df['Sector'].isin(['Financials','Energy']) & (df['52 week low'] < 80)]

Python Pandas filtering dataframe on date

I am trying to manipulate a CSV file on a certain date in a certain column.
I am using pandas (total noob) for that and was pretty successful until i got to dates.
The CSV looks something like this (with more columns and rows of course).
These are the columns:
Circuit
Status
Effective Date
These are the values:
XXXX001
Operational
31-DEC-2007
I tried dataframe query (which i use for everything else) without success.
I tried dataframe loc (which worked for everything else) without success.
How can i get all rows that are older or newer from a given date? If i have other conditions to filter the dataframe, how do i combine them with the date filter?
Here's my "raw" code:
import pandas as pd
# parse_dates = ['Effective Date']
# dtypes = {'Effective Date': 'str'}
df = pd.read_csv("example.csv", dtype=object)
# , parse_dates=parse_dates, infer_datetime_format=True
# tried lot of suggestions found on SO
cols = df.columns
cols = cols.map(lambda x: x.replace(' ', '_'))
df.columns = cols
status1 = 'Suppressed'
status2 = 'Order Aborted'
pool = '2'
region = 'EU'
date1 = '31-DEC-2017'
filt_df = df.query('Status != #status1 and Status != #status2 and Pool == #pool and Region_A == #region')
filt_df.reset_index(drop=True, inplace=True)
filt_df.to_csv('filtered.csv')
# this is working pretty well
supp_df = df.query('Status == #status1 and Effective_Date < #date1')
supp_df.reset_index(drop=True, inplace=True)
supp_df.to_csv('supp.csv')
# this is what is not working at all
I tried many approaches, but i was not able to put it together. This is just one of many approaches i tried.. so i know it is perhaps completely wrong, as no date parsing is used.
supp.csv will be saved, but the dates present are all over the place, so there's no match with the "logic" in this code.
Thanks for any help!

Make sure you convert your date to datetime and then filter slice on it.
df['Effective Date'] = pd.to_datetime(df['Effective Date'])
df[df['Effective Date'] < '2017-12-31']
#This returns all the values with dates before 31th of December, 2017.
#You can also use Query

How to update dataframe value

I have a project where for each row in a table I need to iterate through rows from another table and update values in both. The changes need to stick for the next iteration. What is the best way to do that?
for invoice_line in invoices.itertuples():
qty = invoice_line.SHIP_QTY
for receipt_line in receipts[receipts.SKU == invoice_line.SKU].itertuples():
if qty > receipt_line.REC_QTY:
receipts.set_value(receipt_line.index,'REC_QTY',0)
qty = qty - receipt_line.REC_QTY
else:
receipts.set_value(receipt_line.index,'REC_QTY', receipt_line.REC_QTY - qty)
qty = 0
recd = receipt_line.REC_DATE
if qty < 1:break
invoices.set_value(invoice_line.index,'REC_DATE',recd)
set_value does not seem to work.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(50, 4), columns=list('ABCD'))
for row in df.itertuples():
df.set_value(row.index,'test',row.D)
print df.head()

I think what you want is a capitalized Index
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(50, 4), columns=list('ABCD'))
for row in df.itertuples():
df.set_value(row.Index,'test',row.D)
print df.head()

Not 100% sure if this is what you want, but I think you're trying to loop thru a list and update the value of a cell in a dataframe. The syntax for that is:
for ix in df.index:
df.loc[ix, 'Test'] = 'My New Value'
where ix is the row position and 'Test' is the column name that you want to update. If you need to add more logic, you could try somthing like:
for ix in df.index:
row = df.loc[ix]
if row.myVariable < 100:
df.loc[ix, 'SomeColumn'] = 'Less than ahundred'
else:
df.loc[ix, 'SomeColumn'] = 'ahundred or more'

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Using Groupby within a for loop - python

Related

Pandas include single row in df after filtering with .loc

Applying a conditional statement on one column to achieve a result in another column

how can I select rows in a Dataframe according to specified conditions in several columns?

Python Pandas filtering dataframe on date

How to update dataframe value

Categories

Resources