I have a project where for each row in a table I need to iterate through rows from another table and update values in both. The changes need to stick for the next iteration. What is the best way to do that?
for invoice_line in invoices.itertuples():
qty = invoice_line.SHIP_QTY
for receipt_line in receipts[receipts.SKU == invoice_line.SKU].itertuples():
if qty > receipt_line.REC_QTY:
receipts.set_value(receipt_line.index,'REC_QTY',0)
qty = qty - receipt_line.REC_QTY
else:
receipts.set_value(receipt_line.index,'REC_QTY', receipt_line.REC_QTY - qty)
qty = 0
recd = receipt_line.REC_DATE
if qty < 1:break
invoices.set_value(invoice_line.index,'REC_DATE',recd)
set_value does not seem to work.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(50, 4), columns=list('ABCD'))
for row in df.itertuples():
df.set_value(row.index,'test',row.D)
print df.head()
I think what you want is a capitalized Index
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(50, 4), columns=list('ABCD'))
for row in df.itertuples():
df.set_value(row.Index,'test',row.D)
print df.head()
Not 100% sure if this is what you want, but I think you're trying to loop thru a list and update the value of a cell in a dataframe. The syntax for that is:
for ix in df.index:
df.loc[ix, 'Test'] = 'My New Value'
where ix is the row position and 'Test' is the column name that you want to update. If you need to add more logic, you could try somthing like:
for ix in df.index:
row = df.loc[ix]
if row.myVariable < 100:
df.loc[ix, 'SomeColumn'] = 'Less than ahundred'
else:
df.loc[ix, 'SomeColumn'] = 'ahundred or more'
Related
So, in this function:
def filter_by_freq(df, frequency):
filtered_df = df.copy()
if frequency.upper() == 'DAY':
pass
else:
date_obj = filtered_df['Date'].values[0]
target_day = pd.to_datetime(date_obj).day
target_month = pd.to_datetime(date_obj).month
final_date_obj = filtered_df['Date'].values[-1]
if frequency.upper() == 'MONTH':
filtered_df = filtered_df.loc[filtered_df['Date'].dt.day.eq(target_day)]
elif frequency.upper() == 'YEAR':
filtered_df = filtered_df.loc[filtered_df['Date'].dt.day.eq(target_day)]
filtered_df = filtered_df.loc[filtered_df['Date'].dt.month.eq(target_month)]
return filtered_df
How can I also include in the .loc the very last row from the original df? Tried doing (for month frequency): filtered_df = filtered_df.loc[(filtered_df['Date'].dt.day.eq(target_day)) | (filtered_df['Date'].dt.date.eq(final_date_obj))] but didn't work.
Thanks for your time!
Here's one way you could do it. In this example I have a df and I want to filter out all rows that have c1 > 0.5, but I want to keep the last row no matter what. I create a boolean series called lte_half to keep track of the first condition, and then I create another boolean series/list/array (all interchangeable) called end_ind which is True only for the last row. The filtered table is created by taking all rows that pass either condition with the |
import pandas as pd
import numpy as np
np.random.seed(0)
df = pd.DataFrame({'c1':np.random.rand(20)})
lte_half = df['c1'].le(0.5)
end_ind = df.index == df.index[-1]
filt_df = df[lte_half | end_ind]
print(filt_df)
How can I make this account that I made in excel in python...
I wanted to take the column "Acumulado" and multiply by the bottom row of the column 'Selic por diy' and add that value in that row, and so do the same thing successively
import pandas as pd
# Creating the dataframe
df = pd.DataFrame({"Data":['06/03/2006','07/03/2006','08/03/2006','09/03/2006','10/03/2006','13/03/2006','14/03/2006','15/03/2006','16/03/2006','17/03/2006'],
"Taxa SELIC":[17.29,17.29,17.29,16.54,16.54,16.54,16.54,16.54,16.54,16.54,]})
df['Taxa Selic %'] = df['Taxa SELIC'] / 100
df['Selic por dia'] = (1 + df['Taxa SELIC'])**(1/252)
Data frame Example
Here's an example I did in excel
Second example of how I would like it to look
Not an efficient method, but you can try this:
import numpy as np
selic_per_dia = list(df['Selic por dia'].values)
accumulado = [1000000*selic_per_dia[0]]
for i,value in enumerate(selic_per_dia):
if i==0:
continue
else:
accumulado.append(accumulado[i-1]*value)
df['Acumulado'] = accumulado
df.loc[-1] = [np.nan,np.nan,np.nan,np.nan,1000000]
df.index = df.index + 1
df = df.sort_index()
Assuming I have the following multiindex DF
import pandas as pd
import numpy as np
import pandas as pd
input_id = np.array(['12345'])
docType = np.array(['pre','pub','app','dw'])
docId = np.array(['34455667'])
sec_type = np.array(['bib','abs','cl','de'])
sec_ids = np.array(['x-y','z-k'])
index = pd.MultiIndex.from_product([input_id,docType,docId,sec_type,sec_ids])
content= [str(randint(1,10))+ '##' + str(randint(1,10)) for i in range(len(index))]
df = pd.DataFrame(content, index=index, columns=['content'])
df.rename_axis(index=['input_id','docType','docId','secType','sec_ids'], inplace=True)
df
I know that I can query a multiindex DF as follows:
# querying a multiindex DF
idx = pd.IndexSlice
df.loc[idx[:,['pub','pre'],:,'de',:]]
basically with the help of pd.IndexSlice I can pass the values I want for every of the indexes. In the above case I want the resulting DF where the second index is 'pub' OR 'pre' and the 4th one is 'de'.
I am looking for the way to pass a range of values to the query. something like multiindex 3 beeing between 34567 and 45657. Assume those are integers.
pseudocode: df.loc[idx[:,['pub','pre'],XXXXX,'de',:]]
XXXX = ?
EDIT 1:
docId column index is of text type, probably its necessary to change it first to int
Turns out query is very powerful:
df.query('docType in ["pub","pre"] and ("34455667" <= docId <= "3445568") and (secType=="de")')
Output:
content
input_id docType docId secType sec_ids
12345 pre 34455667 de x-y 2##9
z-k 6##1
pub 34455667 de x-y 6##5
z-k 9##8
I have the following DataFrame
If df['Time'] and df['OrderID'] are the same, and df['MessageType'] is 'D' followed by 'A', then remove the row that contains 'D' and rename the value 'A' to 'AMEND'. Here's my code:
import pandas as pd
Instrument = df['Symbol']
Date = df['Date']
Time = df['Time']
RecordType = df['MessageType']
Price = df['Price']
Volume = df['Quantity']
Qualifiers = df['ExchangeOrderType']
OrderID = df['OrderID']
MatchID = df['MatchID']
Side = df['Side']
for i in range(len(Time)-1):
if((Time[i] == Time[i+1]) & (RecordType[i] == "D") & (RecordType[i+1] == "A")):
del Instrument[i]
del Date[i]
del Time[i]
del RecordType[i]
del Price[i]
del Volume[i]
del Qualifiers[i]
del OrderID[i]
del Side[i]
RecordType[i+1] = "AMEND" # rename the message type
# creating a new dataframe with updated lists
new_df = pd.DataFrame({'Instrument':Instrument, 'Date':Date, 'Time':Time, 'RecordType':RecordType, 'Price':Price, 'Volume':Volume, 'Qualifiers':Qualifiers, 'OrderID':OrderID, 'MatchID':MatchID, 'Side':Side}).reset_index(drop=True)
new_df['RecordType']=np.where(new_df['RecordType'] =='O', 'CONTROL', new_df['RecordType'])
new_df['RecordType']=np.where(new_df['RecordType'] =='A', 'ENTER', new_df['RecordType'])
new_df['RecordType']=np.where(new_df['RecordType'] =='D', 'DELETE', new_df['RecordType'])
However, I have many different Symbol and Date and wish to use groupby in the for loop. I tried
grouped = df.groupby(['Symbol', 'Date']) and replaced df with grouped but it didn't work. Also, I realize that my code is index sensitive, i.e., it must start with index zero for the for loop to work. I'm not sure if groupby will cause index problem or not.
Please help.
Thank you.
A good solution is to use np.where() for the conditions you have mentioned and .shift(-1) to compare to the next row. You can add more conditions (e.g. a condition for the df['Symbol'] column).
import pandas as pd, numpy as np
import pandas as pd, numpy as np
df = pd.DataFrame({'Symbol': ['A2M', 'A2M', 'A2M'],
'Time' : ['14:00:02 678544300', '07:00:02 678544300', '07:00:02 678544300'],
'MessageType' : ['D', 'D', 'A'],
'OrderID' : ['72222771064878939976', '72222771064878939976', '72222771064878939976'],
'Date' : ['2020-01-02', '2020-01-02', '2020-01-02']})
df['MessageType'] = np.where((df['MessageType'] == 'D') & (df['MessageType'].shift(-1) == 'A') &
(df['Date'] == df['Date'].shift(-1)) & (df['Time'] == df['Time'].shift(-1)) &
(df['Symbol'] == df['Symbol'].shift(-1)) &
(df['OrderID'] == df['OrderID'].shift(-1)), 'AMEND', df['MessageType'])
df
Output:
Symbol Time MessageType OrderID Date
0 A2M 14:00:02 678544300 D 72222771064878939976 2020-01-02
1 A2M 07:00:02 678544300 AMEND 72222771064878939976 2020-01-02
2 A2M 07:00:02 678544300 A 72222771064878939976 2020-01-02
For all your future posts, please consider this post: How to make good reproducible pandas examples
You should not include an image. As you can see, I was forced to create a sample dataframe. You can simply copy and paste the data into your answer (and should do that), and then format it or you can do df.to_dict() and copy ans paste that into your SatackOverFlow question. See the link.
I have the following example of my dataframe:
df = pd.DataFrame({'first_date': ['01-07-2017', '01-07-2017', '01-08-2017'],
'end_date': ['01-08-2017', '01-08-2017', '15-08-2017'],
'second_date': ['01-09-2017', '01-08-2017', '15-07-2017'],
'cust_num': [1, 2, 1],
'Title': ['philips', 'samsung', 'philips']})
If the cus_num is equal in the column
The Title is equal for both rows in the dataframe
The second_date in a row <= end_date in an other row
If all these requirements are met the value True should be appended to a new column in the original row.
Because I'm working with a big dataset I'm looking for an efficient way to do this.
In this case only the first record should get a true value.
I have checked for the apply with lambda and groupby function in python but couldnt find a way to make these work.
Try this (spontaneously I cannot come up with a faster method):
import pandas as pd
import numpy as np
df["second_date"]=pd.to_datetime(df["second_date"], format='%d-%m-%Y')
df["end_date"]=pd.to_datetime(df["end_date"], format='%d-%m-%Y')
df["new col"] = False
for cust in set(df["cust_num"]):
indices = df.index[df["cust_num"] == cust].tolist()
if len(indices) > 1:
sub_df = df.loc[indices]
for title in set(df.loc[indices]["Title"]):
indices_title = sub_df.index[sub_df["Title"] == title]
if len(indices_title) > 1:
for i in indices_title:
if sub_df.loc[indices_title]["second_date"][i] <= sub_df.loc[indices_title]["end_date"][i]:
df["new col"] = True
break
df["new_col"] = new_col
First you need to make all date columns comparable with eachother by casting them into datetime. Then create the additional column you want.
Now create a set of all unique customer numbers and iterate through them. For each customer number get a list of all row indices with this customer number. If this list is longer than 1, then you have several same customer numbers. Then you create a sub df of your dataframe with all rows with the same customer number. Then iterate through the set of all titles. For each title check if there is the same title somewhere else in the sub df (len > 1). If this is the case, then iterate through all rows and write True in your additional column in the same row where the date condition is met for the first time.
This should work. Also while reading comments, I am assuming that all cust_num is unique.
import pandas as pd
df = pd.DataFrame({'first_date': ['01-07-2017', '01-07-2017', '01-08-2017'],
'end_date': ['01-08-2017', '01-08-2017', '15-08-2017'],
'second_date': ['01-09-2017', '01-08-2017', '15-07-2017'],
'cust_num': [1, 2, 1],
'Title': ['philips', 'samsung', 'philips']})
df["second_date"]=pd.to_datetime(df["second_date"])
df["end_date"]=pd.to_datetime(df["end_date"])
df['Value'] = False
for i in range(len(df)):
for j in range(len(df)):
if (i != j):
if (df.loc[j,'end_date'] >= df.loc[i,'second_date']) == True:
if (df.loc[i,'cust_num'] == df.loc[j,'cust_num']) == True:
if (df.loc[i,'Title'] == df.loc[j,'Title']) == True:
df.loc[i,'Value'] = True
Tell me if this code works! and any errors.