DataFrame change doesn't save when iterating - python

I am trying to read a certain DF from file and add to it two more columns containing, say, the year and the week from other columns in DF. When i apply the code to generate a single new column, all works great. But when there are few columns to be created, the change does not apply. Specifically, new columns are created but their values are not what they are supposed to be.
I know that this happens because i first set all new values to a certain initial string and then change some of them, but I don't understand why it works on a single column and is "nulled" for multiple columns, leaving only the latest column changed... Help please?
tbl = pd.read_csv(file).fillna('No Fill')
date_cols = ['Col1','Col2']
for i in range(len(date_cols)):
tmp_col_name = date_cols[i] + '_WEEK'
tbl[tmp_col_name] = 'No Week'
bad_ind = list(np.where(tbl[date_cols[i]] == 'No Fill')[0])
tbl_ind = range(len(tbl))
for i in range(len(bad_ind)):
tbl_ind.remove(bad_ind[i])
tmp = pd.to_datetime(tbl[date_cols[i]][tbl_ind])
tbl[tmp_col_name][tbl_ind] = tmp.apply(lambda x: str(x.isocalendar()[0]) + '+' + str(x.isocalendar()[1]))
If I try the following lines, disregarding possible "empty data values", everything works...
tbl = pd.read_csv(file).fillna('No Fill')
date_cols = ['Col1','Col2']
for i in range(len(date_cols)):
tmp_col_name = date_cols[i] + '_WEEK'
tbl[tmp_col_name] = 'No Week'
tmp = pd.to_datetime(tbl[date_cols[i]])
tbl[tmp_col_name] = tmp.apply(lambda x: str(x.isocalendar()[0]) + '+' + str(x.isocalendar()[1]))
it has to do with not changing all data values, but i don't understand why the change does not apply - after all, before the second iteration begins, the DF seems to be updated and then tbl[tmp_col_name] = 'No Week' for the second iteration "deletes" the changes made in the first iteration, but only partially - it leaves the new column created but filled with 'No Week' values...

Many thanks to #EdChum! Performing chained indexing may or may not work. In case of creating new multiple columns and then filling in only some of their values, it doesn't work. More precise, it does work but only on the last updated column. Using loc, iloc or ix accessors to set the data works. In case of the above code, to make it work, one needs to cast the tbl_ind into np.array, using tbl[col_name[j]].iloc[np.array(tbl_ind)] = tmp.apply(lambda x: x.year)
Many thanks and credit for the answer to #EdChum.

Related

how to divide a row with the previous one and that the values remain in a new column?

My problem is that I have time series, and I need to create a new column that gives me the natural logarithm of today's price divided by yesterday's price, and that these values are in a new column
#Accion1['Rentabilidad1'] = Accion1.apply(lambda row: np.log(Promedio), axis=1)
Accion1['Rentabilidad'] = np.log(Accion1['Promedio'])
Accion1
I was thinking of creating new variables and splitting iloc[0]/iloc[1] but it doesn't work either, please help 👨🏽‍💻
I would grab all of the previous dates and save them to a separate list:
fechas_anteriores = [x.strftime("%d-%m-%Y") for x in Accion1["Fecha anterior"]]
Then I would create a new column called Promedio anterior using .loc, making reference to fechas_anteriores:
Accion1["Promedio anterior"] = [fecha[0] if len(fecha) > 0 else None \
for fecha in [Accion1.loc[Accion1["Fecha de cotizacion"] == dt.strptime(fecha_anterior, "%d-%m-%Y")]["Promedio"].to_list() \
for fecha_anterior in fechas_anteriores]]
Finally I would execute the division:
Accion1["Division"] = Accion1["Rentabilidad"]/Accion1["Promedio anterior"]
You could do all of this in one line of course, though it would be less readable.

Python - How to optimize code to run faster? (lots of for loops in DataFrame)

I have a code that works with an excel file (SAP Download) quite extensively (data transformation and calculation steps).
I need to loop through all the lines (couple thousand rows) a few times. I have written a code prior that adds DataFrame columns separately, so I could do everything in one for loop that was of course quite quick, however, I had to change data source that meant change in raw data structure.
The raw data structure has 1st 3 rows empty, then a Title row comes with column names, then 2 rows empty, and the 1st column is also empty. I decided to wipe these, and assign column names and make them headers (steps below), however, since then, separately adding column names and later calculating everything in one for statement does not fill data to any of these specific columns.
How could i optimize this code?
I have deleted some calculation steps since they are quite long and make code part even less readable
#This function adds new column to the dataframe
def NewColdfConverter(*args):
for i in args:
dfConverter[i] = '' #previously used dfConverter[i] = NaN
#This function creates dataframe from excel file
def DataFrameCreator(path,sheetname):
excelFile = pd.ExcelFile(path)
global readExcel
readExcel = pd.read_excel(excelFile,sheet_name=sheetname)
#calling my function to create dataframe
DataFrameCreator(filePath,sheetName)
dfConverter = pd.DataFrame(readExcel)
#dropping NA values from Orders column (right now called Unnamed)
dfConverter.dropna(subset=['Unnamed: 1'], inplace=True)
#dropping rows and deleting other unnecessary columns
dfConverter.drop(dfConverter.head(1).index, inplace=True)
dfConverter.drop(dfConverter.columns[[0,11,12,13,17,22,23,48]], axis = 1,inplace = True)
#renaming columns from Unnamed 1: etc to proper names
dfConverter = dfConverter.rename(columns={Unnamed 1:propername1 Unnamed 2:propername2 etc.})
#calling new column function -> this Day column appears in the 1st for loop
NewColdfConverter("Day")
#example for loop that worked prior, but not working since new dataset and new header/column steps added:
for i in range(len(dfConverter)):
#Day column-> floor Entry Date -1, if time is less than 5:00:00
if(dfConverter['Time'][i] <= time(hour=5,minute=0,second=0)):
dfConverter['Day'][i] = pd.to_datetime(dfConverter['Entry Date'][i])-timedelta(days=1)
else:
dfConverter['Day'][i] = pd.to_datetime(dfConverter['Entry Date'][i])
Problem is, there are many columns that build on one another, so I cannot get them in one for loop, for instance in below example I need to calculate reqsWoSetUpValue, so I can calculate requirementsValue, so I can calculate otherReqsValue, but I'm not able to do this within 1 for loop by assigning the values to the dataframecolumn[i] row, because the value will just be missing, like nothing happened.
(dfsorted is the same as dfConverter, but a sorted version of it)
#example code of getting reqsWoSetUpValue
for i in range(len(dfSorted)):
reqsWoSetUpValue[i] = #calculationsteps...
#inserting column with value
dfSorted.insert(49,'Reqs wo SetUp',reqsWoSetUpValue)
#getting requirements value with previously calculated Reqs wo SetUp column
for i in range(len(dfSorted)):
requirementsValue[i] = #calc
dfSorted.insert(50,'Requirements',requirementsValue)
#Calculating Other Reqs value with previously calculated Requirements column.
for i in range(len(dfSorted)):
otherReqsValue[i] = #calc
dfSorted.insert(51,'Other Reqs',otherReqsValue)
Anyone have a clue, why I cannot do this in 1 for loop anymore by 1st adding all columns by the function, like:
NewColdfConverter('Reqs wo setup','Requirements','Other reqs')
#then in 1 for loop:
for i in range(len(dfsorted)):
dfSorted['Reqs wo setup'] = #calculationsteps
dfSorted['Requirements'] = #calculationsteps
dfSorted['Other reqs'] = #calculationsteps
Thank you
General comment: How to identify bottlenecks
To get started, you should try to identify which parts of the code are slow.
Method 1: time code sections using the time package
Wrap blocks of code in statements like this:
import time
t = time.time()
# do something
print("time elapsed: {:.1f} seconds".format(time.time() - t))
Method 2: use a profiler
E.g. Spyder has a built-in profiler. This allows you to check which operations are most time consuming.
Vectorize your operations
Your code will be orders of magnitude faster if you vectorize your operations. It looks like your loops are all avoidable.
For example, rather than calling pd.to_datetime on every row separately, you should call it on the entire column at once
# slow (don't do this):
for i in range(len(dfConverter)):
dfConverter['Day'][i] = pd.to_datetime(dfConverter['Entry Date'][i])
# fast (do this instead):
dfConverter['Day'] = pd.to_datetime(dfConverter['Entry Date'])
If you want to perform an operation on a subset of rows, you can also do this in a vectorized operation by using loc:
mask = dfConverter['Time'] <= time(hour=5,minute=0,second=0)
dfConverter.loc[mask,'Day'] = pd.to_datetime(dfConverter.loc[mask,'Entry Date']) - timedelta(days=1)
Not sure this would improve performance, but you could calculate the dependent columns at the same time row by row with DataFrame.iterrows()
for index, data in dfSorted.iterrows():
dfSorted['Reqs wo setup'][index] = #calculationsteps
dfSorted['Requirements'][index] = #calculationsteps
dfSorted['Other reqs'][index] = #calculationsteps

Correct way of testing Pandas dataframe values and modifying them

I need to modify some values of a Pandas dataframe based on a test, and leave the others values intact. I also need to leave the order of the rows intact.
I have a working code, based on iterating on the dataframe's rows. But it's horrendously slow. Is there a quicker way to get it done?
Here are two examples of this very slow code
for index, row in df.iterrows():
if df.number[index].is_integer():
df.number[index] = int(df.number[index])
for index, row in df.iterrows():
if df.string[index] == "XXX":
df.string[index] = df.other_colum[index].split("\")[0] + df.other_colum[index].split("\")[1]
else:
df.string[index] = df.other_colum[index].split("\")[1] + df.other_colum[index].split("\")[0]
Thanks
Generally you want to avoid iterating through rows in a pandas dataframe as it is slower than other methods pandas has created for accomplishing the same thing. One way of getting around this is using apply. You would redefine the number column:
df["number"] = df["number"].apply(lambda x: int(x) if x.is_integer() else x)
And (re)define the string column:
df["string"] = df["other column"].apply(lambda x: x.split("\\")[0] + x.split("\\")[1] if x == r"XX\X" else x.split("\\")[1] + x.split("\\")[0])
Made some assumptions based off of the data you removed from the problem set up -- .split("\") is incorrect syntax, and "other column" above necessarily has to have a backslash in it in order for your code (and mine) to work, otherwise .split("\\")[1] will return an error.

Extract and match items dealing with multiple data-frames using Python

I have two dataframes which can be created using the code shown below
df1 = pd.DataFrame({'home':[1,np.nan,2,np.nan,3,4],
'PERSONAL INFORMATION':['Study Number', 'Study ID','Age when interview
done', 'Derived using date of birth','Gender','ethnicity],
'VARIABLE':
['studyid','dummy','age_interview','dummy','gender','Chinese'],
'Remarks':[2000000001,20005000001,4265453,0,4135376,2345678]})
df2 = df2 = pd.DataFrame({'level_0': ['studyid','age_interview','gender','dobyear','ethderived','smoke','alcohol'],
'0':['tmp001', 56,'Female',1950,'Chinese','No', 'Yes']})
Aim
1) My objective is to take the values from 'level_0' column of df2 and look for them in 'VARIABLE' column of df1 to fetch their 'Remarks' column value provided it satisfies the the below condition
a) 'Home' column of df1 should contain digits as part of their value( Ex: 1,2,3,4,B1.5,C1.9, D1.2 etc are all valid values for 'Home' column)
2) My objective is same as above, but here I would like to take the values from '0' column of df2 and look for them in 'PERSONAL INFORMATION' column of df1 to fetch their 'Remarks' value provided it satisfies the below condition
a) 'VARIABLE' column of df1 should contain 'dummy' as a value
For the above two scenarios, I have written the below code but for some reason I feel that it is quite lengthy/inefficient. There should be some easy way to do this.
Scenario - 1
qconc_id = []
missed_items=[]
col_list=[]
for i in df7.index:
ques = df7['level_0'][i]
col_list.append(ques)
try:
qindex = int(df[df['VARIABLE']==ques].index[0]),
df.columns.get_loc('VARIABLE')
pos_qindex = qindex[0]
ques_value = df['home '][pos_qindex]
result = re.match(r"[A-Z]?[\d]?[\.]?[\d]+", ques_value)
while result is None:
pos_qindex = pos_qindex-1
ques_value = df['home '][pos_qindex]
result = re.match(r"[A-Z]?[\d]?[\.]?[\d]+", ques_value)
qconc_id.append(df['Remarks'][pos_qindex])
except:
missed_items.append(ans)
Scenario - 2
aconc_id = []
missed_items=[]
ans_list=[]
for i in df7.index:
ans = df7[0][i]
print("ans is ",ans)
ans_list.append(ans)
idx=0
try:
aindex = df[df['PERSONAL
INFORMATION'].str.contains(ans,case=False,regex=False)].index
print(aindex)
pos_aindex = aindex[idx]
while (df['VARIABLE'][pos_aindex] !='dummy') and
(df['PERSONAL INFORMATION'].str.contains('Yes|No',regex=True)
[pos_aindex])==False):
pos_aindex = aindex[idx+1]
print("The value is ",df['Remarks'][pos_aindex])
aconc_id.append(df['Remarks'][pos_aindex])
except:
print("Goes to Exception")
aconc_id.append('0')
missed_items.append(ans)
Please note these two things
a) I have used while loop because the values might be repeating. For example, we might have a matching value as 'No' but the df1['VARIABLE'] may not be dummy. So I increase the id values in both scenarios to find whether the next occurrence of 'No' has 'Dummy' value for VARIABLE column. The same applies for scenario 1 as well
b) How can I handle scenarios like "No" when finds match in "Notes", "Nocase". As you can see from my code that I am using regex but it still am encountering error here.
As you can see, I am making some modifications to the code and writing it twice. How can I make it elegant and efficient? I am sure there must be very easy and simple way to do this.
Any suggestions/ideas on alternative approach w.r.t to changing the data format of source data or using merge/join approach is also welcome.
I expect the output, the 'Remarks' value to be stored in the list. Please find the screenshot of what I have done
You should avoid as much as possible explicit loops in pandas, because they will not be vectorized (optimized in pandas and numpy wording). Here you could merge your dataframes:
Scenario 1:
# extract values where df2.level_0 == df1.VARIABLE
tmp = pd.merge(pd.DataFrame(df2.level_0), df1.loc[:,['home', 'VARIABLE', 'Remarks']],
left_on = ['level_0'], right_on=['VARIABLE'])
# drop lines where home would not contain a digit
tmp.drop(tmp.loc[~tmp.home.astype(np.str_).str.contains(r'\d')].index,
inplace=True)
# extract the Remarks column into a list
lst = tmp.Remarks.tolist()
With your example data I get [2000000001, 4265453, 4135376]
Scenario 2:
tmp = pd.merge(pd.DataFrame(df2['0']), df1.loc[:,['PERSONAL INFORMATION',
'VARIABLE', 'Remarks']],
left_on = ['0'], right_on=['PERSONAL INFORMATION'])
tmp.drop(tmp.loc[~tmp['VARIABLE'] == 'dummy'].index, inplace=True)
lst.extend(tmp.Remarks.tolist())
With your example data I get no additional values because from the first step, tmp is an empty dataframe.

Pandas For Loop, If String Is Present In ColumnA Then ColumnB Value = X

I'm pulling Json data from the Binance REST API, after formatting I'm left with the following...
I have a dataframe called Assets with 3 columns [Asset,Amount,Location],
['Asset'] holds ticker names for crypto assets e.g.(ETH,LTC,BNB).
However when all or part of that asset has been moved to 'Binance Earn' the strings are returned like this e.g.(LDETH,LDLTC,LDBNB).
['Amount'] can be ignored for now.
['Location'] is initially empty.
I'm trying to set the value of ['Location'] to 'Earn' if the string in ['Asset'] includes 'LD'.
This is how far I got, but I can't remember how to apply the change to only the current item, it's been ages since I've used Pandas or for loops.
And I'm only able to apply it to the entire column rather than the row iteration.
for Row in Assets['Asset']:
if Row.find('LD') == 0:
print('Earn')
Assets['Location'] = 'Earn' # <----How to apply this to current row only
else:
print('???')
Assets['Location'] = '???' # <----How to apply this to current row only
The print statements work correctly, but currently the whole column gets populated with the same value (whichever was last) as you might expect.
So (LDETH,HOT,LDBTC) returns ('Earn','Earn','Earn') rather than the desired ('Earn','???','Earn')
Any help would be appreciated...
np.where() fits here. If the Asset starts with LD, then return Earn, else return ???:
Assets['Location'] = np.where(Assets['Asset'].str.startswith('LD'), 'Earn', '???')
You could run a lambda in df.apply to check whether 'LD' is in df['Asset']:
df['Location'] = df['Asset'].apply(lambda x: 'Earn' if 'LD' in x else None)
One possible solution:
def get_loc(row):
asset = row['Asset']
if asset.find('LD') == 0:
print('Earn')
return 'Earn'
print('???')
return '???'
Assets['Location'] = Assets.apply(get_loc, axis=1)
Note, you should almost never iterate over a pandas dataframe or series.

Categories

Resources