How to loop through a DataFrame containing 80k+ rows - python

This question might have other answers but I could not figure out how to apply them on my current code.
I have to iterate through the DataFrame and modify certain column values as shown below:
NOTE: All of the columns are strings. The ones with _Length contain the length in int of the columns containing strings.
for col in range(0, 200):
if df['Partial_Input_Length'][col] < 50:
df['Full_Input'][col] = df['Partial_Input'][col] + " " + df['Input5'][col] + " " + df['Input6'][col]
else:
df['Full_Input'][col] = df['Partial_Input'][col]
This was used when I used a testing DataFrame containing only 200 rows. If I use for col in range(0, 80000): in the 80k rows DataFrame, it takes a huge amount of time until every operation is done.
I also tried out with itertuples() in this way:
for col in df.itertuples():
if col.Partial_Input_Length < 50:
col.Full_Input = col.Partial_Input + " " + col.Input5 + " " + col.Input6
else:
col.Full_Input = col.Partial_Input
But after running it, I get the following error.
File "", line 23, in
col.Full_Input = col.Partial_Input + " " + col.Input5 + " " + col.Input6
AttributeError: can't set attribute
Moreover, I tried with iterrows() like this:
for index, col in df.iterrows():
if df['Partial_Input_Length'][index] < 50:
df['Full_Input'][index] = df['Partial_Input'][index] + " " + df['Input5'][index] + " " + df['Input6'][index]
else:
df['Full_Input'][index] = df['Partial_Input'][index]
But the code above is taking huge amounts of time, as well.
Is it normal that every time I run these iterations on a big dataframe it takes a lot of time or am I doing something wrong?
I am quite a newbie when it comes to iterating in python. Therefore, what method should I use for the quickest iteration time and which fits on what I am trying to use it for?

You can do it without looping:
df['Full_Input'] = df['Partial_Input'].str.cat(df['Input5'], sep=" ").str.cat(df['Input6'], sep=" ")
df['Full_Input'] = np.where(df['Partial_Input_Length'].str.len() > 50, df['Partial_Input'], df['Full_Input'])

first of all you should not be modifying the elements that you are iterating over
almost all iter* functions in pandas will return read-only items, so setting anything on them will not work
to do what you want, I use apply or run a loop, that will call a function that will return a dict with the changes you want to be done and then either remake the entire dataframe or do a merge
something like
# if your modification is more simple then a simple apply will also work
df['new_col'] = df.apply(lambda x: f'{x.startDate.year}-{x.startDate.week}')
# if you want to do something more complex with all the items in the row
def foo(row):
def mofification_code(item):
return modified_item
return {
'primary_key': row.primary_key,
'modified_data': modification_code(row.item)
}
modified_data = [foo(row) for row in df.itertuples()]
# sometimes this may be sufficient,
new_df = pd.DataFrame(modified_data)
# alternatively, you can do a merge with the original data
new_df = pd.merge(df, new_df, how='left', left_on='primary_key', right_on='primary_key')

Related

How do I loop column names in a pandas dataframe?

I am new to Python and have never really used Pandas, so forgive me if this doesn't make sense. I am trying to create a df based on frontend data I am sending to a flask route. The data is looped through and appended for each row. My only problem is that I don't know how to get the df columns to reflect that. Here is my code to build the rows and the current output:
claims = csv_data["claims"]
setups = csv_data["setups"]
for setup in setups:
setup = setups[0]
offerings = setup["currentOfferings"]
considered = setup["considerationSet"]
reach_dict = setup["reach"]
favorite_dict = setup["favorite"]
summary_dict = setup["summaryMetrics"]
rows = []
for i, claim in enumerate(claims):
row = []
row.append(i + 1)
row.append(claim)
for setup in setups:
setup = setups[0]
row.append("X") if claim in setup["currentOfferings"] else row.append(float('nan'))
row.append("X") if claim in setup["considerationSet"] else row.append(float('nan'))
if claim in setup["currentOfferings"]:
reach_score = reach_dict[claim]
reach_percentage = "{:.0%}".format(reach_score)
row.append(reach_percentage)
else:
row.append(float('nan'))
if claim in setup["currentOfferings"]:
favorite_score = favorite_dict[claim]
fav_percentage = "{:.0%}".format(favorite_score)
row.append(fav_percentage)
else:
row.append(float('nan'))
rows.append(row)
I know that I can put columns = ["#", "Claims", "Setups", etc...] in the df, but that doesn't work because the rows are looping through multiple setups, and the number of setups can change. If I don't specify the column names (how it is in the image), then I just have numbers as columns names. Ideally it should loop through the data it receives in the route, and would start with "#" "Claims" as columns, and then for each setup "Setup 1", "Consideration Set 1", "Reach", "Favorite", "Setup 2", "Consideration Set 2", and so on... etc.
I tried to create a similar type of loop for the columns:
my_columns = []
for i, row in enumerate(rows):
col = []
if row[0] != None:
col.append("#")
else:
pass
if row[1] != None:
col.append("Claims")
else:
pass
if row[2] != None:
col.append("Setup")
else:
pass
if row[3] != None:
col.append("Consideration Set")
else:
pass
if row[4] != None:
col.append("Reach")
else:
pass
if row[5] != None:
col.append("Favorite")
else:
pass
my_columns.append(col)
df = pd.DataFrame(
rows,
columns = my_columns
)
But this didn't work because I have the same issue of no loop, I have 6 columns passed and 10 data columns passed. I'm not sure if I am just not doing the loop of the columns properly, or if I am making everything more complicated than it needs to be.
This is what I am trying to accomplish without having to explicitly name the columns because this is just sample data. There could end up being 3, 4, however many setups in the actual app.
what I would like the ouput to look like
I don't know if this is the most efficient way of doing something like this but I think this is what you want to achieve.
def create_columns(df):
new_cols=[]
for i in range(len(df.columns)):
repeated_cols = 6 #here is the number of columns you need to repeat for every setup
idx = 1 + i // repeated_cols
basic = ['#', 'Claims', f'Setup_{idx}', f'Consideration_Set_{idx}', 'Reach', 'Favorite']
new_cols.append(basic[i % len(basic)])
return new_cols
df.columns = create_columns(df)
If your data comes as csv then try pd.read_csv() to create dataframe.

Create pandas DataFrames in a function

How can I build a function that creates these dataframes?:
buy_orders_1h = pd.DataFrame(
{'Date_buy': buy_orders_date_1h,
'Name_buy': buy_orders_name_1h
})
sell_orders_1h = pd.DataFrame(
{'Date_sell': sell_orders_date_1h,
'Name_sell': sell_orders_name_1h
})
I have 10 dataframes like this I create very manually and everytime I want to add a new column I would have to do it in all of them which is time consuming. If I can build a function I would only have to do it once.
The differences between the two above function are of course one is for buy signals the other is for sell signals.
I guess the inputs to the function should be:
_buy/_sell - for the Column name
buy_ / sell_ - for the Column input
I'm thinking input to the function could be something like:
def create_dfs(col, col_input,hour):
df = pd.DataFrame(
{'Date' + col : col_input + "_orders_date_" + hour,
'Name' + col : col_input + "_orders_name_" + hour
}
return df
buy_orders_1h = create_dfs("_buy", "buy_", "1h")
sell_orders_1h = create_dfs("_sell", "sell_", "1h")
A dataframe needs an index, so either you can manually pass an index, or enter your row values in list form:
def create_dfs(col, col_input, hour):
df = pd.DataFrame(
{'Date' + col: [col_input + "_orders_date_" + hour],
'Name' + col: [col_input + "_orders_name_" + hour]})
return df
buy_orders_1h = create_dfs("_buy", "buy_", "1h")
sell_orders_1h = create_dfs("_sell", "sell_", "1h")
Edit: Updated due to new information:
To call a global variable using a string, enter globals() before the string in the following manner:
'Date' + col: globals()[col_input + "_orders_date_" + hour]
Check the output please to see if this is what you want. You first create two dictionaries, then depending on the buy=True condition, it either appends to the buying_df or to the selling_df. I created two sample lists of dates and column names, and iteratively appended to the desired dataframes. After creating the dicts, then pandas.DataFrame is created. You do not need to create it iteratively, rather once in the end when your dates and names have been collected into a dict.
from collections import defaultdict
import pandas as pd
buying_df=defaultdict(list)
selling_df=defaultdict(list)
def add_column_to_df(date,name,buy=True):
if buy:
buying_df["Date_buy"].append(date)
buying_df["Name_buy"].append(name)
else:
selling_df["Date_sell"].append(date)
selling_df["Name_sell"].append(name)
dates=["1900","2000","2010"]
names=["Col_name1","Col_name2","Col_name3"]
for date, name in zip(dates,names):
add_column_to_df(date,name)
#print(buying_df)
df=pd.DataFrame(buying_df)
print(df)

How to open the excel file creating from pandas faster?

The excel file creating from python is extremely slow to open even the size of file is about 50 mb.
I have tried on both pandas and openpyxl.
def to_file(list_report,list_sheet,strip_columns,Name):
i = 0
wb = ExcelWriter(path_output + '\\' + Name + dateformat + '.xlsx')
while i <= len(list_report)-1:
try:
df = pd.DataFrame(pd.read_csv(path_input + '\\' + list_report[i] + reportdate + '.csv'))
for column in strip_column:
try:
df[column] = df[column].str.strip('=("")')
except:
pass
df = adjust_report(df,list_report[i])
df = df.apply(pd.to_numeric, errors ='ignore', downcast = 'integer')
df.to_excel(wb, sheet_name = list_sheet[i], index = False)
except:
print('Missing report: ' + list_report[i])
i += 1
wb.save()
Is there anyway to speed it up?
idiom
Let us rename list_report to reports.
Then your while loop is usually expressed as simply: for i in range(len(reports)):
You access the i-th element several times. The loop could bind that for you, with: for i, report in enumerate(reports):.
But it turns out you never even need i. So most folks would write this as: for report in reports:
code organization
This bit of code is very nice:
for column in strip_column:
try:
df[column] = df[column].str.strip('=("")')
except:
pass
I recommend you bury it in a helper function, using def strip_punctuation.
(The list should be plural, I think? strip_columns?)
Then you would have a simple sequence of df assignments.
timing
Profile elapsed time(). Surround each df assignment with code like this:
t0 = time()
df = ...
print(time() - t0)
That will show you which part of your processing pipeline takes the longest and therefore should receive the most effort for speeding it up.
I suspect adjust_report() uses the bulk of the time,
but without seeing it that's hard to say.

Modify DataFrame column values based on condition

I am trying to modify the formatting of the strings of a Datframe column according to a condition.
Here is an example of the file
The DataFrame
Now, as you might see, the object column values either start with http or a capital letter: I want to make it so that:
if the string starts with http, I put it between <>
if the string starts with a capital letter, I format it as " + string + " + '#en'
However, I cant seem to be able to do so: I tried to make a simple if condition with .startswith(h) or contains('http') but it doesn't work, because I understand that it actually returns a list of booleans instead of a single condition.
Maybe it is very simple but I cannot solve, any help is appreciated.
Here is my code
import numpy as np
import pandas as pd
import re
ont1 = pd.read_csv('1.tsv',sep='\t',names=['subject','predicate','object'])
ont1['subject'] = '<' + ont1['subject'] + '>'
ont1['predicate'] = '<' + ont1['predicate'] + '>'
So it looks like you have many of the right pieces here, you mentioned boolean indexing which is what you can use to select and update certain rows, for example I'll do this on a dummy DataFrame:
df = pd.DataFrame({"a":["http://akjsdhka", "Helloall", "http://asdffa", "Bignames", "nonetodohere"]})
First we can find rows starting with "http":
mask = df["a"].str.startswith("http")
df.loc[mask, "a"] = "<" + df["a"] + ">"
Then we update the rows where that mask is true, and the same for the other condition:
mask2 = df["a"].str[0].str.isupper()
df.loc[mask2, "a"] = "\"" + df["a"] + "\"#en"
Final result:
a
0 <http://akjsdhka>
1 "Helloall"#en
2 <http://asdffa>
3 "Bignames"#en
4 nonetodohere
Try:
ont1.loc[['subject'].str.startsWith("http"),'subject'] = "<" + ont1 ['subject'] + ">"
Ref to read:
https://www.shanelynn.ie/select-pandas-dataframe-rows-and-columns-using-iloc-loc-and-ix/

Conditional sum in Python between multiple columns

I have the following script, from a larger analysis of securities data,
returns_columns = []
df_merged[ticker + '_returns'] = df_merged[ticker + '_close'].pct_change(periods=1)
returns_columns.append(ticker + '_returns')
df_merged['applicable_returns_sum'] = (df_merged[returns_columns] > df_merged['return_threshold']).sum(axis=1)
'return_threshold' is a complete series of float numbers.
I've been able to successfully sum each row in the returns_columns array, but cannot figure out how to conditionally sum only the numbers in the returns_columns that are greater than the res'return_threshold' in that row.
This seems like a problem similar to the one shown here, Python Pandas counting and summing specific conditions, but I'm trying to sum based on the changing condition in the returns_columns.
Any help would be much appreciated, thanks as always!
EDIT: ANOTHER APPROACH
This is another approach I tried. The script below has an error associated with the ticker input, even though I think it's necessary, and then produces and error:
def compute_applicable_returns(row, ticker):
if row[ticker + '_returns'] >= row['top_return']:
return row[ticker + '_returns']
else:
return 0
df_merged['applicable_top_returns'] = df_merged[returns_columns].apply(compute_applicable_returns, axis=1)
The [] operator for a dataframe should allow you to filter by an expression df > threshold and return a dataframe. You can then call .sum() on this df.
df[df > threshold].sum()
answered the question like this:
def compute_applicable_returns(row, ticker):
if row[ticker + '_returns'] >= row['return_threshold']:
return row[ticker + '_returns']
else:
return 0
for ticker in tickers:
df_merged[ticker + '_applicable_returns'] = df_merged.apply(compute_applicable_returns, args=(ticker,), axis=1)

Categories

Resources