Fastest way to insert data(5000 rows) in dataframe in python

Fastest way to insert data(5000 rows) in dataframe in python - python

I have 5000 json data points which I am iterating and holding data in dataframe. Initially I am adding data in series list and thereafter adding it into dataframe using below code
1. (5000 times)pd.Series([trading_symbol, instrument_token], index=stock_instrument_token_df.columns)
then:
2. (once) stock_instrument_token_df.append(listOfSeries, ignore_index=True)
time taken in executing 1 is around 700-800 ms and 2 is around 200-300ms
So overall it takes around 1 second for this process
Before this I am iterating through another set of 50,000 json data points and adding them into python dict. That takes around 300 ms
Is there any faster way to do insertion in data frame.
Is there something wrong the way I am looping through data or inserting in data frame ?Is there any faster way to get work done in dataframe?
Complete code as requested, if it helps
stock_instrument_token_df = pd.DataFrame(columns=['first', 'second'])
listOfSeries = []
for data in api_response:
trading_symbol = data[Constants.tradingsymbol]
instrument_token = data[Constants.instrument_token]
listOfSeries.append(
pd.Series([trading_symbol, instrument_token], index=stock_instrument_token_df.columns))
stock_instrument_token_df = stock_instrument_token_df.append(listOfSeries, ignore_index=True)

Related

Looping a function over huge dataset

A (already defined) function takes ISIN (unique identifier in finance) as input and gets the corresponding RIC (another identifier) as output by looking at a particular internal web app where this data is available in tabular form. The Key limitation of this website is that it can't take more than 500 input ID at a time. So when 500 or less number of ISINs are entered as input it returns a dataframe containing 500 input ISIN and their corresponding RIC codes from the website.
Task is to take a csv as input containing 30k ISINs and batch them in group of 500 IDs so that it can pass through the function and then store the produced output (dataframe). Keep looping input and appending output incrementally.
Can someone please suggest how to break this data of 30K into size of 500 and then loop through function and store all results? Many thanks in advance!

.iloc is the method you want to use.
df.iloc[firstrow:lastrow , firstcol:lastcol]
if you put it in a for loop such as
for x in range (0, 30000, 50):
a = x #first row
b = x+50 #last row
thisDF = bigdf.iloc[a:b , firstcol:lastcol]
Try it and implement it in your code. You should make questions with some code you tried, so you get helped better.

Assuming you read in the .csv file as a pandas Series (e.g., by using something like this: pd.Series.from_csv('ISINs.csv')) or that you have a list, you could split these up as thus:
import pandas as pd
import numpy as np
# mock series of ISINs
isins = pd.Series(np.arange(0, 30002, 1))
data = pd.DataFrame()
for i in range(0, len(isins)//500):
isins_for_function = isins.iloc[i*500: i*500+500]
# if you have a list instead of a series, you will need to split it like this instead
isins_for_function = isins[i*500: i*500+500]
df = func(isins_for_function)
data = pd.concat([data, df])
# for Series
df = func(isins.iloc[-(len(isins)%500):]
# for list
df = func(isins[-(len(isins)%500):]
data = pd.concat([data, df])
This will concatenate your dataframes together into data. isins is your Series/list of isins. You will need the last bit after the for loop for any index values that are after the last chunk of 500 (in the Series above, which has 30002 rows, the last two are not included in the chunks of 500 so still need to be entered into the function func).

Fastest way to generate new rows and append them to DataFrame

I want to change the target value in the data set within a certain interval. When doing it with 500 data, it takes about 1.5 seconds, but I have around 100000 data. Most of the execution time is spent in this process. I want to speed this up.
What is the fastest and most efficient way to append rows to a DataFrame?
I tried the solution in this link, tried to create a dictionary, but I couldn't do it.
Here is the code which takes around 1.5 seconds for 500 data.
def add_new(df,base,interval):
df_appended = pd.DataFrame()
np.random.seed(5)
s = np.random.normal(base,interval/3,4)
s = np.append(s,base)
for i in range(0,5):
df_new = df
df_new["DeltaG"] = s[i]
df_appended = df_appended.append(df_new)
return df_appended

DataFrames in the pandas are continuous peaces of memory, so appending or concatenating etc. dataframes is very inefficient - this operations create new DataFrames and overwrite all data from old DataFrames.
But basic python structures as list and dicts are not, when append new element to it python just create pointer to new element of structure.
So my advice - make all you data processing on lists or dicts and convert them to DataFrames in the end.
Another advice can be creating preallocated DataFrame of the final size and just change values in it using .iloc. But it works only if you know final size of your resulting DataFrame.
Good examples with code: Add one row to pandas DataFrame
If you need more code examples - let me know.

def add_new(df1,base,interval,has_interval):
dictionary = {}
if has_interval == 0:
for i in range(0,5):
dictionary[i] = (df1.copy())
elif has_interval == 1:
np.random.seed(5)
s = np.random.normal(base,interval/3,4)
s = np.append(s,base)
for i in range(0,5):
df_new = df1
df_new[4] = s[i]
dictionary[i] = (df_new.copy())
return dictionary
It works. It takes around 10 seconds for whole data. Thanks for your answers.

Python - How to optimize code to run faster? (lots of for loops in DataFrame)

I have a code that works with an excel file (SAP Download) quite extensively (data transformation and calculation steps).
I need to loop through all the lines (couple thousand rows) a few times. I have written a code prior that adds DataFrame columns separately, so I could do everything in one for loop that was of course quite quick, however, I had to change data source that meant change in raw data structure.
The raw data structure has 1st 3 rows empty, then a Title row comes with column names, then 2 rows empty, and the 1st column is also empty. I decided to wipe these, and assign column names and make them headers (steps below), however, since then, separately adding column names and later calculating everything in one for statement does not fill data to any of these specific columns.
How could i optimize this code?
I have deleted some calculation steps since they are quite long and make code part even less readable
#This function adds new column to the dataframe
def NewColdfConverter(*args):
for i in args:
dfConverter[i] = '' #previously used dfConverter[i] = NaN
#This function creates dataframe from excel file
def DataFrameCreator(path,sheetname):
excelFile = pd.ExcelFile(path)
global readExcel
readExcel = pd.read_excel(excelFile,sheet_name=sheetname)
#calling my function to create dataframe
DataFrameCreator(filePath,sheetName)
dfConverter = pd.DataFrame(readExcel)
#dropping NA values from Orders column (right now called Unnamed)
dfConverter.dropna(subset=['Unnamed: 1'], inplace=True)
#dropping rows and deleting other unnecessary columns
dfConverter.drop(dfConverter.head(1).index, inplace=True)
dfConverter.drop(dfConverter.columns[[0,11,12,13,17,22,23,48]], axis = 1,inplace = True)
#renaming columns from Unnamed 1: etc to proper names
dfConverter = dfConverter.rename(columns={Unnamed 1:propername1 Unnamed 2:propername2 etc.})
#calling new column function -> this Day column appears in the 1st for loop
NewColdfConverter("Day")
#example for loop that worked prior, but not working since new dataset and new header/column steps added:
for i in range(len(dfConverter)):
#Day column-> floor Entry Date -1, if time is less than 5:00:00
if(dfConverter['Time'][i] <= time(hour=5,minute=0,second=0)):
dfConverter['Day'][i] = pd.to_datetime(dfConverter['Entry Date'][i])-timedelta(days=1)
else:
dfConverter['Day'][i] = pd.to_datetime(dfConverter['Entry Date'][i])
Problem is, there are many columns that build on one another, so I cannot get them in one for loop, for instance in below example I need to calculate reqsWoSetUpValue, so I can calculate requirementsValue, so I can calculate otherReqsValue, but I'm not able to do this within 1 for loop by assigning the values to the dataframecolumn[i] row, because the value will just be missing, like nothing happened.
(dfsorted is the same as dfConverter, but a sorted version of it)
#example code of getting reqsWoSetUpValue
for i in range(len(dfSorted)):
reqsWoSetUpValue[i] = #calculationsteps...
#inserting column with value
dfSorted.insert(49,'Reqs wo SetUp',reqsWoSetUpValue)
#getting requirements value with previously calculated Reqs wo SetUp column
for i in range(len(dfSorted)):
requirementsValue[i] = #calc
dfSorted.insert(50,'Requirements',requirementsValue)
#Calculating Other Reqs value with previously calculated Requirements column.
for i in range(len(dfSorted)):
otherReqsValue[i] = #calc
dfSorted.insert(51,'Other Reqs',otherReqsValue)
Anyone have a clue, why I cannot do this in 1 for loop anymore by 1st adding all columns by the function, like:
NewColdfConverter('Reqs wo setup','Requirements','Other reqs')
#then in 1 for loop:
for i in range(len(dfsorted)):
dfSorted['Reqs wo setup'] = #calculationsteps
dfSorted['Requirements'] = #calculationsteps
dfSorted['Other reqs'] = #calculationsteps
Thank you

General comment: How to identify bottlenecks
To get started, you should try to identify which parts of the code are slow.
Method 1: time code sections using the time package
Wrap blocks of code in statements like this:
import time
t = time.time()
# do something
print("time elapsed: {:.1f} seconds".format(time.time() - t))
Method 2: use a profiler
E.g. Spyder has a built-in profiler. This allows you to check which operations are most time consuming.
Vectorize your operations
Your code will be orders of magnitude faster if you vectorize your operations. It looks like your loops are all avoidable.
For example, rather than calling pd.to_datetime on every row separately, you should call it on the entire column at once
# slow (don't do this):
for i in range(len(dfConverter)):
dfConverter['Day'][i] = pd.to_datetime(dfConverter['Entry Date'][i])
# fast (do this instead):
dfConverter['Day'] = pd.to_datetime(dfConverter['Entry Date'])
If you want to perform an operation on a subset of rows, you can also do this in a vectorized operation by using loc:
mask = dfConverter['Time'] <= time(hour=5,minute=0,second=0)
dfConverter.loc[mask,'Day'] = pd.to_datetime(dfConverter.loc[mask,'Entry Date']) - timedelta(days=1)

Not sure this would improve performance, but you could calculate the dependent columns at the same time row by row with DataFrame.iterrows()
for index, data in dfSorted.iterrows():
dfSorted['Reqs wo setup'][index] = #calculationsteps
dfSorted['Requirements'][index] = #calculationsteps
dfSorted['Other reqs'][index] = #calculationsteps

Efficient merging of 2.1 billion records on datasets

I have several data sets consisting of 2 columns: ID and ActivityDate. Within an dataset a ID+ActivityDate is unique. Each dataset is about 35 million records long. There are 60+ datasets.
My desired out put is basically ID, FirstActivityDate and LastActivityDate. This is basically the reduce part of a map/reduce job.
My first try was basically read the first dataset, establishing the base line, and then as I read the next dataset I do a foreach comparing and updating the LastActivityDate. Although the memory used was very acceptable (spiking at 2 GB but consistently under 1.25 GB), this took too long. I made the calculation, the result set should be around 1.5 GB long, so it's local-memory manageable.
for x in files:
parsedData = parseFile(x)
dt = parsedData[0]
cards = parsedData[1]
for card in cards:
#num = int(card[:16])
if card in result:
result[card].lastRecharged = dt
else:
result[card]=CreditCard(dt)
Commenting that line #num = int(card[:16]) made the loop execution drop to 30 seconds per file (original was around 150 seconds), but now the memory is out of control. The file parsing is basically a file read, which takes less then 1 second.
My second try was using pandas, but I couldn't merge the datasets the way I want. I must say I'm not proficient in pandas.
Is there a third option?

I ended up getting pretty close to my objective.
Fist I made the reading and parsing into memory concurrent and in batches using multithreading.pool, each result was pushed into a queue. Each pool would have 3 consecutive files, cycling in 4 pools. In a 5th pool, I pre-merge the dictionaries dropping non-recurrent keys (cards). Then in a 6th pool I do the final merge.
The whole process takes about 75 seconds on my local machine. The whole thing ends up consuming over 4gb of ram, which is not ideal, but manageable.

IIUC you are interested on the first and the last ActivityDate per each ID
If this is the case you could use dask. Let assume all your file are csv and they are stored inside a folder called data.
import dask.dataframe as dd
import pandas as pd
df = dd.read_csv("data/*.csv")
# convert ActivityDate to datetime
df["ActivityDate"] = df["ActivityDate"].astype("M8[us]")
# use aggregate
out = df.groupby("ID")\
.agg({"ActivityDate":["min", "max"]})\
.compute()
out.columns = ["_".join(col) for col in out.columns]
out = out.reset_index()
out = out.rename(columns={'ActivityDate_min':"FirstActivityDate",
'ActivityDate_max':"LastActivityDate"})

How to add rows to pandas dataframe with reasonable performance

I have an empty data frame with about 120 columns, I want to fill it using data I have in a file.
I'm iterating over a file that has about 1.8 million lines.
(The lines are unstructured, I can't load them to a dataframe directly)
For each line in the file I do the following:
Extract the data I need from the current line
Copy the last row in the data frame and append it to the end df = df.append(df.iloc[-1]). The copy is critical, most of the data in the previous row won't be changed.
Change several values in the last row according to the data I've extracted df.iloc[-1, df.columns.get_loc('column_name')] = some_extracted_value
This is very slow, I assume the fault is in the append.
What is the correct approach to speed things up ? preallocate the dataframe ?
EDIT:
After reading the answers I did the following:
I preallocated the dataframe (saved like 10% of the time)
I replaced this : df = df.append(df.iloc[-1]) with this : df.iloc[i] = df.iloc[i-1] (i is the current iteration in the loop).(save like 10% of the time).
Did profiling, even though I removed the append the main issue is copying the previous line, meaning : df.iloc[i] = df.iloc[i-1] takes about 95% of the time.

You may need plenty of memory, whichever option you choose.
However, what you should certainly avoid is using pd.DataFrame.append within a loop. This is expensive versus list.append.
Instead, aggregate to a list of lists, then feed into a dataframe. Since you haven't provided an example, here's some pseudo-code:
# initialize empty list
L = []
for line in my_binary_file:
# extract components required from each line to a list of Python types
line_vars = [line['var1'], line['var2'], line['var3']]
# append to list of results
L.append(line_vars)
# create dataframe from list of lists
df = pd.DataFrame(L, columns=['var1', 'var2', 'var3'])

The Fastest way would be load to dataframe directly via pd.read_csv()
Try separating the logic to clean out unstructured to structured data and then use pd.read_csv to load the dataframe.
You can share the sample unstructured line and logic to take out the structured data, So that might share some insights on the same.

Where you use append you end up copying the dataframe which is inefficient. Try this whole thing again but avoiding this line:
df = df.append(df.iloc[-1])
You could do something like this to copy the last row to a new row (only do this if the last row contains information that you want in the new row):
df.iloc[...calculate the next available index...] = df.iloc[-1]
Then edit the last row accordingly as you have done
df.iloc[-1, df.columns.get_loc('column_name')] = some_extracted_value

You could try some multiprocessing to speed things up
from multiprocessing.dummy import Pool as ThreadPool
def YourCleaningFunction(line):
for each line do the following
blablabla
return(your formated lines with ,) # or use the kind of function jpp just provided
pool = ThreadPool(8) # your number of cores
lines = open('your_big_csv.csv').read().split('\n') # your csv as a list of lines
df = pool.map(YourCleaningFunction, lines)
df = pandas.DataFrame(df)
pool.close()
pool.join()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Fastest way to insert data(5000 rows) in dataframe in python - python

Related

Looping a function over huge dataset

Fastest way to generate new rows and append them to DataFrame

Python - How to optimize code to run faster? (lots of for loops in DataFrame)

Efficient merging of 2.1 billion records on datasets

How to add rows to pandas dataframe with reasonable performance

Categories

Resources