What is the most efficient(fastest) way to create a dataframe? - python

I am working on a project that reads a couple of thousand text documents, creates a dataframe from them, and then trains a model on the dataframe. The most time-consuming aspect of the code is the creation of the dataframe.
Here is how I create the dataframe:
I first create 4-5 lists, create a dictionary with 'Column-name' as the key and the previous lists as the values. Then use pd.DataFrame to give the dictionary. I have added print updates after each step and the dataframe creation step takes the most time.
Method I am using:
line_of_interest = []
line_no = []
file_name = []
for file in file_names:
with open(file) as txt:
for i, line in enumerate(txt):
if 'word of interest' in line:
line_of_interest.append(line)
line_no.append(i)
file_name.append()
rows = {'Line_no':line_no,'Line':line_of_interest,'File':file_name}
df = pd.DataFrame(data = rows)
I was wondering if there is a more efficient and less time-consuming way to create the dataframe. I tried looking for similar questions and the only thing I could find was "Most Efficient Way to Create Pandas DataFrame from Web Scraped Data".
Let me know if there is a similar question with a good answer. The only other method of creating a dataframe I know is appending row by row all the values as I discover them, and I don't know a way to check if that is quicker. Thanks!

Related

Fastest way to generate new rows and append them to DataFrame

I want to change the target value in the data set within a certain interval. When doing it with 500 data, it takes about 1.5 seconds, but I have around 100000 data. Most of the execution time is spent in this process. I want to speed this up.
What is the fastest and most efficient way to append rows to a DataFrame?
I tried the solution in this link, tried to create a dictionary, but I couldn't do it.
Here is the code which takes around 1.5 seconds for 500 data.
def add_new(df,base,interval):
df_appended = pd.DataFrame()
np.random.seed(5)
s = np.random.normal(base,interval/3,4)
s = np.append(s,base)
for i in range(0,5):
df_new = df
df_new["DeltaG"] = s[i]
df_appended = df_appended.append(df_new)
return df_appended
DataFrames in the pandas are continuous peaces of memory, so appending or concatenating etc. dataframes is very inefficient - this operations create new DataFrames and overwrite all data from old DataFrames.
But basic python structures as list and dicts are not, when append new element to it python just create pointer to new element of structure.
So my advice - make all you data processing on lists or dicts and convert them to DataFrames in the end.
Another advice can be creating preallocated DataFrame of the final size and just change values in it using .iloc. But it works only if you know final size of your resulting DataFrame.
Good examples with code: Add one row to pandas DataFrame
If you need more code examples - let me know.
def add_new(df1,base,interval,has_interval):
dictionary = {}
if has_interval == 0:
for i in range(0,5):
dictionary[i] = (df1.copy())
elif has_interval == 1:
np.random.seed(5)
s = np.random.normal(base,interval/3,4)
s = np.append(s,base)
for i in range(0,5):
df_new = df1
df_new[4] = s[i]
dictionary[i] = (df_new.copy())
return dictionary
It works. It takes around 10 seconds for whole data. Thanks for your answers.

How to create one large dataframe of specific excel information using python pandas from a list of excel paths

Perhaps an easy fix.
I am looking to extract specific information from many of the same style of excel workbooks within a directory and concatenate the specific information all in into one workbook (while changing the format). I have completed every part of this task except for successfully creating one big dataframe of n columns from the different workbooks(proportional to the number of xlsx files read). Each of the read workbooks has only one sheet ['Sheet1']. Does this sound like I am taking the right approach? I am currently using a for loop to gather this data.
Upon much research online (Github, youtube, stackoverflow), others say to make one big dataframe, then concatenate. I have tried to use a for loop to create this dataframe; however, I have not seen users "piece together" bits of data to form a dataframe the way I have. I don't believe this should hinder the operation. I realize I am not appending or concatenating, just not sure where to go with it.
for i in filepaths: #filepaths is a list of n filepaths`
df = pd.read_excel(i) #read the excel sheets`
info = otherslices #condensed form of added slices from df`
Final = pd.DataFrame(info) #expected big dataframe`
The expected results should be columns directly next to each other (one from each excel sheet respectively)
Excel1 Excel2 -> Excel(n)
info1a info1b
info2a info2b
info3a info3b
... ...
What I currently get when using "print(Final)" in loop is
Excel1
info1a
info2a
info3a
...
Excel2
info1b
info2b
info3b
...
|
Excel(n)
However, the dataframe I get from this loop (when I type "Final") is only
the very last excel workbook's data
I would create a list of data frames which you append in each loop then after the loop concate the list into a single data frame. So something like this.
Final=[]
for i in filepaths: #filepaths is a list of n filepaths`
df = pd.read_excel(i) #read the excel sheets`
info = otherslices #condensed form of added slices from df`
Final.append(info) #expected big dataframe`'
Final=pd.concat(Final)
I discovered my own solution to this problem.
Final = pd.DataFrame(index=range(95)) #95 is the number of rows I have for each column
n=0
for i in filepaths: #filepaths is a list of n filepaths
df = pd.read_excel(i) #read the excel sheets`
info = otherslices #condensed form of added slices from df`
Final[n]=pd.DataFrame(info)
n+=1
Final = Final.append(Final) #big dataframe of n columns
Final

How to populate arrays with values read in from csv via pandas?

I have create a DataFrame using pandas by reading a csv file. What I want to do is iterate down the rows (for the values in column 1) into a certain array, and do the same for the values in column 2 for a different array. This seems like it would normally be a fairly easy thing to do, so I think I am missing something, however I can't find much online that doesn't get too complicated and doesn't seem to do what I want. Stack questions like this one appear to be asking the same thing, but the answers are long and complicated. Is there no way to do this in a few lines of code? Here is what I have set up:
import pandas as pd
#available possible players
playerNames = []
df = pd.read_csv('Fantasy Week 1.csv')
What I anticipate I should be able to do would be something like:
for row in df.columns[1]:
playerNames.append(row)
This however does not return the desired result.
Essentially, if df =
[1,2,3
4,5,6
7,8,9], I would want my array to be [1,4,7]
Do:
for row in df[df.columns[1]]:
playerNames.append(row)
Or even better:
print(df[df.columns[1]].tolist())
In this case you want the 1st column's values so do:
for row in df[df.columns[0]]:
playerNames.append(row)
Or even better:
print(df[df.columns[0]].tolist())

Is there a more efficient tool than iterrows() in this situation?

Okay so, here's the thing. I'm working with a lot of pandas data frames and arrays. Often times, I need to pair up a value from one frame with a value from another, ideally combining the information into one frame in the end.
Say I'm looking at image files. There's a set of information specific to each file. Sometimes there's certain types of image files that share the same kind of information. Simple example:
FILEPATH, TYPE, COLOR, VALUE_I,<br>
/img2.jpg, A, 'green', 0.6294<br>
/img45.jpg, B, 'green', 0.1846<br>
/img87.jpg, A, 'blue', 34.78<br>
Often, this information is indexed out by type/color/value etc and fed into some other function that gives me another important output, let's say VALUE_II. But I can't concatenate it directly onto the original dataframe because the indices won't match, either because of the nature of the output or because I only fed part of the frame.
Or another situation: I learn that images of a certain TYPE have a specific value attached to them, so I make a dictionary of types and their value. Again, this column doesn't exist, so in this case I would use iterrows() to march down the frame, see if the type matches a specific key, and if it does append it to an array. Then in the end, I convert that array to a dataframe and concatenate it onto the original.
Here's the worse offender. With up to 1800 rows in each frame, it takes FOREVER.:
newColumn = []
for index, row in originalDataframe.iterrows():
for indx, rw in otherDataframe.iterrows():
if row['filename'] in rw['filepath']:
newColumn.append([rw['VALUE_I'],rw['VALUE_II'], rw['VALUE_III']])
newColumn = pd.DataFrame(newColumn, columns = ['VALUE_I', 'VALUE_II', 'VALUE_III'])
originalDataframe = pd.concat([originalDataframe, newColumn], axis=1)
Solutions would be appreciated!
If you can split filename from otherDataframe["filepath"], you can then just compare for equality with orinalDataframe's filename without need to check in. After that you can simplify calculation with pandas.DataFrame.join, which for each filename in originalDataframe will find the same filename in otherDataframe and add all other columns from it.
import os
otherDataframe["filename"] = otherDataframe["filepath"].map(os.path.basename)
joinedDataframe = originalDataframe.join(otherDataframe.set_index("filename"), on="filename")
If there are columns with the same name in originalDataframe and otherDataframe you should set lsuffix or rsuffix.
focusing on the second half of your question, as that's what you provided code for. Your program is checking every row of df1 against every row in df2, yielding potentially 1800 *1800, or 3240000 possible combinations. If there is only one possible match for each row then adding 'break' in will help some, but is not ideal.
newColumn.append([rw['VALUE_I'],rw['VALUE_II'], rw['VALUE_III']])
break
if the structure of you data allows it, i would try something like:
ref = {}
for i, path in enumerate(otherDataframe['filepath']):
*_, file = path.split('\\')
ref[file] = i
originalDataframe['VALUE_I'] = None
originalDataframe['VALUE_II'] = None
originalDataframe['VALUE_III'] = None
for i, file in enumerate(originalDataframe['filename']):
try:
j = ref[file]
originalDataframe.loc[i, 'VALUE_I'] = otherDataframe.loc[j, 'VALUE_I']
originalDataframe.loc[i, 'VALUE_II'] = otherDataframe.loc[j, 'VALUE_II']
originalDataframe.loc[i, 'VALUE_III'] = otherDataframe.loc[j, 'VALUE_III']
except:
pass
Here we we iterate through the paths in otherDataframe (I assume they follow a pattern of C:\asdf\asdf\file), split the path on \ to pull out file, and then construct a dictionary of files to row numbers. Next we initialize the 3 columns in originalDataframe that you want to write to.
Lastly we iterate through the files in originalDataframe, check to see if that file exists in our dictionary of files in otherDataframe (done inside a try to catch errors), and pull the row number (out of the dictionary) which we then use to write the values from other to original.
Side note, you describe you paths as being in the vein of 'C:/asd/fdg/img2.jpg', in which case you should use:
*_, file = path.split('/')

How to add rows to pandas dataframe with reasonable performance

I have an empty data frame with about 120 columns, I want to fill it using data I have in a file.
I'm iterating over a file that has about 1.8 million lines.
(The lines are unstructured, I can't load them to a dataframe directly)
For each line in the file I do the following:
Extract the data I need from the current line
Copy the last row in the data frame and append it to the end df = df.append(df.iloc[-1]). The copy is critical, most of the data in the previous row won't be changed.
Change several values in the last row according to the data I've extracted df.iloc[-1, df.columns.get_loc('column_name')] = some_extracted_value
This is very slow, I assume the fault is in the append.
What is the correct approach to speed things up ? preallocate the dataframe ?
EDIT:
After reading the answers I did the following:
I preallocated the dataframe (saved like 10% of the time)
I replaced this : df = df.append(df.iloc[-1]) with this : df.iloc[i] = df.iloc[i-1] (i is the current iteration in the loop).(save like 10% of the time).
Did profiling, even though I removed the append the main issue is copying the previous line, meaning : df.iloc[i] = df.iloc[i-1] takes about 95% of the time.
You may need plenty of memory, whichever option you choose.
However, what you should certainly avoid is using pd.DataFrame.append within a loop. This is expensive versus list.append.
Instead, aggregate to a list of lists, then feed into a dataframe. Since you haven't provided an example, here's some pseudo-code:
# initialize empty list
L = []
for line in my_binary_file:
# extract components required from each line to a list of Python types
line_vars = [line['var1'], line['var2'], line['var3']]
# append to list of results
L.append(line_vars)
# create dataframe from list of lists
df = pd.DataFrame(L, columns=['var1', 'var2', 'var3'])
The Fastest way would be load to dataframe directly via pd.read_csv()
Try separating the logic to clean out unstructured to structured data and then use pd.read_csv to load the dataframe.
You can share the sample unstructured line and logic to take out the structured data, So that might share some insights on the same.
Where you use append you end up copying the dataframe which is inefficient. Try this whole thing again but avoiding this line:
df = df.append(df.iloc[-1])
You could do something like this to copy the last row to a new row (only do this if the last row contains information that you want in the new row):
df.iloc[...calculate the next available index...] = df.iloc[-1]
Then edit the last row accordingly as you have done
df.iloc[-1, df.columns.get_loc('column_name')] = some_extracted_value
You could try some multiprocessing to speed things up
from multiprocessing.dummy import Pool as ThreadPool
def YourCleaningFunction(line):
for each line do the following
blablabla
return(your formated lines with ,) # or use the kind of function jpp just provided
pool = ThreadPool(8) # your number of cores
lines = open('your_big_csv.csv').read().split('\n') # your csv as a list of lines
df = pool.map(YourCleaningFunction, lines)
df = pandas.DataFrame(df)
pool.close()
pool.join()

Categories

Resources