Is there a way to parallelize Pandas' Append method?

Is there a way to parallelize Pandas' Append method? - python

I have 100 XLS files that I would like to combine into a single CSV file. Is there a way to improve the speed of combining them all together?
This issue with using concat is that it lacks the arguments that to_csv affords me:
listOfFiles = glob.glob(file_location)
frame = pd.DataFrame()
for idx, a_file in enumerate(listOfFiles):
print a_file
data = pd.read_excel(a_file, sheetname=0, skiprows=range(1,2), header=1)
frame = frame.append(data)
# Save to CSV..
print frame.info()
frame.to_csv(output_dir, index=False, encoding='utf-8', date_format="%Y-%m-%d")

Using multiprocessing, you could read them in parallel using something like:
import multiprocessing
import pandas as pd
dfs = multiprocessing.Pool().map(df.read_excel, f_names)
and then concatenate them to a single one:
df = pd.concat(dfs)
You probably should check if the first part is at all faster than
dfs = map(df.read_excel, f_names)
YMMV - it depends on the files, the disks, etc.

It'd be more performant to read them into a list and then call concat:
merged = pd.concat(df_list)
so something like
df_list=[]
for f in xl_list:
df_list.append(pd.read_csv(f)) # or read_excel
merged = pd.concat(df_list)
The problem with repeatedly appending to a dataframe is that the memory has to be allocated to fit the new size and the contents copied and really you only want to do this once.

Related

How to read multiple csv from folder without concatenating each file

I have a folder and inside the folder suppose there are 1000 of .csv files stored. Now I have to create a data frame based on 50 of these files so instead of loading line by line is there any fast approach available?
And I also want the file_name to be the name of my data frame?
I tried below method but it is not working.
# List of file that I want to load out of 1000
path = "..."
file_names = ['a.csv', 'b.csv', 'c.csv', 'd.csv', 'e.csv']
for i in range(0, len(file_names)):
col_names[i] = pd.read_csv(path + col_name[i])
But when I tried to read the variable name it is not displaying any result.
Is there any way I can achieve the desired result.
I have checked various article but in each of these article at the end all the data has been concatenated and I want each file to be loaded indiviually.

IIUC, what you want to do is create multiple dfs, and not concatenate them
in this case well you can read it using read_csv, and them stackings the return df objects in a list
your_paths = [# Paths to all of your wanted csvs]
l = [pd.read_csv(i) for i in your_paths] # This will give you a list of your dfs
l[0] # One of your dfs
If you want them named, you can make it as dict with different named keys
You can access them individually, through index slicing or key slicing depends on the data structure you use.
Would not recommend this action tho, as well it is counter intuitive and well multiple df objects use a little more memory than a unique one

file_names = ['a.csv', 'b.csv', 'c.csv', 'd.csv', 'e.csv']
data_frames = {}
for file_name in file_names:
df = pd.read_csv(file_name)
data_frames[file_name.split('.')[0]] = df
Now you can reach any data frame from data_frames dictionary; as data_frames['a'] to access a.csv

try:
import glob
p = glob.glob( 'folder_path_where_csv_files_stored/*.csv' ) #1. will return a list of all csv files in this folder, no need to tape them one by one.
d = [pd.read_csv(i) for i in p] #2. will create a list of dataframes: one dataframe from each csv file
df = pd.concat(d, axis=0, ignore_index=True) #3. will create one dataframe `df` from those dataframes in the list `d`

Adding to pandas dataframe line by line

I'm making a dataframe and I need to add to it line by line. I created the df with
df = pd.DataFrame(columns=('date', 'daily_high', 'daily_low'))
then I'm reading data from an API, so I run
for api in api_list:
with urllib.request.urlopen(api) as url:
data = json.loads(url.read().decode())
and I need to put different attributes from data in to the dataframe.
I tried to put
df = df.append({'date':datetime.fromtimestamp(data["currently"]["time"]).strftime("20%y%m%d"), 'daily_high' : data["daily"]["data"][0]["temperatureHigh"], 'daily_low': data["daily"]["data"][0]["temperatureLow"]},ignore_index=True)
in the for loop, but it was taking a long time and I'm not sure if this is good practice. Is there a better way to do this? Maybe I could create three separate series and join them together?

pandas.DataFrame.append is inefficient for iterative approaches.
From documentation:
Iteratively appending rows to a DataFrame can be more computationally
intensive than a single concatenate. A better solution is to append
those rows to a list and then concatenate the list with the original
DataFrame all at once.
As mentioned, concatenating results will be more efficient, but in your case using pandas.DataFrame.from_dict would be even more convenient.
Also, I would use requests library for requesting urls.
import requests
d = {}
d['date'] = []
d['daily_high'] = []
d['daily_low'] = []
for api_url in api_list:
data = requests.get(api_url).json()
d['date'].append(datetime.fromtimestamp(data["currently"]["time"]).strftime("20%y%m%d"))
d['daily_high'].append(data["daily"]["data"][0]["temperatureHigh"])
d['daily_low'].append(data["daily"]["data"][0]["temperatureLow"])
df = pd.DataFrame.from_dict(d)

python efficient way to append all worksheets in multiple excel into pandas dataframe

I have around 20++ xlsx files, inside each xlsx files might contain different numbers of worksheets. But thank god, all the columns are the some in all worksheets and all xlsx files. By referring to here", i got some idea. I have been trying a few ways to import and append all excel files (all worksheet) into a single dataframe (around 4 million rows of records).
Note: i did check here" as well, but it only include file level, mine consits file and down to worksheet level.
I have tried below code
# import all necessary package
import pandas as pd
from pathlib import Path
import glob
import sys
# set source path
source_dataset_path = "C:/Users/aaa/Desktop/Sample_dataset/"
source_dataset_list = glob.iglob(source_dataset_path + "Sales transaction *")
for file in source_dataset_list:
#xls = pd.ExcelFile(source_dataset_list[i])
sys.stdout.write(str(file))
sys.stdout.flush()
xls = pd.ExcelFile(file)
out_df = pd.DataFrame() ## create empty output dataframe
for sheet in xls.sheet_names:
sys.stdout.write(str(sheet))
sys.stdout.flush() ## # View the excel files sheet names
#df = pd.read_excel(source_dataset_list[i], sheet_name=sheet)
df = pd.read_excel(file, sheetname=sheet)
out_df = out_df.append(df) ## This will append rows of one dataframe to another(just like your expected output)
Question:
My approach is like first read the every single excel file and get a list of sheets inside it, then load the sheets and append all sheets. The looping seems not very efficient expecially when datasize increase for every append.
Is there any other efficient way to import and append all sheets from multiple excel files?

Use sheet_name=None in read_excel for return orderdict of DataFrames created from all sheetnames, then join together by concat and last DataFrame.append to final DataFrame:
out_df = pd.DataFrame()
for f in source_dataset_list:
df = pd.read_excel(f, sheet_name=None)
cdf = pd.concat(df.values())
out_df = out_df.append(cdf,ignore_index=True)
Another solution:
cdf = [pd.read_excel(excel_names, sheet_name=None).values()
for excel_names in source_dataset_list]
out_df = pd.concat([pd.concat(x) for x in cdf], ignore_index=True)

If i understand your problem correctly, set sheet_name=None in pd.read_excel does the trick.
import os
import pandas as pd
path = "C:/Users/aaa/Desktop/Sample_dataset/"
dfs = [
pd.concat(pd.read_excel(path + x, sheet_name=None))
for x in os.listdir(path)
if x.endswith(".xlsx") or x.endswith(".xls")
]
df = pd.concat(dfs)

I have a pretty straight forward solution if you want to read all the sheets.
import pandas as pd
df = pd.concat(pd.read_excel(path+file_name, sheet_name=None),
ignore_index=True)

looping in lists of csv files names

I am having some issues with my below code. The purpose of the code is to take a list of lists that within each of the lists, carries a series of csv files. I want to loop through each of these lists (one at a time) and pull in only the csv files found in the respective list.
my current code is accumulating all the data instead of starting from scratch each time it loops thru. First loop, use all the csv files in 0th index, second loop, use all the csv files in the 1st index - but dont accumulate
path = "C:/DataFolder/"
allFiles = glob.glob(path + "/*.csv")
fileChunks = [['2003.csv','2004.csv','2005.csv'],['2006.csv','2007.csv','2008.csv']]
for i in range(len(fileChunks)):
"""move empty dataframe here"""
df = pd.DataFrame()
for file_ in fileChunks[i]:
df_temp = pd.read_csv(file_, index_col = None, names = names, parse_dates=True)
df = df.append(df_temp)
note: fileChunks is derived from a function, and it spits out a list of lists like the example above
any help to documentation or pointing out my error would be great - I want to learn from this. thank you.
EDIT
It seems that moving the empty dataframe to within the first for loop works.

This should unnest your files and read each separately using a list comprehension, and then join them all using concat. This is much more efficient than appending each read to a growing dataframe.
df = pd.concat([pd.read_csv(file_, index_col=None, names=names, parse_dates=True)
for chunk in fileChunks for file_ in chunk],
ignore_index=True)
>>> [file_ for chunk in fileChunks for file_ in chunk]
['2003.csv', '2004.csv', '2005.csv', '2006.csv', '2007.csv', '2008.csv']

How do I export TextFileReader object to txt

I read a census ACS file into iPython Notebook in chunks using:
pusb = pd.read_csv('ss14pusb.csv', low_memory=False, chunksize = 25000)
Then I selected some columns I want to keep and use for analysis. Now I want to export pusb to a txt or csv file, but `pusb.to_csv(etc... didn't work. How do I do this? Is there a way to concatenate the chunks one I read them so that they're one data frame?
Thanks in advance!

You can try function concat:
pusb = pd.read_csv('ss14pusb.csv', low_memory=False, chunksize = 25000)
print pusb
#<pandas.io.parsers.TextFileReader object at 0x00000000150E0048>
df = pd.concat(tp, ignore_index=True)
I think is necessary add parameter ignore index to function concat, because avoiding duplicity of indexes.
I try explain it better:
pusb = pd.read_csv('ss14pusb.csv', low_memory=False, chunksize = 25000)
function read csv by chunks - docs and output is TextFileReader, not DataFrame.
You can check this iterable object by:
for chunk in pusb:
print(chunk)
And then you need concat chunks to one big DataFrame - use concat.
Concatenating objects.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Is there a way to parallelize Pandas' Append method? - python

Related

How to read multiple csv from folder without concatenating each file

Adding to pandas dataframe line by line

python efficient way to append all worksheets in multiple excel into pandas dataframe

looping in lists of csv files names

How do I export TextFileReader object to txt

Categories

Resources