I need to extract some data from 37,000 xls files, which are stored in 2100 folders (activity/year/month/day). I already wrote the script, however when given a small sample of a thousand files, it takes 5 minutes to run. Each individual file can include up to ten thousand entries I need to extract. Haven't tried running it on the entire folder, I'm looking for suggestions how to make it more efficient.
I would also like some help on how to export the dictionary to a new excel file, two columns, or how to skip the entire dictionary and just save directly to xls, and how to point the entire script at a shared drive folder, instead of Python's root.
import fnmatch
import os
import pandas as pd
docid = []
CoCo = []
for root, dirs, files in os.walk('Z_Option'):
for filename in files:
if fnmatch.fnmatch(filename, 'Z_*.xls'):
df = pd.read_excel(os.path.join(root, filename), sheet_name='Sheet0')
for i in df['BLDAT']:
if isinstance(i, int):
docid.append(i)
CoCo.append(df['BUKRS'].iloc[1])
data = dict(zip(docid, CoCo))
print(data)
This walkthrough was very helpful for me when I was beginning with pandas. What is likely taking so long is the for i in df['BLDAT'] line.
Using something like an apply function can offer a speed boost:
def check_if_int(row): #row is effectively a pd.Series of the index
if type(row['BLDAT']) == 'int':
docid.append(i)
CoCo.append(row.name) #name should be the index
df.apply(check_if_int, axis = 1) #axis = 1 will work rowwise
It's unclear what exactly this script is trying to do, but if it's as simple as filtering the dataframe to only include rows where the 'BLDAT' column is an integer, using a mask would be much faster
df_filtered = df.loc[type(df['BLDAT']) == 'int'] #could also use .isinstance()
Another advantage of filtering the dataframe as opposed to creating lists is the ability to use the pandas function df_filtered.to_csv() to output the file as an .xlsx compatible file.
Eventually I gave up due to time constraints (yay last minute "I need this tomorrow" reports), and came up with this. Dropping empty rows helped by some margin, and for the next quarter I'll try to do this entirely with pandas.
#Shared drive
import fnmatch
import os
import pandas as pd
import time
start_time = time.time()
docid = []
CoCo = []
os.chdir("X:\Shared activities")
for root, dirs, files in os.walk("folder"):
for filename in files:
if fnmatch.fnmatch(filename, 'Z_*.xls'):
try:
df = pd.read_excel(os.path.join(root, filename), sheet_name='Sheet0')
df.dropna(subset = ['BLDAT'], inplace = True)
for i in df['BLDAT']:
if isinstance(i, int):
docid.append(i)
CoCo.append(df['BUKRS'].iloc[1])
except:
errors.append((os.path.join(root, filename)))
data = dict(zip(docid, CoCo))
os.chdir("C:\project\reports")
pd.DataFrame.from_dict(data, orient="index").to_csv('test.csv')
with open('errors.csv', 'w') as f:
for item in errors:
f.write("%s\n" % item)
print("--- %s seconds ---" % (time.time() - start_time))
Related
I want to create an algorithm to extract data from csv files in different folders / subfolders. each folder will have 9000 csvs. and we will have 12 of them. 12*9000. over 100,000 files
If the files have consistent structure (column names and column order), then dask can create a large lazy representation of the data:
from dask.dataframe import read_csv
ddf = read_csv('my_path/*/file_*.csv')
# do something
This is working solution for over 100,000 files
Credits : Abhishek Thakur - https://twitter.com/abhi1thakur/status/1358794466283388934
import pandas as pd
import glob
import time
start = time.time()
path = 'csv_test/data/'
all_files = glob.glob(path + "/*.csv")
l = []
for filename in all_files:
df = pd.read_csv(filename, index_col=None, header = 0)
l.append(df)
frame = pd.concat(l, axis = 0, ignore_index = True)
frame.to_csv('output.csv', index = False)
end = time.time()
print(end - start)
not sure if it can handle data of size 200 gb. - need feedback regarding this
You can read CSV-files using pandas and store them space efficiently on disk:
import pandas as pd
file = "your_file.csv"
data = pd.read_csv(file)
data = data.astype({"column1": int})
data.to_hdf("new_filename.hdf", "key")
Depending on the contents of your file, you can make adjustments to read_csv as described here:
https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
Make sure that after you've read your data in as a dataframe, the column types match the types they are holding. This way you can save a lot of storage in memory and later when saving these dataframes to disk. You can use astype to make these adjustments.
After you've done that, store your dataframe to disk with to_hdf.
If your data is compatible across csv-files, you can append the dataframes onto each other into a larger dataframe.
I have the following function:
def json_to_pickle(json_path=REVIEWS_JSON_DIR,
pickle_path=REVIEWS_PICKLE_DIR,
force_update=False):
'''Create a pickled dataframe from selected JSON files.'''
current_date = datetime.today().strftime("%Y%m%d")
filename_safe_path = json_path.replace("/", "_")
# Get a list of all JSON files from a given directory
in_paths = Path(json_path).glob("**/*.json")
in_paths_list = []
for path in in_paths: # Convert generator to a simple list
in_paths_list.append(path)
out_path = (f"{pickle_path}"
f"/{current_date}"
f"_{filename_safe_path}"
f"_reviews.pkl")
if exists(out_path) == False or force_update == True:
pprint("Creating a new pickle file from raw JSON files")
df = pd.DataFrame()
for path in tqdm(in_paths_list):
with open(path, "r") as file:
normalized_json_df = pd.json_normalize(json.load(file))
df = pd.concat([df, normalized_json_df])
df.to_pickle(out_path, compression="infer")
else:
pprint(f"Using existing pickle file ({out_path})")
return out_path
Unless a pickle file already exists, it finds all JSON files in a given directory (including all subdirectories), normalizes them, concatenates them to a dataframe, and saves the dataframe to disk as a pickle. It takes 54 minutes to process 240.255 JSON files.
According to tqdm, the for-loop averages 73.51 iterations per second (running on an M1 MacBook Pro with 10 cores), but it seems to get slower over time. Presumably, as the dataframe grows larger. It starts at around 1668.44 iterations per second.
All the JSON files have an identical structure, but a couple of fields may be missing. The size varies between 500 bytes to 2 KB. Here is the JSON spec from Google's documentation.
What can I do to speed up this for-loop?
Edit:
This is how I changed the for-loop following the selected answer:
data_frames = []
for path in tqdm(in_paths_list):
with open(path, "r") as file:
data_frames.append(pd.json_normalize(json.load(file)))
pd.concat(data_frames).to_pickle(out_path, compression="infer")
Now it finishes in 1 minute and 29 seconds. Quite the improvement!
Instead of loading each file and appending it to an ever bigger temporary dataframe, load all files in dataframes and concatenate them in a single operation.
The current code loads the N dataframes that correspond to the files and creates N-1 ever bigger dataframes with the exact same data.
I use this code to load all Excel files in a folder into a single dataframe :
root = rf"C:\path\to\folder"
matches=Path(root).glob("22*.xls*")
files=tqdm((pd.read_excel(file,skipfooter=1)
for file in matches
if not file.name.startswith("~")))
df=pd.concat(files)
In the question's case you could use:
def load_json(path):
with open(path, "r") as file:
df = pd.json_normalize(json.load(file))
return df
in_paths = Path(json_path).glob("**/*.json")
files=tqdm((load_json(file) for file in in_paths))
Problem: Efficiently reading in multiple .dta files at once without crashing.
Progress: Can currently read in one large .dta file, pickle it, and merge without it exceeding memory capacity. When I try to loop it by setting a dynamic variable n, and calling it from a list or dictionary, I do not output 3 separate DataFrame objects, but instead get a list with one of the values being a DataFrame object.
Current Code:
import pandas as pd
import pickle
import glob
import os
n = 0 # Change variable value when reading in new DF
# Name of variables, lists, and path
path = r"directory\directory\\"
fname = ["survey_2000_ucode.dta", "2010_survey_ucode.dta", "2020_survey.dta"]
chunks = ["chunks_2000", "chunks_2010", "chunks_2020"]
input_path = [path + fname[0], path + fname[1], path + fname[2]]
output_path = [path + chunks[0], path + chunks[1], path + chunks[2]]
# Create folders and directory if it does not exist
for chunk in chunks:
if not os.path.exists(os.path.join(path, chunk)):
os.mkdir(os.path.join(path, chunk))
CHUNK_SIZE = 100000 # Input size of chunks ~ around 14MB
# Read in .dta files in chunks and output pickle files
reader = pd.read_stata(input_path[n], chunksize=CHUNK_SIZE, convert_categoricals=False)
for i, chunk in enumerate(reader):
output_file = output_path[n] + "/chunk_{}.pkl".format(i+1)
with open(output_file[n], "wb") as f:
pickle.dump(chunk, f, pickle.HIGHEST_PROTOCOL)
# Read in pickle files and append one by one into DataFrame
pickle_files = []
for name in glob.glob(output_path[n] + "chunk_*.pkl"):
pickle_files.append(name)
# Create a list/dictionary of dataframes and append data
dfs = ["2000", "2010", "2020"]
dfs[n] = pd.DataFrame([])
for i in range(len(pickle_files)):
dfs[n] = dfs[n].append(pd.read_pickles(pickle_files[i]), ignore_index=True)
Current Output: No df2000, df2010, df2020 DataFrames outputted. Instead, my DataFrame with data is the first object in the dfs list. Basically, the dfs list:
index 0 is a DataFrame with 2,442,717 rows and 34 columns;
index 1 is a string value of 2010 and;
index 2 is a string value of 2020.
Desired Output:
Read in multiple large data files efficiently and create separate multiple DataFrames at once.
Advice/suggestions on interacting (i.e. data cleaning, wrangling, manipulation, etc) with the read-in multiple large DataFrames without crashing or taking long periods when running a line of code.
All help and input is greatly appreciated. Thank you for your time and consideration. I apologize for not being able to share pictures of my results and datasets as I am accessing it through a secured connection and have no access to internet.
I am relatively new to python (about a weeks experience) and I can't seem to find the answer to my problem.
I am trying to merge hundreds of csv files based in my folder Data into a single csv file based on column name.
The solutions I have found require me to type out either each file name or column headers which would take days.
I used this code to create one csv file but the column names move around and therefore the data is not in the same columns over the whole DataFrame:
import pandas as pd
import glob
import os
def concatenate(indir=r"C:\\Users\ge\Documents\d\de",
outfile = r"C:\Users\ge\Documents\d"):
os.chdir(indir)
fileList=glob.glob("*.csv")
dfList = []
for filename in fileList:
print(filename)
df = pd.read_csv(filename, header = None)
dfList.append(df)
concatDf = pd.concat(dfList, axis = 0)
concatDf.to_csv(outfile, index= None)
Is there quick fire method to do this as I have less than a week to run statistics on the dataset.
Any help would be appreciated.
Here is one, memory efficient, way to do that.
from pathlib import Path
import csv
indir = Path(r'C:\\Users\gerardchurch\Documents\Data\dev_en')
outfile = Path(r"C:\\Users\gerardchurch\Documents\Data\output.csv")
def find_header_from_all_files(indir):
columns = set()
print("Looking for column names in", indir)
for f in indir.glob('*.csv'):
with f.open() as sample_csv:
sample_reader = csv.DictReader(sample_csv)
try:
first_row = next(sample_reader)
except StopIteration:
print("File {} doesn't contain any data. Double check this".format(f))
continue
else:
columns.update(first_row.keys())
return columns
columns = find_header_from_all_files(indir)
print("The columns are:", sorted(columns))
with outfile.open('w') as outf:
wr = csv.DictWriter(outf, fieldnames=list(columns))
wr.writeheader()
for inpath in indir.glob('*.csv'):
print("Parsing", inpath)
with inpath.open() as infile:
reader = csv.DictReader(infile)
wr.writerows(reader)
print("Done, find the output at", outfile)
This should handle case, when one of the input csvs doesn't contain all columns
I am not sure if I understand your problem correctly, but this is one of the ways that you can merge your files without giving any column names:
import pandas as pd
import glob
import os
def concatenate(indir):
os.chdir(indir)
fileList=glob.glob("*.csv")
output_file = pd.concat([pd.read_csv(filename) for filename in fileList])
output_file.to_csv("_output.csv", index=False)
concatenate(indir= r"C:\\Users\gerardchurch\Documents\Data\dev_en")
I am puzzled with the following problem. I have a set of csv files, which I parse iterativly. Before collecting the dataframes in a list, I apply some function (as simple as tmp_df*2) to each of the tmp_df. It all worked perfectly fine at first glance, until I've realized I have inconsistencies with the results from run to run.
For example, when I apply df.std() I might receive for a first run:
In[2]: df1.std()
Out[2]:
some_int 15281.99
some_float 5.302293
and for a second run after:
In[3]: df2.std()
Out[3]:
some_int 15281.99
some_float 6.691013
Strangly, I don't not observe inconsistencies like this one when I don't manipulate the parsed data (simply comment out tmp_df = tmp_df*2). I also noticed that for the columns where I have datatypes int the results are consistent from run to run, which does not hold for floats. I suspect it has to do with the precision points. I also cannot establish a pattern how they vary, it might be that I have the same results for two or three consecutive runs. Maybe someone has an idea if I am missing something here. I am working on a replication example, I'll edit asap, as I cannot share the underlying data. Maybe someone can shed some light on this in the meantime. I am using win8.1, pandas 17.1, python 3.4.3.
Code example:
import pandas as pd
import numpy as np
data_list = list()
csv_files = ['a.csv', 'b.csv', 'c.csv']
for csv_file in csv_files:
# load csv_file
tmp_df = pd.read_csv(csv_file, index_col='ID', dtype=np.float64)
# replace infs by na
tmp_df.replace([np.inf, -np.inf], np.nan, inplace=True)
# manipulate tmp_df
tmp_df = tmp_df*2
data_list.append(tmp_df)
df = pd.concat(data_list, ignore_index=True)
df.reset_index(inplace=True)
Update:
Running the same code and data on a UX system works perfectly fine.
Edit:
I have managed to re-create the problem, it should run on win and ux. I've tested on win8.1 facing the same problem when with_function=True (typically after 1-5 runs), on ux the it runs without problems. with_function=False runs without differences for win and ux. I can also reject the hypothesis that it is related to int or float issue as also the simulated int are different...
Here is the code:
import pandas as pd
import numpy as np
from pathlib import Path
from tempfile import gettempdir
def simulate_csv_data(tmp_dir,num_files=5):
""" simulate a csv files
:param tmp_dir: Path, csv files are saved to
:param num_files: int, how many csv files to simulate
:return:
"""
rows = 20000
columns = 5
np.random.seed(1282)
for file_num in range(num_files):
file_path = tmp_dir.joinpath(''.join(['df_', str(file_num), '.csv']))
simulated_df = pd.DataFrame(np.random.standard_normal((rows, columns)))
simulated_df['some_int'] = np.random.randint(0,100)
simulated_df.to_csv(str(file_path))
def get_csv_data(tmp_dir,num_files=5, with_function=True):
""" Collect various csv files and return a concatenated dfs
:param tmp_dir: Path, csv files are saved to
:param num_files: int, how many csv files to simulate
:param with_function: Bool, apply function to tmp_dataframe
:return:
"""
data_list = list()
for file_num in range(num_files):
# current file path
file_path = tmp_dir.joinpath(''.join(['df_', str(file_num), '.csv']))
# load csv_file
tmp_df = pd.read_csv(str(file_path), dtype=np.float64)
# replace infs by na
tmp_df.replace([np.inf, -np.inf], np.nan, inplace=True)
# apply function to tmp_dataframe
if with_function:
tmp_df = tmp_df*2
data_list.append(tmp_df)
df = pd.concat(data_list, ignore_index=True)
df.reset_index(inplace=True)
return df
def main():
# INPUT ----------------------------------------------
num_files = 5
with_function = True
max_comparisons = 50
# ----------------------------------------------------
tmp_dir = gettempdir()
# use temporary "non_existing" dir for new file
tmp_csv_folder = Path(tmp_dir).joinpath('csv_files_sdfs2eqqf')
# if exists already don't simulate data/files again
if tmp_csv_folder.exists() is False:
tmp_csv_folder.mkdir()
print('Simulating temp files...')
simulate_csv_data(tmp_csv_folder, num_files)
print('Getting benchmark data frame...')
df1 = get_csv_data(tmp_csv_folder, num_files, with_function)
df_is_same = True
count_runs = 0
# Run until different df is found or max runs exceeded
print('Comparing data frames...')
while df_is_same:
# get another data frame
df2 = get_csv_data(tmp_csv_folder, num_files, with_function)
count_runs += 1
# compare data frames
if df1.equals(df2) is False:
df_is_same = False
print('Found unequal df after {} runs'.format(count_runs))
# print out a standard deviations (arbitrary example)
print('Std Run1: \n {}'.format(df1.std()))
print('Std Run2: \n {}'.format(df2.std()))
if count_runs > max_comparisons:
df_is_same = False
print('No unequal df found after {} runs'.format(count_runs))
print('Delete the following folder if no longer needed: "{}"'.format(
str(tmp_csv_folder)))
if __name__ == '__main__':
main()
Your variations are caused by something else, like input data change between executions, or source code changes.
Float precision does not ever gives different results between executions.
Btw, clean your examples and you will find the bug. At this moment you say something about and int but display a decimal value instead!!
Updating numexpr to 2.4.6 (or later), as numexpr 2.4.4 had some bugs on windows. After running the update it works for me.