Problem: Efficiently reading in multiple .dta files at once without crashing.
Progress: Can currently read in one large .dta file, pickle it, and merge without it exceeding memory capacity. When I try to loop it by setting a dynamic variable n, and calling it from a list or dictionary, I do not output 3 separate DataFrame objects, but instead get a list with one of the values being a DataFrame object.
Current Code:
import pandas as pd
import pickle
import glob
import os
n = 0 # Change variable value when reading in new DF
# Name of variables, lists, and path
path = r"directory\directory\\"
fname = ["survey_2000_ucode.dta", "2010_survey_ucode.dta", "2020_survey.dta"]
chunks = ["chunks_2000", "chunks_2010", "chunks_2020"]
input_path = [path + fname[0], path + fname[1], path + fname[2]]
output_path = [path + chunks[0], path + chunks[1], path + chunks[2]]
# Create folders and directory if it does not exist
for chunk in chunks:
if not os.path.exists(os.path.join(path, chunk)):
os.mkdir(os.path.join(path, chunk))
CHUNK_SIZE = 100000 # Input size of chunks ~ around 14MB
# Read in .dta files in chunks and output pickle files
reader = pd.read_stata(input_path[n], chunksize=CHUNK_SIZE, convert_categoricals=False)
for i, chunk in enumerate(reader):
output_file = output_path[n] + "/chunk_{}.pkl".format(i+1)
with open(output_file[n], "wb") as f:
pickle.dump(chunk, f, pickle.HIGHEST_PROTOCOL)
# Read in pickle files and append one by one into DataFrame
pickle_files = []
for name in glob.glob(output_path[n] + "chunk_*.pkl"):
pickle_files.append(name)
# Create a list/dictionary of dataframes and append data
dfs = ["2000", "2010", "2020"]
dfs[n] = pd.DataFrame([])
for i in range(len(pickle_files)):
dfs[n] = dfs[n].append(pd.read_pickles(pickle_files[i]), ignore_index=True)
Current Output: No df2000, df2010, df2020 DataFrames outputted. Instead, my DataFrame with data is the first object in the dfs list. Basically, the dfs list:
index 0 is a DataFrame with 2,442,717 rows and 34 columns;
index 1 is a string value of 2010 and;
index 2 is a string value of 2020.
Desired Output:
Read in multiple large data files efficiently and create separate multiple DataFrames at once.
Advice/suggestions on interacting (i.e. data cleaning, wrangling, manipulation, etc) with the read-in multiple large DataFrames without crashing or taking long periods when running a line of code.
All help and input is greatly appreciated. Thank you for your time and consideration. I apologize for not being able to share pictures of my results and datasets as I am accessing it through a secured connection and have no access to internet.
Related
I want to create an algorithm to extract data from csv files in different folders / subfolders. each folder will have 9000 csvs. and we will have 12 of them. 12*9000. over 100,000 files
If the files have consistent structure (column names and column order), then dask can create a large lazy representation of the data:
from dask.dataframe import read_csv
ddf = read_csv('my_path/*/file_*.csv')
# do something
This is working solution for over 100,000 files
Credits : Abhishek Thakur - https://twitter.com/abhi1thakur/status/1358794466283388934
import pandas as pd
import glob
import time
start = time.time()
path = 'csv_test/data/'
all_files = glob.glob(path + "/*.csv")
l = []
for filename in all_files:
df = pd.read_csv(filename, index_col=None, header = 0)
l.append(df)
frame = pd.concat(l, axis = 0, ignore_index = True)
frame.to_csv('output.csv', index = False)
end = time.time()
print(end - start)
not sure if it can handle data of size 200 gb. - need feedback regarding this
You can read CSV-files using pandas and store them space efficiently on disk:
import pandas as pd
file = "your_file.csv"
data = pd.read_csv(file)
data = data.astype({"column1": int})
data.to_hdf("new_filename.hdf", "key")
Depending on the contents of your file, you can make adjustments to read_csv as described here:
https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
Make sure that after you've read your data in as a dataframe, the column types match the types they are holding. This way you can save a lot of storage in memory and later when saving these dataframes to disk. You can use astype to make these adjustments.
After you've done that, store your dataframe to disk with to_hdf.
If your data is compatible across csv-files, you can append the dataframes onto each other into a larger dataframe.
If I have for example 3 txt files that looks as follows:
file1.txt:
a 10
b 20
c 30
file2.txt:
d 40
e 50
f 60
file3.txt:
g 70
h 80
i 90
I would like to read this data from the files and create a single excel file that will look like this:
Specifically in my case I have 100+ txt files that I read using glob and loop.
Thank you
There's a bit of logic involved into getting the output you need.
First, to process the input files into separate lists. You might need to adjust this logic depending on the actual contents of the files. You need to be able to get the columns for the files. For the samples provided my logic works.
I added a safety check to see if the input files have the same number of rows. If they don't it will seriously mess up the resulting excel file. You'll need to add some logic if a length mismatch happens.
For the writing to the excel file, it's very easy using pandas in combination with openpyxl. There are likely more elegant solutions, but I'll leave it to you.
I'm referencing some SO answers in the code for further reading.
requirements.txt
pandas
openpyxl
main.py
# we use pandas for easy saving as XSLX
import pandas as pd
filelist = ["file01.txt", "file02.txt", "file03.txt"]
def load_file(filename: str) -> list:
result = []
with open(filename) as infile:
# the split below is OS agnostic and removes EOL characters
for line in infile.read().splitlines():
# the split below splits on space character by default
result.append(line.split())
return result
loaded_files = []
for filename in filelist:
loaded_files.append(load_file(filename))
# you will want to check if the files have the same number of rows
# it will break stuff if they don't, you could fix it by appending empty rows
# stolen from:
# https://stackoverflow.com/a/10825126/9267296
len_first = len(loaded_files[0]) if loaded_files else None
if not all(len(i) == len_first for i in loaded_files):
print("length mismatch")
exit(419)
# generate empty list of lists so we don't get index error below
# stolen from:
# https://stackoverflow.com/a/33990699/9267296
result = [ [] for _ in range(len(loaded_files[0])) ]
for f in loaded_files:
for index, row in enumerate(f):
result[index].extend(row)
result[index].append('')
# trim the last empty column
result = [line[:-1] for line in result]
# write as excel file
# stolen from:
# https://stackoverflow.com/a/55511313/9267296
# note that there are some other options on this SO question, but this one
# is easily readable
df = pd.DataFrame(result)
with pd.ExcelWriter("output.xlsx") as writer:
df.to_excel(writer, sheet_name="sheet_name_goes_here", index=False)
result:
I have the following function:
def json_to_pickle(json_path=REVIEWS_JSON_DIR,
pickle_path=REVIEWS_PICKLE_DIR,
force_update=False):
'''Create a pickled dataframe from selected JSON files.'''
current_date = datetime.today().strftime("%Y%m%d")
filename_safe_path = json_path.replace("/", "_")
# Get a list of all JSON files from a given directory
in_paths = Path(json_path).glob("**/*.json")
in_paths_list = []
for path in in_paths: # Convert generator to a simple list
in_paths_list.append(path)
out_path = (f"{pickle_path}"
f"/{current_date}"
f"_{filename_safe_path}"
f"_reviews.pkl")
if exists(out_path) == False or force_update == True:
pprint("Creating a new pickle file from raw JSON files")
df = pd.DataFrame()
for path in tqdm(in_paths_list):
with open(path, "r") as file:
normalized_json_df = pd.json_normalize(json.load(file))
df = pd.concat([df, normalized_json_df])
df.to_pickle(out_path, compression="infer")
else:
pprint(f"Using existing pickle file ({out_path})")
return out_path
Unless a pickle file already exists, it finds all JSON files in a given directory (including all subdirectories), normalizes them, concatenates them to a dataframe, and saves the dataframe to disk as a pickle. It takes 54 minutes to process 240.255 JSON files.
According to tqdm, the for-loop averages 73.51 iterations per second (running on an M1 MacBook Pro with 10 cores), but it seems to get slower over time. Presumably, as the dataframe grows larger. It starts at around 1668.44 iterations per second.
All the JSON files have an identical structure, but a couple of fields may be missing. The size varies between 500 bytes to 2 KB. Here is the JSON spec from Google's documentation.
What can I do to speed up this for-loop?
Edit:
This is how I changed the for-loop following the selected answer:
data_frames = []
for path in tqdm(in_paths_list):
with open(path, "r") as file:
data_frames.append(pd.json_normalize(json.load(file)))
pd.concat(data_frames).to_pickle(out_path, compression="infer")
Now it finishes in 1 minute and 29 seconds. Quite the improvement!
Instead of loading each file and appending it to an ever bigger temporary dataframe, load all files in dataframes and concatenate them in a single operation.
The current code loads the N dataframes that correspond to the files and creates N-1 ever bigger dataframes with the exact same data.
I use this code to load all Excel files in a folder into a single dataframe :
root = rf"C:\path\to\folder"
matches=Path(root).glob("22*.xls*")
files=tqdm((pd.read_excel(file,skipfooter=1)
for file in matches
if not file.name.startswith("~")))
df=pd.concat(files)
In the question's case you could use:
def load_json(path):
with open(path, "r") as file:
df = pd.json_normalize(json.load(file))
return df
in_paths = Path(json_path).glob("**/*.json")
files=tqdm((load_json(file) for file in in_paths))
I have a directory with hundreds of csv files that represent the pixels of a thermal camera (288x383), and I want to get the center value of each file (e.g. 144 x 191), and with each one of the those values collected, add them in a dataframe that presents the list with the names of each file.
Follow my code, where I created the dataframe with the lists of several csv files:
import os
import glob
import numpy as np
import pandas as pd
os.chdir("/Programming/Proj1/Code/Image_Data")
!ls
Out:
2021-09-13_13-42-16.csv
2021-09-13_13-42-22.csv
2021-09-13_13-42-29.csv
2021-09-13_13-42-35.csv
2021-09-13_13-42-47.csv
2021-09-13_13-42-53.csv
...
file_extension = '.csv'
all_filenames = [i for i in glob.glob(f"*{file_extension}")]
files = glob.glob('*.csv')
all_df = pd.DataFrame(all_filenames, columns = ['Full_name '])
all_df.head()
**Full_name**
0 2021-09-13_13-42-16.csv
1 2021-09-13_13-42-22.csv
2 2021-09-13_13-42-29.csv
3 2021-09-13_13-42-35.csv
4 2021-09-13_13-42-47.csv
5 2021-09-13_13-42-53.csv
6 2021-09-13_13-43-00.csv
You can loop through your files one by one, reading them in as a dataframe and taking the center value that you want. Then save this value along with the file name. This list of results can then be read in to a new dataframe ready for you to use.
result = []
for file in files:
# read in the file, you may need to specify some extra parameters
# check the pandas docs for read_csv
df = pd.read_csv(file)
# now select the value you want
# this will vary depending on what your indexes look like (if any)
# and also your column names
value = df.loc[row, col]
# append to the list
result.append((file, value))
# you should now have a list in the format:
# [('2021-09-13_13-42-16.csv', 100), ('2021-09-13_13-42-22.csv', 255), ...
# load the list of tuples as a dataframe for further processing or analysis...
result_df = pd.DataFrame(result)
I need to extract some data from 37,000 xls files, which are stored in 2100 folders (activity/year/month/day). I already wrote the script, however when given a small sample of a thousand files, it takes 5 minutes to run. Each individual file can include up to ten thousand entries I need to extract. Haven't tried running it on the entire folder, I'm looking for suggestions how to make it more efficient.
I would also like some help on how to export the dictionary to a new excel file, two columns, or how to skip the entire dictionary and just save directly to xls, and how to point the entire script at a shared drive folder, instead of Python's root.
import fnmatch
import os
import pandas as pd
docid = []
CoCo = []
for root, dirs, files in os.walk('Z_Option'):
for filename in files:
if fnmatch.fnmatch(filename, 'Z_*.xls'):
df = pd.read_excel(os.path.join(root, filename), sheet_name='Sheet0')
for i in df['BLDAT']:
if isinstance(i, int):
docid.append(i)
CoCo.append(df['BUKRS'].iloc[1])
data = dict(zip(docid, CoCo))
print(data)
This walkthrough was very helpful for me when I was beginning with pandas. What is likely taking so long is the for i in df['BLDAT'] line.
Using something like an apply function can offer a speed boost:
def check_if_int(row): #row is effectively a pd.Series of the index
if type(row['BLDAT']) == 'int':
docid.append(i)
CoCo.append(row.name) #name should be the index
df.apply(check_if_int, axis = 1) #axis = 1 will work rowwise
It's unclear what exactly this script is trying to do, but if it's as simple as filtering the dataframe to only include rows where the 'BLDAT' column is an integer, using a mask would be much faster
df_filtered = df.loc[type(df['BLDAT']) == 'int'] #could also use .isinstance()
Another advantage of filtering the dataframe as opposed to creating lists is the ability to use the pandas function df_filtered.to_csv() to output the file as an .xlsx compatible file.
Eventually I gave up due to time constraints (yay last minute "I need this tomorrow" reports), and came up with this. Dropping empty rows helped by some margin, and for the next quarter I'll try to do this entirely with pandas.
#Shared drive
import fnmatch
import os
import pandas as pd
import time
start_time = time.time()
docid = []
CoCo = []
os.chdir("X:\Shared activities")
for root, dirs, files in os.walk("folder"):
for filename in files:
if fnmatch.fnmatch(filename, 'Z_*.xls'):
try:
df = pd.read_excel(os.path.join(root, filename), sheet_name='Sheet0')
df.dropna(subset = ['BLDAT'], inplace = True)
for i in df['BLDAT']:
if isinstance(i, int):
docid.append(i)
CoCo.append(df['BUKRS'].iloc[1])
except:
errors.append((os.path.join(root, filename)))
data = dict(zip(docid, CoCo))
os.chdir("C:\project\reports")
pd.DataFrame.from_dict(data, orient="index").to_csv('test.csv')
with open('errors.csv', 'w') as f:
for item in errors:
f.write("%s\n" % item)
print("--- %s seconds ---" % (time.time() - start_time))