Dask merge and export csv - python

I have several big CSV files more than 5GB, which need to merge. My RAM is only 8 GB.
Currently, I am using Dask to merge all of the files together and tried to export the data frame into CSV. I cannot export them due to low memory.
import dask.dataframe as dd
file_loc_1=r"..."
file_loc_2=r"..."
data_1=dd.read_csv(file_loc_1,dtype="object",encoding='cp1252')
data_2=dd.read_csv(file_loc_2,dtype="object",encoding='cp1252')
final_1=dd.merge(file_data_1,file_data_2,left_on="A",right_on="A",how="left")
final_loc=r"..."
dd.to_csv(final_1,final_loc,index=False,low_memory=False)
If Dask is not the good way to process the data, please feel free to suggest new methods!
Thanks!

You can read the csv files with pandas.read_csv: setting the chunksize parameter the method returns an iterators. Afterwards you can write a single csv in append mode.
Code example (not tested):
import pandas ad pd
import os
src = ['file1.csv', 'file2.csv']
dst = 'file.csv'
for f in src:
for df in pd.read_csv(f,chuncksize=200000):
if not os.path.isfile(dst):
df.to_csv(dst)
else:
df.to_csv(dst,mode = 'a', header=False)
Useful links:
http://acepor.github.io/2017/08/03/using-chunksize/
Panda's Write CSV - Append vs. Write
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

Related

how do i assemble bunch of excel files into one or more using python

how to there is around 10k .csv files named as data0,data1 like that in sequence, want to combine them and want to have a master sheet in one file or at least couple of sheets using python because i think there is limitation of around 1070000 rows in one excel file i think?
import pandas as pd
import os
master_df = pd.DataFrame()
for file in os.listdir(os.getcwd()):
if file.endswith('.csv'):
master_df = master_df.append(pd.read_csv(file))
master_df.to_csv('master file.CSV', index=False)
A few things to note:
Please check your csv file content first. It would easily mismatch columns when reading csv with text(maybe ; in the content). Or you can try changing the csv engine
df= pd.read_csv(csvfilename,sep=';', encoding='utf-8',engine='python')
If you want to combing into one sheet, your can concat into one dataframe first, then to_excel
df = pd.concat([df,sh_tmp],axis=0,sort=False)
note: concat or append would be a straightforward way to combine data. However, 10k would lead to a perfomance topic. Try list instead of pd.concat if you facing perfomance issue.
Excel has maximum row limitation. 10k files would easily exceed the limit (1048576). You might change the output to csv file or split into multiple .xlsx
----update the 3rd----
You can try grouping the data first (1000k each), then write to sheet one by one.
row_limit = 1000000
master_df['group']=master_df.index//row_limit
writer = pd.ExcelWriter(path_out)
for gr in range(0,master_df['group'].max()+1):
master_df.loc[master_df['group']==gr].to_excel(writer,sheet_name='Sheet'+str(gr),index=False)
writer.save()

What is most efficient approach to read multiple JSON files between Pandas and Pyspark?

I have a cloud bucket with many (around 1000) small JSON files (few KB each one). I have to read them, select some fields and store the result in a single parquet file. Since the JSON files are very small, the resulting dataframe (around 100MB) stays in memory.
I tried two ways. The first is using Pandas with a for loop:
import os
import pandas as pd
import json
path = ...
df = pd.DataFrame()
for root, _, filenames in os.walk(path):
for filename in fnmatch.filter(filenames, '*.json'):
with os.open(file_path, 'r') as f:
json_file = json.loads(f.read())
df = pd.DataFrame(json_file)
df = df.append(df, ignore_index=True)
The second option would be using Pyspark:
from pyspark.sql import SparkSession, SQLContext
path = ...
spark_builder = SparkSession.builder.appName(app_name).config(conf=conf)
sql_context = SQLContext(spark_builder)
df = sql_context.read.json(path)
What is the most efficient way to read multiple JSON files between the two approaches? And how the solutions scale if the number of files to read would be larger (more than 100K)?
If you are not running Spark in a cluster, it will not change much.
Pandas dataframe is not distributed. When performing transformations on a pd dataset, the data will not be spread across the cluster so all the processing will be concentrated in one node.
Working with the Spark datasets - like in second option - Spark will send chunks of data to the available workers in your cluster so this data will be processed in parallel, making the process much more fast. Depending on the size and shape of your data, you can play with how this data is "sliced" so you can increase performance even further.

Split a parquet file in smaller chunks using dask

I am trying to split a parquet file using DASK with the following piece of code
import dask.dataframe as pd
df = pd.read_parquet(dataset_path, chunksize="100MB")
df.repartition(partition_size="100MB")
pd.to_parquet(df,output_path)
I have only one physical file in input, i.e. file.parquet
The output of this script is as well only one file, i.e. part.0.parquet.
Based on the partition_size & chunksize parameters, I should have multiple files in output
Any help would be appreciated
df.repartition(partition_size="100MB") returns a Dask Dataframe.
You have to write :
df = df.repartition(partition_size="100MB")
You can check the number of partitions created looking at df.npartitions
Also, you can use the following to write your parquet files :
df.to_parquet(output_path)
Because Parquet files are meant to deal with large files, you should also consider using the argument compression= when writing you parquet files.
You should get what you expect.
NB: Writing import dask.dataframe as pd is missleading because import dask.dataframe as dd is commonly used

Splitting very large csv files into smaller files

Is Dask proper to read large csv files in parallel and split them into multiple smaller files?
Yes, dask can read large CSV files. It will split them into chunks
df = dd.read_csv("/path/to/myfile.csv")
Then, when saving, Dask always saves CSV data to multiple files
df.to_csv("/output/path/*.csv")
See the read_csv and to_csv docstrings for much more information about this.
dd.read_csv
dd.DataFrame.to_csv
Hi Nutsa Nazgaide and welcome on SO. First of all I'd suggest you to read about how-to-ask and mcve. your question is good enough but it will be great to produce a sample of your original dataframe. I'm going to produce a basic dataframe but the logic shouldn't be too different in your case as you just need to consider location.
Generate dataframe
import dask.dataframe as dd
import numpy as np
import pandas as pd
import string
letters = list(string.ascii_lowercase)
N = int(1e6)
df = pd.DataFrame({"member":np.random.choice(letters, N),
"values":np.random.rand(N)})
df.to_csv("file.csv", index=False)
One parquet file (folder) per member
If you're happy to have the output in as parquet you can just use the option partition_on as
df = dd.read_csv("file.csv")
df.to_parquet("output", partition_on="member")
If you then really need csv you can convert to this format. I strongly suggest you to move your data to parquet.

Merge csv files using dask

I am new to python. I am using dask to read 5 large (>1 GB) csv files and merge (SQL like) them into a dask dataframe. Now, I am trying to write the merged result into a single csv. I used compute() on dask dataframe to collect data into a single df and then call to_csv. However, compute() is slow in reading data across all partitions. I tried calling to_csv directly on dask df and it created multiple .part files (I didn't try merging those .part files into a csv). Is there any alternative to get dask df into a single csv or any parameter to compute() to gather data. I am using 6GB RAM with HDD and i5 processor.
Thanks
Dask.dataframe will not write to a single CSV file. As you mention it will write to multiple CSV files, one file per partition. Your solution of calling .compute().to_csv(...) would work, but calling .compute() converts the full dask.dataframe into a Pandas dataframe, which might fill up memory.
One option is to just avoid Pandas and Dask all-together and just read in bytes from multiple files and dump them to another file
with open(out_filename, 'w') as outfile:
for in_filename in filenames:
with open(in_filename, 'r') as infile:
# if your csv files have headers then you might want to burn a line here with `next(infile)
for line in infile:
outfile.write(line + '\n')
If you don't need to do anything except for merge your CSV files into a larger one then I would just do this and not touch pandas/dask at all. They'll try to read the CSV data into in-memory data, which will take a while and which you don't need. If on the other hand you need to do some processing with pandas/dask then I would use dask.dataframe to read and process the data, write to many csv files, and then use the trick above to merge them afterwards.
You might also consider writing to a datastore other than CSV. Formats like HDF5 and Parquet can be much faster. http://dask.pydata.org/en/latest/dataframe-create.html
As of Dask 2.4.0 you may now specify single_file=True when calling to_csv. Example: dask_df.to_csv('path/to/csv.csv', single_file=True)
Like #mrocklin said, I recommend using other file formats.

Categories

Resources