Splitting very large csv files into smaller files - python

Is Dask proper to read large csv files in parallel and split them into multiple smaller files?

Yes, dask can read large CSV files. It will split them into chunks
df = dd.read_csv("/path/to/myfile.csv")
Then, when saving, Dask always saves CSV data to multiple files
df.to_csv("/output/path/*.csv")
See the read_csv and to_csv docstrings for much more information about this.
dd.read_csv
dd.DataFrame.to_csv

Hi Nutsa Nazgaide and welcome on SO. First of all I'd suggest you to read about how-to-ask and mcve. your question is good enough but it will be great to produce a sample of your original dataframe. I'm going to produce a basic dataframe but the logic shouldn't be too different in your case as you just need to consider location.
Generate dataframe
import dask.dataframe as dd
import numpy as np
import pandas as pd
import string
letters = list(string.ascii_lowercase)
N = int(1e6)
df = pd.DataFrame({"member":np.random.choice(letters, N),
"values":np.random.rand(N)})
df.to_csv("file.csv", index=False)
One parquet file (folder) per member
If you're happy to have the output in as parquet you can just use the option partition_on as
df = dd.read_csv("file.csv")
df.to_parquet("output", partition_on="member")
If you then really need csv you can convert to this format. I strongly suggest you to move your data to parquet.

Related

Split a parquet file in smaller chunks using dask

I am trying to split a parquet file using DASK with the following piece of code
import dask.dataframe as pd
df = pd.read_parquet(dataset_path, chunksize="100MB")
df.repartition(partition_size="100MB")
pd.to_parquet(df,output_path)
I have only one physical file in input, i.e. file.parquet
The output of this script is as well only one file, i.e. part.0.parquet.
Based on the partition_size & chunksize parameters, I should have multiple files in output
Any help would be appreciated
df.repartition(partition_size="100MB") returns a Dask Dataframe.
You have to write :
df = df.repartition(partition_size="100MB")
You can check the number of partitions created looking at df.npartitions
Also, you can use the following to write your parquet files :
df.to_parquet(output_path)
Because Parquet files are meant to deal with large files, you should also consider using the argument compression= when writing you parquet files.
You should get what you expect.
NB: Writing import dask.dataframe as pd is missleading because import dask.dataframe as dd is commonly used

Python 33gb csv file Dataset to Pandas DataFrame

Im kinda new to Python and Datascience.
I have a 33gb csv file Dataset, and i want to parse it in a DataFrame to do some stuff on it.
I tried to do it the 'Casual' with pandas.read_csv and it's taking ages to parse..
I searched on the internet and found this article.
It says that the most efficent way to read a large csv file is to use csv.DictReader.
So i tried to do that :
import pandas as pd
import csv
df = pd.DataFrame(csv.DictReader(open("MyFilePath")))
Even with this solution it's taking ages to do the job..
Can you please guys tell me what's the most efficient way to parse a large dataset into pandas?
There is no way you can read such a big file in a short time. Anyway there are some strategies to deal with a large data, these are some of them which give u opportunity to implement your code without leaving the comfort of Pandas:
Sampling
Chunking
Optimising Pandas dtypes
Parallelising Pandas with Dask.
The most simple option is sampling your dataset(This may be helpful for you). Sometimes a random part ofa large dataset will already contain enough information to do next calculations. If u don't actually need to process your entire dataset this is excellent technique to use.
sample code :
import pandas
import random
filename = "data.csv"
n = sum(1 for line in open(filename)) - 1 # number of lines in file
s = n//m # part of the data
skip = sorted(random.sample(range(1, n + 1), n - s))
df = pandas.read_csv(filename, skiprows=skip)
This is the link for Chunking large data.

Dask merge and export csv

I have several big CSV files more than 5GB, which need to merge. My RAM is only 8 GB.
Currently, I am using Dask to merge all of the files together and tried to export the data frame into CSV. I cannot export them due to low memory.
import dask.dataframe as dd
file_loc_1=r"..."
file_loc_2=r"..."
data_1=dd.read_csv(file_loc_1,dtype="object",encoding='cp1252')
data_2=dd.read_csv(file_loc_2,dtype="object",encoding='cp1252')
final_1=dd.merge(file_data_1,file_data_2,left_on="A",right_on="A",how="left")
final_loc=r"..."
dd.to_csv(final_1,final_loc,index=False,low_memory=False)
If Dask is not the good way to process the data, please feel free to suggest new methods!
Thanks!
You can read the csv files with pandas.read_csv: setting the chunksize parameter the method returns an iterators. Afterwards you can write a single csv in append mode.
Code example (not tested):
import pandas ad pd
import os
src = ['file1.csv', 'file2.csv']
dst = 'file.csv'
for f in src:
for df in pd.read_csv(f,chuncksize=200000):
if not os.path.isfile(dst):
df.to_csv(dst)
else:
df.to_csv(dst,mode = 'a', header=False)
Useful links:
http://acepor.github.io/2017/08/03/using-chunksize/
Panda's Write CSV - Append vs. Write
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

way to generate a specified number dataframe of new csv file from existing csv file in python using pandas

I have large data-frame in a Csv file sample1 from that i have to generate a new Csv file contain only 100 data-frame.i have generate code for it.but i am getting key Error the label[100] is not in the index?
I have just tried as below,Any help would be appreciated
import pandas as pd
data_frame = pd.read_csv("C:/users/raju/sample1.csv")
data_frame1 = data_frame[:100]
data_frame.to_csv("C:/users/raju/sample.csv")`
`
The correct syntax is with iloc:
data_frame.iloc[:100]
A more efficient way to do it is to use nrows argument who purpose is exactly to extract portions of files. This way you avoid wasting resources and time parsing useless rows:
import pandas as pd
data_frame = pd.read_csv("C:/users/raju/sample1.csv", nrows=101) # 100+1 for header
data_frame.to_csv("C:/users/raju/sample.csv")

Split csv files based on columns

I have a csv file that I am trying to split based on the number of columns. The original file has about 24000 columns and I want to split this into files with each files having a fixed number of columns (say 1000). I want to run to do feature selection on weka on the individual files. I have the following code in python.
import pandas as pd
import numpy as np
i=0
df=pd.read_csv("glio.csv")
#row_split=int(input("Enter the Row Split: "))
row_split=6000
name ="temp_file_"
ext=".csv"
rows, columns = df.shape
df_temp=df.iloc[:,:row_split]
df_temp.to_csv(name+str(i)+ext)
i=i+1
while(row_split<columns):
df_temp=df.iloc[:,row_split+1:row_split+100]
df_temp.to_csv(name+str(i)+ext)
i=i+1
row_split+=1000
It is generating the individual files as expected but after splitting I am not able to load the individual files in weka. I am getting the following error
I am new to this and have no idea why this occurs. I cannot find answers online. It would be really helpful if someone could explain why this is happening and how to correct this
First of all add index=False to the to_csv call:
df_temp.to_csv(name+str(i)+ext, index=False)
Also please upload a screenshot of the csv file when you open it in some csv viewer application (e.g. Excel).

Categories

Resources