How to get the last chunk of a csv file using Pandas? - python

Suppose I have a csv file containing 5 rows.
Now I iterate over this file using a chunksize of 2.
data = pd.read_csv(data_name, header=None, iterator=True, chunksize=2)
Suppose I am doing some magic on this data chunk and appending it to another csv file.
processed_data.to_csv(fname, index=None, mode="a")
Problem: The last row is not written.
I do not know how to solve this problem. Can someone help?
I need to use chunks because I don't have enough RAM.
I can not use chunksize=1, because opening/closing a file is too time consuming.

If you are running out of memory I would use blaze for this type of data.
https://blaze.readthedocs.io/en/latest/ooc.html
Then you don't have to mess with the chunksize.

Related

How to remove part of a panda dataframe without loading the full file?

I have a very large dataframe with many million rows and it is normally not feasible to load the entire file into memory to work with. Recently some bad data have gotten in and I need to remove them from the database. So far what I've done is:
file = '/path to database'
rf = pd.read_csv(f'{file}.csv', chunksize = 3000000, index_col=False)
res = pd.concat([chunk[chunk['timestamp'] < 1.6636434764745E+018] for chunk in rf)]
res.to_csv(f'{file}.csv', index=False)
Basically it is opening the database and saving the part i want, overwriting the original file.
However the data has gotten so large that this is failing to fit in memory. Is there a better way to truncate a part of the dataframe based on a simple query?
The truncated part would usually be very small compared to the rest, say 100k rows and always at the end.
I would avoid using pandas in this case and just directly edit the csv file itself. For example:
import csv
with open("test_big.csv", "r") as f_in, open("test_out.csv", "w") as f_out:
reader = csv.reader(f_in)
writer = csv.writer(f_out)
for row in reader:
if int(row[-1]) > 9900: # your condition here
writer.writerow(row)
For context, test_big.csv looks like this
1,2,3,4,5891
1,2,3,4,7286
1,2,3,4,7917
1,2,3,4,937
...
And is 400,000 records long. Execution took 0.2s.
Edit: Ran with 40,000,000 records and took 15s.

how do i assemble bunch of excel files into one or more using python

how to there is around 10k .csv files named as data0,data1 like that in sequence, want to combine them and want to have a master sheet in one file or at least couple of sheets using python because i think there is limitation of around 1070000 rows in one excel file i think?
import pandas as pd
import os
master_df = pd.DataFrame()
for file in os.listdir(os.getcwd()):
if file.endswith('.csv'):
master_df = master_df.append(pd.read_csv(file))
master_df.to_csv('master file.CSV', index=False)
A few things to note:
Please check your csv file content first. It would easily mismatch columns when reading csv with text(maybe ; in the content). Or you can try changing the csv engine
df= pd.read_csv(csvfilename,sep=';', encoding='utf-8',engine='python')
If you want to combing into one sheet, your can concat into one dataframe first, then to_excel
df = pd.concat([df,sh_tmp],axis=0,sort=False)
note: concat or append would be a straightforward way to combine data. However, 10k would lead to a perfomance topic. Try list instead of pd.concat if you facing perfomance issue.
Excel has maximum row limitation. 10k files would easily exceed the limit (1048576). You might change the output to csv file or split into multiple .xlsx
----update the 3rd----
You can try grouping the data first (1000k each), then write to sheet one by one.
row_limit = 1000000
master_df['group']=master_df.index//row_limit
writer = pd.ExcelWriter(path_out)
for gr in range(0,master_df['group'].max()+1):
master_df.loc[master_df['group']==gr].to_excel(writer,sheet_name='Sheet'+str(gr),index=False)
writer.save()

Problem with python memory, flush, csv size

After solving a sorting of a dataset, I have a problem at this point of my code.
with open(fns_land[xx]) as infile:
lines = infile.readlines()
for line in lines:
result_station.append(line.split(',')[0])
result_date.append(line.split(',')[1])
result_metar.append(line.split(',')[-1])
I have a problem in the lines line. In this line the data are sometimes to huge and i get a kill error.
Is there a short/nice way to rewrite this point?
Use readline instead, this read it one line at a time without loading the entire file into memory.
with open(fns_land[xx]) as infile:
while True:
line = infile.readline()
if not line:
break
result_station.append(line.split(',')[0])
result_date.append(line.split(',')[1])
result_metar.append(line.split(',')[-1])
If you are dealing with a dataset, I would suggest that you have a look at pandas, which I great for dealing with data wrangling.
If your problem is a large dataset, you could load the data in chunks.
import pandas as pd
tfr = pd.read_csv('fns_land{0}.csv'.format(xx), iterator=True, chunksize=1000)
Line: imported pandas modul
Line: read data from your csv file in chunks of 1000 lines.
This will be of type pandas.io.parsers.TextFileReader. To load the entire csv file, you follow up with:
df = pd.concat(tfr, ignore_index=True)
The parameter ignore_index=True is added to avoid duplicity of indexes.
You now have all your data loaded into a dataframe. Then do your data manipulation on the columns as vectors, which also is faster than regular line by line.
Have a look here this question that dealt with something similar.

How to read only a slice of data stored in a big csv file in python

I cant read the data from a CSV file into memory because it is too large, i.e. doing pandas.read_csv using pandas won't work.
I only want to get data out based on some column values which should fit into memory. Using a pandas dataframe df that could hypothetically contain the full data from the CSV, I would do
df.loc[(df['column_name'] == 1)
The CSV file does contain a header, and it is ordered so I don't really need to use column_name but the order of that column if I have to.
How can I achieve this? I read a bit about pyspark but I don't know if this is something where it can be useful
You can read the CSV file chunk by chunk and retain the rows which you want to have
iter_csv = pd.read_csv('sample.csv', iterator=True, chunksize=10000,error_bad_lines=False)
data = pd.concat ([chunk.loc[chunk['Column_name']==1] for chunk in iter_csv] )

Merge csv files using dask

I am new to python. I am using dask to read 5 large (>1 GB) csv files and merge (SQL like) them into a dask dataframe. Now, I am trying to write the merged result into a single csv. I used compute() on dask dataframe to collect data into a single df and then call to_csv. However, compute() is slow in reading data across all partitions. I tried calling to_csv directly on dask df and it created multiple .part files (I didn't try merging those .part files into a csv). Is there any alternative to get dask df into a single csv or any parameter to compute() to gather data. I am using 6GB RAM with HDD and i5 processor.
Thanks
Dask.dataframe will not write to a single CSV file. As you mention it will write to multiple CSV files, one file per partition. Your solution of calling .compute().to_csv(...) would work, but calling .compute() converts the full dask.dataframe into a Pandas dataframe, which might fill up memory.
One option is to just avoid Pandas and Dask all-together and just read in bytes from multiple files and dump them to another file
with open(out_filename, 'w') as outfile:
for in_filename in filenames:
with open(in_filename, 'r') as infile:
# if your csv files have headers then you might want to burn a line here with `next(infile)
for line in infile:
outfile.write(line + '\n')
If you don't need to do anything except for merge your CSV files into a larger one then I would just do this and not touch pandas/dask at all. They'll try to read the CSV data into in-memory data, which will take a while and which you don't need. If on the other hand you need to do some processing with pandas/dask then I would use dask.dataframe to read and process the data, write to many csv files, and then use the trick above to merge them afterwards.
You might also consider writing to a datastore other than CSV. Formats like HDF5 and Parquet can be much faster. http://dask.pydata.org/en/latest/dataframe-create.html
As of Dask 2.4.0 you may now specify single_file=True when calling to_csv. Example: dask_df.to_csv('path/to/csv.csv', single_file=True)
Like #mrocklin said, I recommend using other file formats.

Categories

Resources