write / append to very large csv with panda's to_csv [closed] - python

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 5 years ago.
Improve this question
I am opening one very large csv in chunks using pandas read_csv with a chunksize set because the csv is too large to fit into memory. I am performing transformations on each chunk. I then want to append the transformed df chunk to the end another existing (and very large) csv.
I have been running into out-of-memory errors though. Does pandas to_csv(mode='a', header=False) open the csv in order to append the new chunk? In other words, is the to_csv() causing my memory errors?

I had this same issue several times. What you might try is to export your data chunks in several csv (without headers) and then concatenate them with a non pandas function (e.g. Writing new lines on a text file read from your different csv)

Related

Do the changes made in the panda data frame created don't reflect in the original file? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
I was creating a program to replace all the none values in a data file ( excel sheet) using pandas. for that I first read the excel file and then used the replace method.
My code:
import pandas as pd
import numpy as np
A='Book2.xlsx'
dataa=pd.read_excel(A)
dataa[:].replace(np.nan,1,inplace=True)
print(dataa.iloc[3,1])
print(dataa)
But the changes reflected only in the read variable (dataa), not in the original file.
The original excel data file (after code execution), where no changes took place.
Please tell if I've done something going wrong, and what to do to resolve this.
This is because you are not changing the file, you just reading it.
try this:
import pandas as pd
import numpy as np
A='Book2.xlsx'
dataa=pd.read_excel(A)
dataa[:].replace(np.nan,1,inplace=True)
dataa.to_excel("book2_modified.xlsx")
The "to_excel" method documentation:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_excel.html

Handling large binary files in Python [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 4 years ago.
Improve this question
I have a binary file (>1GB in size) which contains single precision data, created in Matlab.
I am new to Python and would like to read the same file structure in Python.
any help would be much appreciated:
From Matlab, I can load the file as follow:
fid = fopen('file.dat','r');
my_data = fread(fid,[117276,1794],'single');
Many thanks
InP
Using numpy is easiest with fromfile https://docs.scipy.org/doc/numpy/reference/generated/numpy.fromfile.html:
np.fromfile('file.dat', dtype=np.dtype('single')).reshape((117276, 1794))
where np.dtype('single') is the same as np.dtype('float32')
Note that it may be transposed from what you want since MATLAB reads in column order, while numpy reshapes with row-order.
Also, I'm assuming that using numpy is ok since you are coming from MATLAB and probably will end up using it if you want to keep having MATLAB-like functions and not have to deal with pure python like these answers Reading binary file and looping over each byte

Sampling from a 6GB csv file without loading in Python [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 7 months ago.
Improve this question
I have a training data-set in CSV format of size 6 GB which I am required to analyze and implement machine learning on it. My system RAM is 6 GB so it is not possible for me to load the file in the memory. I need to perform random sampling and load the samples from the data-set. The number of samples may vary according to requirement. How to do this?
Something to start with:
with open('dataset.csv') as f:
for line in f:
sample_foo(line.split(","))
This will load only one line at a time in memory and not the whole file.

Easiest way to validate between two CSV files using python [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I have two CSV files, and I would like to validate(find the differences and similarities) the data between these two files.
I am retrieving this data from vertica and because the data is so large I would like to do the validation at CSV level.
csvdiff allows you to compare the semantic contents of two CSV files, ignoring things like row and column ordering in order to get to what’s actually changed. This is useful if you’re comparing the output of an automatic system from one day to the next, so that you can look at just what’s changed.
I don't think you can directly compare sheets using openpyxl without manually looping on each rows and using your own validation code.
That depend your aim at performance, if speed is not a requirement, then why not but that will require some additional work.
Instead I would use pandas dataframes for any CSV validation needs, if you can add this dependency it should become really easier to compare files while keeping it at a great performance.
Here is a link to complete example:
http://pbpython.com/excel-diff-pandas.html
However, use read_csv() instead of read_excel() to read data from your files.

pandas dataframe to R using pyRserve [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 9 years ago.
Improve this question
A large data frame (a couple of million rows, a few thousand columns) is created Pandas in python. This data frame is to be passed to R using PyRserve. This has to be quick - few seconds at most.
There is a to_json function in pandas. Is to and from json conversation for such large objects the only way? is it OK for such large objects?
I can always write it to disk and read it (fast using fread, and that it what I have done), but what is the best way to do this?
Without having tried it out, to_json seems to be a very bad idea, getting worse with larger dataframes as this has a lot of overhead, both in writing and reading the data.
I'd recommend using rpy2 (which is supported directly by pandas) or, if you want to write something to disk (maybe because the dataframe is only generated once) you can use HDF5 (see this thread for more information on interfacing pandas and R using this format).

Categories

Resources