I wrote the code for points generation which will generate a dataframe for every one second and it keeps on generating. Each dataframe has 1000 rows and 7 columns.. It was implemented using while loop and thus for every iteration one dataframe is generated and it must be appended on a file. While file format should I use to manage the memory efficiency? Which file format takes less memory.? Can anyone give me a suggestion.. Is it okay to use csv? If so what datatype should I prefer to use.. Currently my dataframe has int16 values.. Should I append the same or should I convert it into binary format or byte format?
numpy arrays can be stored in binary format. Since you you have a single int16 data type, you can create a numpy array and write that. You would have 2 bytes per int16 value which is fairly good for size. The trick is that you need to know the dimensions of the stored data when you read it later. In this example its hard coded. This is a bit fragile - if you change your mind and start using different dimensions later, old data would have to be converted.
Assuming you want to read a bunch of 1000x7 dataframes later, you could do something like the example below. The writer keeps appending 1000x7 int16s and the reader chunks them back into dataframes. If you don't use anything specific to pandas itself, you would be better off just sticking with numpy for all of your operations and skip the demonstrated conversions.
import pandas as pd
import numpy as np
def write_df(filename, df):
with open(filename, "ab") as fp:
np.array(df, dtype="int16").tofile(fp)
def read_dfs(filename, dim=(1000,7)):
"""Sequentially reads dataframes from a file formatted as raw int16
with dimension 1000x7"""
size = dim[0] * dim[1]
with open(filename, "rb") as fp:
while True:
arr = np.fromfile(fp, dtype="int16", count=size)
if not len(arr):
break
yield pd.DataFrame(arr.reshape(*dim))
import os
# ready for test
test_filename = "test123"
if os.path.exists(test_filename):
os.remove(test_filename)
df = pd.DataFrame({"a":[1,2,3], "b":[4,5,6]})
# write test file
for _ in range(5):
write_df(test_filename, df)
# read and verify test file
return_data = [df for df in read_dfs(test_filename, dim=(3,2))]
assert len(return_data) == 5
Related
I have file with 50 GB data. I know how to use Pandas for my data analysis.
I am only in need of the large 1000 lines or rows and in need of complete 50 GB.
Hence, I thought of using the nrows option in the read_csv().
I have written the code like this:
import pandas as pd
df = pd.read_csv("Analysis_of_50GB.csv",encoding="utf-16",nrows=1000,index_col=0)
But it has taken the top 1000 rows. I am in need of the last 100 rows. So I did this and received error:
df = pd.read_csv("Analysis_of_50GB.csv",encoding="utf-16",nrows=-1000,index_col=0)
ValueError: 'nrows' must be an integer >=0
I have even tried using the chunksize in the read_csv(). But it still loads the complete file. And even the output was not DataFrame but iterables.
Hence, please let me know what I can in this scenario.
Please NOTE THAT I DO NOT WANT TO OPEN THE COMPLETE FILE...
A pure pandas method:
import pandas as pd
line = 0
chksz = 1000
for chunk in pd.read_csv("Analysis_of_50GB.csv",encoding="utf-16",chunksize = chksz,index_col=0, usecols=0):
line += chunk.shape[0]
So this just counts the the number of rows, we read just the first column for performance reasons.
Once we have the total number of rows we just subtract from this the number of rows we want from the end:
df = pd.read_csv("Analysis_of_50GB.csv",encoding="utf-16", skiprows = line - 1000,index_col=0)
The normal way would be to read the whole file and keep 1000 lines in a dequeue as suggested in the accepted answer to Efficiently Read last 'n' rows of CSV into DataFrame. But it may be suboptimal for a really huge file of 50GB.
In that case I would try a simple pre-processing:
open the file
read and discard 1000 lines
use ftell to have an approximation of what has been read so far
seek that size from the end of the file and read the end of file in a large buffer (if you have enough memory)
store the positions of the '\n' characters in the buffer in a dequeue of size 1001 (the file has probably a terminal '\n'), let us call it deq
ensure that you have 1001 newlines, else iterate with a larger offset
load the dataframe with the 1000 lines contained in the buffer:
df = pd.read_csv(io.StringIO(buffer[d[0]+1:]))
Code could be (beware: untested):
with open("Analysis_of_50GB.csv", "r", encoding="utf-16") as fd:
for i in itertools.islice(fd, 1250): # read a bit more...
pass
offset = fd.tell()
while(True):
fd.seek(-offset, os.SEEK_END)
deq = collection.deque(maxlen = 1001)
buffer = fd.read()
for i,c in enumerate(buffer):
if c == '\n':
deq.append(i)
if len(deq) == 1001:
break
offset = offset * 1250 // len(deq)
df = pd.read_csv(io.StringIO(buffer[d[0]+1:]))
You should consider using dask which does chunking under the hood and allows you to work with very large data frames. It has a very similar workflow as pandas and the most important functions are already implemented.
I think you need to use skiprows and nrows together. Assuming that your file has 1000 rows, then,
df =pd.read_csv('"Analysis_of_50GB.csv", encoding="utf16",skiprows = lambda x: 0<x<=900, nrows=1000-900,index_col=0)
reads all the rows from 901 to 1000.
I have a data frame with 13000 rows and 3 columns:
('time', 'rowScore', 'label')
I want to read subset by subset:
[[1..360], [360..712], ..., [12640..13000]]
I used list too but it's not working:
import pandas as pd
import math
import datetime
result="data.csv"
dataSet = pd.read_csv(result)
TP=0
count=0
x=0
df = pd.DataFrame(dataSet, columns =
['rawScore','label'])
for i,row in df.iterrows():
data= row.to_dict()
ScoreX= data['rawScore']
labelX=data['label']
for i in range (1,13000,360):
x=x+1
for j in range (i,360*x,1):
if ((ScoreX > 0.3) and (labelX ==0)):
count=count+1
print("count=",count)
You can also use the parameters nrows or skiprows to break it up into chunks. I would recommend against using iterrows since that is typically very slow. If you do this when reading in the values, and saving these chunks separately, then it would skip the iterrows section. This is for the file reading if you want to split up into chunks (which seems to be an intermediate step in what you're trying to do).
Another way is to subset using generators by seeing if the values belong to each set:
[[1..360], [360..712], ..., [12640..13000]]
So write a function that takes the chunks with indices divisible by 360 and if the indices are in that range, then choose that particular subset.
I just wrote these approaches down as alternative ideas you might want to play around with, since in some cases you may only want a subset and not all of the chunks for calculation purposes.
I have a very large data file (1000 by 1400000 array) that contains integers of 0, 1, 2 and 4. It takes a very long time to load this big data into a numpy array using h5py because my memory(4GB) cannot hold that much and the program uses the swap space. Since there are only 4 numbers in the data, I want to use a 8 bit integer array. Currently I load the data and convert it to a 8 bit int array after that.
with h5py.File("largedata", 'r') as f:
variables = f.items()
# extract all data
for name, data in variables:
# If DataSet pull the associated Data
if type(data) is h5py.Dataset:
value = data.value
if(name == 'foo'):
# convert to 8 bit int
nparray = np.array(value, dtype=np.int8)
Is it possible to load the data directly into a 8bit int array to save memory while loading?
From the dataset docs page
astype(dtype)
Return a context manager allowing you to read data as a particular type.
Conversion is handled by HDF5 directly, on the fly:
>>> dset = f.create_dataset("bigint", (1000,), dtype='int64')
>>> with dset.astype('int16'):
out = dset[:]
>>> out.dtype
=dtype('int16')
I know using == for float is generally not safe. But does it work for the below scenario?
Read from csv file A.csv, save first half of the data to csv file B.csv without doing anything.
Read from both A.csv and B.csv. Use == to check if data match everywhere in the first half.
These are all done with Pandas. The columns in A.csv have types datetime, string, and float. Obviously == works for datetime and string, so if == works for float as well in this case, it saves a lot of work.
It seems to be working for all my tests, but can I assume it will work all the time?
The same string representation will become the same float representation when put through the same parse routine. The float inaccuracy issue occurs either when mathematical operations are performed on the values or when high-precision representations are used, but equality on low-precision values is no reason to worry.
No, you cannot assume that this will work all the time.
For this to work, you need to know that the text value written out by Pandas when it's writing to a CSV file recovers the exact same value when read back in (again using Pandas). But by default, the Pandas read_csv function sacrifices accuracy for speed, and so the parsing operation does not automatically recover the same float.
To demonstrate this, try the following: we'll create some random values, write them out to a CSV file, and read them back in, all using Pandas. First the necessary imports:
>>> import pandas as pd
>>> import numpy as np
Now create some random values, and put them into a Pandas Series object:
>>> test_values = np.random.rand(10000)
>>> s = pd.Series(test_values, name='test_values')
Now we use the to_csv method to write these values out to a file, and then read the contents of that file back into a DataFrame:
>>> s.to_csv('test.csv', header=True)
>>> df = pd.read_csv('test.csv')
Finally, let's extract the values from the relevant column of df and compare. We'll sum the result of the == operation to find out how many of the 10000 input values were recovered exactly.
>>> sum(test_values == df['test_values'])
7808
So approximately 78% of the values were recovered correctly; the others were not.
This behaviour is considered a feature of Pandas, rather than a bug. However, there's a workaround: Pandas 0.15 added a new float_precision argument to the CSV reader. By supplying float_precision='round_trip' to the read_csv operation, Pandas uses a slower but more accurate parser. Trying that on the example above, we get the values recovered perfectly:
>>> df = pd.read_csv('test.csv', float_precision='round_trip')
>>> sum(test_values == df['test_values'])
10000
Here's a second example, going in the other direction. The previous example showed that writing and then reading doesn't give back the same data. This example shows that reading and then writing doesn't preserve the data, either. The setup closely matches the one you describe in the question. First we'll create A.csv, this time using regularly-spaced values instead of random ones:
>>> import pandas as pd, numpy as np
>>> s = pd.Series(np.arange(10**4) / 1e3, name='test_values')
>>> s.to_csv('A.csv', header=True)
Now we read A.csv, and write the first half of the data back out again to B.csv, as in your Step 1.
>>> recovered_s = pd.read_csv('A.csv').test_values
>>> recovered_s[:5000].to_csv('B.csv', header=True)
Then we read in both A.csv and B.csv, and compare the first half of A with B, as in your Step 2.
>>> a = pd.read_csv('A.csv').test_values
>>> b = pd.read_csv('B.csv').test_values
>>> (a[:5000] == b).all()
False
>>> (a[:5000] == b).sum()
4251
So again, several of the values don't compare correctly. Opening up the files, A.csv looks pretty much as I expect. Here are the first 15 entries in A.csv:
,test_values
0,0.0
1,0.001
2,0.002
3,0.003
4,0.004
5,0.005
6,0.006
7,0.007
8,0.008
9,0.009
10,0.01
11,0.011
12,0.012
13,0.013
14,0.014
15,0.015
And here are the corresponding entries in B.csv:
,test_values
0,0.0
1,0.001
2,0.002
3,0.003
4,0.004
5,0.005
6,0.006
7,0.006999999999999999
8,0.008
9,0.009000000000000001
10,0.01
11,0.011000000000000001
12,0.012
13,0.013000000000000001
14,0.013999999999999999
15,0.015
See this bug report for more information on the introduction of the float_precision keyword to read_csv.
I have a large SPSS-file (containing a little over 1 million records, with a little under 150 columns) that I want to convert to a Pandas DataFrame.
It takes a few minutes to convert the file to a list, than another couple of minutes to convert it to a dataframe, than another few minutes to set the columnheaders.
Are there any optimizations possible, that I'm missing?
import pandas as pd
import numpy as np
import savReaderWriter as spss
raw_data = spss.SavReader('largefile.sav', returnHeader = True) # This is fast
raw_data_list = list(raw_data) # this is slow
data = pd.DataFrame(raw_data_list) # this is slow
data = data.rename(columns=data.loc[0]).iloc[1:] # setting columnheaders, this is slow too.
You can use rawMode=True to speed up things a bit, as in:
raw_data = spss.SavReader('largefile.sav', returnHeader=True, rawMode=True)
This way, datetime variables (if any) won't be converted to ISO-strings, and SPSS $sysmis values won't be converted to None, and a few other things.