I have file with 50 GB data. I know how to use Pandas for my data analysis.
I am only in need of the large 1000 lines or rows and in need of complete 50 GB.
Hence, I thought of using the nrows option in the read_csv().
I have written the code like this:
import pandas as pd
df = pd.read_csv("Analysis_of_50GB.csv",encoding="utf-16",nrows=1000,index_col=0)
But it has taken the top 1000 rows. I am in need of the last 100 rows. So I did this and received error:
df = pd.read_csv("Analysis_of_50GB.csv",encoding="utf-16",nrows=-1000,index_col=0)
ValueError: 'nrows' must be an integer >=0
I have even tried using the chunksize in the read_csv(). But it still loads the complete file. And even the output was not DataFrame but iterables.
Hence, please let me know what I can in this scenario.
Please NOTE THAT I DO NOT WANT TO OPEN THE COMPLETE FILE...
A pure pandas method:
import pandas as pd
line = 0
chksz = 1000
for chunk in pd.read_csv("Analysis_of_50GB.csv",encoding="utf-16",chunksize = chksz,index_col=0, usecols=0):
line += chunk.shape[0]
So this just counts the the number of rows, we read just the first column for performance reasons.
Once we have the total number of rows we just subtract from this the number of rows we want from the end:
df = pd.read_csv("Analysis_of_50GB.csv",encoding="utf-16", skiprows = line - 1000,index_col=0)
The normal way would be to read the whole file and keep 1000 lines in a dequeue as suggested in the accepted answer to Efficiently Read last 'n' rows of CSV into DataFrame. But it may be suboptimal for a really huge file of 50GB.
In that case I would try a simple pre-processing:
open the file
read and discard 1000 lines
use ftell to have an approximation of what has been read so far
seek that size from the end of the file and read the end of file in a large buffer (if you have enough memory)
store the positions of the '\n' characters in the buffer in a dequeue of size 1001 (the file has probably a terminal '\n'), let us call it deq
ensure that you have 1001 newlines, else iterate with a larger offset
load the dataframe with the 1000 lines contained in the buffer:
df = pd.read_csv(io.StringIO(buffer[d[0]+1:]))
Code could be (beware: untested):
with open("Analysis_of_50GB.csv", "r", encoding="utf-16") as fd:
for i in itertools.islice(fd, 1250): # read a bit more...
pass
offset = fd.tell()
while(True):
fd.seek(-offset, os.SEEK_END)
deq = collection.deque(maxlen = 1001)
buffer = fd.read()
for i,c in enumerate(buffer):
if c == '\n':
deq.append(i)
if len(deq) == 1001:
break
offset = offset * 1250 // len(deq)
df = pd.read_csv(io.StringIO(buffer[d[0]+1:]))
You should consider using dask which does chunking under the hood and allows you to work with very large data frames. It has a very similar workflow as pandas and the most important functions are already implemented.
I think you need to use skiprows and nrows together. Assuming that your file has 1000 rows, then,
df =pd.read_csv('"Analysis_of_50GB.csv", encoding="utf16",skiprows = lambda x: 0<x<=900, nrows=1000-900,index_col=0)
reads all the rows from 901 to 1000.
Related
Using pandas.read_csv with on_bad_lines='warn' option for lines with too many
columns delimiters, it work well, bad lines are not loaded and stderr catch the bad lines
numbers:
import pandas as pd
from io import StringIO
data = StringIO("""
nom,f,nb
bat,F,52
cat,M,66,
caw,F,15
dog,M,66,,
fly,F,61
ant,F,21""")
df = pd.read_csv(data, sep=',', on_bad_lines='warn')
# b'Skipping line 4: expected 3 fields, saw 4\nSkipping line 6: expected 3 fields, saw 5\n'
df.head(10)
# nom f nb
# 0 bat F 52
# 1 caw F 15
# 2 fly F 61
# 3 ant F 21
But in case the number of delimiter (here sep=,) is less than the main, the line
is added adding NaN.:
import pandas as pd
from io import StringIO
data = StringIO("""
nom,f,nb
bat,F,52
catM66,
caw,F,15
dog,M66
fly,F,61
ant,F,21""")
df = pd.read_csv(data, sep=',', on_bad_lines='warn', dtype=str)
df.head(10)
# nom f nb
# 0 bat F 52
# 1 catM66 NaN NaN <==
# 2 caw F 15
# 3 dog M66 NaN <==
# 4 fly F 61
# 5 ant F 21
Is there a way to make read_csv to not add lines with less columns delimiters than
the main lines ?
Note: I'm in a context of loading real big data files (e.g. hundred of millions of
lines, so the idea is not to propose any upfront grep/sed/awk processing but to take
benefit of fast read_csv bulk_load)
pd.read_csv() is a very nice function that performs a well-defined
computation, but you desire a slightly different computation.
You wish to filter out all rows containing fewer than K fields.
the idea is not to propose any upfront grep / sed / awk processing
You have rather constrained the solution space.
Apparently speed (elapsed time) or power efficiency (watts dissipated)
are motivating concerns.
You correctly observe that grep is quite fast and would be
a natural pre-processing stage.
One could store its filtered output to a temp file
which we feed to .read_csv(), potentially costing extra disk I/O.
A better solution would be to pipe its output
using the subprocess library.
The original post mentions no grep timing results,
so it is unclear if overhead due to an extra child
process has been shown to be "too slow".
There's no throughput specification of N rows / second,
so it's unclear how this or any competing proposal
should be evaluated.
Note that .read_csv() accepts a file-like object,
which could be a python generator that
inspects each row and only yields suitable rows.
Given that you're gung ho on calling .read_csv(), a function
which doesn't quite compute what you want, it seems there's
little for it but to post-process its output and hope for the best.
Filtering out all NaNs might do, but that's a little on the drastic side.
There is some buggy generating process that produces "short" rows
with fewer than K fields.
If you know the minimum number of fields it's guaranteed to produce,
you could at least do appropriate column-wise filtering
to discard short rows.
Then you get to preserve true NaNs in the first several columns.
Good luck!
I wrote the code for points generation which will generate a dataframe for every one second and it keeps on generating. Each dataframe has 1000 rows and 7 columns.. It was implemented using while loop and thus for every iteration one dataframe is generated and it must be appended on a file. While file format should I use to manage the memory efficiency? Which file format takes less memory.? Can anyone give me a suggestion.. Is it okay to use csv? If so what datatype should I prefer to use.. Currently my dataframe has int16 values.. Should I append the same or should I convert it into binary format or byte format?
numpy arrays can be stored in binary format. Since you you have a single int16 data type, you can create a numpy array and write that. You would have 2 bytes per int16 value which is fairly good for size. The trick is that you need to know the dimensions of the stored data when you read it later. In this example its hard coded. This is a bit fragile - if you change your mind and start using different dimensions later, old data would have to be converted.
Assuming you want to read a bunch of 1000x7 dataframes later, you could do something like the example below. The writer keeps appending 1000x7 int16s and the reader chunks them back into dataframes. If you don't use anything specific to pandas itself, you would be better off just sticking with numpy for all of your operations and skip the demonstrated conversions.
import pandas as pd
import numpy as np
def write_df(filename, df):
with open(filename, "ab") as fp:
np.array(df, dtype="int16").tofile(fp)
def read_dfs(filename, dim=(1000,7)):
"""Sequentially reads dataframes from a file formatted as raw int16
with dimension 1000x7"""
size = dim[0] * dim[1]
with open(filename, "rb") as fp:
while True:
arr = np.fromfile(fp, dtype="int16", count=size)
if not len(arr):
break
yield pd.DataFrame(arr.reshape(*dim))
import os
# ready for test
test_filename = "test123"
if os.path.exists(test_filename):
os.remove(test_filename)
df = pd.DataFrame({"a":[1,2,3], "b":[4,5,6]})
# write test file
for _ in range(5):
write_df(test_filename, df)
# read and verify test file
return_data = [df for df in read_dfs(test_filename, dim=(3,2))]
assert len(return_data) == 5
I'm trying to import large files (.tab/.txt, 300+ columns and 1 000 000+ rows) in Python. The file are tab seperated. The columns are filled with integer values. One of my goals is to make a sum of each column. However, the files are too large to import with pandas.read_csv() as it consumes too much RAM.
sample data:
Therefore I wrote following code to import 1 column, perform the sum of that column, store the result in a dataframe (= summed_cols), delete the column, and go on with the next column of the file:
x=10 ###columns I'm interested in start at col 11
#empty dataframe to fill
summed_cols=pd.DataFrame(columns=["sample","read sum"])
while x<352:
x=x+1
sample_col=pd.read_csv("file.txt",sep="\t",usecols=[x])
summed_cols=summed_cols.append(pd.DataFrame({"sample":[sample_col.columns[0]],"read sum":sum(sample_col[sample_col.columns[0]])}))
del sample_col
Each column represents a sample and the ''read sum'' is the sum of that column. So the output of this code is a dataframe with 2 columns with in the first column one sample per row, and in the second column the corresponding read sum.
This code does exactly what I want to do, however, it is not efficient. For this large file it takes about 1-2 hours to complete the calculations. Especially the loading of just 1 columns takes quiet a long time.
My question: Is there a faster way to import just one column of this large tab file and perform the same calculations as I'm doing with the code above?
You can try something like this:
samples = []
sums = []
with open('file.txt','r') as f:
for i,line in enumerate(f):
columns = line.strip().split('\t')[10:] #from column 10 onward
if i == 0: #supposing the sample_name is the first row of each column
samples = columns #save sample names
sums = [0 for s in samples] #init the sums to 0
else:
for n,v in enumerate(columns):
sums[n] += float(v)
result = dict(zip(samples,sums)) #{sample_name:sum, ...}
I am not sure this will work since I don't know the content of your input file but it describes the general procedure. You open the file only once, you iterate over each line, split to get the columns, and store the data you need.
Mind that this code does not deal with missing values.
The else block can be improved using numpy:
import numpy as np
...
else:
sums = np.add(sums, map(float,columns))
Reading html tables in pandas for small size is ok, but the big files in range of 10MB or like 10000 rows/records in html table makes me wait for 10 minutes still no progress, where as same in csv is parsed quickly.
Kindly help speedup html table read in pandas, or getting this converted to csv.
file='testfile.html'
dfdefault = pd.read_html(file, header = 0, match='Client Inventory Details')
#print(dfdefault)
df = dfdefault[0]
Html dataset is still dataset. In order to read faster large data sets in Pandas, you can choose different strategies, it applies to read_html aswell:
1.Sampling
2.Chunking
3.Optimising Pandas dtypes
Sampling. The most simple option is sampling your dataset.
import pandas
import random
filename = "data.csv"
n = sum(1 for line in open(filename))-1 # Calculate number of rows in file
s = n//10 # sample size of 10%
skip = sorted(random.sample(range(1, n+1), n-s)) # n+1 to compensate for header
df = pandas.read_csv(filename, skiprows=skip)
Chunks / Iteration
If you do need to process all data, you can choose to split the data into a number of chunks (which in itself do fit in memory) and perform your data cleaning and feature engineering on each individual chunk
import pandas
from sklearn.linear_model import LogisticRegression
datafile = "data.csv"
chunksize = 100000
models = []
for chunk in pd.read_csv(datafile, chunksize=chunksize):
chunk = pre_process_and_feature_engineer(chunk)
# A function to clean my data and create my features
model = LogisticRegression()
model.fit(chunk[features], chunk['label'])
models.append(model)
df = pd.read_csv("data_to_score.csv")
df = pre_process_and_feature_engineer(df)
predictions = mean([model.predict(df[features]) for model in models], axis=0)
Optimise data types
When loading data from file, Pandas automatically infers the datatypes. Very convenient of course, however, often these datatypes are not optimal and take up more memory than needed. We will go over the three most common datatypes used by Pandas — int, float and object — and show how to decrease their memory imprint while looking at an example.
Another way to drastically reduce the size of your Pandas Dataframe is to transform columns of dtype Object to category.
I have a data frame with 13000 rows and 3 columns:
('time', 'rowScore', 'label')
I want to read subset by subset:
[[1..360], [360..712], ..., [12640..13000]]
I used list too but it's not working:
import pandas as pd
import math
import datetime
result="data.csv"
dataSet = pd.read_csv(result)
TP=0
count=0
x=0
df = pd.DataFrame(dataSet, columns =
['rawScore','label'])
for i,row in df.iterrows():
data= row.to_dict()
ScoreX= data['rawScore']
labelX=data['label']
for i in range (1,13000,360):
x=x+1
for j in range (i,360*x,1):
if ((ScoreX > 0.3) and (labelX ==0)):
count=count+1
print("count=",count)
You can also use the parameters nrows or skiprows to break it up into chunks. I would recommend against using iterrows since that is typically very slow. If you do this when reading in the values, and saving these chunks separately, then it would skip the iterrows section. This is for the file reading if you want to split up into chunks (which seems to be an intermediate step in what you're trying to do).
Another way is to subset using generators by seeing if the values belong to each set:
[[1..360], [360..712], ..., [12640..13000]]
So write a function that takes the chunks with indices divisible by 360 and if the indices are in that range, then choose that particular subset.
I just wrote these approaches down as alternative ideas you might want to play around with, since in some cases you may only want a subset and not all of the chunks for calculation purposes.