Pandas .DAT file import error with skip rows - python

I am trying to break a huge data file into smaller parts. I am using the following scripts -
df = pd.read_csv(file_name, header=None,encoding='latin1',sep='\t',nrows=100000, skiprows = 100000)
but I see that skip rows argument skips around 200000 rows instead of 100000 can anyone tell me as to why this is happening

Thanks to #EdChum I was able to solve the problem using chunksize with the following code:-
i = 0
tp = pd.read_csv(filename,header=None,encoding='latin1', sep='\t', iterator=True, chunksize=1000000)
for c in tp:
ca = pd.DataFrame(c)
ca.to_csv (file_destination +str(i)+'test.csv', index = False, header = False)
i = i+1

Related

Python Pandas: read_csv with chunksize and concat still throws MemoryError

I am trying to extract certain rows from a 10GB ~35mil rows csv file into a new csv based on condition (value of a column (Geography = Ontario)). It runs for a few minutes and I can see my free hard drive space getting drained from 14GB to basically zero and then get the MemoryError. I thought chunksize would help here but it did not :( Please advise.
import pandas as pd
df = pd.read_csv("Data.csv", chunksize = 10000)
result = pd.concat(df)
output=result[result['Geography']=='Ontario']
rowcount=len(output)
print(output)
print(rowcount)
output.to_csv('data2.csv')
You can try writing in chunks. Roughly:
df = pd.read_csv("Data.csv", chunksize = 10000)
header = True
for chunk in df:
chunk=chunk[chunk['Geography']=='Ontario']
chunk.to_csv(outfilename, header=header, mode='a')
header = False
Idea from here.

How to read a large csv file without opening it to have a sum of numbers for each row

I'm working with a large csv file (>500 000 columns x 4033 rows) and my goal is to have the sum of all numbers per row, with the exception of the three first cells first of the row that are only descriptive of my samples. I'd like to use the pandas package.
The dataset is something like this:
label Group numOtus Otu0000001 Otu0000002 Otu0000003 Otu0000004 ... Otu0518246 sum
00.03 1.118234 518246 0 62 275 0 ... 5 ?
I've tried a couple of different things, none of them worked.
I can't simply use read_csv from pandas and then work with it because the file is too large (4 Gb). So, I tried a for loop, opening one row at a time, but I'm not getting what I was expecting. The final output should be a column with the sum per line.
Any ideas?
lst = []
for line in range(4033):
l = pd.read_csv("doc.csv", sep = "\t", nrows=1, low_memory=false)
l = l.drop(columns=['label', 'Group', "numOtus"])
x = l[list(l.columns)].sum(axis=1, numeric_only=float)
lst.append(x)
One other solution besides dask is to use chunksize parameter in pd.read_csv, then pd.concat your chunks.
A quick example:
chunksize = 1000
l = pd.read_csv('doc.csv', chunksize=chunksize, iterator=True)
df = pd.concat(l, ignore_index=True)
Addition:
To do something with the chunks one by one you can use:
chunksize = 1000
for chunk in pd.read_csv('doc.csv', chunksize=chunksize, iterator=True):
# do something with individual chucks there
To see the progress you can consider using tqdm.
from tqdm import tqdm
chunksize = 1000
for chunk in tqdm(pd.read_csv('doc.csv', chunksize=chunksize, iterator=True)):
# do something with individual chucks there
You could use dask, which is specially built for this.
import dask.dataframe as dd
dd.read_csv("doc.csv", sep = "\t").sum().compute()
u can use pandas.Series.append and pandas.DataFrame.sum along with pandas.DataFrame.iloc, while reading the data in chunks,
row_sum = pd.Series([])
for chunk in pd.read_csv('doc.csv',sep = "\t" ,chunksize=50000):
row_sum = row_sum.append(chunk.iloc[:,3:].sum(axis = 1, skipna = True))

what is an efficient way to load and aggregate a large .bz2 file into pandas?

I'm trying to load a large bz2 file in chunks and aggregate into a pandas DataFrame, but Python keeps crashing. The methodology I'm using is below, which I've had success with on smaller datasets. What is a more efficient way to aggregate larger than memory files into Pandas?
Data is line delimited json compressed to bz2, taken from https://files.pushshift.io/reddit/comments/ (all publicly available reddit comments).
import pandas as pd
reader = pd.read_json('RC_2017-09.bz2', compression='bz2', lines=True, chunksize=100000) df = pd.DataFrame() for chunk in reader:
# Count of comments in each subreddit
count = chunk.groupby('subreddit').size()
df = pd.concat([df, count], axis=0)
df = df.groupby(df.index).sum()
reader.close()
EDIT: Python crashed when I used chunksize 1e5. The script worked when i increased chunksize to 1e6.
I used this iterator method which work for me without memory error. you can try it.
chunksize = 10 ** 6
cols=['a','b','c','d']
iter_csv = pd.read_csv(filename.bz2, compression='bz2', delimiter='\t', usecols=cols, low_memory=False, iterator=True, chunksize=chunksize, encoding="utf-8")
# some work related to your group by replacing below code
df = pd.concat([chunk[chunk['b'] == 1012] for chunk in iter_csv])

Reading random rows of a large csv file, python, pandas

could you help me, I faced a problem of reading random rows from the large csv file using 0.18.1 pandas and 2.7.10 Python on Windows (8 Gb RAM).
In Read a small random sample from a big CSV file into a Python data frame
I saw an approach, however, it occured for my PC to be very memory consuming, namely, part of the code:
n = 100
s = 10
skip = sorted(rnd.sample(xrange(1, n), n-s))# skip n-s random rows from *.csv
data = pd.read_csv(path, usecols = ['Col1', 'Col2'],
dtype = {'Col1': 'int32', 'Col2':'int32'}, skiprows = skip)
so, if I want to take some random rows from the file considering not only 100 rows, but 100 000, it becomes hard, however taking not random rows from the file is almost alright:
skiprows = xrange(100000)
data = pd.read_csv(path, usecols = ['Col1', 'Col2'],
dtype = {'Col1': 'int32', 'Col2':'int32'}, skiprows = skip, nrows = 10000)
So the question how can I deal with reading large number of random rows from the large csv file with pandas, i.e. since I can't read the entire csv file, even with chunking it, I'm interested exactly in random rows.
Thanks
If memory is the biggest issue, a possible solution might be to use chunks, and randomly select from the chunks
n = 100
s = 10
factor = 1 # should be integer
chunksize = int(s/factor)
reader = pd.read_csv(path, usecols = ['Col1', 'Col2'],dtype = {'Col1': 'int32', 'Col2':'int32'}, chunksize=chunksize)
out = []
tot = 0
for df in reader:
nsample = random.randint(factor,chunksize)
tot += nsample
if tot > s:
nsample = s - (tot - nsample)
out.append(df.sample(nsample))
if tot >= s:
break
data = pd.concat(out)
And you can use factor to control the sizes of the chunks.
I think this is faster than other methods showed here and may be worth trying.
Say, we have already chosen rows to be skipped in a list skipped. First, I convert it to a lookup bool table.
# Some preparation:
skipped = np.asarray(skipped)
# MAX >= number of rows in the file
bool_skipped = np.zeros(shape(MAX,), dtype=bool)
bool_skipped[skipped] = True
Main stuff:
from io import StringIO
# in Python 2 use
# from StringIO import StringIO
def load_with_buffer(filename, bool_skipped, **kwargs):
s_buf = StringIO()
with open(filename) as file:
count = -1
for line in file:
count += 1
if bool_skipped[count]:
continue
s_buf.write(line)
s_buf.seek(0)
df = pd.read_csv(s_buf, **kwargs)
return df
I tested it as follows:
df = pd.DataFrame(np.random.rand(100000, 100))
df.to_csv('test.csv')
df1 = load_with_buffer('test.csv', bool_skipped, index_col=0)
with 90% of rows skipped. It performs comparably to
pd.read_csv('test.csv', skiprows=skipped, index_col=0)
and is about 3-4 times faster than using dask or reading in chunks.

difference in csv.reader and pandas - python

I am importing a csv file using csv.reader and pandas. However, the number of rows from the same file are different.
reviews = []
openfile = open("reviews.csv", 'rb')
r = csv.reader(openfile)
for i in r:
reviews.append(i)
openfile.close()
print len(reviews)
the results is 10,000 (which is the correct value). However, pandas returns a different value.
df = pd.read_csv("reviews.csv", header=None)
df.info()
this returns 9,985
Does anyone know why there is difference between the two methods of importing data?
I just tried this:
reviews_df = pd.DataFrame(reviews)
reviews_df.info()
This returns 10,000.
Refer to the pandas.read_csv there is an argument named skip_blank_lines and its default value is True hence unless you are setting it to False it will not read the blank lines.
Consider the following example, there are two blank rows:
A,B,C,D
0.07,-0.71,1.42,-0.37
0.08,0.36,0.99,0.11
1.06,1.55,-0.93,-0.90
-0.33,0.13,-0.11,0.89
1.91,-0.74,0.69,0.83
-0.28,0.14,1.28,-0.40
0.35,1.75,-1.10,1.23
-0.09,0.32,0.91,-0.08
Read it with skip_blank_lines=False:
df = pd.read_csv('test_data.csv', skip_blank_lines=False)
len(df)
10
Read it with skip_blank_lines=True:
df = pd.read_csv('test_data.csv', skip_blank_lines=True)
len(df)
8

Categories

Resources