pandas split chunks missing rows when chunksize is small - python

I have a 8GB file, I split it into hundreds chunks, each chunk has n rows:
def find_contract():
n = 100000
global df
for i, df in enumerate(pd.read_csv(file, chunksize=n,
iterator=True, low_memory=False)):
target_row= df.loc[(df.contract == 'C123')& (df.RB == '63')]
print(target_row)
find_contract()
If n=100000 or 50000 or 40000
it can find the target_row, but once the n
<=30000
the program wouldn't be able to find the target row. And also some other rows seems missing.
This looks so weird, any friend can help ?

Related

Anyone has a better way of efficiently creating a DataFrame from 60000 txt Files with keys in one column and values in the second?

Disclaimer !! This is my first post ever, so sorry if I don't meet certain standards of the community. _________________ _________________ _________________ _________________ _________________
I use python3, Jupyter Notebooks, Pandas
I used KMC kmer counter to count kmers of 60,000 DNA sequences in a reasonable amount of time. I want to use these kmer counts as input to ML algorithms as part of a Bag Of Words model.
The shape of a file containing kmer counts is as below, or as in image here and I have 60K files:
AAAAAC     2
AAAAAG     6
AAAAAT      2
AAAACC     4
AAAACG     2
AAAACT     3
AAAAGA     5
I want to create a single DataFrame from all the 60K files with one line per DNA sequence kmer counts which would have this form:
The target DataFrame shape
A first approach was successful and I managed to import 100 sequences(100 txt files) in 58 seconds, using this code:
import time
countsPath = r'D:\DataSet\MULTI\bow\6mer'
start = time.time()
for i in range(0, 60000):
sample = pd.read_fwf(countsPath + r'\kmers-' + str(k) +'-seqNb-'+ str(i) + '.txt',sep=" ", header=None).T
new_header = sample.iloc[0] #grab the first row for the header
sample = sample[1:] #take the data less the header row
sample.columns = new_header #set the header row as the df header
df= df.append(sample, ignore_index=True) #APPEND Sample to df DataSet
end = time.time()
# total time taken
print(f"Runtime of the program is {end - start} secs")
# display(sample)
display(df)
However, this was very slow, and took 59 secs on 100 files. On the full dataset, take a factor of x600.
I tried dask DataFrames Bag to accelerate the process because it reads dictionary-like data, but I couldn't append each file as a row. The resulting Dask DataFrame is as follows or as in this image:
0          AAAAA   18
1          AAAAC   16
2          AAAAG   13
...
1023   TTTTT   14
0          AAAAA   5
1          AAAAC   4
...
1023   TTTTT   9
0          AAAAA   18
1          AAAAC   16
2          AAAAG   13
3          AAAAT   12
4          AAACA   11
So the files are being inserted in a single column.
Anyone has a better way of efficiently creating a DataFrame from 60k txt Files?
Love the disclaimer. I have a similar one - this is the first time I'm trying to answer a question. But I'm pretty certain I got this...and so will you:
dict_name = dict(zip(df['column_name'],df['the_other_column_name']))

Python divide dataframe into chunks

I have a 1 column df with 37365 rows. I would need to separate it in chunks like the below:
df[0:2499]
df[2500:4999]
df[5000:7499]
...
df[32500:34999]
df[35000:37364]
The idea would be to use this in a loop like the below (process_operation does not work for dfs larger than 2500 rows)
while chunk <len(df):
process_operation(df[lower:upper])
EDIT:
I will be having different dataframes as inputs. Some of them will be smaller than 2500. What would be the best approach to also capture these?
Ej: df[0:1234] because 1234<2500
The range function is enough here:
for start in range(0, len(df), 2500):
process_operation(df[start:start+2500])
Do you mean something like that?
lower = 0
upper = 2499
while upper <= len(df):
process_operation(df[lower:upper])
lower += 2500
upper += 2500
I would use
import numpy as np
import math
chunk_max_size = 2500
chunks = int(math.ceil(len(df) / chunk_max_size))
for df_chunk in np.array_split(df, chunks):
#where: len(df_chunk) <= 2500

compare values in different chunks using pandas

Say I have in memory a large file, loaded using chunksize in pandas. Now I have to compare every value with the ones ajdacent to it. My problem is that I can't seem to select at the same time the extreme values (in first and last position) of two different chunks.
Example:
print(df)
a
0 102
1 101
2 104
3 110
4 104
5 105
count = 0
for i in range(len(df)-1):
if df.iloc[i+1]['a']>df.iloc[i]['a']:
count+=1
count would be equal to 3 in this example. But say I have loaded df from a .csv with chunksize=1, how would I achieve a similar result, considering that values will be in different chunks? In practice chunksize is 10000 and so the problem would be limited to the first and last value for each chunk.
EDIT:
Here is an example where I store the last_chunk_value to update the value when running the next loop.
I've tested a 'brut force' method to compare with the 'chunk script'. The results are the same with both methods.
By the way, I've simplified the 'brut force' method.
import pandas as pd
import numpy as np
import random
# 'data' generation as csv file
file = open("data.csv", 'w')
file.write('rand_int' + '\n')
for i in range(0, 10000):
file.write(str(random.randint(80,120)) + '\n')
file.close()
# "brute force method"
df = pd.read_csv("data.csv")
length = int( (df.shift(-1) - df > 0).sum() )
print('number=', length)
# chunksize method
chunksize = 33
length = 0
last_chunk_value = np.nan
for chunk in pd.read_csv("data.csv", chunksize=chunksize):
chunk['shift'] = chunk.shift(1)
chunk.iloc[0, 1] = last_chunk_value
length += (chunk['rand_int'] - chunk['shift'] > 0).sum()
last_chunk_value = chunk.iloc[-1, 0]
print('number=', length)

Repartition Dask DataFrame to get even partitions

I have a Dask DataFrames that contains index which is not unique (client_id). Repartitioning and resetting index ends up with very uneven partitions - some contains only a few rows, some thousands. For instance the following code:
for p in range(ddd.npartitions):
print(len(ddd.get_partition(p)))
prints out something like that:
55
17
5
41
51
1144
4391
75153
138970
197105
409466
415925
486076
306377
543998
395974
530056
374293
237
12
104
52
28
My DataFrame is one-hot encoded and has over 500 columns. Larger partitions don't fit in memory. I wanted to repartition the DataFrame to have partitions even in size. Do you know an efficient way to do this?
EDIT 1
Simple reproduce:
df = pd.DataFrame({'x':np.arange(0,10000),'y':np.arange(0,10000)})
df2 = pd.DataFrame({'x':np.append(np.arange(0,4995),np.arange(5000,10000,1000)),'y2':np.arange(0,10000,2)})
dd_df = dd.from_pandas(df, npartitions=10).set_index('x')
dd_df2= dd.from_pandas(df2, npartitions=5).set_index('x')
new_ddf=dd_df.merge(dd_df2, how='right')
#new_ddf = new_ddf.reset_index().set_index('x')
#new_ddf = new_ddf.repartition(npartitions=2)
new_ddf.divisions
for p in range(new_ddf.npartitions):
print(len(new_ddf.get_partition(p)))
Note the last partitions (one single element):
1000
1000
1000
1000
995
1
1
1
1
1
Even when we uncomment the commented lines, partitions remain uneven in the size.
Edit II: Walkoround
Simple wlakoround can be achieved by the following code.
Is there a more elgant way to do this (more in a Dask way)?
def repartition(ddf, npartitions=None):
MAX_PART_SIZE = 100*1024
if npartitions is None:
npartitions = ddf.npartitions
one_row_size = sum([dt.itemsize for dt in ddf.dtypes])
length = len(ddf)
requested_part_size = length/npartitions*one_row_size
if requested_part_size <= MAX_PART_SIZE:
np = npartitions
else:
np = length*one_row_size/MAX_PART_SIZE
chunksize = int(length/np)
vc = ddf.index.value_counts().to_frame(name='count').compute().sort_index()
vsum = 0
divisions = [ddf.divisions[0]]
for i,v in vc.iterrows():
vsum+=v['count']
if vsum > chunksize:
divisions.append(i)
vsum = 0
divisions.append(ddf.divisions[-1])
return ddf.repartition(divisions=divisions, force=True)
You're correct that .repartition won't do the trick since it doesn't handle any of the logic for computing divisions and just tries to combine the existing partitions wherever possible. Here's a solution I came up with for the same problem:
def _rebalance_ddf(ddf):
"""Repartition dask dataframe to ensure that partitions are roughly equal size.
Assumes `ddf.index` is already sorted.
"""
if not ddf.known_divisions: # e.g. for read_parquet(..., infer_divisions=False)
ddf = ddf.reset_index().set_index(ddf.index.name, sorted=True)
index_counts = ddf.map_partitions(lambda _df: _df.index.value_counts().sort_index()).compute()
index = np.repeat(index_counts.index, index_counts.values)
divisions, _ = dd.io.io.sorted_division_locations(index, npartitions=ddf.npartitions)
return ddf.repartition(divisions=divisions)
The internal function sorted_division_locations does what you want already, but it only works on an actual list-like, not a lazy dask.dataframe.Index. This avoids pulling the full index in case there are many duplicates and instead just gets the counts and reconstructs locally from that.
If your dataframe is so large that even the index won't fit in memory then you'd need to do something even more clever.

Pandas GroupBy Mean of Large DataSet in CSV

A common SQLism is "Select A, mean(X) from table group by A" and I would like to replicate this in pandas. Suppose that the data is stored in something like a CSV file and is too big to be loaded into memory.
If the CSV could fit in memory a simple two-liner would suffice:
data=pandas.read_csv("report.csv")
mean=data.groupby(data.A).mean()
When the CSV cannot be read into memory one might try:
chunks=pandas.read_csv("report.csv",chunksize=whatever)
cmeans=pandas.concat([chunk.groupby(data.A).mean() for chunk in chunks])
badMeans=cmeans.groupby(cmeans.A).mean()
Except that the resulting cmeans table contains repeated entries for each distinct value of A, one for each appearance of that value of A in distinct chunks (since read_csv's chunksize knows nothing about the grouping fields). As a result the final badMeans table has the wrong answer... it needs to compute a weighted average mean.
So a working approach seems to be something like:
final=pandas.DataFrame({"A":[],"mean":[],"cnt":[]})
for chunk in chunks:
t=chunk.groupby(chunk.A).sum()
c=chunk.groupby(chunk.A).count()
cmean=pandas.DataFrame({"tot":t,"cnt":c}).reset_index()
joined=pandas.concat(final,cmean)
final=joined.groupby(joined.A).sum().reset_indeX()
mean=final.tot/final.cnt
Am I missing something? This seems insanely complicated... I would rather write a for loop that processes a CSV line by line than deal with this. There has to be a better way.
I think you could do something like the following which seems a bit simpler to me. I made the following data:
id,val
A,2
A,5
B,4
A,2
C,9
A,7
B,6
B,1
B,2
C,4
C,4
A,6
A,9
A,10
A,11
C,12
A,4
A,4
B,6
B,5
C,7
C,8
B,9
B,10
B,11
A,20
I'll do chunks of 5:
chunks = pd.read_csv("foo.csv",chunksize=5)
pieces = [x.groupby('id')['val'].agg(['sum','count']) for x in chunks]
agg = pd.concat(pieces).groupby(level=0).sum()
print agg['sum']/agg['count']
id
A 7.272727
B 6.000000
C 7.333333
Compared to the non-chunk version:
df = pd.read_csv('foo.csv')
print df.groupby('id')['val'].mean()
id
A 7.272727
B 6.000000
C 7.333333

Categories

Resources