Python pandas: how does chunksize works?

Python pandas: how does chunksize works? - python

I have the following code:
from numpy import dtype
import pandas as pd
import os
import sys
inputFile='data.json'
chunks = pd.read_json(inputFile, lines=True, chunksize = 1000)
original_stdout = sys.stdout
i = 1
for c in chunks:
location = c.location.str.split(',')
for b in range(1000):
print(location[b])
if not type(location[b]) == float:
# get the country name
country = location[b][-1]
else:
country = 'unknown'
I'm extracting the location field from a large file including json objects. Because the file is so large, I've divided it into 1000-line chunks. I cycle through each chunk and retrieve the information I require:
for c in chunks:
location = c.a.str.split(',')
for b in range(1000):
print(location[b])
All goes smoothly during the first iteration. At the second iteration the line:
print(location[b])
gives the error:
ValueError: 0 is not in range
How do I cycle trough the chuncks following the first?
Thank you for your help

The problem is that by doing location[b] you are accessing the location frame by index (i.e., here you are asking for the row with the index value b). The chunks will follow the index correctly, which means the first chunk will have the index starting by 0, the second by 1000, and so on. This means, index 0 will only be contained in the first chunk.
So, instead, you need to iterate the rows without the index:
for row in location:
# Do something.
In fact, probably if you look at the full trace of the error you will also see a KeyError below the ValueError.
To iterate the Series and have the index you can use Series.iteritems():
for idx, row in a.iteritems():
# Do something...

Related

How to get around a NumPy error "The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()"

The below code is being used to analysis a csv file and at the moment im trying to remove the columns of the array which are not in my check_list. This only checks the first row and if the first row of the particular column doesnt belong to the check_list it removes the entire column. But this error keeps getting thrown and not sure how to avoid it.
import numpy as np
def load_metrics(filename):
"""opens a csv file and returns stuff"""
check_list = ["created_at","tweet_ID","valence_intensity","anger_intensity","fear_intensity","sadness_intensity","joy_intensity","sentiment_category","emotion_category"]
file=open(filename)
data = []
for lin in file:
lin = lin.strip()
lin = lin.split(",")
data.append(lin)
for col in range(len(data[0])):
if np.any(data[0][col] not in check_list) == True:
data[0]= np.delete(np.array(data), col, 1)
print(col)
return np.array(data)
The below test is being used on the code too:
data = load_metrics("covid_sentiment_metrics.csv")
print(data[0])
Test results:
['created_at' 'tweet_ID' 'valence_intensity' 'anger_intensity'
'fear_intensity' 'sadness_intensity' 'joy_intensity' 'sentiment_category'
'emotion_category']

Change your load_metrics function to:
def load_metrics(filename):
check_list = ["created_at","tweet_ID","valence_intensity","anger_intensity",
"fear_intensity","sadness_intensity","joy_intensity","sentiment_category",
"emotion_category"]
data = []
with open(filename, 'r') as file:
for lin in file:
lin = lin.strip()
lin = lin.split(",")
data.append(lin)
arr = np.array(data)
colFilter = []
for col in arr[0]:
colFilter.append(col in check_list)
return arr[:, colFilter]
I introduced the following corrections:
Use with to automatically close the input file (your code fails to close it).
Create a "full" Numpy array (all columns) after the data has been read.
Compute colFilter list - which columns are in check_list.
Return only filtered columns.

Read columns by checklist
This code does not include checks related to reading a file or a broken data structure, so that the main idea is more or less clear. So, here I assume that a csv-file exists and has at least 2 lines:
import numpy as np
def load_metrics(filename, check_list):
"""open a csv file and return data as numpy.array
with columns from a check list"""
data = []
with open(filename) as file:
headers = file.readline().rstrip("\n").split(",")
for line in file:
data.append(line.rstrip("\n").split(","))
col_to_remove = []
for col in reversed(range(len(headers))):
if headers[col] not in check_list:
col_to_remove.append(col)
headers.pop(col)
data = np.delete(np.array(data), col_to_remove, 1)
return data, headers
Quick testing:
test_data = """\
hello,some,other,world
1,2,3,4
5,6,7,8
"""
with open("test.csv",'w') as f:
f.write(test_data)
check_list = ["hello","world"]
d, h = load_metrics("test.csv", check_list)
print(d, h)
Expected output:
[['1' '4']
['5' '8']] ['hello', 'world']
Some details:
Instead of np.any(data[0][col] not in check_list) == True would be enough data[0][col] not in check_list
Stripping with default parameters is not good as far as you can delete meaningful spaces.
Do not delete anything while looping forward. But we can do it (with some reservations) while looping backward.
check_list is better as a parameter.
Separate data and headers as they may have different types.
In your case it is better to use pandas.read_csv, see the picture below.

how to iterate over files in python and export several output files

I have a code and I want to put it in a for loop. I want to input some data stored as files into my code and based on the each input, generate outputs automatically. At the moment, my code is only working for one input file and consequently gives one output. My input file is named as model000.msh, but the fact is that I have a series of these input files with the names model000.msh, model001.msh, and so on. In the code I am doing some calculation on the imported file and finally compare it to a numpy array (my_data) that is generated by another numpy array (ID) having one column and thousands of rows. ID array is the second variable which I want to iterate over. ID is making my_data through a np.concatenate function. I want to use each column of ID to make my_data (my_data=np.concatenate((ID[:,iterator], gr), axis =1)). So, I want to iterate over several files, then extract arrays from each file (extracted), then follow the loop with generating my_data from each column of ID and do calculations on my_data and extracted and finally export results of each iteration with a dynamic naming method (changed_000, changed_001 and so on). This is my code fo one single input and one single my_data array (made by an ID that has only one column), but I want to change iterate over several input files and several my_data arrays and finally several outputs:
from itertools import islice
with open('model000.msh') as lines:
nodes = np.genfromtxt(islice(lines, 0, 1000))
with open('model000.msh', "r") as f:
saved_lines = np.array([line.split() for line in f if len(line.split()) == 9])
saved_lines[saved_lines == ''] = 0.0
elem = saved_lines.astype(np.int)
# following lines extract some data from my file
extracted=np.c_[elem[:,:-4], nodes[elem[:,-4]-1, 1:], nodes[elem[:,-3]-1, 1:],nodes[elem[:,-2]-1, 1:], nodes[elem[:,-1]-1, 1:]]
…
extracted =np.concatenate((extracted, avs), axis =1) # each input file ('model000.msh') will make this numpy array
# another data set, stored as a numpy array is compared to the data extracted from the file
ID= np.array [[… ..., …, …]] # now, it is has one column, but it should have several columns and each iteration, one column will make a my_data array
my_data=np.concatenate((ID, gr), axis =1) # I think it should be something like my_data=np.concatenate((ID[:,iterator], gr), axis =1)
from scipy.spatial import distance
distances=distance.cdist(extracted [:,17:20],my_data[:,1:4])
ind_min_dis=np.argmin(distances, axis=1).reshape(-1,1)
z=np.array([])
for i in ind_min_dis:
u=my_data[i,0]
z=np.array([np.append(z,u)]).reshape(-1,1)
final_merged=np.concatenate((extracted,z), axis =1)
new_vol=final_merged[:,-1].reshape(-1,1)
new_elements=np.concatenate((elements,new_vol), axis =1)
new_elements[:,[4,-1]] = new_elements[:,[-1,4]]
# The next block is output block
chunk_size = 3
buffer = ""
i = 0
relavent_line = 0
with open('changed_00', 'a') as fout:
with open('model000.msh', 'r') as fin:
for line in fin:
if len(line.split()) == 9:
aux_string = ' '.join([str(num) for num in new_elements[relavent_line]])
buffer += '%s\n' % aux_string
relavent_line += 1
else:
buffer += line
i+=1
if i == chunk_size:
fout.write(buffer)
i=0
buffer = ""
if buffer:
fout.write(buffer)
i=0
buffer = ""
I appreciate any help in advance.

I'm not very sure about your question. But it seems like you are asking for something like:
for idx in range(10):
with open('changed_{:0>2d}'.format(idx), 'a') as fout:
with open('model0{:0>2d}.msh'.format(idx), 'r') as fin:
#read something from fin...
#calculate something...
#write something to fout...
If so, you could search for str.format() for more details.

iterating through a file Python

I am looking to tally up all of the Baseline numbers in the Isc, Voc Imp, Vmp, FF and Pmp columns individually and take the average for each column. Below is the file that I am reading in to my program (test_results.csv).
Here is my code.
from MyClasses import TestResult
def main():
test = "test_results.csv"
inputFile = open(test, 'r')
user = TestResult()
counter = 0.0
hold = 0.0
for i in range (4,10):
for l in inputFile.readlines()[1:]:
split = l.split(",")
if user.getTestSeq(split[1]) == "Baseline":
num = float(user.getIsc(split[i]))
hold += num
counter += 1
print counter
print hold
total = hold/counter
print total
main()
I used the line
num = float(user.getIsc(split[i]))
with the hope that I could iterate through with the i, totaling one column, taking the average and moving to the next column. But I am not able to move to the next column. I just print out the same Isc column multiple times. Any ideas as to why? I am also looking to put the Test Sequences items in a list that I could iterate through in the same way for line
if user.getTestSeq(split[1]) == "Baseline":
so that I can tally up all the columns for Baseline, then move to tally up all columns for TC200, Hotspot and so on. Is this a good approach? wanted to solve the first iteration issue first before moving on to this one.
Thank you

You should use either DictReader from CSV Module or read_csv from pandas module
I recommend pandas module as you also perform operations on your data using.
import pandas as pd
df = pd.read_csv("test_results.csv")
df will contain your CSV table as is, no need to convert to floats or integers

Read a small random sample from a big CSV file into a Python data frame

The CSV file that I want to read does not fit into main memory. How can I read a few (~10K) random lines of it and do some simple statistics on the selected data frame?

Assuming no header in the CSV file:
import pandas
import random
n = 1000000 #number of records in file
s = 10000 #desired sample size
filename = "data.txt"
skip = sorted(random.sample(range(n),n-s))
df = pandas.read_csv(filename, skiprows=skip)
would be better if read_csv had a keeprows, or if skiprows took a callback func instead of a list.
With header and unknown file length:
import pandas
import random
filename = "data.txt"
n = sum(1 for line in open(filename)) - 1 #number of records in file (excludes header)
s = 10000 #desired sample size
skip = sorted(random.sample(range(1,n+1),n-s)) #the 0-indexed header will not be included in the skip list
df = pandas.read_csv(filename, skiprows=skip)

#dlm's answer is great but since v0.20.0, skiprows does accept a callable. The callable receives as an argument the row number.
Note also that their answer for unknown file length relies on iterating through the file twice -- once to get the length, and then another time to read the csv. I have three solutions here which only rely on iterating through the file once, though they all have tradeoffs.
Solution 1: Approximate Percentage
If you can specify what percent of lines you want, rather than how many lines, you don't even need to get the file size and you just need to read through the file once. Assuming a header on the first row:
import pandas as pd
import random
p = 0.01 # 1% of the lines
# keep the header, then take only 1% of lines
# if random from [0,1] interval is greater than 0.01 the row will be skipped
df = pd.read_csv(
filename,
header=0,
skiprows=lambda i: i>0 and random.random() > p
)
As pointed out in the comments, this only gives approximately the right number of lines, but I think it satisfies the desired usecase.
Solution 2: Every Nth line
This isn't actually a random sample, but depending on how your input is sorted and what you're trying to achieve, this may meet your needs.
n = 100 # every 100th line = 1% of the lines
df = pd.read_csv(filename, header=0, skiprows=lambda i: i % n != 0)
Solution 3: Reservoir Sampling
(Added July 2021)
Reservoir sampling is an elegant algorithm for selecting k items randomly from a stream whose length is unknown, but that you only see once.
The big advantage is that you can use this without having the full dataset on disk, and that it gives you an exactly-sized sample without knowing the full dataset size. The disadvantage is that I don't see a way to implement it in pure pandas, I think you need to drop into python to read the file and then construct the dataframe afterwards. So you may lose some functionality from read_csv or need to reimplement it, since we're not using pandas to actually read the file.
Taking an implementation of the algorithm from Oscar Benjamin here:
from math import exp, log, floor
from random import random, randrange
from itertools import islice
from io import StringIO
def reservoir_sample(iterable, k=1):
"""Select k items uniformly from iterable.
Returns the whole population if there are k or fewer items
from https://bugs.python.org/issue41311#msg373733
"""
iterator = iter(iterable)
values = list(islice(iterator, k))
W = exp(log(random())/k)
while True:
# skip is geometrically distributed
skip = floor( log(random())/log(1-W) )
selection = list(islice(iterator, skip, skip+1))
if selection:
values[randrange(k)] = selection[0]
W *= exp(log(random())/k)
else:
return values
def sample_file(filepath, k):
with open(filepath, 'r') as f:
header = next(f)
result = [header] + sample_iter(f, k)
df = pd.read_csv(StringIO(''.join(result)))
The reservoir_sample function returns a list of strings, each of which is a single row, so we just need to turn it into a dataframe at the end. This assumes there is exactly one header row, I haven't thought about how to extend it to other situations.
I tested this locally and it is much faster than the other two solutions. Using a 550 MB csv (January 2020 "Yellow Taxi Trip Records" from the NYC TLC), solution 3 runs in about 1 second, while the other two take ~3-4 seconds.
In my test this is even slightly (~10-20%) faster than #Bar's answer using shuf, which surprises me.

This is not in Pandas, but it achieves the same result much faster through bash, while not reading the entire file into memory:
shuf -n 100000 data/original.tsv > data/sample.tsv
The shuf command will shuffle the input and the and the -n argument indicates how many lines we want in the output.
Relevant question: https://unix.stackexchange.com/q/108581
Benchmark on a 7M lines csv available here (2008):
Top answer:
def pd_read():
filename = "2008.csv"
n = sum(1 for line in open(filename)) - 1 #number of records in file (excludes header)
s = 100000 #desired sample size
skip = sorted(random.sample(range(1,n+1),n-s)) #the 0-indexed header will not be included in the skip list
df = pandas.read_csv(filename, skiprows=skip)
df.to_csv("temp.csv")
Timing for pandas:
%time pd_read()
CPU times: user 18.4 s, sys: 448 ms, total: 18.9 s
Wall time: 18.9 s
While using shuf:
time shuf -n 100000 2008.csv > temp.csv
real 0m1.583s
user 0m1.445s
sys 0m0.136s
So shuf is about 12x faster and importantly does not read the whole file into memory.

Here is an algorithm that doesn't require counting the number of lines in the file beforehand, so you only need to read the file once.
Say you want m samples. First, the algorithm keeps the first m samples. When it sees the i-th sample (i > m), with probability m/i, the algorithm uses the sample to randomly replace an already selected sample.
By doing so, for any i > m, we always have a subset of m samples randomly selected from the first i samples.
See code below:
import random
n_samples = 10
samples = []
for i, line in enumerate(f):
if i < n_samples:
samples.append(line)
elif random.random() < n_samples * 1. / (i+1):
samples[random.randint(0, n_samples-1)] = line

The following code reads first the header, and then a random sample on the other lines:
import pandas as pd
import numpy as np
filename = 'hugedatafile.csv'
nlinesfile = 10000000
nlinesrandomsample = 10000
lines2skip = np.random.choice(np.arange(1,nlinesfile+1), (nlinesfile-nlinesrandomsample), replace=False)
df = pd.read_csv(filename, skiprows=lines2skip)

class magic_checker:
def __init__(self,target_count):
self.target = target_count
self.count = 0
def __eq__(self,x):
self.count += 1
return self.count >= self.target
min_target=100000
max_target = min_target*2
nlines = randint(100,1000)
seek_target = randint(min_target,max_target)
with open("big.csv") as f:
f.seek(seek_target)
f.readline() #discard this line
rand_lines = list(iter(lambda:f.readline(),magic_checker(nlines)))
#do something to process the lines you got returned .. perhaps just a split
print rand_lines
print rand_lines[0].split(",")
something like that should work I think

No pandas!
import random
from os import fstat
from sys import exit
f = open('/usr/share/dict/words')
# Number of lines to be read
lines_to_read = 100
# Minimum and maximum bytes that will be randomly skipped
min_bytes_to_skip = 10000
max_bytes_to_skip = 1000000
def is_EOF():
return f.tell() >= fstat(f.fileno()).st_size
# To accumulate the read lines
sampled_lines = []
for n in xrange(lines_to_read):
bytes_to_skip = random.randint(min_bytes_to_skip, max_bytes_to_skip)
f.seek(bytes_to_skip, 1)
# After skipping "bytes_to_skip" bytes, we can stop in the middle of a line
# Skip current entire line
f.readline()
if not is_EOF():
sampled_lines.append(f.readline())
else:
# Go to the begginig of the file ...
f.seek(0, 0)
# ... and skip lines again
f.seek(bytes_to_skip, 1)
# If it has reached the EOF again
if is_EOF():
print "You have skipped more lines than your file has"
print "Reduce the values of:"
print " min_bytes_to_skip"
print " max_bytes_to_skip"
exit(1)
else:
f.readline()
sampled_lines.append(f.readline())
print sampled_lines
You'll end up with a sampled_lines list. What kind of statistics do you mean?

use subsample
pip install subsample
subsample -n 1000 file.csv > file_1000_sample.csv

You can also create a sample with the 10000 records before bringing it into the Python environment.
Using Git Bash (Windows 10) I just ran the following command to produce the sample
shuf -n 10000 BIGFILE.csv > SAMPLEFILE.csv
To note: If your CSV has headers this is not the best solution.

TL;DR
If you know the size of the sample you want, but not the size of the input file, you can efficiently load a random sample out of it with the following pandas code:
import pandas as pd
import numpy as np
filename = "data.csv"
sample_size = 10000
batch_size = 200
rng = np.random.default_rng()
sample_reader = pd.read_csv(filename, dtype=str, chunksize=batch_size)
sample = sample_reader.get_chunk(sample_size)
for chunk in sample_reader:
chunk.index = rng.integers(sample_size, size=len(chunk))
sample.loc[chunk.index] = chunk
Explanation
It's not always trivial to know the size of the input CSV file.
If there are embedded line breaks, tools like wc or shuf will give you the wrong answer or just make a mess out of your data.
So, based on desktable's answer, we can treat the first sample_size lines of the file as the initial sample and then, for each subsequent line in the file, randomly replace a line in the initial sample.
To do that efficiently, we load the CSV file using a TextFileReader by passing the chunksize= parameter:
sample_reader = pd.read_csv(filename, dtype=str, chunksize=batch_size)
First, we get the initial sample:
sample = sample_reader.get_chunk(sample_size)
Then, we iterate over the remaining chunks of the file, replacing the index of each chunk with a sequence of random integers as long as size of the chunk, but where each integer is in the range of the index of the initial sample (which happens to be the same as range(sample_size)):
for chunk in sample_reader:
chunk.index = rng.integers(sample_size, size=len(chunk))
And use this reindexed chunk to replace (some of the) lines in the sample:
sample.loc[chunk.index] = chunk
After the for loop, you'll have a dataframe at most sample_size rows long, but with random lines selected from the big CSV file.
To make the loop more efficient, you can make batch_size as large as your memory allows (and yes, even larger than sample_size if you can).
Notice that, while creating the new chunk index with np.random.default_rng().integers(), we use len(chunk) as the new chunk index size instead of simply batch_size because the last chunk in the loop could be smaller.
On the other hand, we use sample_size instead of len(sample) as the "range" of the random integers, even though there could be less lines in the file than sample_size. This is because there won't be any chunks left to loop over in this case so that will never be a problem.

read the data file
import pandas as pd
df = pd.read_csv('data.csv', 'r')
First check the shape of df
df.shape()
create the small sample of 1000 raws from df
sample_data = df.sample(n=1000, replace='False')
#check the shape of sample_data
sample_data.shape()

For example, you have the loan.csv, you can use this script to easily load the specified number of random items.
data = pd.read_csv('loan.csv').sample(10000, random_state=44)

Let's say that you want to load a 20% sample of the dataset:
import pandas as pd
df = pd.read_csv(filepath).sample(frac = 0.20)

Using CSV arrays as an input to Python

I have been presented with a csv file that is full of 100+ arrays that I need to run through my data analysis code but I am not sure how to read these arrays in Python. Each array is preceded with a line that includes only an integer that gives the number of rows in the array and ends with the line '1234567890' to be used as a line separator.
Here is a snippet of the csv file:
7,,,,,,,
1,-199.117,-105.4,-4.525,227.5415,225.2925647,-0.0198891,-2.6547518
2,133.0423,55.4573,-48.4174,155.16,144.1380093,-0.322813,0.3949385
3,129.8405,-16.9527,-303.3192,331.0847,130.9425427,-1.5644458,-0.1298311
4,-73.6373,71.4677,151.517,183.9712,102.616198,1.1678785,2.3711453
5,41.2654,10.4196,30.3773,54.0915,42.5605604,0.6351541,0.2473322
6,-20.3159,-32.4484,62.4574,74.8581,38.2836056,1.2022641,-2.1301853
7,-13.2904,22.029,-28.2895,38.5096,25.7276422,-0.9386666,2.1136489
1234567890,,,,,,,
5,,,,,,,
1,-136.0755,-204.2787,-48.2127,259.2592,245.4512762,-0.1881526,-2.158425
2,220.5184,46.9388,-113.6448,265.1745,225.4586784,-0.4581388,0.2097266
3,-45.3132,169.6283,-49.2729,188.9506,175.576326,-0.2669358,1.8318334
4,-40.7141,34.7414,25.5414,60.9535,53.5219844,0.4465159,2.4351851
5,15.3863,-49.6703,17.1692,56.7635,51.9988166,0.312235,-1.2704018
1234567890,,,,,,,
6,,,,,,,
1,-19.3083,295.4128,191.8666,360.3712,296.0431267,0.5935079,1.6360639
2,-169.8708,-128.3904,-1.0052,215.4187,212.9323449,-0.0046663,-2.4943822
3,15.4505,-209.6656,-178.0715,279.4077,210.2341118,-0.7536439,-1.4972381
4,172.4142,13.0485,-63.7912,192.2842,172.9072576,-0.3447988,0.0755371
5,16.7456,24.8768,-46.5025,55.9188,29.9878358,-1.1933262,0.9783247
6,-8.911,4.1138,12.7751,17.7283,9.8147477,0.9089022,2.7090895
1234567890,,,,,,,
I am certain I could import the array if the csv was just one big array but I am stumped when it comes to picking one array out of many. The data analysis needs to be run on the temporary array before it is replaced with the next array in the csv file.

You could use itertools.groupby to parse the rows into separate arrays:
import csv
import itertools
with open('errors','w') as err: pass
with open('data','r') as f:
for key, group in itertools.groupby(
csv.reader(f),
lambda row: row[0].startswith('1234567890')):
if key: continue # key is True means we've reach the end of an array
group=list(group) # group is an iterator; we turn it into a list
array=group[1:] # everything but the first row is data
arr_length=int(group[0][0]) # first row contains the length
if arr_length != len(array): # sanity check
with open('errors','a') as err:
err.write('''\
Data file claims arr_length = {l}
{a}
{h}
'''.format(l=arr_length,a=str(list(array)),h='-'*80))
print(array)
itertools.groupby returns an iterator. It loops through the rows in csv.reader(f), and applies the lambda function to each row. The lambda function returns True when the row starts with '1234567890'. The return value (e.g. True or False) is called the key. The important point is that itertools.groupby collects together all contiguous rows that return the same key.

This should give you a nicely formatted variable called "data" to work with.
import csv
rows = csv.reader(open('your_file.csv'))
data = []
temp = []
for row in rows:
if '1234567890' in row:
data.append(temp)
temp = []
continue
else:
temp.append(row)
if temp != []:
data.append(temp)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python pandas: how does chunksize works? - python

Related

How to get around a NumPy error "The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()"

how to iterate over files in python and export several output files

iterating through a file Python

Read a small random sample from a big CSV file into a Python data frame

Using CSV arrays as an input to Python

Categories

Resources