I have the following code:
from numpy import dtype
import pandas as pd
import os
import sys
inputFile='data.json'
chunks = pd.read_json(inputFile, lines=True, chunksize = 1000)
original_stdout = sys.stdout
i = 1
for c in chunks:
location = c.location.str.split(',')
for b in range(1000):
print(location[b])
if not type(location[b]) == float:
# get the country name
country = location[b][-1]
else:
country = 'unknown'
I'm extracting the location field from a large file including json objects. Because the file is so large, I've divided it into 1000-line chunks. I cycle through each chunk and retrieve the information I require:
for c in chunks:
location = c.a.str.split(',')
for b in range(1000):
print(location[b])
All goes smoothly during the first iteration. At the second iteration the line:
print(location[b])
gives the error:
ValueError: 0 is not in range
How do I cycle trough the chuncks following the first?
Thank you for your help
The problem is that by doing location[b] you are accessing the location frame by index (i.e., here you are asking for the row with the index value b). The chunks will follow the index correctly, which means the first chunk will have the index starting by 0, the second by 1000, and so on. This means, index 0 will only be contained in the first chunk.
So, instead, you need to iterate the rows without the index:
for row in location:
# Do something.
In fact, probably if you look at the full trace of the error you will also see a KeyError below the ValueError.
To iterate the Series and have the index you can use Series.iteritems():
for idx, row in a.iteritems():
# Do something...
I have a script that removes "bad elements" from a master list of elements, then returns a csv with the updated elements and their associated values.
My question, is whether there is a more efficient way to perform the same operation in the for loop?
Master=pd.read_csv('some.csv', sep=',',header=0,error_bad_lines=False)
MasterList = Master['Elem'].tolist()
MasterListStrain1 = Master['Max_Principal_Strain'].tolist()
#this file should contain elements that are slated for deletion
BadElem=pd.read_csv('delete_me_elements_column.csv', sep=',',header=None, error_bad_lines=False)
BadElemList = BadElem[0].tolist()
NewMasterList = (list(set(MasterList) - set(BadElemList)))
filename = 'NewOutput.csv'
outfile = open(filename,'w')
#pdb.set_trace()
for i,j in enumerate(NewMasterList):
#pdb.set_trace()
Elem_Loc = MasterList.index(j)
line ='\n%s,%.25f'%(j,MasterListStrain1[Elem_Loc])
outfile.write(line)
print ("\n The new output file will be named: " + filename)
outfile.close()
Stage 1
If you necessarily want to iterate in the for loop then besides using pd.to_csv which likely to improve performance you can do the following:
...
SetBadElem = set(BadElemList)
...
for i,Elem_Loc in enumerate(MasterList):
if Elem_Loc not in SetBadElem:
line ='\n%s,%.25f'%(j,MasterListStrain1[Elem_Loc])
outfile.write(line)
Jumping around the index is never efficient whereas iteration with skipping will give you much better performance (checking presence in a set is log n operation so it is relatively quick).
Stage 2 Using Pandas properly
...
SetBadElem = set(BadElemList)
...
for Elem in Master:
if Elem not in SetBadElem:
line ='\n%s,%.25f'%(Elem['elem'], Elem['Max_Principal_Strain'])
outfile.write(line)
There is no need to create lists out of pandas dataframe columns. Using the whole dataframe (and indexing into it) is a much better approach.
Stage 3 Removing messy iterated formatting operations
We can add a column ('Formatted') that will contain formatted data. For that we will create a lambda function:
formatter = lambda row: '\n%s,%.25f'%(row['elem'], row['Max_Principal_Strain'])
Master['Formatted'] = Master.apply(formatter)
Stage 4 Pandas-way filtering and output
We can format the dataframe in two ways. My preference is to reuse the formatting function:
import numpy as np
formatter = lambda row: '\n%s,%.25f'%(row['elem'], row['Max_Principal_Strain']) if row not in SetBadElem else np.nan
Now we can use the built-in dropna which drops all rows that have any NaN values
Master.dropna()
Master.to_csv(filename)
My csv file looks like this:
Test Number,Score
1,100 2,40 3,80 4,90.
I have been trying to figure out how to write a code that ignores the header + first column and focuses on scores because the assignment was to find the averages of the test scores and print out a float(for those particular numbers the output should be 77.5). I've looked online and found pieces that I think would work but I'm getting errors every time. Were learning about read, realines, split, rstrip and \n if that helps! I'm sure the answer is so simple, but I'm new to coding and I have no idea what I'm doing. Thank you!
def calculateTestAverage(fileName):
myFile = open(fileName, "r")
column = myFile.readline().rstrip("\n")
for column in myFile:
scoreColumn = column.split(",")
(scoreColumn[1])
This is my code so far my professor wanted us to define a function and go from there using the stuff we learned in lecture. I'm stuck because it's printing out all the scores I need on separate returned lines, yet I am not able to sum those without getting an error. Thanks for all your help, I don't think I would be able to use any of the suggestions because we never went over them. If anyone has an idea of how to take those test scores that printed out vertically as a column and sum them that would help me a ton!
You can use csv library. This code should do the work:
import csv
reader = csv.reader(open('csvfile.txt','r'), delimiter=' ')
reader.next() # this line lets you skip the header line
for row_number, row in enumerate(reader):
total_score = 0
for element in row:
test_number, score = element.split(',')
total_score += score
average_score = total_score/float(len(row))
print("Average score for row #%d is: %.1f" % (row_number, average_score))
The output should look like this:
Average score for row #1 is: 77.5
I always approach this with a pandas data frame. Specifically the read_csv() function. You don’t need to ignore the header, just state that it is in row 0 (for example) and then also the same with the row labels.
So for example:
import pandas as pd
import numpy as np
df=read_csv(“filename”,header=0,index_col=0)
scores=df.values
print(np.average(scores))
I will break it down for you.
Since you're dealing with .csv files, I recommend using the csv library. You can import it with:
import csv
Now we need to open() the file. One common way is to use with:
with open('test.csv') as file:
Which is a context manager that avoids having to close the file at the end. The other option is to open and close normally:
file = open('test.csv')
# Do your stuff here
file.close()
Now you need to wrap the opened file with csv.reader(), which allows you to read .csv files and do things with them:
csv_reader = csv.reader(file)
To skip the headers, you can use next():
next(csv_reader)
Now for the average calculation part. One simple way is to have two variables, score_sum and total. The aim is to increment the scores and totals to these two variables respectively. Here is an example snippet :
score_sum = 0
total = 0
for number, score in csv_reader:
score_sum += int(score)
total += 1
Here's how to do it with indexing also:
score_sum = 0
total = 0
for line in csv_reader:
score_sum += int(line[1])
total += 1
Now that we have our score and totals calculated, getting the average is simply:
score_sum / total
The above code combined will then result in an average of 77.5.
Off course, this all assumes that your .csv file is actually in this format:
Test Number,Score
1,100
2,40
3,80
4,90
I am processing a CSV file in python thats delimited by a comma (,).
Each column is a sampled parameter, for instance column 0 is time, sampled at once a second, column 1 is altitude sampled at 4 times a second, etc.
So columns will look like as below:
Column 0 -> ["Time", 0, " "," "," ",1]
Column 1 -> ["Altitude", 100, 200, 300, 400]
I am trying to create a list for each column that captures its name and all its data. That way I can do calculations and organize my data into a new file automatically (the sampled data I am working with has substantial number of rows)
I want to do this for any file not just one, so the number of columns can vary.
Normally if every file was consistent I would do something like:
import csv
time =[]
alt = []
dct = {}
with open('test.csv',"r") as csvfile:
csv_f = csv.reader(csvfile)
for row in csv_f:
header.append(row[0])
alt.append(row[1]) #etc for all columns
I am pretty new in python. Is this a good way to tackle this, if not what is better methodology?
Thanks for your time
Pandas will probably work best for you. If you use csv_read from pandas, it will create a DataFrame based on the column. It's roughly a dictionary of lists.
You can also use the .tolist() functionality of pandas to convert it to a list if you want a list specifically.
import pandas as pd
data = pd.read_csv("soqn.csv")
dict_of_lists = {}
for column_name in data.columns:
temp_list = data[column_name].tolist()
dict_of_lists[column_name] = temp_list
print dict_of_lists
EDIT:
dict_of_lists={column_name: data[column_name].tolist() for column_name in data.columns}
#This list comprehension might work faster.
I think I made my problem more simpler and just focused on one column.
What I ultimately wanted to do was to interpolate to the highest sampling rate. So here is what I came up with... Please let me know if I can do anything more efficient. I used A LOT of searching on this site to help build this. Again I am new at Python (about 2-3 weeks but some former programming experience)
import csv
header = []
#initialize variables
loc_int = 0
loc_fin = 0
temp_i = 0
temp_f = 0
with open('test2.csv',"r") as csvfile: # open csv file
csv_f = csv.reader(csvfile)
for row in csv_f:
header.append(row[0]) #make a list that consists of all content in column A
for x in range(0,len(header)-1): #go through entire column
if header[x].isdigit() and header[x+1]=="": # find lower bound of sample to be interpolated
loc_int = x
temp_i = int(header[x])
elif header[x+1].isdigit() and header[x]=="": # find upper bound of sample to be interpolated
loc_fin = x
temp_f = int(header[x+1])
if temp_f>temp_i: #calculate interpolated values
f_min_i = temp_f - temp_i
interp = f_min_i/float((loc_fin+1)-loc_int)
for y in range(loc_int, loc_fin+1):
header[y] = temp_i + interp*(y-loc_int)
print header
with open("output.csv", 'wb') as g: #write to new file
writer = csv.writer(g)
for item in header:
writer.writerow([item])
I couldn't figure out how to write my new list "header" with its interpolated values and replace it with column A of my old file , test2.csv.
Anywho thank you very much for looking...
The CSV file that I want to read does not fit into main memory. How can I read a few (~10K) random lines of it and do some simple statistics on the selected data frame?
Assuming no header in the CSV file:
import pandas
import random
n = 1000000 #number of records in file
s = 10000 #desired sample size
filename = "data.txt"
skip = sorted(random.sample(range(n),n-s))
df = pandas.read_csv(filename, skiprows=skip)
would be better if read_csv had a keeprows, or if skiprows took a callback func instead of a list.
With header and unknown file length:
import pandas
import random
filename = "data.txt"
n = sum(1 for line in open(filename)) - 1 #number of records in file (excludes header)
s = 10000 #desired sample size
skip = sorted(random.sample(range(1,n+1),n-s)) #the 0-indexed header will not be included in the skip list
df = pandas.read_csv(filename, skiprows=skip)
#dlm's answer is great but since v0.20.0, skiprows does accept a callable. The callable receives as an argument the row number.
Note also that their answer for unknown file length relies on iterating through the file twice -- once to get the length, and then another time to read the csv. I have three solutions here which only rely on iterating through the file once, though they all have tradeoffs.
Solution 1: Approximate Percentage
If you can specify what percent of lines you want, rather than how many lines, you don't even need to get the file size and you just need to read through the file once. Assuming a header on the first row:
import pandas as pd
import random
p = 0.01 # 1% of the lines
# keep the header, then take only 1% of lines
# if random from [0,1] interval is greater than 0.01 the row will be skipped
df = pd.read_csv(
filename,
header=0,
skiprows=lambda i: i>0 and random.random() > p
)
As pointed out in the comments, this only gives approximately the right number of lines, but I think it satisfies the desired usecase.
Solution 2: Every Nth line
This isn't actually a random sample, but depending on how your input is sorted and what you're trying to achieve, this may meet your needs.
n = 100 # every 100th line = 1% of the lines
df = pd.read_csv(filename, header=0, skiprows=lambda i: i % n != 0)
Solution 3: Reservoir Sampling
(Added July 2021)
Reservoir sampling is an elegant algorithm for selecting k items randomly from a stream whose length is unknown, but that you only see once.
The big advantage is that you can use this without having the full dataset on disk, and that it gives you an exactly-sized sample without knowing the full dataset size. The disadvantage is that I don't see a way to implement it in pure pandas, I think you need to drop into python to read the file and then construct the dataframe afterwards. So you may lose some functionality from read_csv or need to reimplement it, since we're not using pandas to actually read the file.
Taking an implementation of the algorithm from Oscar Benjamin here:
from math import exp, log, floor
from random import random, randrange
from itertools import islice
from io import StringIO
def reservoir_sample(iterable, k=1):
"""Select k items uniformly from iterable.
Returns the whole population if there are k or fewer items
from https://bugs.python.org/issue41311#msg373733
"""
iterator = iter(iterable)
values = list(islice(iterator, k))
W = exp(log(random())/k)
while True:
# skip is geometrically distributed
skip = floor( log(random())/log(1-W) )
selection = list(islice(iterator, skip, skip+1))
if selection:
values[randrange(k)] = selection[0]
W *= exp(log(random())/k)
else:
return values
def sample_file(filepath, k):
with open(filepath, 'r') as f:
header = next(f)
result = [header] + sample_iter(f, k)
df = pd.read_csv(StringIO(''.join(result)))
The reservoir_sample function returns a list of strings, each of which is a single row, so we just need to turn it into a dataframe at the end. This assumes there is exactly one header row, I haven't thought about how to extend it to other situations.
I tested this locally and it is much faster than the other two solutions. Using a 550 MB csv (January 2020 "Yellow Taxi Trip Records" from the NYC TLC), solution 3 runs in about 1 second, while the other two take ~3-4 seconds.
In my test this is even slightly (~10-20%) faster than #Bar's answer using shuf, which surprises me.
This is not in Pandas, but it achieves the same result much faster through bash, while not reading the entire file into memory:
shuf -n 100000 data/original.tsv > data/sample.tsv
The shuf command will shuffle the input and the and the -n argument indicates how many lines we want in the output.
Relevant question: https://unix.stackexchange.com/q/108581
Benchmark on a 7M lines csv available here (2008):
Top answer:
def pd_read():
filename = "2008.csv"
n = sum(1 for line in open(filename)) - 1 #number of records in file (excludes header)
s = 100000 #desired sample size
skip = sorted(random.sample(range(1,n+1),n-s)) #the 0-indexed header will not be included in the skip list
df = pandas.read_csv(filename, skiprows=skip)
df.to_csv("temp.csv")
Timing for pandas:
%time pd_read()
CPU times: user 18.4 s, sys: 448 ms, total: 18.9 s
Wall time: 18.9 s
While using shuf:
time shuf -n 100000 2008.csv > temp.csv
real 0m1.583s
user 0m1.445s
sys 0m0.136s
So shuf is about 12x faster and importantly does not read the whole file into memory.
Here is an algorithm that doesn't require counting the number of lines in the file beforehand, so you only need to read the file once.
Say you want m samples. First, the algorithm keeps the first m samples. When it sees the i-th sample (i > m), with probability m/i, the algorithm uses the sample to randomly replace an already selected sample.
By doing so, for any i > m, we always have a subset of m samples randomly selected from the first i samples.
See code below:
import random
n_samples = 10
samples = []
for i, line in enumerate(f):
if i < n_samples:
samples.append(line)
elif random.random() < n_samples * 1. / (i+1):
samples[random.randint(0, n_samples-1)] = line
The following code reads first the header, and then a random sample on the other lines:
import pandas as pd
import numpy as np
filename = 'hugedatafile.csv'
nlinesfile = 10000000
nlinesrandomsample = 10000
lines2skip = np.random.choice(np.arange(1,nlinesfile+1), (nlinesfile-nlinesrandomsample), replace=False)
df = pd.read_csv(filename, skiprows=lines2skip)
class magic_checker:
def __init__(self,target_count):
self.target = target_count
self.count = 0
def __eq__(self,x):
self.count += 1
return self.count >= self.target
min_target=100000
max_target = min_target*2
nlines = randint(100,1000)
seek_target = randint(min_target,max_target)
with open("big.csv") as f:
f.seek(seek_target)
f.readline() #discard this line
rand_lines = list(iter(lambda:f.readline(),magic_checker(nlines)))
#do something to process the lines you got returned .. perhaps just a split
print rand_lines
print rand_lines[0].split(",")
something like that should work I think
No pandas!
import random
from os import fstat
from sys import exit
f = open('/usr/share/dict/words')
# Number of lines to be read
lines_to_read = 100
# Minimum and maximum bytes that will be randomly skipped
min_bytes_to_skip = 10000
max_bytes_to_skip = 1000000
def is_EOF():
return f.tell() >= fstat(f.fileno()).st_size
# To accumulate the read lines
sampled_lines = []
for n in xrange(lines_to_read):
bytes_to_skip = random.randint(min_bytes_to_skip, max_bytes_to_skip)
f.seek(bytes_to_skip, 1)
# After skipping "bytes_to_skip" bytes, we can stop in the middle of a line
# Skip current entire line
f.readline()
if not is_EOF():
sampled_lines.append(f.readline())
else:
# Go to the begginig of the file ...
f.seek(0, 0)
# ... and skip lines again
f.seek(bytes_to_skip, 1)
# If it has reached the EOF again
if is_EOF():
print "You have skipped more lines than your file has"
print "Reduce the values of:"
print " min_bytes_to_skip"
print " max_bytes_to_skip"
exit(1)
else:
f.readline()
sampled_lines.append(f.readline())
print sampled_lines
You'll end up with a sampled_lines list. What kind of statistics do you mean?
use subsample
pip install subsample
subsample -n 1000 file.csv > file_1000_sample.csv
You can also create a sample with the 10000 records before bringing it into the Python environment.
Using Git Bash (Windows 10) I just ran the following command to produce the sample
shuf -n 10000 BIGFILE.csv > SAMPLEFILE.csv
To note: If your CSV has headers this is not the best solution.
TL;DR
If you know the size of the sample you want, but not the size of the input file, you can efficiently load a random sample out of it with the following pandas code:
import pandas as pd
import numpy as np
filename = "data.csv"
sample_size = 10000
batch_size = 200
rng = np.random.default_rng()
sample_reader = pd.read_csv(filename, dtype=str, chunksize=batch_size)
sample = sample_reader.get_chunk(sample_size)
for chunk in sample_reader:
chunk.index = rng.integers(sample_size, size=len(chunk))
sample.loc[chunk.index] = chunk
Explanation
It's not always trivial to know the size of the input CSV file.
If there are embedded line breaks, tools like wc or shuf will give you the wrong answer or just make a mess out of your data.
So, based on desktable's answer, we can treat the first sample_size lines of the file as the initial sample and then, for each subsequent line in the file, randomly replace a line in the initial sample.
To do that efficiently, we load the CSV file using a TextFileReader by passing the chunksize= parameter:
sample_reader = pd.read_csv(filename, dtype=str, chunksize=batch_size)
First, we get the initial sample:
sample = sample_reader.get_chunk(sample_size)
Then, we iterate over the remaining chunks of the file, replacing the index of each chunk with a sequence of random integers as long as size of the chunk, but where each integer is in the range of the index of the initial sample (which happens to be the same as range(sample_size)):
for chunk in sample_reader:
chunk.index = rng.integers(sample_size, size=len(chunk))
And use this reindexed chunk to replace (some of the) lines in the sample:
sample.loc[chunk.index] = chunk
After the for loop, you'll have a dataframe at most sample_size rows long, but with random lines selected from the big CSV file.
To make the loop more efficient, you can make batch_size as large as your memory allows (and yes, even larger than sample_size if you can).
Notice that, while creating the new chunk index with np.random.default_rng().integers(), we use len(chunk) as the new chunk index size instead of simply batch_size because the last chunk in the loop could be smaller.
On the other hand, we use sample_size instead of len(sample) as the "range" of the random integers, even though there could be less lines in the file than sample_size. This is because there won't be any chunks left to loop over in this case so that will never be a problem.
read the data file
import pandas as pd
df = pd.read_csv('data.csv', 'r')
First check the shape of df
df.shape()
create the small sample of 1000 raws from df
sample_data = df.sample(n=1000, replace='False')
#check the shape of sample_data
sample_data.shape()
For example, you have the loan.csv, you can use this script to easily load the specified number of random items.
data = pd.read_csv('loan.csv').sample(10000, random_state=44)
Let's say that you want to load a 20% sample of the dataset:
import pandas as pd
df = pd.read_csv(filepath).sample(frac = 0.20)