write 5M data to a csv file using file write operation - python

I'm writing some data in a single column of a CSV file using file write operation. I am able to write values only in 1048576 rows. I have 5 million integer data values and I want it to be saved in a single CSV file. Below is my code
with open(path, 'w') as fp:
for i in range(0,len(values)):
fp.write(values[i] + '\n')
fp.close()
Is it possible to continue writing values after 1048576 rows to 3rd/4th column of the CSV file?? OR
Is it possible to write values in a sequential way so that i can have all the values in a single file?

You can use itertools.izip_longest to "chunk" the values into "columns", then use the csv module to write those rows to the file. eg:
import csv
from itertools import izip_longest
N = 5 # adapt as needed
values = range(1, 23) # use real values here
with open(path, 'wb') as fout:
csvout = csv.writer(fout)
rows = izip_longest(*[iter(values)] * N, fillvalue='')
csvout.writerows(rows)
This will give you the following output:
1,2,3,4,5
6,7,8,9,10
11,12,13,14,15
16,17,18,19,20
21,22,,,
You can also "transpose" the data so the data "runs the other way round", eg:
import csv
from itertools import izip_longest, izip
N = 5 # adapt as needed
values = range(1, 23) # use real values here
with open(path, 'wb') as fout:
csvout = csv.writer(fout)
rows = izip_longest(*[iter(values)] * N, fillvalue='')
transposed = izip(*rows)
csvout.writerows(transposed)
This will give you:
1,6,11,16,21
2,7,12,17,22
3,8,13,18,
4,9,14,19,
5,10,15,20,

As an alternative, you can use islice to give you the required number of columns per row as follows:
from itertools import islice
import csv
path = 'output.txt'
values = range(105) # Create sample 'values' data
columns = 10
ivalues = iter(values)
with open(path, 'wb') as fp:
csv_output = csv.writer(fp)
for row in iter(lambda: list(islice(ivalues, columns)), []):
csv_output.writerow(row)
Giving you the following:
0,1,2,3,4,5,6,7,8,9
10,11,12,13,14,15,16,17,18,19
20,21,22,23,24,25,26,27,28,29
30,31,32,33,34,35,36,37,38,39
40,41,42,43,44,45,46,47,48,49
50,51,52,53,54,55,56,57,58,59
60,61,62,63,64,65,66,67,68,69
70,71,72,73,74,75,76,77,78,79
80,81,82,83,84,85,86,87,88,89
90,91,92,93,94,95,96,97,98,99
100,101,102,103,104
Note, in your example, you should convert range to xrange to avoid Python creating a huge list of numbers to iterate on.

Related

How to get specific columns in a certain range from a csv file without using pandas

For some reason the pandas module does not work and I have to find another way to read a (large) csv file and have as Output specific columns within a certain range (e.g. first 1000 lines). I have the code that reads the entire csv file, but I haven't found a way to display just specific columns.
Any help is much appreciated!
import csv
fileObj = open('apartment-data-all-4-xaver.2018.csv')
csvReader = csv.reader( fileObj )
for row in csvReader:
print row
fileObj.close()
I created a small csv file with the following contents:
first,second,third
11,12,13
21,22,23
31,32,33
41,42,43
You can use the following helper function that uses namedtuple from collections module, and generates objects that allows you to access your columns like attributes:
import csv
from collections import namedtuple
def get_first_n_lines(file_name, n):
with open(file_name) as file_obj:
csv_reader = csv.reader(file_obj)
header = next(csv_reader)
Tuple = namedtuple('Tuple', header)
for i, row in enumerate(csv_reader, start=1):
yield Tuple(*row)
if i >= n: break
If you want to print first and third columns, having n=3 lines, you use the method like this (Python 3.6 +):
for line in get_first_n_lines(file_name='csv_file.csv', n=3):
print(f'{line.first}, {line.third}')
Or like this (Python 3.0 - 3.5):
for line in get_first_n_lines(file_name='csv_file.csv', n=3):
print('{}, {}'.format(line.first, line.third))
Outputs:
11, 13
21, 23
31, 33
use csv dictreader and then filter out specific rows and columns
import csv
data = []
with open('names.csv', newline='') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
data.append(row)
colnames = ['col1', 'col2']
for i in range(1000):
print(data[i][colnames[0]], data[i][colnames[1]])

Create single CSV from two CSV files with X and Y values in python

I have two CSV files; one containing X(longitude) and the other Y(latitude) values (they are 'float' data type)
I am trying to create a single CSV with all possible combinations (e.g. X1,Y1; X1, Y2; X1,Y3; X2,Y1; X2,Y2; X2,Y3... etc)
I have written the following which partly works. However the CSV file created has lines in between values and i also get the values stored like this with there list parentheses ['20.7599'] ['135.9028']. What I need is 20.7599, 135.9028
import csv
inLatCSV = r"C:\data\Lat.csv"
inLongCSV = r"C:\data\Long.csv"
outCSV = r"C:\data\LatLong.csv"
with open(inLatCSV, 'r') as f:
reader = csv.reader(f)
list_Lat = list(reader)
with open(inLongCSV, 'r') as f:
reader = csv.reader(f)
list_Long = list(reader)
with open(outCSV, 'w') as myfile:
for y in list_Lat:
for x in list_Long:
combVal = (y,x)
#print (combVal)
wr = csv.writer(myfile)
wr.writerow(combVal)
Adding a argument to the open function made the difference:
with open(my_csv, 'w', newline="") as myfile:
combinations = [[y,x] for y in list_Lat for x in list_Long]
wr = csv.writer(myfile)
wr.writerows(combinations)
Any time you're doing something with csv files, Pandas is a great tool
import pandas as pd
lats = pd.read_csv("C:\data\Lat.csv",header=None)
lons = pd.read_csv("C:\data\Long.csv",header=None)
lats['_tmp'] = 1
lons['_tmp'] = 1
df = pd.merge(lats,lons,on='_tmp').drop('_tmp',axis=1)
df.to_csv('C:\data\LatLong.csv',header=False,index=False)
We create a dataframe for each file, and merge them on a temporary column, which produces the cartesian product. https://pandas.pydata.org/pandas-docs/version/0.20/merging.html

Replacing content of column 'x' from file A with column 'x' in a very large file B

I have two files. "A" which is not too large (2GB) and "B" which is rather large at 60GB. I have a primitive code as follows:
import csv #imports module csv
filea = "A.csv"
fileb = "B.csv"
output = "Python_modified.csv"
source1 = csv.reader(open(filea,"r"),delimiter='\t')
source2 = csv.reader(open(fileb,"r"),delimiter='\t')
#open csv readers
source2_dict = {}
# prepare changes from file B
for row in source2:
source2_dict[row[2]] = row[2]
# write new changed rows
with open(output, "w") as fout:
csvwriter = csv.writer(fout, delimiter='\t')
for row in source1:
# needs to check whether there are any changes prepared
if row[3] in source2_dict:
# change the item
row[3] = source2_dict[row[3]]
csvwriter.writerow(row)
Which should read through column 3 from both files and replace column 4 in file A with the contents of column 4 in file B if there's a match. However since it's reading in the large files its very slow. Is there any way to optimize this?
You could try reading file_a in large blocks into memory, and then process each block. This would mean you are doing a groups of reads followed by a group of writes which should help to reduce disk thrashing. You will need to decide which block_size to use, obviously something that will fit comfortably in memory.
from itertools import islice
import csv #imports module csv
file_a = "A.csv"
file_b = "B.csv"
output = "Python_modified.csv"
block_size = 10000
# prepare changes from file B
source2_dict = {}
with open(file_b, 'rb') as f_source2:
for row in csv.reader(f_source2, delimiter='\t'):
source2_dict[row[3]] = row[4] # just store the replacement value
# write new changed rows
with open(file_a, 'rb') as f_source1, open(output, "wb") as f_output:
csv_source1 = csv.reader(f_source1, delimiter='\t')
csv_output = csv.writer(f_output, delimiter='\t')
# read input file_a in large groups
for block in iter(lambda: list(islice(csv_source1, block_size)), []):
for row in block:
try:
row[4] = source2_dict[row[3]]
except KeyError as e:
pass
csv_output.writerow(row)
Secondly, to reduce memory usage, if you are just replacing one value, then just store that one value in your dictionary.
Tested using Python 2.x. If you are using Python 3.x, you will need to change your file open's, e.g.
with open(file_b, 'r', newline='') as f_source2:

Python - merging of csv files with one axis in common

I need to merge two csv files, A.csv and B.csv, with one axis in common, extract:
9.358,3.0
9.388,2.0
and
8.551,2.0
8.638,2.0
I want the final file C.csv to have the following pattern:
8.551,0.0,2.0
8.638,0.0,2.0
9.358,3.0,0.0
9.388,2.0,0.0
How to you suggest to do it? Should I go for a for loop?
Just read from each file, writing out to the output file and adding in the 'missing' column:
import csv
with open('c.csv', 'wb') as outcsv:
# Python 3: use open('c.csv', 'w', newline='') instead
writer = csv.writer(outcsv)
# copy a.csv across, adding a 3rd column
with open('a.csv', 'rb') as incsv:
# Python 3: use open('a.csv', newline='') instead
reader = csv.reader(incsv)
writer.writerows(row + [0.0] for row in reader)
# copy b.csv across, inserting a 2nd column
with open('b.csv', 'rb') as incsv:
# Python 3: use open('b.csv', newline='') instead
reader = csv.reader(incsv)
writer.writerows(row[:1] + [0.0] + row[1:] for row in reader)
The writer.writerows() lines do all the work; a generator expression loops over the rows in each reader, either appending a column or inserting a column in the middle.
This works with whatever size of input CSVs you have, as only some read and write buffers are held in memory. Rows are processed in iterative fashion without ever needing to hold all of the input or output files in memory.
import numpy as np
dat1 = np.genfromtxt('dat1.txt', delimiter=',')
dat2 = np.genfromtxt('dat2.txt', delimiter=',')
dat1 = np.insert(dat1, 2, 0, axis=1)
dat2 = np.insert(dat2, 1, 0, axis=1)
dat = np.vstack((dat1, dat2))
np.savetxt('dat.txt', dat, delimiter=',', fmt='%.3f')
Here's a simple solution using a dictionary, which will work for any number of files:
from __future__ import print_function
def process(*filenames):
lines = {}
index = 0
for filename in filenames:
with open(filename,'rU') as f:
for line in f:
v1, v2 = line.rstrip('\n').split(',')
lines.setdefault(v1,{})[index] = v2
index += 1
for line in sorted(lines):
print(line, end=',')
for i in range(index):
print(lines[line].get(i,0.0), end=',' if i < index-1 else '\n')
process('A.csv','B.csv')
prints
8.551,0.0,2.0
8.638,0.0,2.0
9.358,3.0,0.0
9.388,2.0,0.0

Python- Import Multiple Files to a single .csv file

I have 125 data files containing two columns and 21 rows of data and I'd like to import them into a single .csv file (as 125 pairs of columns and only 21 rows).
This is what my data files look like:
I am fairly new to python but I have come up with the following code:
import glob
Results = glob.glob('./*.data')
fout='c:/Results/res.csv'
fout=open ("res.csv", 'w')
for file in Results:
g = open( file, "r" )
fout.write(g.read())
g.close()
fout.close()
The problem with the above code is that all the data are copied into only two columns with 125*21 rows.
Any help is very much appreciated!
This should work:
import glob
files = [open(f) for f in glob.glob('./*.data')] #Make list of open files
fout = open("res.csv", 'w')
for row in range(21):
for f in files:
fout.write( f.readline().strip() ) # strip removes trailing newline
fout.write(',')
fout.write('\n')
fout.close()
Note that this method will probably fail if you try a large number of files, I believe the default limit in Python is 256.
You may want to try the python CSV module (http://docs.python.org/library/csv.html), which provides very useful methods for reading and writing CSV files. Since you stated that you want only 21 rows with 250 columns of data, I would suggest creating 21 python lists as your rows and then appending data to each row as you loop through your files.
something like:
import csv
rows = []
for i in range(0,21):
row = []
rows.append(row)
#not sure the structure of your input files or how they are delimited, but for each one, as you have it open and iterate through the rows, you would want to append the values in each row to the end of the corresponding list contained within the rows list.
#then, write each row to the new csv:
writer = csv.writer(open('output.csv', 'wb'), delimiter=',')
for row in rows:
writer.writerow(row)
(Sorry, I cannot add comments, yet.)
[Edited later, the following statement is wrong!!!] "The davesnitty's generating the rows loop can be replaced by rows = [[]] * 21." It is wrong because this would create the list of empty lists, but the empty lists would be a single empty list shared by all elements of the outer list.
My +1 to using the standard csv module. But the file should be always closed -- especially when you open that much of them. Also, there is a bug. The row read from the file via the -- even though you only write the result here. The solution is actually missing. Basically, the row read from the file should be appended to the sublist related to the line number. The line number should be obtained via enumerate(reader) where reader is csv.reader(fin, ...).
[added later] Try the following code, fix the paths for your puprose:
import csv
import glob
import os
datapath = './data'
resultpath = './result'
if not os.path.isdir(resultpath):
os.makedirs(resultpath)
# Initialize the empty rows. It does not check how many rows are
# in the file.
rows = []
# Read data from the files to the above matrix.
for fname in glob.glob(os.path.join(datapath, '*.data')):
with open(fname, 'rb') as f:
reader = csv.reader(f)
for n, row in enumerate(reader):
if len(rows) < n+1:
rows.append([]) # add another row
rows[n].extend(row) # append the elements from the file
# Write the data from memory to the result file.
fname = os.path.join(resultpath, 'result.csv')
with open(fname, 'wb') as f:
writer = csv.writer(f)
for row in rows:
writer.writerow(row)

Categories

Resources