How to split a CSV file in two files with overlapping rows?

How to split a CSV file in two files with overlapping rows? - python

I have a CSV file with, let's say, 16000 rows. I need to split it up in two separate files, but also need an overlap in the files of about 360 rows, so row 1-8360 in one file and row 8000-16000 in the other. Or 1-8000 and 7640-16000.
CSV file look like this:
Value X Y Z
4.5234 -46.29753186 -440.4915915 -6291.285393
4.5261 -30.89639381 -441.8390165 -6291.285393
4.5289 -15.45761327 -442.6481287 -6291.285393
4.5318 0 -442.9179423 -6291.285393
I have used this code in Python 3 to split the file, but I'm unable to get the overlap I want:
with open('myfile.csv', 'r') as f:
csvfile = f.readlines()
linesPerFile = 8000
filename = 1
for i in range(0,len(csvfile),linesPerFile+):
with open(str(filename) + '.csv', 'w+') as f:
if filename > 1: # this is the second or later file, we need to write the
f.write(csvfile[0]) # header again if 2nd.... file
f.writelines(csvfile[i:i+linesPerFile])
filename += 1
And tried to modify it like this:
for i in range(0,len(csvfile),linesPerFile+360):
and
f.writelines(csvfile[360-i:i+linesPerFile])
but I haven't been able to make it work.

It's very easy with Pandas CSV and iloc.
import pandas as pd
# df = pd.read_csv('source_file.csv')
df = pd.DataFrame(data=pd.np.random.randn(16000, 5))
df.iloc[:8360].to_csv('file_1.csv')
df.iloc[8000:].to_csv('file_2.csv')

Hope you have got a more elegant answer using Pandas. You could consider below if don't like to install modules.
def write_files(input_file, file1, file2, file1_end_line_no, file2_end_line_no):
# Open all 3 file handles
with open(input_file) as csv_in, open(file1, 'w') as ff, open(file2, 'w') as sf:
# Process headers
header = next(csv_in)
header = ','.join(header.split())
ff.write(header + '\n')
sf.write(header + '\n')
for index, line in enumerate(csv_in):
line_content = ','.join(line.split()) # 4.5234 -46.29753186 -440.4915915 -6291.285393 => 4.5234,-46.29753186,-440.4915915,-6291.285393
if index <= file1_end_line_no: # Check if index is less than or equals first file's max index
ff.write(line_content + '\n')
if index >= file2_end_line_no: # Check if index is greater than or equals second file's max index
sf.write(line_content + '\n')
Sample Run:
if __name__ == '__main__':
in_file = 'csvfile.csv'
write_files(
in_file,
'1.txt',
'2.txt',
2,
2
)

What about this?
for i in range(0,len(csvfile),linesPerFile+):
init = i
with open(str(filename) + '.csv', 'w+') as f:
if filename > 1: # this is the second or later file, we need to write the
f.write(csvfile[0]) # header again if 2nd.... file
init = i - 360
f.writelines(csvfile[init:i+linesPerFile+1])
filename += 1
Is this what you are looking for? Please upload a test file if it doesn't so we can provide a better answer :-)

Related

Is there a faster way of reading two files line by line, then adding one line at the end of the other?

so here's my problem:
I have two CSV files with each files having around 500 000 lines.
File 1 looks like this:
ID|NAME|OTHER INFO
353253453|LAURENT|STUFF 1
563636345|MARK|OTHERS
786970908|GEORGES|THINGS
File 2 looks like this:
LOCATION;ID_PERSON;PHONE
CA;786970908;555555
NY;353253453;555666
So what I have to do is look look for the lines where there are the same IDs, and add the line from file 2 to the end of corresponding line from file 1 in a new file, and if there's no corresponding IDs, add empty columns, like this:
ID;NAME;OTHER INFO;LOCATION;ID_PERSON;PHONE
353253453;LAURENT;STUFF 1;NY;353253453;555666
563636345;MARK;OTHERS;;;
786970908;GEORGES;THINGS;CA;786970908;555555
File 1 is the primary one if that makes sense.
The thing is I have found a solution but it takes way too long since for each lines of file 1 I loop through file 2.
Here's my code:
input1 = open(filename1, 'r', errors='ignore')
input2 = open(filename2, 'r', errors='ignore')
output = open('result.csv', 'w', newline='')
for line1 in input1:
line_splitted = line1.split("|")
id_1 = line_splitted[0]
index = 0
find = False
for line2 in file2:
file2_splitted = line2.split(";")
if id_1 in file2_splitted[1]:
output.write((";").join(line1.split("|"))+line2)
find = True
file2.remove(line2)
break
index+=1
if index == len(file2) and find == True:
output.write((";").join(line1.split("|")))
for j in range(nbr_col_2):
output.write(";")
output.write("\n")
So I was wondering if there is a faster way to do that, or if I just have to be patient, because right now after 20 minutes, only 20000 lines have been written...

First read file2 line by line in order to build the lookup dict.
Then read file1 line by line, lookup in the dict for the ID key, build the output line and write to the output file.
Should be quite efficient with complexity = O(n).
Only the dict consumes a little bit of memory.
All the file processing is "stream-based".
with open("file2.txt") as f:
lookup = {}
f.readline() # skip header
while True:
line = f.readline().rstrip()
if not line:
break
fields = line.split(";")
lookup[fields[1]] = line
with open("file1.txt") as f, open("output.txt", "w") as out:
f.readline() # skip header
out.write("ID;NAME;OTHER INFO;LOCATION;ID_PERSON;PHONE\n")
while True:
line_in = f.readline().rstrip()
if not line_in:
break
fields = line_in.split("|")
line_out = ";".join(fields)
if found := lookup.get(fields[0]):
line_out += ";" + found
else:
line_out += ";;;"
out.write(line_out + "\n")

As Alex pointed out in his comment, you can merge both files using pandas.
import pandas as pd
# Load files
file_1 = pd.read_csv("file_1.csv", index_col=0, delimiter="|")
file_2 = pd.read_csv("file_2.csv", index_col=1, delimiter=";")
# Rename PERSON_ID as ID
file_2.index.name = "ID"
# Merge files
file_3 = file_1.merge(file_2, how="left", on="ID")
file_3.to_csv("file_3.csv")
Using your examples file_3.csv looks like this:
ID,NAME,OTHER INFO,LOCATION,PHONE
353253453,LAURENT,STUFF 1,NY,555666.0
563636345,MARK,OTHERS,,
786970908,GEORGES,THINGS,CA,555555.0
Extra
By the way, if you are not familiar with pandas, this is a great introductory course: Learn Pandas Tutorials

You can create indexing to prevent iterating over file2 each time.
Do this by creating a dictionary from file2 and retrieving each related item of it by calling its index.
file1 = open(filename1, 'r', errors='ignore')
file2 = open(filename2, 'r', errors='ignore')
output = open('result.csv', 'w', newline='')
indexed_data = {}
for line2 in file2.readlines()[1:]:
data2 = line2.rstrip('\n').split(";")
indexed_data[data2[1]] = {
'Location': data2[0],
'Phone': data2[2],
}
output.write('ID;NAME;OTHER INFO;LOCATION;ID_PERSON;PHONE\n')
for line1 in file1.readlines()[1:]:
data1 = line1.rstrip('\n').split("|")
if data1[0] in indexed_data:
output.write(f'{data1[0]};{data1[1]};{data1[2]};{indexed_data[data1[0]]["Location"]};{data1[0]};{indexed_data[data1[0]]["Phone"]}\n')
else:
output.write(f'{data1[0]};{data1[1]};{data1[2]};;;\n')

drop columns in a txt file by length

I have a txt file with 2 columns and many rows with integers and strings (without IDs), where I need to remove rows longer that 50 characters, for example \
4:33333333:3333333: -:aaaaaeeeeeeeffffffffhhhhhhhh
I guess pandas drop function is not suitable in this case (from description: Remove rows or columns by specifying label names and corresponding axis, or by specifying directly index or column names)
Does Python have any other options?
Thank you!

This will rewrite the file according to that rule. In this case the new paragraph character \n is not counted.
with open('file.txt', 'r') as file:
rows = [line for line in file.readlines() if len(line.strip()) <= 50]
with open('out.txt', 'w') as file:
for row in rows:
file.write(row)
if you need to use the rows for something else do:
rows = [line.strip() for line in file.readlines() if len(line.strip()) <= 50]
to clean the strings.

I think that you could process a large text file by processing chunks (specifying the number of rows in a chunk) using pandas as follows:
import pandas as pd
if __name__ == '__main__':
n_rows_per_process = 10 ** 6 # 1,000,000 rows in a chunk
allowed_column_length = 50
columns = ["feat1", "feat2"]
input_path = "input.txt"
output_path = "results.txt"
with open(output_path, "a") as fout:
for chunk_df in pd.read_csv(input_path, chunksize=n_rows_per_process, sep=r"\s+", names=columns):
tmp_df = chunk_df[chunk_df["feat2"].str.len() < allowed_column_length]
tmp_df.to_csv(fout, mode="a", header=False, sep="\t")
You will get the expected result in the results.txt.

You may read your file line by line and write to (other) file only lines which fulfill your condition:
MAXLEN = 50
with open("input.txt") as inp, open("output.txt", "w") as outp:
for line in inp:
if len(line) <= MAXLEN + 1:
outp.write(line)
Every read line includes the (invisible) ending \n symbol, too, so we compare its length with MAXLEN + 1.

appending dates by columns in a csv file in python

I have a script who goes through all the files within a given path('C:\Users\Razvi\Desktop\italia') and reads the number of lines from each file found with the name "cnt.csv" and then it writes in another file named "counters.csv" the current date + the name of folder + the sum of lines found in the "cnt.csv".
Now the output file("counters.csv") looks like this:
30/9/2017
8dg5 5
8dg7 7
01/10/2017
8dg5 8
8dg7 10
In which the names 8dg5 and 8dg7 are the folders where the script found the file"cnt.csv" aand the numers 5,7,8,10 are the sum of lines found in csv files every time when i run the script and obviously the date when i run the script it's appended every time.
Now what i want it's to write those dates in csv, somehow appended on columns , not lines, something like this:
30/9/2017 01/10/2017
8dg5 5 8dg5 8
8dg7 7 8dg7 10
Here is the code:
import re
import os
from datetime import datetime
#italia =r'C:\Users\timraion\Desktop\italia'
italia =r'C:\Users\Razvi\Desktop\italia'
file = open(r'C:\Users\Razvi\Desktop\counters.csv', 'a')
now=datetime.now()
dd=str(now.day)
mm=str(now.month)
yyyy=str(now.year)
date=(dd + "/" + mm + "/" + yyyy)
file.write(date + '\n')
for filename in os.listdir(italia):
infilename = os.path.join(italia, filename)
for files in os.listdir(infilename):
if files.endswith("cnt.csv"):
result=os.path.join(infilename,"cnt.csv")
print(result)
infilename2 = os.path.join(infilename, files)
lines = 0
for line in open(infilename2):
lines += 1
file = open(r'C:\Users\Razvi\Desktop\counters.csv', 'a')
file.write(str(filename) +","+ str(lines)+"\n" )
file.close()
Thanks!

If you want to add the new entries as columns instead of rows further down, you'll have to read in the counters.csv file each time, append your columns, and rewrite it.
import os
from datetime import datetime
import csv
italia =r'C:\Users\Razvi\Desktop\italia'
counts = []
for filename in os.listdir(italia):
infilename = os.path.join(italia, filename)
for files in os.listdir(infilename):
if files.endswith("cnt.csv"):
result=os.path.join(infilename,"cnt.csv")
print(result)
infilename2 = os.path.join(infilename, files)
lines = 0
for line in open(infilename2):
lines += 1
counts.append((filename, lines)) # save the filename and count for later
if os.path.exists(r'C:\Users\Razvi\Desktop\counters.csv'):
with open(r'C:\Users\Razvi\Desktop\counters.csv', 'r') as csvfile:
counters = [row for row in csv.reader(csvfile)] # read the csv file
else:
counters = [[]]
now=datetime.now()
# Add the date and a blank cell to the first row
counters[0].append('{}/{}/{}'.format(now.day, now.month, now.year))
counters[0].append('')
for i, count in enumerate(counts):
if i + 1 == len(counters): # check if we need to add a row
counters.append(['']*(len(counters[0])-2))
while len(counters[i+1]) < len(counters[0]) - 2:
counters[i+1].append('') # ensure there are enough columns in the row we're on
counters[i+1].extend(count) # Add the count info as two new cells
with open(r'C:\Users\Razvi\Desktop\counters.csv', 'w') as csvfile:
writer = csv.writer(csvfile) # write the new csv file
for row in counters:
writer.writerow(row)
Another note, file is a builtin function in Python, so you shouldn't use that name as a variable, since unexpected stuff might happen further down in your code.

Breaking up large file, but adding header to each subsequent file

I'm using the following code to break up a large CSV file and I want the original CSV header to be written to each smaller CSV file. The problem I am having, though, is that the current code seems to skip a line of data for each smaller file. So in the example below Line 51 wouldn't be written to the smaller file (code modified from http://code.activestate.com/recipes/578045-split-up-text-file-by-line-count/). It seems to skip that line or perhaps it's being overwritten by the header:
import os
filepath = 'test.csv'
lines_per_file=50
lpf = lines_per_file
path, filename = os.path.split(filepath)
with open(filepath, 'r') as r:
name, ext = os.path.splitext(filename)
try:
w = open(os.path.join(path, '{}_{}{}'.format(name, 0, ext)), 'w')
header = r.readline()
for i, line in enumerate(r):
if not i % lpf:
#possible enhancement: don't check modulo lpf on each pass
#keep a counter variable, and reset on each checkpoint lpf.
w.close()
filename = os.path.join(path,
'{}_{}{}'.format(name, i, ext))
w = open(filename, 'w')
w.write(header)
w.write(line)
finally:
w.close()

Consider using pandas to split the large csv file:
Lets create a csv file having 500 rows and four columns using pandas:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(500,4), columns=['a','b','c','d'])
df.to_csv('large_data.csv', index=False)
Lets split the large_data.csv in to multiple csv files of each having 50 rows:
import pandas as pd
df = pd.read_csv('large_data.csv', chunksize=50)
i = 1
for chunk in df:
chunk.to_csv('split_data_'+str(i)+'.csv', index=False)
i = i+1
This would have produced the following resultant files:

Comparing data between 4 csv files and writing them to separate output files

Could someone please advice me on how I could improve my code? I have 4 big csv files. The first is a reference file to which 3 other files (file1, file2 and file3) are compared to. In the files, there are three columns. Each row is a unit (e.g. ABC, DEF, GHI are 3 separate units).
col_1 col_2 col_3
A B C
D E F
G H I
I would like to compare file1, file2 and file3 to the reference file. If unit for each row in the reference file is present in all 3 files, I would like to write them into file A. If unit for each row is present in at least 1 of the 3 files, they should be written to file B. If unit for each each row is not present in any of the 3 files, I would like to write them in file C. My current strategy is to append the files as 4 separate lists and to compare them. I realize that this approach is memory intensive. In addition, my script has been running for a long time without final output. As such I was wondering if there is a more efficient approach to this problem?
Below is my code:
import csv
reference_1 = open ('reference.csv', 'rt', newline = '')
reader = csv.reader(reference_1, delimiter = ',')
file1 = open ('file1.csv','rt', newline = '')
reader1 = csv.reader(file1, delimiter = ',')
file2 = open ('file2.csv', 'rt',newline = '')
reader2 = csv.reader(file2, delimiter = ',')
file3 = open ('file3.csv', 'rt',newline = '')
reader3 = csv.reader(file3, delimiter = ',')
Common = open ('Common.csv', 'w',newline = '')
writer1 = csv.writer(Common, delimiter = ',')
Partial = open ('Partial.csv', 'w',newline = '')
writer2 = csv.writer(Partial, delimiter = ',')
Absent = open ('Absent.csv', 'w',newline = '')
writer3 = csv.writer(Absent, delimiter = ',')
reference = []
fileA = []
fileB = []
fileC = []
for row in reader:
reference.append (row)
for row in reader1:
fileA.append(row)
for row in reader2:
fileB.append(row)
for row in reader3:
fileC.append(row)
for row in reference:
if row in fileA and row in fileB and row in fileC:
writer1.writerow (row)
continue
elif row in fileA or row in fileB or row in fileC:
writer2.writerow (row)
continue
else:
writer3.writerow (row)
reference_1.close()
file1.close()
file2.close()
file3.close()
Common.close()
Partial.close()
Absent.close()

Assuming the order of the rows is not important and that there aren't duplicate rows in the reference file, here is an option using set.
def file_to_set(filename):
"""Opens a file and returns a set containing each line of the file."""
with open(filename) as f:
return set(f.read().splitlines(True))
def set_to_file(s, filename):
"""Writes a set to file."""
with open(filename, 'w') as f:
f.writelines(s)
def compare_files(ref_filename, *files):
"""Compares a reference file to two or more files."""
if len(files) < 2:
raise TypeError("compare_files expected at least 2 files, got %s" %
len(files))
ref = file_to_set(ref_filename)
file_data = [file_to_set(f) for f in files]
all = file_data[0].union(*file_data[1:])
common = ref.intersection(*file_data)
partial = ref.intersection(all).difference(common)
absent = ref.difference(all)
set_to_file(common, 'common.csv')
set_to_file(partial, 'partial.csv')
set_to_file(absent, 'absent.csv')
compare_files('reference.csv', 'file1.csv', 'file2.csv', 'file3.csv')
The idea is:
Create sets containing each line of a file.
Make a set (all) that contains every line in every file (except the reference file).
Make a set (common) that contains only the lines that are in every file, including the reference file.
Make a set (partial) that contains the lines in the reference file that also appear in at least one but not all of the other files.
Make a set (absent) that contains the lines only present in the reference file.
Write common, partial, and absent to files.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to split a CSV file in two files with overlapping rows? - python

It's very easy with Pandas CSV and iloc. import pandas as pd # df = pd.read_csv('source_file.csv') df = pd.DataFrame(data=pd.np.random.randn(16000, 5)) df.iloc[:8360].to_csv('file_1.csv') df.iloc[8000:].to_csv('file_2.csv')

Related

Is there a faster way of reading two files line by line, then adding one line at the end of the other?

drop columns in a txt file by length

appending dates by columns in a csv file in python

Breaking up large file, but adding header to each subsequent file

Comparing data between 4 csv files and writing them to separate output files

Categories

Resources