Python combine rows from different files into one data file

Python combine rows from different files into one data file - python

I have distributed information over multiple large csv files.
I want to combine all the files into one new file such as the first row from the first file is combined to the first row from the other file etc.
file1.csv
A,B
A,C
A,D
file2.csv
F,G
H,I
J,K
expected result:
output.csv
A,B,F,G
A,C,H,I
A,D,J,K
so consider I have an array ['file1.csv', 'file2.csv', ...] How to go from here ?
I tried to load each file into the memory and combine by np.column_stack but my files are too large to fit in memory.

Not pretty code, but this should work.
I'm not using with(open'filename','r') as myfile for the inputs. It could get a bit messy with 50 files, so these are opened and closed explicitly.
It opens each file then places the handle in a list. The first handle is taken as the master file, then we iterate through it line-by-line, each time reading one line from all the other open files and joining them with ',' then output that to the output file.
Note that if the other files have more lines, they won't be included. If any have less lines, this will raise an exception. I'll leave it to you to deal with these situations gracefully.
Note also that you can use glob to create filelist if the names follow a logical pattern (thanks to N. Wouda, below)
filelist = ['book1.csv','book2.csv','book3.csv','book4.csv']
openfiles = []
for filename in filelist:
openfiles.append(open(filename,'rb'))
# Use first file in the list as the master
# All files must have same number of lines (or greater)
masterfile = openfiles.pop(0)
with (open('output.csv','w')) as outputfile:
for line in masterfile:
outputlist = [line.strip()]
for openfile in openfiles:
outputlist.append(openfile.readline().strip())
outputfile.write(str.join(',', outputlist)+'\n')
masterfile.close()
for openfile in openfiles:
openfile.close()
Input Files
a b c d e f
1 2 3 4 5 6
7 8 9 10 11 12
13 14 15 16 17 18
Output
a b c d e f a b c d e f a b c d e f a b c d e f
1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6
7 8 9 10 11 12 7 8 9 10 11 12 7 8 9 10 11 12 7 8 9 10 11 12
13 14 15 16 17 18 13 14 15 16 17 18 13 14 15 16 17 18 13 14 15 16 17 18

Instead of completely reading the files into the memory you can iterate over them line by line.
from itertools import izip # like zip but gives us an iterator
with open('file1.csv') as f1, open('file2.csv') as f2, open('output.csv', 'w') as out:
for f1line, f2line in izip(f1, f2):
out.write('{},{}'.format(f1line.strip(), f2line))
Demo:
$ cat file1.csv
A,B
A,C
A,D
$ cat file2.csv
F,G
H,I
J,K
$ python2.7 merge.py
$ cat output.csv
A,B,F,G
A,C,H,I
A,D,J,K

Related

Print line in file until line blank line

I have a file "testread.txt" having below data.
A
1
2
3
4
BA
5
6
7
8
CB
9
10
11
D
12
13
14
15
I Wanted to read and extract data each section wise and write it to different files. Eg;
1
2
3
4
Write it to File "a.txt"
5
6
7
8
Write it to File "b.txt"
9
10
11
Write it to File "c.txt"
and so on...

A (rough) solution could be get using:
collections.defaultdict to divide and store items;
numpy.savetxt to save them into files.
import numpy as np
from collections import defaultdict
with open('testread.txt', 'r') as f:
content = f.readlines()
d = defaultdict(list)
i = 0
for line in content:
if line == '\n':
i+=1
else:
d[i].append(line.strip())
for k,v in d.items():
np.savetxt('file{}.txt'.format(k), v[1:], delimiter=",", fmt='%s')
and you get:
file0.txt
1
2
3
4
file1.txt:
5
6
7
8
file2.txt:
9
10
11
file3.txt
12
13
14
15

The idea is to skip file when a new empty line is available. The below code should do the trick.
files_list = ['a.txt', 'b.txt', 'c.txt']
fpr = open('input.txt')
for f in files_list:
with open(f, 'w') as fpw:
for i, line in enumerate(fpr):
if i == 0: # skips first line
continue
if line.strip():
fpw.write(line)
else:
break

Merging files in Python

I want to merge 2 files. First file (co60.txt) contains only integer values and the second file (bins.txt) contains float numbers.
co60.txt:
11
14
12
14
18
15
18
9
bins.txt:
0.00017777777777777795
0.0003555555555555559
0.0005333333333333338
0.0007111111111111118
0.0008888888888888898
0.0010666666666666676
0.0012444444444444456
0.0014222222222222236
When I merge those two file with this code:
with open("co60.txt", 'r') as a:
a1 = [re.findall(r"[\w']+", line) for line in a]
with open("bins.txt", 'r') as b:
b1 = [re.findall(r"[\w']+", line) for line in b]
with open("spectrum.txt", 'w') as c:
for x,y in zip(a1,b1):
c.write("{} {}\n".format(" ".join(x),y[0]))
I get:
11 0
14 0
12 0
14 0
18 0
15 0
18 0
9 0
It appears that when I merge these 2 files, this code only merges round values of the file bins.txt.
How do I get that files merge like this:
11 0.00017777777777777795
14 0.0003555555555555559
12 0.0005333333333333338
14 0.0007111111111111118
18 0.0008888888888888898
15 0.0010666666666666676
18 0.0012444444444444456
9 0.0014222222222222236

You can do it without regex::
with open("co60.txt") as a, open("bins.txt") as b, \
open("spectrum.txt", 'w') as c:
for x,y in zip(a, b):
c.write("{} {}\n".format(x.strip(), y.strip()))
Content of spectrum.txt:
11 0.00017777777777777795
14 0.0003555555555555559
12 0.0005333333333333338
14 0.0007111111111111118
18 0.0008888888888888898
15 0.0010666666666666676
18 0.0012444444444444456
9 0.0014222222222222236

As mentioned by #immortal, if you want to use regex then use -
b1 = [re.findall(r"[0-9\.]+", line) for line in b]

Finding the most frequent items in a dataset

I am working with a big dataset and thus I only want to use the items that are most frequent.
Simple example of a dataset:
1 2 3 4 5 6 7
1 2
3 4 5
4 5
4
8 9 10 11 12 13 14
15 16 17 18 19 20
4 has 4 occurrences,
1 has 2 occurrences,
2 has 2 occurrences,
5 has 2 occurrences,
I want to be able to generate a new dataset just with the most frequent items, in this case the 4 most common:
The wanted result:
1 2 3 4 5
1 2
3 4 5
4 5
4
I am finding the first 50 most common items, but I am failing to print them out in a correct way. (my output is resulting in the same dataset)
Here is my code:
from collections import Counter
with open('dataset.dat', 'r') as f:
lines = []
for line in f:
lines.append(line.split())
c = Counter(sum(lines, []))
p = c.most_common(50);
with open('dataset-mostcommon.txt', 'w') as output:
..............
Can someone please help me on how I can achieve it?

You have to iterate again the dataset and, for each line, show only those who are int the most common data set.
If the input lines are sorted, you may just do a set intersection and print those in sorted order. If it is not, iterate your line data and check each item
for line in dataset:
for element in line.split()
if element in most_common_elements:
print(element, end=' ')
print()
PS: For Python 2, add from __future__ import print_function on top of your script

According to the documentation, c.most-common returns a list of tuples, you can get the desired output as follow:
with open('dataset-mostcommon.txt', 'w') as output:
for item, occurence in p:
output.writelines("%d has %d occurrences,\n"%(item, occurence))

Converting coordinate representation to adjecency list representation

What is the most efficient way to convert this file:
10 3
10 5
12 6
12 19
19 12
19 14
19 10
to this:
10 3 5
12 6 19
19 12 14 10
First column of the input file is numerically sorted in increasing order.
Any solutions using Python, AWK, etc. are welcome.

from itertools import groupby
lines, op_file = [line.split() for line in open("In.txt")], open("Out.txt", "w")
for key, grp in groupby(lines, key = lambda x: x[0]):
print >> op_file, "{} {}".format(key, " ".join([i[1] for i in grp]))
op_file.close()
Output
10 3 5
12 6 19
19 12 14 10

Since you mentioned awk:
$ awk '{a[$1]=a[$1]" "$2}END{for (i in a){print i a[i]}}' input
19 12 14 10
10 3 5
12 6 19
pipe it to sort to have it, well, sorted:
$ awk '...' input | sort
10 3 5
12 6 19
19 12 14 10

In Python 2:
import itertools, operator
with open(infilename) as infile:
input = (line.split() for line in infile)
output = itertools.groupby(input, operator.itemgetter(0))
with open(outfilename, 'w') as outfile:
for key, line in output:
print >>outfile, key, ' '.join(val[1] for val in line)
This assumes that the input and output files are different: you could just write the output to standard out and leave it as the user's problem to save it.

Try out this code
fp = open('/tmp/test.txt')
list_dict = {}
for line in fp.readlines():
split_values = line.split()
if split_values[0] in list_dict:
list_dict[split_values[0]].extend(split_values[1:])
else:
list_dict[split_values[0]] = split_values
for val in list_dict.values():
print " ".join(val)

Using numpy.genfromtxt to read a csv file with strings containing commas

I am trying to read in a csv file with numpy.genfromtxt but some of the fields are strings which contain commas. The strings are in quotes, but numpy is not recognizing the quotes as defining a single string. For example, with the data in 't.csv':
2012, "Louisville KY", 3.5
2011, "Lexington, KY", 4.0
the code
np.genfromtxt('t.csv', delimiter=',')
produces the error:
ValueError: Some errors were detected !
Line #2 (got 4 columns instead of 3)
The data structure I am looking for is:
array([['2012', 'Louisville KY', '3.5'],
['2011', 'Lexington, KY', '4.0']],
dtype='|S13')
Looking over the documentation, I don't see any options to deal with this. Is there a way do to it with numpy, or do I just need to read in the data with the csv module and then convert it to a numpy array?

You can use pandas (the becoming default library for working with dataframes (heterogeneous data) in scientific python) for this. It's read_csv can handle this. From the docs:
quotechar : string
The character to used to denote the start and end of a quoted item. Quoted items
can include the delimiter and it will be ignored.
The default value is ". An example:
In [1]: import pandas as pd
In [2]: from StringIO import StringIO
In [3]: s="""year, city, value
...: 2012, "Louisville KY", 3.5
...: 2011, "Lexington, KY", 4.0"""
In [4]: pd.read_csv(StringIO(s), quotechar='"', skipinitialspace=True)
Out[4]:
year city value
0 2012 Louisville KY 3.5
1 2011 Lexington, KY 4.0
The trick here is that you also have to use skipinitialspace=True to deal with the spaces after the comma-delimiter.
Apart from a powerful csv reader, I can also strongly advice to use pandas with the heterogeneous data you have (the example output in numpy you give are all strings, although you could use structured arrays).

The problem with the additional comma, np.genfromtxt does not deal with that.
One simple solution is to read the file with csv.reader() from python's csv module into a list and then dump it into a numpy array if you like.
If you really want to use np.genfromtxt, note that it can take iterators instead of files, e.g. np.genfromtxt(my_iterator, ...). So, you can wrap a csv.reader in an iterator and give it to np.genfromtxt.
That would go something like this:
import csv
import numpy as np
np.genfromtxt(("\t".join(i) for i in csv.reader(open('myfile.csv'))), delimiter="\t")
This essentially replaces on-the-fly only the appropriate commas with tabs.

If you are using a numpy you probably want to work with numpy.ndarray. This will give you a numpy.ndarray:
import pandas
data = pandas.read_csv('file.csv').as_matrix()
Pandas will handle the "Lexington, KY" case correctly

Make a better function that combines the power of the standard csv module and Numpy's recfromcsv. For instance, the csv module has good control and customization of dialects, quotes, escape characters, etc., which you can add to the example below.
The example genfromcsv_mod function below reads in a complicated CSV file similar to what Microsoft Excel sees, which may contain commas within quoted fields. Internally, the function has a generator function that rewrites each row with tab delimiters.
import csv
import numpy as np
def recfromcsv_mod(fname, **kwargs):
def rewrite_csv_as_tab(fname):
with open(fname, newline='') as fp:
dialect = csv.Sniffer().sniff(fp.read(1024))
fp.seek(0)
for row in csv.reader(fp, dialect):
yield "\t".join(row)
return np.recfromcsv(
rewrite_csv_as_tab(fname), delimiter="\t", encoding=None, **kwargs)
# Use it to read a CSV file into a record array
x = recfromcsv_mod("t.csv", case_sensitive=True)

You can try this code. We are reading .csv file from np.genfromtext()
method
Code:
myfile = np.genfromtxt('MyData.csv', delimiter = ',')
myfile = myfile.astype('int64')
print(myfile)
Output:
[[ 1 1 1 1 1 1 1 1 1 1 1]
[ 3 3 3 3 3 3 3 3 3 3 3]
[ 3 3 3 3 3 3 3 3 3 3 3]
[ 4 4 4 4 4 4 4 4 4 4 4]
[ 5 5 5 5 5 5 5 5 5 5 5]
[ 6 6 6 6 6 6 6 6 6 6 6]
[ 7 7 7 7 7 7 7 7 7 7 7]
[ 8 8 8 8 8 8 8 8 8 8 8]
[ 9 9 9 9 9 9 9 9 9 9 9]
[10 10 10 10 10 10 10 10 10 10 10]
[11 11 11 11 11 11 11 11 11 11 11]
[12 12 12 12 12 12 12 12 12 12 12]
[13 13 13 13 13 13 13 13 13 13 13]
[14 14 14 14 14 14 14 14 14 14 14]
[15 15 15 15 15 15 15 15 15 15 15]
[16 17 18 19 20 21 22 23 24 25 26]]
Input File "MyData.csv"

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python combine rows from different files into one data file - python

Related

Print line in file until line blank line

Merging files in Python

Finding the most frequent items in a dataset

Converting coordinate representation to adjecency list representation

Using numpy.genfromtxt to read a csv file with strings containing commas

Categories

Resources