Change values in CSV or text style file - python

I'm having some problems with the following file.
Each line has the following content:
foobar 1234.569 7890.125 12356.789 -236.4569 236.9874 -569.9844
What I want to edit in this file, is reverse last three numbers, positive or negative.
The output should be:
foobar 1234.569 7890.125 12356.789 236.4569 -236.9874 569.9844
Or even better:
foobar,1234.569,7890.125,12356.789,236.4569,-236.9874,569.9844
What is the easiest pythonic way to accomplish this?
At first I used the csv.reader, but I found out it's not tab separated, but random (3-5) spaces.
I've read the CSV module and some examples / similar questions here, but my knowledge of python ain't that good and the CSV module seems pretty tough when you want to edit a value of a row.
I can import and edit this in excel with no problem, but I want to use it in a python script, since I have hundreds of these files. VBA in excel is not an option.
Would it be better to just regex each line?
If so, can someone point me in a direction with an example?

You can use str.split() to split your white-space-separated lines into a row:
row = line.split()
then use csv.writer() to create your new file.
str.split() with no arguments, or None as the first argument, splits on arbitrary-width whitespace and ignores leading and trailing whitespace on the line:
>>> 'foobar 1234.569 7890.125 12356.789 -236.4569 236.9874 -569.9844\n'.split()
['foobar', '1234.569', '7890.125', '12356.789', '-236.4569', '236.9874', '-569.9844']
As a complete script:
import csv
with open(inputfilename, 'r') as infile, open(outputcsv, 'wb') as outfile:
writer = csv.writer(outfile)
for line in infile:
row = line.split()
inverted_nums = [-float(val) for val in row[-3:]]
writer.writerow(row[:-3] + inverted_nums)

from operator import neg
with open('file.txt') as f:
for line in f:
line = line.rstrip().split()
last3 = map(str,map(neg,map(float,line[-3:])))
print("{0},{1}".format(line[0],','.join(line[1:-3]+last3)))
Produces:
>>>
foobar,1234.569,7890.125,12356.789,236.4569,-236.9874,569.9844
CSV outputting version:
with open('file.txt') as f, open('ofile.txt','w+') as o:
writer = csv.writer(o)
for line in f:
line = line.rstrip().split()
last3 = map(neg,map(float,line[-3:]))
writer.writerow(line[:-3]+last3)

You could use genfromtxt:
import numpy as np
a=np.genfromtxt('foo.csv', dtype=None)
with open('foo.csv','w') as f:
for el in a[()]:
f.write(str(el)+',')

Related

Using CSV columns to Search and Replace in a text file

Background
I have a two column CSV file like this:
Find
Replace
is
was
A
one
b
two
etc.
First column is text to find and second is text to replace.
I have second file with some text like this:
"This is A paragraph in a text file." (Please note the case sensitivity)
My requirement:
I want to use that csv file to search and replace in the text file with three conditions:-
whole word replacement.
case sensitive replacement.
Replace all instances of each entry in CSV
Script tried:
with open(CSV_file.csv', mode='r') as infile:
reader = csv.reader(infile)
mydict = {(r'\b' + rows[0] + r'\b'): (r'\b' + rows[1]+r'\b') for rows in reader}<--Requires Attention
print(mydict)
with open('find.txt') as infile, open(r'resul_out.txt', 'w') as outfile:
for line in infile:
for src, target in mydict.items():
line = re.sub(src, target, line) <--Requires Attention
# line = line.replace(src, target)
outfile.write(line)
Description of script
I have loaded my csv into a python dictionary and use regex to find whole words.
Problems
I used r'\b' to make word boundry in order to make whole word replacement but output gives me "\\b" in the dictionary instead of '\b' ??
using REPLACE function gives like:
"Thwas was one paragraph in a text file."
secondly I don't know how to make replacement case sensitive in regex pattern?
If anyone know better solution than this script or can improve the script?
Thanks for help if any..
Here's a more cumbersome approach (more code) but which is easier to read and does not rely on regular expressions. In fact, given the very simple nature of your CSV control file, I wouldn't normally bother using the csv module at all:-
import csv
with open('temp.csv', newline='') as c:
reader = csv.DictReader(c, delimiter=' ')
D = {}
for row in reader:
D[row['Find']] = row['Replace']
with open('input.txt', newline='') as infile:
with open('output.txt', 'w') as outfile:
for line in infile:
tokens = line.split()
for i, t in enumerate(tokens):
if t in D:
tokens[i] = D[t]
outfile.write(' '.join(tokens)+'\n')
I'd just put pure strings into mydict so it looks like
{'is': 'was', 'A': 'one', ...}
and replace this line:
# line = re.sub(src, target, line) # old
line = re.sub(r'\b' + src + r'\b', target, line) # new
Note that you don't need \b in the replacement pattern. Regarding your other questions,
regular expressions are case-sensitive by default,
changing '\b' to '\\b' is exactly what the r'' does. You can omit the r and write '\\b', but that quickly gets ugly with more complex regexs.

Convert from space to comma and reorder the values Python

I have a .csv file with many lines and with the structure:
YYY-MM-DD HH first_name quantity_number second_name first_number second_number third_number
I have a script in python to convert the separator from space to comma, and that working fine.
import csv
with open('file.csv') as infile, open('newfile.dat', 'w') as outfile:
for line in infile:
outfile.write(" ".join(line.split()).replace(' ', ','))
I need change, in the newfile.dat, the position of each value, for example put the HH value in position 6, the second_name value in position 2, etc.
Thanks in advance for your help.
If you're import csv might as well use it
import csv
with open('file.csv', newline='') as infile, open('newfile.dat', 'w+', newline='') as outfile:
read = csv.reader(infile, delimiter=' ')
write = csv.writer(outfile) #defaults to excel format, ie commas
for line in read:
write.writerow(line)
Use newline='' when opening csv files, otherwise you get double spaced files.
This just writes the line as it is in the input. If you want to change it before writing, do it in the for line in read: loop. line is a list of strings, which you can change the order of in any number of ways.
One way to reorder the values is to use operator.itemgetter:
from operator import itemgetter
getter = itemgetter(5,4,3,2,1,0) #This will reverse a six_element list
for line in read:
write.writerow(getter(line))
To reorder the items, a basic way could be as follows:
split_line = line.split(" ")
column_mapping = [9,6,3,7,3,2,1]
reordered = [split_line[c] for c in column_mapping]
joined = ",".join(reordered)
outfile.write(joined)
This splits up the string, reorders it according to column_mapping and then combines it back into one string (comma separated)
(in your code don't include column_mapping in the loop to avoid reinitialising it)

Sort a file by first (or second, or else) column in python

This seems a very basic question, but I am new to python, and after spending a long time trying to find a solution on my own, I thought it's time to ask some more advanced people!
So, I have a file (sample):
ENSMUSG00000098737 95734911 95734973 3 miRNA
ENSMUSG00000077677 101186764 101186867 4 snRNA
ENSMUSG00000092727 68990574 68990678 11 miRNA
ENSMUSG00000088009 83405631 83405764 14 snoRNA
ENSMUSG00000028255 145003817 145032776 3 protein_coding
ENSMUSG00000028255 145003817 145032776 3 processed_transcript
ENSMUSG00000028255 145003817 145032776 3 processed_transcript
ENSMUSG00000098481 38086202 38086317 13 miRNA
ENSMUSG00000097075 126971720 126976098 7 lincRNA
ENSMUSG00000097075 126971720 126976098 7 lincRNA
and I need to write a new file with all the same information, but sorted by the first column.
What I use so far is :
lines = open(my_file, 'r').readlines()
output = open("intermediate_alphabetical_order.txt", 'w')
for line in sorted(lines, key=itemgetter(0)):
output.write(line)
output.close()
It doesn't return me any error, but just writes the output file exactly as the input file.
I know it is certainly a very basic mistake, but it would be amazing if some of you could tell me what I'm doing wrong!
Thanks a lot!
Edit
I am having trouble with the way I open the file, so the answers concerning already opened arrays don't really help.
The problem you're having is that you're not turning each line into a list. When you read in the file, you're just getting the whole line as a string. You're then sorting by the first character of each line, and this is always the same character in your input, 'E'.
To just sort by the first column, you need to split the first block off and just read that section. So your key should be this:
for line in sorted(lines, key=lambda line: line.split()[0]):
split will turn your line into a list, and then the first column is taken from that list.
If your input file is tab-separated, you can also use the csv module.
import csv
from operator import itemgetter
reader = csv.reader(open("t.txt"), delimiter="\t")
for line in sorted(reader, key=itemgetter(0)):
print(line)
sorts by first column.
Change the number in
key=itemgetter(0)
for sorting by a different column.
Same idea as SuperBiasedMan, but I prefer this approach: if you want another way of sorting (for example: if first column matches, sort by second, then third, etc) it is more easily implemented
with open(my_file) as f:
lines = [line.split(' ') for line in f]
output = open("result.txt", 'w')
for line in sorted(lines):
output.write(' '.join(line), key=itemgetter(0))
output.close()
You can write a function that takes a filename, delimiter and column to sort by using csv.reader to parse the file:
from operator import itemgetter
import csv
def sort_by(fle,col,delim):
with open(fle) as f:
r = csv.reader(f, delim=delim)
for row in sorted(r, key=itemgetter(col)):
yield row
for row in sort_by("your_file",2, "\t"):
print(row)
You can do this quickly with pandas as follows, with the data file set up exactly as you show it (i.e., with variable spaces as separators):
import pandas as pd
df = pd.read_csv('csvdata.csv', sep=' ', skipinitialspace=True, header=None)
df.sort(columns=[0], inplace=True)
df.to_csv('sorted_csvdata.csv', header=None, index=None)
Just to check the result:
with open('sorted_csvdata.csv', 'r') as f:
print(f.read())
ENSMUSG00000028255,145003817,145032776,3,protein_coding
ENSMUSG00000028255,145003817,145032776,3,processed_transcript
ENSMUSG00000028255,145003817,145032776,3,processed_transcript
ENSMUSG00000077677,101186764,101186867,4,snRNA
ENSMUSG00000088009,83405631,83405764,14,snoRNA
ENSMUSG00000092727,68990574,68990678,11,miRNA
ENSMUSG00000097075,126971720,126976098,7,lincRNA
ENSMUSG00000097075,126971720,126976098,7,lincRNA
ENSMUSG00000098481,38086202,38086317,13,miRNA
ENSMUSG00000098737,95734911,95734973,3,miRNA
You can do multi column sorting by adding additional columns to the list in the colmuns=[...] keyword argument.
Here is another option. Similar to some of the ideas above. Basically, mysort is a function that will do the custom sorting for you which here is based on
def mysort(line):
return line.split()[0]
with open("records.txt", "r") as f:
text = f.readlines()
for line in sorted(text, key=mysort):
print line

re.sub for a csv file

I am receiving a error on this code. It is "TypeError: expected string or buffer". I looked around, and found out that the error is because I am passing re.sub a list, and it does not take lists. However, I wasn't able to figure out how to change my line from the csv file into something that it would read.
I am trying to change all the periods in a csv file into commas. Here is my code:
import csv
import re
in_file = open("/test.csv", "rb")
reader = csv.reader(in_file)
out_file = open("/out.csv", "wb")
writer = csv.writer(out_file)
for row in reader:
newrow = re.sub(r"(\.)+", ",", row)
writer.writerow(newrow)
in_file.close()
out_file.close()
I'm sorry if this has already been answered somewhere. There was certainly a lot of answers regarding this error, but I couldn't make any of them work with my csv file. Also, as a side note, this was originally an .xslb excel file that I converted into csv in order to be able to work with it. Was that necessary?
You could use list comprehension to apply your substitution to each item in row
for row in reader:
newrow = [re.sub(r"(\.)+", ",", item) for item in row]
writer.writerow(newrow)
for row in reader does not return single element to parse it rather it returns list of of elements in that row so you have to unpack that list and parse each item individually, just like #Trii shew you:
[re.sub(r'(\.)+','.',s) for s in row]
In this case, we are using glob to access all the csv files in the directory.
The code below overwrites the source csv file, so there is no need to create an output file.
NOTE:
If you want to get a second file with the parameters provided with re.sub, replace write = open(i, 'w') for write = open('secondFile.csv', 'w')
import re
import glob
for i in glob.glob("*.csv"):
read = open(i, 'r')
reader = read.read()
csvRe = re.sub(re.sub(r"(\.)+", ",", str(reader))
write = open(i, 'w')
write.write(csvRe)
read.close()
write.close()

Python slicing and splitting data from a text file

def read_slice():
openfile=open('slicefile1.txt','r')
#savetofile=open('slicefile2.txt','w')
lines = openfile.readlines()
for line in lines:
line.slice(',')[16].split('\t')
print(line)
The file that I want to read and parse ( is that the correct word?) is in the format
['05,21,34,37,38,01,06', '09,16,26,36,39,02,06', '03,10,18,31,37,02,04'].
I want to return ( actually write to another file) in the format
05,21,34,37,38 tab 01,06
09,16,26,36,39 tab 02,06
I am obviously stupid because I still cannot handle lists and strings. The function gives an error method slice not available of string. Please help if you would
Slicing in python strings refers to using the [start:stop:step] notation. In your case, you can simply use:
'\t'.join((line[:14], line[15:]))
to split the line and rejoin it with a tab:
>>> line = '05,21,34,37,38,01,06'
>>> line[:14]
'05,21,34,37,38'
>>> line[15:]
'01,06'
>>> '\t'.join((line[:14], line[15:]))
'05,21,34,37,38\t01,06'
Slightly more complicated that Martijn's answer, but perhaps more flexible if changes to formatting or delimiters etc... occur:
import csv
with open('yourfile') as fin, open('output.csv', 'wb') as fout:
csvin = csv.reader(fin)
csvout = csv.writer(fout, delimiter='\t')
for row in csvin:
csvout.writerow( [','.join(row[:5], ','.join(row[5:])] )

Categories

Resources