how can I use csv tools for zip text file? - python

update-my file.txt.zp is tab delimited and looks kind of like this :
file.txt.zp
I want to split the first col by : _ /
original post:
I have a very large zipped tab delimited file.
I want to open it, scan it one row at a time, split some of the col, and write it to a new file.
I got various errors (every time I fix one another pops)
This is my code:
import csv
import re
import gzip
f = gzip.open('file.txt.gz')
original = f.readlines()
f.close()
original_l = csv.reader(original)
for row in original_l:
file_l = re.split('_|:|/',row)
with open ('newfile.gz', 'w', newline='') as final:
finalfile = csv.writer(final,delimiter = ' ')
finalfile.writerow(file_l)
Thanks!
for this code i got the error:
for row in original_l:
_csv.Error: iterator should return strings, not bytes (did you open the file in text mode?)
so based on what I found here I added this after f.close():
original = original.decode('utf8')
and then got the error:
original = original.decode('utf8')
AttributeError: 'list' object has no attribute 'decode'

Update 2
This code should produce the output that you're after.
import csv
import gzip
import re
with gzip.open('file.txt.gz', mode='rt') as f, \
open('newfile.gz', 'w') as final:
writer = csv.writer(final, delimiter=' ')
reader = csv.reader(f, delimiter='\t')
_ = next(reader) # skip header row
for row in reader:
writer.writerow(re.split(r'_|:|/', row[0]))
Update
Open the gzip file in text mode because str objects are required by the CSV module in Python 3.
f = gzip.open('file.txt.gz', 'rt')
Also specify the delimiter when creating the csv.reader.
original_l = csv.reader(original, delimiter='\t')
This will get you past the first hurdle.
Now you need to explain what the data is, which columns you wish to extract, and what the output should look like.
Original answer follows...
One obvious problem is that the output file is constantly being overwritten by the next row of input. This is because the output file is opened in (over)write mode (`'w`` ) once per row.
It would be better to open the output file once outside of the loop.
Also, the CSV file delimiter is not specified when creating the reader. You said that the file is tab delimited so specify that:
original_l = csv.reader(original, delimiter='\t')
On the other hand, your code attempts to split each row using other delimiters, however, the rows coming from the csv.reader are represented as a list, not a string as the re.split() code would require.
Another problem is that the output file is not zipped as the name suggests.

Related

If the csv file is more than 3 lines, edit it to delete the first line (Not save as new file)

Im trying to code if the csv file is more than 3 lines, edit it to delete the first line.
I want to delete the first line from the existing file instead of saving it as a new file.
For this reason, I had to delete the existing file and create a file with the same name but
only one line is saved and the comma disappears.
I'm using Pandas data frame. But if it doesn't matter if I don't use it, I don't want to use it
Function name might be weird because I'm a beginner
Thanks.
file = open("./csv/selllist.csv", encoding="ANSI")
reader = csv.reader(file)
lines= len(list(reader))
if lines > 3:
df = pd.read_csv('./csv/selllist.csv', 'r+', encoding="ANSI")
dfa = df.iloc[1:]
print(dfa)
with open("./csv/selllist.csv", 'r+', encoding="ANSI") as x:
x.truncate(0)
with open('./csv/selllist.csv', 'a', encoding="ANSI", newline='') as fo:
# Pass the CSV file object to the writer() function
wo = writer(fo)
# Result - a writer object
# Pass the data in the list as an argument into the writerow() function
wo.writerow(dfa)
# Close the file object
fo.close()
print()
This is the type of csv file I deal with
string, string, string, string, string
string, string, string, string, string
string, string, string, string, string
string, string, string, string, string
Take a 2-step approach.
Open the file for reading and count the number of lines. If there are more than 3 lines, re-open the file (for writing) and update it.
For example:
lines = []
with open('./csv/selllist.csv') as csv:
lines = csv.readlines()
if len(lines) > 3:
with open('./csv/selllist.csv', 'w') as csv:
for line in lines[1:]: # skip first line
csv.write(line)
With pandas, you can just specify header=None while reading and writing:
import pandas as pd
if lines > 3:
df = pd.read_csv("data.csv", header=None)
df.iloc[1:].to_csv("data.csv", header=None, index=None)
With the csv module:
import csv
with open("data.csv") as infile:
reader = csv.reader(infile)
lines = list(reader)
if len(lines)>3:
with open("data.csv", "w", newline="") as outfile:
writer = csv.writer(outfile, delimiter=",")
writer.writerows(lines[1:])
With one open call and using seek and truncate.
Setup
out = """\
Look at me
I'm a file
for sure
4th line, woot!"""
with open('filepath.csv', 'w') as fh:
fh.write(out)
Solution
I'm aiming to minimize the stuff I'm doing. I'll only open one file and only one time. I'll only split one time.
with open('filepath.csv', 'r+') as csv:
top, rest = csv.read().split('\n', 1) # Only necessary to pop off first line
if rest.count('\n') > 1: # If 4 or more lines, there will be at
# least two more newline characters
csv.seek(0) # Once we're done reading, we need to
# go back to beginning of file
csv.truncate() # truncate to reduce size of file as well
csv.write(rest)

Read CSV with comma as linebreak

I have a file saved as .csv
"400":0.1,"401":0.2,"402":0.3
Ultimately I want to save the data in a proper format in a csv file for further processing. The problem is that there are no line breaks in the file.
pathname = r"C:\pathtofile\file.csv"
with open(pathname, newline='') as file:
reader = file.read().replace(',', '\n')
print(reader)
with open(r"C:\pathtofile\filenew.csv", 'w') as new_file:
csv_writer = csv.writer(new_file)
csv_writer.writerow(reader)
The print reader output looks exactly how I want (or at least it's a format I can further process).
"400":0.1
"401":0.2
"402":0.3
And now I want to save that to a new csv file. However the output looks like
"""",4,0,0,"""",:,0,.,1,"
","""",4,0,1,"""",:,0,.,2,"
","""",4,0,2,"""",:,0,.,3
I'm sure it would be intelligent to convert the format to
400,0.1
401,0.2
402,0.3
at this stage instead of doing later with another script.
The main problem is that my current code
with open(pathname, newline='') as file:
reader = file.read().replace(',', '\n')
reader = csv.reader(reader,delimiter=':')
x = []
y = []
print(reader)
for row in reader:
x.append( float(row[0]) )
y.append( float(row[1]) )
print(x)
print(y)
works fine for the type of csv files I currently have, but doesn't work for these mentioned above:
y.append( float(row[1]) )
IndexError: list index out of range
So I'm trying to find a way to work with them too. I think I'm missing something obvious as I imagine that it can't be too hard to properly define the linebreak character and delimiter of a file.
with open(pathname, newline=',') as file:
yields
ValueError: illegal newline value: ,
The right way with csv module, without replacing and casting to float:
import csv
with open('file.csv', 'r') as f, open('filenew.csv', 'w', newline='') as out:
reader = csv.reader(f)
writer = csv.writer(out, quotechar=None)
for r in reader:
for i in r:
writer.writerow(i.split(':'))
The resulting filenew.csv contents (according to your "intelligent" condition):
400,0.1
401,0.2
402,0.3
Nuances:
csv.reader and csv.writer objects treat comma , as default delimiter (no need to file.read().replace(',', '\n'))
quotechar=None is specified for csv.writer object to eliminate double quotes around the values being saved
You need to split the values to form a list to represent a row. Presently the code is splitting the string into individual characters to represent the row.
pathname = r"C:\pathtofile\file.csv"
with open(pathname) as old_file:
with open(r"C:\pathtofile\filenew.csv", 'w') as new_file:
csv_writer = csv.writer(new_file, delimiter=',')
text_rows = old_file.read().split(",")
for row in text_rows:
items = row.split(":")
csv_writer.writerow([int(items[0]), items[1])
If you look at the documentation, for write_row, it says:
Write the row parameter to the writer’s file
object, formatted according to the current dialect.
But, you are writing an entire string in your code
csv_writer.writerow(reader)
because reader is a string at this point.
Now, the format you want to use in your CSV file is not clearly mentioned in the question. But as you said, if you can do some preprocessing to create a list of lists and pass each sublist to writerow(), you should be able to produce the required file format.

Writing output to csv file [in correct format]

I realize this question has been asked a million times and there is a lot of documentation on it. However, I am unable to output the results in the correct format.
The below code was adopted from: Replacing empty csv column values with a zero
# Save below script as RepEmptyCells.py
# Add #!/usr/bin/python to script
# Make executable by chmod +x prior to running the script on desired .csv file
# Below code will look through your .csv file and replace empty spaces with 0s
# This can be particularly useful for genetic distance matrices
import csv
import sys
reader = csv.reader(open(sys.argv[1], "rb"))
for row in reader:
for i, x in enumerate(row):
if len(x)< 1:
x = row[i] = 0
print(','.join(int(x) for x in row))
Currently, to get the correct output .csv file [i.e. in correct format] one can run the following command in bash:
#After making the script executable
./RepEmptyCells.py input.csv > output.csv # this produces the correct output
I've tried to use csv.writer function to produce the correctly formatted output.csv file (similar to ./RepEmptyCells.py input.csv > output.csv) without much luck.
I'd like to learn how to add this last part to the code to automate the process without having to do it in bash.
What I have tried:
f = open(output2.csv, 'w')
import csv
import sys
reader = csv.reader(open(sys.argv[1], "rb"))
for row in reader:
for i, x in enumerate(row):
if len(x)< 1:
x = row[i] = 0
f.write(','.join(int(x) for x in row))
f.close()
When looking at the raw files from this code and the one before, they look the same.
However, when I open them in either excel or iNumbers the latter (i.e. output2.csv) shows only a single row of the data.
Its important that both output.csv and output2.csv can be opened in excel.
2 options:
Just do a f.write('\n') after your current f.write statement.
Use csv.writer. You mention it but it isn't in your code.
writer = csv.writer(f)
...
writer.writerow([int(x) for x in row]) # Note difference in parameter format
An humble proposition
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import csv
import sys
# Use with statement to properly close files
# Use newline='' which is the right option for Python 3.x
with open(sys.argv[1], 'r', newline='') as fin, open(sys.argv[2], 'w', newline='') as fout:
reader = csv.reader(fin)
# You may need to redefine the dialect for some version of Excel that
# split cells on semicolons (for _Comma_ Separated Values, yes...)
writer = csv.writer(fout, dialect="excel")
for row in reader:
# Write as reading, let the OS do the caching alone
# Process the data as it comes in a generator, checking all cells
# in a row. If cell is empty, the or will return "0"
# Keep strings all the time: if it's not an int it would fail
# Converting to int will force the writer to convert it back to str
# anwway, and Excel doesn't make any difference when loading.
writer.writerow( cell or "0" for cell in row )
Sample in.csv
1,2,3,,4,5,6,
7,,8,,9,,10
Output out.csv
1,2,3,0,4,5,6,0
7,0,8,0,9,0,10
import csv
import sys
with open(sys.argv[1], 'rb') as f:
reader = csv.reader(f)
for row in reader:
print row.replace(' ', '0')
and I don't understand your need for using the shell and redirecting.
a csv writer is just:
with open('output.csv', 'wb') as f:
writer = csv.writer(f)
writer.writerows(rows)

re.sub for a csv file

I am receiving a error on this code. It is "TypeError: expected string or buffer". I looked around, and found out that the error is because I am passing re.sub a list, and it does not take lists. However, I wasn't able to figure out how to change my line from the csv file into something that it would read.
I am trying to change all the periods in a csv file into commas. Here is my code:
import csv
import re
in_file = open("/test.csv", "rb")
reader = csv.reader(in_file)
out_file = open("/out.csv", "wb")
writer = csv.writer(out_file)
for row in reader:
newrow = re.sub(r"(\.)+", ",", row)
writer.writerow(newrow)
in_file.close()
out_file.close()
I'm sorry if this has already been answered somewhere. There was certainly a lot of answers regarding this error, but I couldn't make any of them work with my csv file. Also, as a side note, this was originally an .xslb excel file that I converted into csv in order to be able to work with it. Was that necessary?
You could use list comprehension to apply your substitution to each item in row
for row in reader:
newrow = [re.sub(r"(\.)+", ",", item) for item in row]
writer.writerow(newrow)
for row in reader does not return single element to parse it rather it returns list of of elements in that row so you have to unpack that list and parse each item individually, just like #Trii shew you:
[re.sub(r'(\.)+','.',s) for s in row]
In this case, we are using glob to access all the csv files in the directory.
The code below overwrites the source csv file, so there is no need to create an output file.
NOTE:
If you want to get a second file with the parameters provided with re.sub, replace write = open(i, 'w') for write = open('secondFile.csv', 'w')
import re
import glob
for i in glob.glob("*.csv"):
read = open(i, 'r')
reader = read.read()
csvRe = re.sub(re.sub(r"(\.)+", ",", str(reader))
write = open(i, 'w')
write.write(csvRe)
read.close()
write.close()

Removing last row in csv

I'm trying to remove the last row in a csv but I getting an error: _csv.Error: string with NUL byte
This is what I have so far:
dcsv = open('PnL.csv' , 'a+r+b')
cWriter = csv.writer(dcsv, delimiter=' ')
cReader = csv.reader(dcsv)
for row in cReader:
cWriter.writerow(row[:-1])
I cant figure out why I keep getting errors
I would just read in the whole file with readlines(), pop out the last row, and then write that with csv module
import csv
f = open("summary.csv", "r+w")
lines=f.readlines()
lines=lines[:-1]
cWriter = csv.writer(f, delimiter=',')
for line in lines:
cWriter.writerow(line)
This should work
import csv
f = open('Pnl.csv', "r+")
lines = f.readlines()
lines.pop()
f = open('Pnl.csv', "w+")
f.writelines(lines)
I'm not sure what you're doing with the 'a+r+b' file mode and reading and writing to the same file, so won't provide a complete code snippet, but here's a simple method to skip any lines that contains a NUL byte in them in a file you're reading, whether it's the last, first, or one in the middle being read.
The trick is to realize that the docs say the csvfile argument to a csv.writer() "can be any object which supports the iterator protocol and returns a string each time its next() method is called." This means that you can replace the file argument in the call with a simple filter iterator function defined this way:
def filter_nul_byte_lines(a_file):
for line in a_file:
if '\x00' not in line:
yield line
and use it in a way similar to this:
dcsv = open('Pnl.csv', 'rb+')
cReader = csv.reader(filter_nul_byte_lines(dcsv))
for row in cReader:
print row
This will cause any lines with a NUL byte in them to be ignored while reading the file. Also this technique works on-the-fly as each line is read, so it does not require reading the entire file into memory at once or preprocessing it ahead of time.

Categories

Resources