I have a folder with several CSV files. These files all contain the box drawing double vertical and horizontal as the delimiter. I am trying to import all these files into python, change that delimiter to a pipe, and then save the new files into another location. The code I currently have runs without any errors but doesn't actually do anything. Any suggestions?
import os
import pandas as pd
directory = 'Y:/Data'
dirlist = os.listdir(directory)
file_dict = {}
x = 0
for filename in dirlist:
if filename.endswith('.csv'):
file_dict[x] = pd.read_csv(filename)
column = file_dict[x].columns[0]
file_dict[x] = file_dict[x][column].str.replace('╬', '|')
file_dict[x].to_csv("python/file{}.csv".format(x))
x += 1
Here is a picture of sample data:
Instead of directly replacing occurrences with the new character (which may replace escaped occurrences of the character as well), we can just use built-in functionality in the csv library to read the file for us, and then write it again
import csv
with open('myfile.csv', newline='') as infile, open('outfile.csv', 'w', newline='') as outfile:
reader = csv.reader(infile, delimiter='╬')
writer = csv.writer(outfile, delimiter='|')
for row in reader:
writer.writerow(row)
Adapted from the docs
with i as open(filename):
with o as open(filename+'.new', 'w+):
for line in i.readlines():
o.write(line.replace('╬', '|'))
or, skip the python, and use sed from your terminal:
$ sed -i 's/╬/|/g' *.csv
Assuming the original delimiter doesn't appear in any escaped strings, this should be slightly faster than using the regular csv module. Panada seems to do some filesystem voodoo when reading CSVs, so I wouldn't be too surprised if it is just as fast. sed will almost certainly beat them both by far.
Related
Instead of manually convert csv file to text tab delimited file using excel software
I would like to automate this process using Python.
However, using the following code
with open('endnote_csv.csv', 'r') as fin:
with open('endnote_deliminated.txt', 'w', newline='') as fout:
reader = csv.DictReader(fin, delimiter=',')
writer = csv.DictWriter(fout, reader.fieldnames, delimiter='|')
writer.writeheader()
writer.writerows(reader)
Return an error of
ValueError: dict contains fields not in fieldnames: None
May I know where did I do wrong,
The csv file is accessible via the following link
Thanks in advance for any insight.
You can use the Python package called pandas to do this:
import pandas as pd
fname = 'endnote_csv'
pd.read_csv(f'{fname}.csv').to_csv(f'{fname}.tsv', sep='\t', index=False)
Here's how it works:
pd.read_csv(fname) - reads a CSV file and stores it as a pd.DataFrame object (not important for this example)
.to_csv(fname) - writes a pd.DataFrame to a CSV file given by fname
sep='\t' - replaces the ',' used in CSVs with a tab character
index=False - use this to remove the row numbers
If you want to be a bit more advanced and use the command line only, you can do this:
# csv-to-tsv.py
import sys
import pandas as pd
fnames = sys.argv[1:]
for fname in fnames:
main_name = '.'.join(fname.split('.')[:-1])
pd.read_csv(f'{main_name}.csv').to_csv(f'{main_name}.tsv', sep='\t', index=False)
This will allow you to run a command like this from the command line and change all .csv files to .tsv files in one go:
python csv-to-tsv.py *.csv
It is erroring out on comma seperated author names. It appears that columns in the underline rows exceeds number of headers.
I have a bunch of folders and sub-folders with CSVs that have quotation marks that I need to get rid of, so I'm trying to build a script that iterates through and performs the operation on all CSVs.
Below is the code I have.
It correctly identifies what is and is not a CSV. And it re-writes them all -- but it's writing blank data in -- and not the row data without the quotation marks.
I know that this is happening around lines 14-19 but I don't know know what to do.
import csv
import os
rootDir = '.'
for dirName, subDirList, fileList in os.walk(rootDir):
print('Found directory: %s' % dirName)
for fname in fileList:
# Check if it's a .csv first
if fname.endswith('.csv'):
input = csv.reader(open(fname, 'r'))
output = open(fname, 'w')
with output:
writer = csv.writer(output)
for row in input:
writer.writerow(row)
# Skip if not a .csv
else:
print 'Not a .csv!!'
The problem is here:
input = csv.reader(open(fname, 'r'))
output = open(fname, 'w')
As soon as you do that second open in 'w' mode, it erases the file. So, your input is looping over an empty file.
One way to fix this is to you read the whole file into memory, and only then erase the whole file and rewrite it:
input = csv.reader(open(fname, 'r'))
contents = list(input)
output = open(fname, 'w')
with output:
writer = csv.writer(output)
for row in contents:
writer.writerow(row)
You can simplify this quite a bit:
with open(fname, 'r') as infile:
contents = list(csv.reader(infile))
with open(fname, 'w') as outfile:
csv.writer(outfile).writerows(contents)
Alternatively, you can write to a temporary file as you go, and then move the temporary file on top of the original file. This is a bit more complicated, but it has a major advantage—if you have an error (or someone turns off the computer) in the middle of writing, you still have the old file and can start over, instead of having 43% of the new file and all your data is lost:
dname = os.path.dirname(fname)
with open(fname, 'r') as infile, tempfile.NamedTemporaryFile('w', dir=dname, delete=False) as outfile:
writer = csv.writer(outfile)
for row in csv.reader(infile):
writer.writerow(row)
os.replace(outfile.name, fname)
If you're not using Python 3.3+, you don't have os.replace. On Unix, you can just use os.rename instead, but on Windows… it's a pain to get this right, and you probably want to look for a third-party library on PyPI. (I haven't used any of then, buy if you're using Windows XP/2003 or later and Python 2.6/3.2 or later, pyosreplace looks promising.)
I have a large CSV file (~250000 rows) and before I work on fully parsing and sorting it I was trying to display only a part of it by writing it to a text file.
csvfile = open(file_path, "rb")
rows = csvfile.readlines()
text_file = open("output.txt", "w")
row_num = 0
while row_num < 20:
text_file.write(", ".join(row[row_num]))
row_num += 1
text_file.close()
I want to iterate through the CSV file and write only a small section of it to a text file so I can look at how it does this and see if it would be of any use to me. Currently the text file ends up empty.
A way I thought might do this would be to iterate through the file with a for loop that exits after a certain number of iteration but I could be wrong and I'm not sure how to do this, any ideas?
There's nothing specifically wrong with what you're doing, but it's not particularly Pythonic. In particular reading the whole file into memory with readlines() at the start seems pointless if you're only using 20 lines.
Instead you could use a for loop with enumerate and break when necessary.
csvfile = open(file_path, "rb")
text_file = open("output.txt", "w")
for i, row in enumerate(csvfile):
text_file.write(row)
if row_num >= 20:
break
text_file.close()
You could further improve this by using with blocks to open the files, rather than closing them explicitly. For example:
with open(file_path, "rb") as csvfile:
#your code here involving csvfile
#now the csvfile is closed!
Also note that Python might not be the best tool for this - you could do it directly from Bash, for example, with just head -n20 csvfile.csv > output.txt.
A simple solution would be to just do :
#!/usr/bin/python
# -*- encoding: utf-8 -*-
file_path = './test.csv'
with open(file_path, 'rb') as csvfile:
with open('output.txt', 'wb') as textfile:
for i, row in enumerate(csvfile):
textfile.write(row)
if i >= 20:
break
Explanation :
with open(file_path, 'rb') as csvfile:
with open('output.txt', 'wb') as textfile:
Instead of using open and close, it is recommended to use this line instead. Just write the lines that you want to execute when your file is opened into a new level of indentation.
'rb' and 'wb' are the keywords you need to open a file in respectively 'reading' and 'writing' in 'binary mode'
for i, row in enumerate(csvfile):
This line allows you to read line by line your CSV file, and using a tuple (i, row) gives you both the content of the row and its index. That's one of the awesome built-in functions from Python : check out here for more about it.
Hope this helps !
EDIT : Note that Python has a CSV package that can do that without enumerate :
# -*- encoding: utf-8 -*-
import csv
file_path = './test.csv'
with open(file_path, 'rb') as csvfile:
reader = csv.reader(csvfile)
with open('output.txt', 'wb') as textfile:
writer = csv.writer(textfile)
i = 0
while i<20:
row = next(reader)
writer.writerow(row)
i += 1
All we need to use is its reader and writer. They have functions next (that reads one line) and writerow (that writes one). Note that here, the variable row is not a string, but a list of strings, because the function does the split job by itself. It might be faster than the previous solution.
Also, this has the major advantage of allowing you to look anywhere you want in the file, no necessarily from the beginning (just change the bounds for i)
I am currently trying to write a csv file in python. The format is as following:
1; 2.51; 12
123; 2.414; 142
EDIT: I already get the above format in my CSV, so the python code seems ok. It appears to be an excel issue which is olved by changing the settigs as #chucksmash mentioned.
However, when I try to open the generated csv file with excel, it doesn't recognize decimal separators. 2.414 is treated as 2414 in excel.
csvfile = open('C:/Users/SUUSER/JRITraffic/Data/data.csv', 'wb')
writer = csv.writer(csvfile, delimiter=";")
writer.writerow(some_array_with_floats)
Did you check that the csv file is generated correctly as you want? Also, try to specify the delimeter character that your using for the csv file when you import/open your file. In this case, it is a semicolon.
For python 3, I think your above code will also run into a TypeError, which may be part of the problem.
I just made a modification with your open method to be 'w' instead of 'wb' since the array has float and not binary data. This seemed to generate the result that you were looking for.
csvfile = open('C:/Users/SUUSER/JRITraffic/Data/data.csv', 'w')
An ugly solution, if you really want to use ; as the separator:
import csv
import os
with open('a.csv', 'wb') as csvfile:
csvfile.write('sep=;'+ os.linesep) # new line
writer = csv.writer(csvfile, delimiter=";")
writer.writerow([1, 2.51, 12])
writer.writerow([123, 2.414, 142])
This will produce:
sep=;
1;2.51;12
123;2.414;142
which is recognized fine by Excel.
I personally would go with , as the separator in which case you do not need the first line, so you can basically:
import csv
with open('a.csv', 'wb') as csvfile:
writer = csv.writer(csvfile) # default delimiter is `,`
writer.writerow([1, 2.51, 12])
writer.writerow([123, 2.414, 142])
And excel will recognize what is going on.
A way to do this is to specify dialect=csv.excel in the writer. For example:
a = [[1, 2.51, 12],[123, 2.414, 142]]
csvfile = open('data.csv', 'wb')
writer = csv.writer(csvfile, delimiter=";", dialect=csv.excel)
writer.writerows(a)
csvfile.close()
Unless Excel is already configured to use semicolon as its default delimiter, it will be necessary to import data.csv using Data/FromText and specify semicolon as the delimiter in the Text Import Wizard step 2 screen.
Very little documentation is provided for the Dialect class at csv.Dialect. More information about it is at Dialects in the PyMOTW's "csv – Comma-separated value files" article on the Python csv module. More information about csv.writer() is available at https://docs.python.org/2/library/csv.html#csv.writer.
I am trying to automate a process where in a specific folder, there are multiple text files following the same data format/structure. In the text files, the data is separated by a comma. I want to be able to output all of these text files into one cumulative csv file. This is what I currently have, and seem to be stuck where I am because of my lack of python knowledge.
from collections import defaultdict
import glob
def get_site_files():
sites = defaultdict(list)
for fname in glob.glob('*.txt'):
csv_out = csv.writer(open('out.csv', 'w'), delimiter=',')
f = open('myfile.txt')
for line in f:
vals = line.split(',')
csv_out.writerow()
f.close()
EDIT: bringing up comments: I want to make sure that all of the text files are read, not just only myfile.txt.
Also, if I could combine them all into one large .txt file and then I could make those into a csv that would be great too, I just am not sure the exact way to do this.
Just a little bit of reordering of your code.
import csv
import glob
def get_site_files():
with open('out.csv', 'w') as out_file:
csv_out = csv.writer(out_file, delimiter=',')
for fname in glob.glob('*.txt'):
with open(fname) as f:
for line in f:
vals = line.split(',')
csv_out.writerow(vals)
get_site_files()
But since they are all in the same format you can just concatenate them:
import glob
with ('out.csv', 'w') as fout:
for fname in glob.glob('*.txt'):
with open(fname, 'r') as fin:
fout.write(fin.read())
You could also try a different way:
I used os.listdir() once. That gives you a List of all the files in your directory. In combination with os.path.join you can manage all *.csv files in a certain directory.
Some additional Information can be found in the reference: os and os.path
So I would just loop through all the files in the directory (searching for them ending on ".csv"), for each of them, store each line in a list as a string, separate the strings by the colums delimiter, make "," to "." in the left strings and concatenate the strings again. Afterwards push each line of the list to the outputfile you wish to use
I highly recommend the python standard library for information about the total functionality of python to newbies ;)
Hope that helps ;)
I adapted the code above to covert text files to csv and get working code to convert all csv files in a folder to one text files appending all csv files. Works great.
import glob
import csv
def get_site_files():
with open('out.txt', 'w') as out_file:
csv_out = csv.writer(out_file, delimiter=',')
for fname in glob.glob('*.csv'):
with open(fname) as f:
for line in f:
vals = line.split(',')
csv_out.writerow(vals)enter code here