I've got a csv file with a matrix of data. When the initial csv file is opened in Notepad it looks like this:
"AAA,15.0"
"BBB,45.0"
"CCC,60.0"
I then want to process this data, adding another column to get something formatted as follows:
"AAA,15.0,50.0"
"BBB,45.0,30.0"
"CCC,60.0,20.0"
So.......
I open the original file into Python using:
with open((FilePath"/XXX.csv"), 'rt') as csvfile:
NewData = list(csv.reader(csvfile, delimiter=';'))
print(NewData)
The first time I do this the code produces a list of strings (which I'm actually happy about - I want this format)...
['AAA,15.0,50.0', 'BBB,45.0,30.0', 'CCC,60.0,20.0']
But then next time I try to add a column I end up with....
[['AAA,15.0,50.0'], ['BBB,45.0,30.0'], ['CCC,60.0,20.0']]
So, each time my code runs it adds an additional layer of 'listing'.
What do I need to do to keep the initial formatting of a list of strings? I imagine it's because I'm opening the file with the list() command. What should I be using?
Further details as requested.........
Distilling this further...My code is...
import csv
FilePathSB="C:/Users/"
with open((FilePathSB+"/Master.csv"), 'rt') as csvfile:
xMatrix = list(csv.reader(csvfile, delimiter=';'))
####Do something to the data like add another column of numbers
#SaveAs same file
with open(FilePathSB+"/Master.csv", "w") as output:
writer=csv.writer(output,lineterminator='\n')
for val in xMatrix:
writer.writerow([val])
Note that there is some data manipulation that occurs while the file is open but this doesn't impact the problem I have so I've left the code out.
Opening the file and then saving it is adding a layer of 'listing' each time the code runs. I'd like the formatting to remain unchanged (ie so despite being opened and then resaved I'd like the data format to be the same as the initial matrix shown below).
So the first time the code runs it opens the initial csv data of:
"AAA,24:17"
"BBB,21:18"
"CCC,16:40"
and changes the format to saves it as:
"['AAA,24:17']"
"['BBB,21:18']"
"['CCC,16:40']"
If I run the code again it takes this data and changes it to:
"[""['AAA,24:17']""]"
"[""['BBB,21:18']""]"
"[""['CCC,16:40']""]"
And if I run it again I end up with:
"['[""[\'AAA,24:17\']""]']"
"['[""[\'BBB,21:18\']""]']"
"['[""[\'CCC,16:40\']""]']"
The csv Reader is meant to parse your file row by row, and return a list for each one.
If we had a file like:
header1|header2
1| A
2| B
when we parsed this csv file using the "|" character as a delimiter, we'd get:
[['header1', 'header2'], ['1', 'A'], ['2', 'B']]
This is exactly what we're supposed to expect in this case. However, if we parsed it with some other character as the delimiter, we'd still get:
[['header1|header2'], ['1| A'], ['2| B']]
This is what you're doing, because your csv reader is primed to expect a delimiter of ";", while your actual csv has (apparently) a delimiter of ",".
After reading your csv in using reader, you'll have a list of lists, where each inner list represents a row. Think of it as looking like this:
[
row1,
row2,
row3
]
where each row looks like:
[cell1, cell2, cell3]
If you want to add a new column to each row, you'll have to iterate over all the rows:
for current_row in rows:
# use current row here
and use the .append() method of a list to add the new column.
current_row.append('new_value')
Finally, you can use csv.writer to write your rows to another file. See csv.writerows
Related
At the moment, I know how to write a list to one row in csv, but when there're multiple rows, they are still written in one row. What I would like to do is write the first list in the first row of csv, and the second list to the second row.
Code:
for i in range(10):
final=[i*1, i*2, i*3]
with open ('0514test.csv', 'a') as file:
file.write(','.join(str(i) for i in kk))
You may want to add linebreak to every ending of row. You can do so by typing for example:
file.write('\n')
The csv module from standard library provides objects to read and write CSV files. In your case, you could do:
import csv
for i in range(10):
final = [i*1, i*2, i*3]
with open("0514test.csv", "a", newline="") as file:
writer = csv.writer(file)
writer.writerow(final)
Using this module is often safer in real life situations because it takes care of all the CSV machinery (adding delimiters like " or ', managing the cases in which your separator is also present in your data etc.)
The Problem
I have a CSV file that contains a large number of items.
The first column can contain either an IP address or random garbage. The only other column I care about is the fourth one.
I have written the below snippet of code in an attempt to check if the first column is an IP address and, if so, write that and the contents of the fourth column to another CSV file side by side.
with open('results.csv','r') as csvresults:
filecontent = csv.reader(csvresults)
output = open('formatted_results.csv','w')
processedcontent = csv.writer(output)
for row in filecontent:
first = str(row[0])
fourth = str(row[3])
if re.match('\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}', first) != None:
processedcontent.writerow(["{},{}".format(first,fourth)])
else:
continue
output.close()
This works to an extent. However, when viewing in Excel, both items are placed in a single cell rather than two adjacent ones. If I open it in notepad I can see that each line is wrapped in quotation marks. If these are removed Excel will display the columns properly.
Example Input
1.2.3.4,rubbish1,rubbish2,reallyimportantdata
Desired Output
1.2.3.4 reallyimportantdata - two separate columns
Actual Output
"1.2.3.4,reallyimportantdata" - single column
The Question
Is there any way to fudge the format part to not write out with quotations? Alternatively, what would be the best way to achieve what I'm trying to do?
I've tried writing out to another file and stripping the lines but, despite not throwing any errors, the result was the same...
writerow() takes a list of elements and writes each of those into a column. Since you are feeding a list with only one element, it is being placed into one column.
Instead, feed writerow() a list:
processedcontent.writerow([first,fourth])
Have you considered using Pandas?
import pandas as pd
df = pd.read_csv("myFile.csv", header=0, low_memory=False, index_col=None)
fid = open("outputp.csv","w")
for index, row in df.iterrows():
aa=re.match(r"^\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}$",row['IP'])
if aa:
tline = '{0},{1}'.format(row['IP'], row['fourth column'])
fid.write(tline)
output.close()
There may be an error or two and I got the regex from here.
This assumes the first row of the csv has titles which can be referenced. If it does not then you can use header = None and reference the columns with iloc
Come to think of it you could probably run the regex on the dataFrame, copy the first and fourth column to a new dataFrame and use the to_csv method in pandas.
I'm trying to use it to manipulate data in large txt-files.
I have a txt-file with more than 2000 columns, and about a third of these have a title which contains the word 'Net'. I want to extract only these columns and write them to a new txt file. Any suggestion on how I can do that?
I have searched around a bit but haven't been able to find something that helps me. Apologies if similar questions have been asked and solved before.
EDIT 1: Thank you all! At the moment of writing 3 users have suggested solutions and they all work really well. I honestly didn't think people would answer so I didn't check for a day or two, and was happily surprised by this. I'm very impressed.
EDIT 2: I've added a picture that shows what a part of the original txt-file can look like, in case it will help anyone in the future:
One way of doing this, without the installation of third-party modules like numpy/pandas, is as follows. Given an input file, called "input.csv" like this:
a,b,c_net,d,e_net
0,0,1,0,1
0,0,1,0,1
(remove the blank lines in between, they are just for formatting the
content in this post)
The following code does what you want.
import csv
input_filename = 'input.csv'
output_filename = 'output.csv'
# Instantiate a CSV reader, check if you have the appropriate delimiter
reader = csv.reader(open(input_filename), delimiter=',')
# Get the first row (assuming this row contains the header)
input_header = reader.next()
# Filter out the columns that you want to keep by storing the column
# index
columns_to_keep = []
for i, name in enumerate(input_header):
if 'net' in name:
columns_to_keep.append(i)
# Create a CSV writer to store the columns you want to keep
writer = csv.writer(open(output_filename, 'w'), delimiter=',')
# Construct the header of the output file
output_header = []
for column_index in columns_to_keep:
output_header.append(input_header[column_index])
# Write the header to the output file
writer.writerow(output_header)
# Iterate of the remainder of the input file, construct a row
# with columns you want to keep and write this row to the output file
for row in reader:
new_row = []
for column_index in columns_to_keep:
new_row.append(row[column_index])
writer.writerow(new_row)
Note that there is no error handling. There are at least two that should be handled. The first one is the check for the existence of the input file (hint: check the functionality provide by the os and os.path modules). The second one is to handle blank lines or lines with an inconsistent amount of columns.
This could be done for instance with Pandas,
import pandas as pd
df = pd.read_csv('path_to_file.txt', sep='\s+')
print(df.columns) # check that the columns are parsed correctly
selected_columns = [col for col in df.columns if "net" in col]
df_filtered = df[selected_columns]
df_filtered.to_csv('new_file.txt')
Of course, since we don't have the structure of your text file, you would have to adapt the arguments of read_csv to make this work in your case (see the the corresponding documentation).
This will load all the file in memory and then filter out the unnecessary columns. If your file is so large that it cannot be loaded in RAM at once, there is a way to load only specific columns with the usecols argument.
You can use pandas filter function to select few columns based on regex
data_filtered = data.filter(regex='net')
I have quite large csv file, about 400000 lines like:
54.10,14.20,34.11
52.10,22.20,22.11
49.20,17.30,29.11
48.40,22.50,58.11
51.30,19.40,13.11
and the second one about 250000 lines with updated data for the third column - the first and the second column are reference for update:
52.10,22.20,22.15
49.20,17.30,29.15
48.40,22.50,58.15
I would like to build third file like:
54.10,14.20,34.11
52.10,22.20,22.15
49.20,17.30,29.15
48.40,22.50,58.15
51.30,19.40,13.11
It has to contain all data from the first file except these lines where value of third column is taken from the second file.
Suggest you look at Pandas merge functions. You should be able to do what you want,, It will also handle the data reading from CSV (create a dataframe that you will merge)
A stdlib solution with just the csv module; the second file is read into memory (into a dictionary):
import csv
with open('file2.csv', 'rb') as updates_fh:
updates = {tuple(r[:2]): r for r in csv.reader(updates_fh)}
with open('file1.csv', 'rb') as infh, open('output.csv', 'wb') as outfh:
writer = csv.writer(outfh)
writer.writerows((updates.get(tuple(r[:2]), r) for r in csv.reader(infh)))
The first with statement opens the second file and builds a dictionary keyed on the first two columns. It is assumed that these are unique in the file.
The second block then opens the first file for reading, the output file for writing, and writes each row from the inputfile to the output file, replacing any row present in the updates dictionary with the updated version.
I have a mass CSV file, that I'd like to do some calculations on several fields and output the result to another CSV file.
Let's imagine that I have 12 fields on my file1.csv.
Here is my sample code :
import csv
file1 = csv.reader(open('file1.csv', 'rb'), delimiter=';') #traffic
for record in file1:
print record[0], int(record[1]) * int(record[4])
Now.. I would like to save these rows in a new csv file.. But I got stuck there.
writerrow() method only accept the whole row, and not a pattern like what I've put on my for loop.
Any suggestion ??
writerow takes an iterable. You can easily compose a new row by creating a list
# new_csv_writer = open file for writing
for record in file1:
new_csv_writer.writerow([record[0], int(record[1]) * int(record[4])])
The above will write a row with 2 columns