This is maybe a very basic question, but let's suppose one has a csv file which looks as follows:
a,a,a,a
b,b,b,b
c,c,c,c
d,d,d,d
e,e,e,e
And I am interested in deleting row[1], and row[3] and rewrite a new file that does not contain such rows. What would be the best way to do this?. As the module csv is already loaded in my code, I'd like to know how to do it within such scheme. I'd be glad if somebody could help me with this.
Since each row is on a separate line (assuming there are no newlines within the data items of the rows themselves), you can do this by simply copying the file line-by-line and skipping any you don't want kept. Since I'm unsure whether you number rows starting from zero or one, I've added a symbolic constant at the beginning to control it. You could, of course, hardcode it, as well as ROWS_TO_DELETE, directly into the code.
Regardless, this approach would be faster than using, for example, the csv module, because it avoids all the unnecessarily parsing and reformatting of the data being processed that the module has to do.
FIRST_ROW_NUM = 1 # or 0
ROWS_TO_DELETE = {1, 3}
with open('infile.csv', 'rt') as infile, open('outfile.csv', 'wt') as outfile:
outfile.writelines(row for row_num, row in enumerate(infile, FIRST_ROW_NUM)
if row_num not in ROWS_TO_DELETE)
Resulting output file's contents:
b,b,b,b
d,d,d,d
e,e,e,e
Related
I have a csv-file that doesn't delimit. Screenshot of csv-file.
This means that all the data stays in row[0], and does not divide into 6 columns. Does anybody know how to solve this issue?
import csv
n=1048576
id=[]*n
a=[]*n
date=[]*n
b=[]*n
c=[]*n
with open('C:\\Users\\andsc\\data_1.csv') as csv_file:
csv_reader = csv.reader(csv_file, delimiter=',')
line_count = 0
for row in csv_reader:
id[line_count] = row[0]
a[line_count] = row[1]
date[line_count] = row[2]
b[line_count] = row[3]
c[line_count] = row[4]
line_count += 1
You appear to be using a non-US version of Excel. In locales where the comma is used as a decimal separator, Excel expects the semicolon as the column delimiter:
csv_reader = csv.reader(csv_file, delimiter=';')
Firstly, don't do this:
id=[]*n
a=[]*n
...etc...
What you are trying to do is emulate a fixed-length array. That won't work. As you will see if you do this at the command prompt:
>>> [] * 9
[]
This is because the * really is a multiply, and just as [1] * 3 gives [1, 1, 1] (three repetitions of the list [1]) doing [] * 9 gives 9 repetitions of the empty list, which is just as empty as one repetition.
Instead create empty lists:
id=[]
a=[]
...etc...
Then, in your loop, do not index into these lists, append() new values to them instead:
id.append(row[0])
a.append(row[1])
...etc...
That means you don't need to keep track of line_count, and even if you do need to do that, use the provided method csv_reader.line_num().
Using Excel screenshots to look at a CSV is often misleading. It is clear that your version of Excel expects the delimiter of the CSV to be a semicolon not a comma, which is why the data is all in one column. To be 100% sure of what is in the file, open it in a text editor like Notepad or Notepad++. That avoids Excel's aggressive type coercion, which changes anything that looks like a date, or a hexadecimal string, into a number. And above all do not save the CSV back from Excel and assume the file still to be as expected.
It is clear that the code you presented will not run. It will get an IndexError the first time through the loop. You have to fix the code before it will run, and when you do that you will see that Python really does respect the comma as delimiter.
But opening the input file in Excel has given you a mistaken idea of where the problem is. You are quite right to say that comma is clearly the intended delimiter in the file. But when you open a CSV in Excel, Excel uses your system decimal and delimiter settings, which for European installations of Windows and MacOS are usually , and ;.
Excel is not bright enough to figure out on its own that those settings are inappropriate for a given file; it needs help from you. You can change Excel's File | Open behaviour by altering your system settings, but if you change the delimiter to , you will have to change the decimal point to . (for every single application, not just Excel) and it is unlikely you would want to do that.
The workaround is to set it manually for a particular file, by importing the CSV instead of simply opening it. On the Data tab select From Text/CSV and Excel will then try to guess the settings from the first 2000 rows. If it guesses wrong you have the opportunity to fix it.
But getting Excel to display the file as you expect has nothing to do with the way Python is reading it.
I need to manipulate a csv file in a way to go into the csv file look for blank fields between c0-c5 in my example csv file. with the csv file where ever there are blanks I would like to replace the blank with any verbage i want, like "not found"
the only thing for code I have so far is dropping a column I do not need, but the manipulation I need I really can not find anything.. maybe it is not possible?
also, i am wondering how to change a column name..thanks..
#!/bin/env python
import pandas
data = pandas.read_csv('report.csv')
data = data.drop(['date',axis=1)
data.to_csv('final_report.csv')
Alternatively and taking your "comment question" into account (if you do not necessarily want to use pandas as in n1colas.m's answer) use string replacements and
simply loop over your file with:
with open("modified_file.csv","w") as of:
with open("report.csv", "r") as inf:
for line in inf:
if "#" not in line: # in the case your csv file has a comment marker somewhere and it is called #, the line is skipped, which means you get a clean comma separated value file as the outfile- if you do want to keep such lines simply remove the if condition
mystring=line.replace(", ,","not_found").replace("data","input") # in case it is not only one blank space you can also use the regex for n times blank space here
print(mystring, file=of, end=""); # prints the replaced line to outfile and writes no newline
I know this is not the most efficient way to do it, but probably the one where you easily understand what you are doing and are able to modify this to your heart's desire.
For any reasonably sized csv files it sould still work nearly instantaneously.
Also for testing purposes always use a separate file (of) for such replacements instead of writing to your infile as your question seems to state. Check that it did what you wanted. ONLY THEN overwrite your infile. This may seem unnecessary at first, but mistakes happen...
You have to perform this line
data['data'] = data['data'].fillna("not found")
Here the documentation https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html
Here an example
import pandas
data = pandas.read_csv('final_report.csv')
data.info()
data['data'] = data['data'].fillna("Something")
print(data)
I would suggest to change the data variable to something different, because your column has the same name and can be confusing.
I am a beginner of Python. I am trying now figuring out why the second 'for' loop doesn't work in the following script. I mean that I could only get the result of the first 'for' loop, but nothing from the second one. I copied and pasted my script and the data csv in the below.
It will be helpful if you tell me why it goes in this way and how to make the second 'for' loop work as well.
My SCRIPT:
import csv
file = "data.csv"
fh = open(file, 'rb')
read = csv.DictReader(fh)
for e in read:
print(e['a'])
for e in read:
print(e['b'])
"data.csv":
a,b,c
tree,bough,trunk
animal,leg,trunk
fish,fin,body
The csv reader is an iterator over the file. Once you go through it once, you read to the end of the file, so there is no more to read. If you need to go through it again, you can seek to the beginning of the file:
fh.seek(0)
This will reset the file to the beginning so you can read it again. Depending on the code, it may also be necessary to skip the field name header:
next(fh)
This is necessary for your code, since the DictReader consumed that line the first time around to determine the field names, and it's not going to do that again. It may not be necessary for other uses of csv.
If the file isn't too big and you need to do several things with the data, you could also just read the whole thing into a list:
data = list(read)
Then you can do what you want with data.
I have created small piece of function which doe take path of csv file read and return list of dict at once then you loop through list very easily,
def read_csv_data(path):
"""
Reads CSV from given path and Return list of dict with Mapping
"""
data = csv.reader(open(path))
# Read the column names from the first line of the file
fields = data.next()
data_lines = []
for row in data:
items = dict(zip(fields, row))
data_lines.append(items)
return data_lines
Regards
I have a csv containing various columns (full_log.csv). One of the columns is labeled "HASH" and contains the hash value of the file shown in that row. For Example, my columns would have the following headers:
Filename - Hash - Hostname - Date
I need my python script to take another CSV (hashes.csv) containing only 1 column of multiple hash values, and compare the hash values against my the HASH column in my full_log.csv.
Anytime it finds a match I want it to output the entire row that contains the hash to an additional CSV (output.csv). So my output.csv will contain only the rows of full_log.csv that contain any of the hash values found in hashes.csv, if that makes sense.
So far I have the following. It works for the hash value that I manually enter in the script, but now I need it to look at hashes.csv to compare instead of manually putting the hash in the script, and instead of printing the results I need to export them to output.csv.
import csv
with open('full_log.csv', 'rb') as input_file1:
reader = csv.DictReader(input_file1)
rows = [row for row in reader if row ['HASH'] == 'FB7D9605D1A38E38AA4C14C6F3622E5C3C832683']
for row in rows:
print row
I would generate a set from the hashes.csv file. Using membership in that set as a filter, I would iterate over the full_log.csv file, outputting only those lines that match.
import csv
with open('hashes.csv') as hashes:
hashes = csv.reader(hashes)
hashes = set(row[0] for row in hashes)
with open('full_log.csv') as input_file:
reader = csv.DictReader(input_file)
with open('output.csv', 'w') as output_file:
writer = csv.DictWriter(output_file, reader.fieldnames)
writer.writeheader()
writer.writerows(row for row in reader if row['Hash'] in hashes)
look at pandas lib for python:
http://pandas.pydata.org/pandas-docs/stable/
it has various helpful function for your question, easily read, transform and write to csv file
Iterating through the rows of files and hashes and using a filter with any to return matches in the collection of hashes:
matching_rows = []
with open('full_log.csv', 'rb') as file1, open('hashes.csv', 'rb') as file2:
reader = csv.DictReader(file1)
hash_reader = csv.DictReader(file2)
matching_rows = [row for row in reader if any(row['Hash'] == r['Hash'] for r in hash_reader)]
with open('output.csv', 'wb') as f:
writer = csv.DictWriter(f)
writer.writerows(matching_rows)
I am a bit unclear as to exactly how much help that you require in solving this. I will assume that you do not need a full solution, but rather, simply tips on how to craft your solution.
First question, which file is larger? If you know that hashes.csv is not too large, meaning it will fit in memory with no problem, then I would simply suck that file in one line at a time and store each hash entry in a Set variable. I won't provide full code, but the general structure is as follows:
hashes = Set()
for each line in the hashes.csv file
hashes.add(hash from the line)
Now, I believe you to already know how to read a CSV file, since you have an example above, but, what you want to do is to now iterate through each row in the full log CSV file. For each of those lines, do not check to see if the hash is a specific value, instead, check to see if that value is contained in the hashes variable. if it is, then use the CSV writer to write the single line to a file.
The biggest gotcha, I think, is knowing if the hashes will always be in a particular case so that you can perform the compare. For example, if one file uses uppercase for the HASH and the other uses lowercase, then you need to be sure to convert to use the same case.
So, currently I have a csv file created with 6 rows, 1 column (header and 5 numbers) I want to be able to do a conversion, say from centimeters to inches and save it in a new csv with a new header.
I am new to coding in general, but I so far I have only been able to import the csv and read it, and print it (using print row) but I was wondering how I could do the conversion, since the numbers are saved in the csv would I have to convert the numbers to float and then somehow write it to a new csv? I only have 5 numbers as I want to be able to just figure out the correct code, but I want to be able use it for a lot of numbers and not just for 5. Therefore, I could not write print row[1] or something like that as that would take too long.
I wasn't sure where the computation would be placed either. Help please! Also, this isn't homework or the like. Im just doing this for fun.
This is the code I currently have:
import csv
with open('test.csv', 'rb') as f:
reader = csv.reader(f)
next(reader, None) # I did this to skip the header I labelled Centimeters
with open('test1.csv', 'wb') as o:
writer = csv.writer(o)
for row in reader
f.close()
o.close()
I guess I dont know how to convert the number in the rows to float and then output the values. I want to just be able to multiply the number in the row by 0.393701 so that in the new csv the header is labelled inches with the output beneath in the rows.
If the numbers are one per row it's not really a CSV file, it's just a text file in which case you don't need to use the csv reading system from Python's libraries to read it (although that library will read the file just fine as well).
Basically, you're program will look like this (this isn't real Python code, it's your job to come up with that!):
with either the CSV module or regular file operations
open your file
with either the CSV module or regular file operations
open an output file with a different name
for each line in the input file
convert the value you read to float()
transform the value
write the value to the output file
Is that enough to get you started? (This is actually more lines than the final Python program will need since you can combine some of the lines easily into one).