I have a csv containing various columns (full_log.csv). One of the columns is labeled "HASH" and contains the hash value of the file shown in that row. For Example, my columns would have the following headers:
Filename - Hash - Hostname - Date
I need my python script to take another CSV (hashes.csv) containing only 1 column of multiple hash values, and compare the hash values against my the HASH column in my full_log.csv.
Anytime it finds a match I want it to output the entire row that contains the hash to an additional CSV (output.csv). So my output.csv will contain only the rows of full_log.csv that contain any of the hash values found in hashes.csv, if that makes sense.
So far I have the following. It works for the hash value that I manually enter in the script, but now I need it to look at hashes.csv to compare instead of manually putting the hash in the script, and instead of printing the results I need to export them to output.csv.
import csv
with open('full_log.csv', 'rb') as input_file1:
reader = csv.DictReader(input_file1)
rows = [row for row in reader if row ['HASH'] == 'FB7D9605D1A38E38AA4C14C6F3622E5C3C832683']
for row in rows:
print row
I would generate a set from the hashes.csv file. Using membership in that set as a filter, I would iterate over the full_log.csv file, outputting only those lines that match.
import csv
with open('hashes.csv') as hashes:
hashes = csv.reader(hashes)
hashes = set(row[0] for row in hashes)
with open('full_log.csv') as input_file:
reader = csv.DictReader(input_file)
with open('output.csv', 'w') as output_file:
writer = csv.DictWriter(output_file, reader.fieldnames)
writer.writeheader()
writer.writerows(row for row in reader if row['Hash'] in hashes)
look at pandas lib for python:
http://pandas.pydata.org/pandas-docs/stable/
it has various helpful function for your question, easily read, transform and write to csv file
Iterating through the rows of files and hashes and using a filter with any to return matches in the collection of hashes:
matching_rows = []
with open('full_log.csv', 'rb') as file1, open('hashes.csv', 'rb') as file2:
reader = csv.DictReader(file1)
hash_reader = csv.DictReader(file2)
matching_rows = [row for row in reader if any(row['Hash'] == r['Hash'] for r in hash_reader)]
with open('output.csv', 'wb') as f:
writer = csv.DictWriter(f)
writer.writerows(matching_rows)
I am a bit unclear as to exactly how much help that you require in solving this. I will assume that you do not need a full solution, but rather, simply tips on how to craft your solution.
First question, which file is larger? If you know that hashes.csv is not too large, meaning it will fit in memory with no problem, then I would simply suck that file in one line at a time and store each hash entry in a Set variable. I won't provide full code, but the general structure is as follows:
hashes = Set()
for each line in the hashes.csv file
hashes.add(hash from the line)
Now, I believe you to already know how to read a CSV file, since you have an example above, but, what you want to do is to now iterate through each row in the full log CSV file. For each of those lines, do not check to see if the hash is a specific value, instead, check to see if that value is contained in the hashes variable. if it is, then use the CSV writer to write the single line to a file.
The biggest gotcha, I think, is knowing if the hashes will always be in a particular case so that you can perform the compare. For example, if one file uses uppercase for the HASH and the other uses lowercase, then you need to be sure to convert to use the same case.
Related
I want to generate a log file in which I have to print two lists for about 50 input files. So, there are approximately 100 lists reported in the log file. I tried using pickle.dump, but it adds some strange characters in the beginning of each value. Also, it writes each value in a different line and the enclosing brackets are also not shown.
Here is a sample output from a test code.
import pickle
x=[1,2,3,4]
fp=open('log.csv','w')
pickle.dump(x,fp)
fp.close()
output:
I want my log file to report:
list 1 is: [1,2,3,4]
If you want your log file to be readable, you are approaching it the wrong way by using pickle which "implements binary protocols"--i.e. it is unreadable.
To get what you want, replace the line
pickle.dump(x,fp)
with
fp.write(' list 1 is: '
fp.write(str(x))
This requires minimal change in the rest of your code. However, good practice would change your code to a better style.
pickle is for storing objects in a form which you could use to recreate the original object. If all you want to do is create a log message, the builtin __str__ method is sufficient.
x = [1, 2, 3, 4]
with open('log.csv', 'w') as fp:
print('list 1 is: {}'.format(x), file=fp)
Python's pickle is used to serialize objects, which is basically a way that an object and its hierarchy can be stored on your computer for use later.
If your goal is to write data to a csv, then read the csv file and output what you read inside of it, then read below.
Writing To A CSV File see here for a great tutorial if you need more info
import csv
list = [1,2,3,4]
myFile = open('yourFile.csv', 'w')
writer = csv.writer(myFile)
writer.writerow(list)
the function writerow() will write each element of an iterable (each element of the list in your case) to a column. You can run through each one of your lists and write it to its own row in this way. If you want to write multiple rows at once, check out the method writerows()
Your file will be automatically saved when you write.
Reading A CSV File
import csv
with open('example.csv', newline='') as File:
reader = csv.reader(File)
for row in reader:
print(row)
This will run through all the rows in your csv file and will print it to the console.
I am new to python and I am trying to reduce the csv file records by matching specific strings. I want to write the rows of the matching one to a new csv file.
Here is an example dataset:
What I am trying to do is search by going through all of the rows for specific matching keywords (e.g. only write the rows containing WARRANT ARREST as can be seen on the image) to a new csv file.
Here is my code for so far:
import csv
with open('test.csv', 'a') as myfile:
with open('train3.csv', 'rb') as csvfile:
spamreader = csv.reader(csvfile, delimiter=',')
for r in spamreader:
for field in row:
if field == "OTHER OFFENSES":
myfile.write(r)
test.csv is empty and train3 contains all the records.
You can often learn a lot about what's going on by simply adding some else statements. For instance, after if field == "OTHER OFFENSES":, you could write else: print(field) or else: print(r). It might become obvious why your comparison fails once you see the actual data.
There might also be a newline character after each row that's messing up the comparison (that was the cause of the problem the last time someone asked about this and I answered). Perhaps python sees OTHER OFFENSES\n which does not equal OTHER OFFENCES. To match these, use a less strict comparison or strip() the field.
Try replacing if field == "OTHER OFFENSES" with if "OTHER OFFENSES" in field:. When you do == you're asking for an exact match whereas something in something_else will search the whole line of text for something.
Try the following approach, it is a bit difficult to test as your data cannot be copy/pasted:
import csv
with open('test.csv', 'a', newline='') as f_outputcsv, open('train3.csv', 'r') as f_inputcsv:
csv_spamreader = csv.reader(f_inputcsv)
csv_writer = csv.writer(f_outputcsv)
for row in csv_spamreader:
for field in row:
if field == "WARRANT ARREST":
csv_writer.writerow(row)
break
This uses a csv.writer instance to write whole rows back to your output file.
I have a csv file which has 1000 entries (it is delimitered by a tab). I've only listed the first few.
Unique ID Name
0 60ff3ads Keith
1 C6LSI545 Shawn
2 O87SI523 Baoru
3 OM022SSI Naomi
4 3LLS34SI Alex
5 Z7423dSI blahblah
I want to remove the some of these entries by their index number from this csv file and save it into another csv file.
I've not started writing any codes for this yet because i'm not sure how i should go about doing it.. Please kindly advise.
A one-liner to solve your problem:
import pandas as pd
indexes_to_drop = [1, 7, ...]
pd.read_csv('original_file.csv', sep='\t').drop(indexes_to_drop, axis=0).to_csv('new_file.csv')
check the read_csv doc to accommodate for your particular CSV flavor if needed
The sample data suggests a tab delimitered file. You could open the input file with a csv.reader, and open an output file with csv.writer. It will be slightly simpler, however, if you simply use split() to grab the first field (index) and compare it with those indices that you want to filter out.
indices_to_delete = ['0', '3', '5']
with open('input.csv') as infile, open('output.csv', 'w') as outfile:
for line in infile:
if line.split()[0] not in indices_to_delete:
outfile.write(line)
This could be reduced to this:
with open('c.csv') as infile, open('output.csv', 'w') as outfile:
outfile.writelines(line for line in infile
if line.split()[0] not in indices_to_delete)
And that should do the trick in this case for the sort of data that you posted. If you find that you need to compare values in other fields containing whitespace, you should consider the csv module.
I don't think it is possible to remove lines. However, you could write two new files. So go over each row of the original csv. Next, for each row save it to csv-A or to csv-B. That way you end up with two seperated csvfiles.
More info here: How to Delete Rows CSV in python
I am extremely new to python 3 and I am learning as I go here. I figured someone could help me with a basic question: how to store text from a CSV file as a variable to be used later in the code. So the idea here would be to import a CSV file into the python interpreter:
import csv
with open('some.csv', 'rb') as f:
reader = csv.reader(f)
for row in reader:
...
and then extract the text from that file and store it as a variable (i.e. w = ["csv file text"]) to then be used later in the code to create permutations:
print (list(itertools.permutations(["w"], 2)))
If someone could please help and explain the process, it would be very much appreciated as I am really trying to learn. Please let me know if any more explanation is needed!
itertools.permutations() wants an iterable (e.g. a list) and a length as its arguments, so your data structure needs to reflect that, but you also need to define what you are trying to achieve here. For example, if you wanted to read a CSV file and produce permutations on every individual CSV field you could try this:
import csv
with open('some.csv', newline='') as f:
reader = csv.reader(f)
w = []
for row in reader:
w.extend(row)
print(list(itertools.permutations(w, 2)))
The key thing here is to create a flat list that can be passed to itertools.permutations() - this is done by intialising w to an empty list, and then extending its elements with the elements/fields from each row of the CSV file.
Note: As pointed out by #martineau, for the reasons explained here, the file should be opened with newline='' when used with the Python 3 csv module.
If you want to use Python 3 (as you state in the question) and to process the CSV file using the standard csv module, you should be careful about how to open the file. So far, your code and the answers use the Python 2 way of opening the CSV file. The things has changed in Python 3.
As shengy wrote, the CSV file is just a text file, and the csv module gets the elements as strings. Strings in Python 3 are unicode strings. Because of that, you should open the file in the text mode, and you should supply the encoding. Because of the nature of CSV file processing, you should also use the newline='' when opening the file.
Now extending the explanation of Burhan Khalid... When reading the CSV file, you get the rows as lists of strings. If you want to read all content of the CSV file into memory and store it in a variable, you probably want to use the list of rows (i.e. list of lists where the nested lists are the rows). The for loop iterates through the rows. The same way the list() function iterates through the sequence (here through the sequence of rows) and build the list of the items. To combine that with the wish to store everything in the content variable, you can write:
import csv
with open('some.csv', newline='', encoding='utf_8') as f:
reader = csv.reader(f)
content = list(reader)
Now you can do your permutation as you wish. The itertools is the correct way to do the permutations.
import csv
data = csv.DictReader(open('FileName.csv', 'r'))
print data.fieldnames
output = []
for each_row in data:
row = {}
try:
p = dict((k.strip(), v) for k, v in p.iteritems() if v.lower() != 'null')
except AttributeError, e:
print e
print p
raise Exception()
//based on the number of column
if p.get('col1'):
row['col1'] = p['col1']
if p.get('col2'):
row['col2'] = p['col2']
output.append(row)
Finally all data stored in output variable
Is this what you need?
import csv
with open('some.csv', 'rb') as f:
reader = csv.reader(f, delimiter=',')
rows = list(reader)
print('The csv file had {} rows'.format(len(rows)))
for row in rows:
do_stuff(row)
do_stuff_to_all_rows(rows)
The interesting line is rows = list(reader), which converts each row from the csv file (which will be a list), into another list rows, in effect giving you a list of lists.
If you had a csv file with three rows, rows would be a list with three elements, each element a row representing each line in the original csv file.
If all you care about is to read the raw text in the file (csv or not) then:
with open('some.csv') as f:
w = f.read()
will be a simple solution to having w="csv, file, text\nwithout, caring, about columns\n"
You should try pandas, which work both with Python 2.7 and Python 3.2+ :
import pandas as pd
csv = pd.read_csv("your_file.csv")
Then you can handle you data easily.
More fun here
First, a csv file is a text file too, so everything you can do with a file, you can do it with a csv file. That means f.read(), f.readline(), f.readlines() can all be used. see detailed information of these functions here.
But, as your file is a csv file, you can utilize the csv module.
# input.csv
# 1,david,enterprise
# 2,jeff,personal
import csv
with open('input.csv') as f:
reader = csv.reader(f)
for serial, name, version in reader:
# The csv module already extracts the information for you
print serial, name, version
More details about the csv module is here.
This is maybe a very basic question, but let's suppose one has a csv file which looks as follows:
a,a,a,a
b,b,b,b
c,c,c,c
d,d,d,d
e,e,e,e
And I am interested in deleting row[1], and row[3] and rewrite a new file that does not contain such rows. What would be the best way to do this?. As the module csv is already loaded in my code, I'd like to know how to do it within such scheme. I'd be glad if somebody could help me with this.
Since each row is on a separate line (assuming there are no newlines within the data items of the rows themselves), you can do this by simply copying the file line-by-line and skipping any you don't want kept. Since I'm unsure whether you number rows starting from zero or one, I've added a symbolic constant at the beginning to control it. You could, of course, hardcode it, as well as ROWS_TO_DELETE, directly into the code.
Regardless, this approach would be faster than using, for example, the csv module, because it avoids all the unnecessarily parsing and reformatting of the data being processed that the module has to do.
FIRST_ROW_NUM = 1 # or 0
ROWS_TO_DELETE = {1, 3}
with open('infile.csv', 'rt') as infile, open('outfile.csv', 'wt') as outfile:
outfile.writelines(row for row_num, row in enumerate(infile, FIRST_ROW_NUM)
if row_num not in ROWS_TO_DELETE)
Resulting output file's contents:
b,b,b,b
d,d,d,d
e,e,e,e