I am new to python and I am trying to reduce the csv file records by matching specific strings. I want to write the rows of the matching one to a new csv file.
Here is an example dataset:
What I am trying to do is search by going through all of the rows for specific matching keywords (e.g. only write the rows containing WARRANT ARREST as can be seen on the image) to a new csv file.
Here is my code for so far:
import csv
with open('test.csv', 'a') as myfile:
with open('train3.csv', 'rb') as csvfile:
spamreader = csv.reader(csvfile, delimiter=',')
for r in spamreader:
for field in row:
if field == "OTHER OFFENSES":
myfile.write(r)
test.csv is empty and train3 contains all the records.
You can often learn a lot about what's going on by simply adding some else statements. For instance, after if field == "OTHER OFFENSES":, you could write else: print(field) or else: print(r). It might become obvious why your comparison fails once you see the actual data.
There might also be a newline character after each row that's messing up the comparison (that was the cause of the problem the last time someone asked about this and I answered). Perhaps python sees OTHER OFFENSES\n which does not equal OTHER OFFENCES. To match these, use a less strict comparison or strip() the field.
Try replacing if field == "OTHER OFFENSES" with if "OTHER OFFENSES" in field:. When you do == you're asking for an exact match whereas something in something_else will search the whole line of text for something.
Try the following approach, it is a bit difficult to test as your data cannot be copy/pasted:
import csv
with open('test.csv', 'a', newline='') as f_outputcsv, open('train3.csv', 'r') as f_inputcsv:
csv_spamreader = csv.reader(f_inputcsv)
csv_writer = csv.writer(f_outputcsv)
for row in csv_spamreader:
for field in row:
if field == "WARRANT ARREST":
csv_writer.writerow(row)
break
This uses a csv.writer instance to write whole rows back to your output file.
Related
Everyday we get CSV file from vendor and we need to parse them and insert it to database. We use single Python3 program for all the tasks.
The problem happening is with multiline CSV files, where the contents in the second lines are skipped.
48.11363;11.53402;81369;München;"";1.0;1962;I would need
help from
Stackoverflow;"";"";"";289500.0;true;""
Here the field "I would need help from Stackoverflow" is spread in 3 lines.
The problem that happens is python3 only considers "I would Need" as a record and skips the rest of the part.
At present I am using below options to read from database :
with open(file_path, newline='', encoding='utf-8') as f:
reader = csv.reader(f, delimiter=',' , quotechar='"', quoting=csv.QUOTE_MINIMAL)
for row in reader:
{MY LOGIC}
Is there any way to include multiline CSV as a single record.
I understand, In pyspark, there is an option of option("multiline",True) but we don't want to use pyspark in first place.
Looking for options.
Thanks in Advance
I'm trying to scrape comments from a certain submission on Reddit and output them to a CSV file.
import praw
import csv
reddit = praw.Reddit(client_id='ClientID', client_secret='ClientSecret', user_agent='UserAgent')
Submission = reddit.submission(id="SubmissionID")
with open('Reddit.csv', 'w') as csvfile:
for comment in submission.comments:
csvfile.write(comment.body)
The problem is that for each cell the comments seem to be randomly split up. I want each comment in its own cell. Any ideas on how to achieve this?
You are importing the csv library but you are not actually utilizing it. Utilize it and your problem may go away.
https://docs.python.org/3/library/csv.html#csv.DictWriter
import csv
# ...
comment = "this string was created from your code"
# ...
with open('names.csv', 'w', newline='') as csvfile:
fieldnames = ['comment']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
writer.writerow({'comment': comment})
To write a CSV file in Python, use the csv module, specifically csv.writer(). You import this module at the top of your code, but you never use it.
Using this in your code, this looks like:
with open('Reddit.csv', 'w') as csvfile:
comment_writer = csv.writer(csvfile)
for comment in submission.comments:
comment_writer.writerow([comment.body])
Here, we use csv.writer() to create a CSV writer from the file that we've opened, and we call it comment_writer. Then, for each comment, we write another row to the CSV file. The row is represented as a list. Since we only have one piece of information to write on each row, the list contains just one item. The row is [comment.body].
The csv module takes care of making sure that values with new lines, commas, or other special characters are properly formatted as CSV values.
Note that, for some submissions with many comments, your PRAW code might raise an exception along the lines of 'MoreComments' object has no attribute 'body'. The PRAW docs discuss this, and I encourage you to read that to learn more, but know that we can avoid this happening in code by further modifying our loop:
from praw.models import Comment
# ...
with open('Reddit.csv', 'w') as csvfile:
comment_writer = csv.writer(csvfile)
for comment in submission.comments:
if isinstance(comment, Comment):
comment_writer.writerow([comment.body])
Also, your code only gets the top level comments of a submission. If you're interested in more, see this question, which is about how to get more than just top-level comments from a submission.
I'm guessing that the cells are not being split randomly, but being split at a comma, space semi-colon. You can choose what character you want the cells to be split at by using the delimiter parameter.
import csv
with open('Reddit.csv', 'w') as csvfile:
comments = ['comment one','comment two','comment three']
csv_writer = csv.writer(csvfile, delimiter='-')
csv_writer.writerow(comments)
I have the following CSV file:
id;name;duration;predecessors;
10;A;7;;
20;B;10;10;
25;B2;3;10;
30;C;5;10;
40;D;5;20,30, 25;
That is, the last row, in the fourth column I have three elements (20,30,25) separated by comma.
I have the following code:
csv_file = open(path_to_csv, 'r')
csv_file_reader = csv.reader(csv_file, delimiter=',')
first_row = True
for row in csv_file_reader :
if not first_row:
print(row)
else :
first_row = False
but I get a weird output:
['10;A;7;;']
['20;B;10;10;']
['25;B2;3;10;']
['30;C;5;10;']
['40;D;5;20', '30', ' 25;']
Any ideas?
Thanks in advance
You have specified CSV in your description, which stands for Comma Separated Values. However, your data uses semicolons.
Consider specifying the delimiter as ; for the CSV library:
with open(path_to_csv, 'r') as csv_file:
csv_file_reader = csv.reader(csv_file, delimiter=';')
...
And while we're here, note the change to using the with statement to open the file. The with statement allows you to open the file in a language-robust manner. No matter what happens (exception, quit, etc.), Python guarantees that the file will be closed and all resources accounted for. You don't need to close the file, just exit the block (unindent). It's "Pythonic" and a good habit to get into.
✓ #Antonio, I appreciate the above answer. As we know CSV is a file with comma separated values and Python's csv module works based on this, by default.
✓ No problem, you can still read from it without using csv module.
✓ Based on your provided input in problem I have written another simple solution without using any Python module to read CSVs (it's ok for simple tasks).
Please read, try and comment if you are not satisfied with the code or if it fails for some of your test cases.I will modify and make it workable.
» Data.csv
id;name;duration;predecessors;
10;A;7;;
20;B;10;10;
25;B2;3;10;
30;C;5;10;
40;D;5;20,30, 25;
Now, have a look at the below code (that finds and prints all the lines with 4th column having more than one elements):
with open ("Data.csv") as csv_file:
for line in csv_file.readlines()[1:]:
arr = line.strip().split(";")
if len(arr[3].split(",") )> 1:
print(line) # 40;D;5;20,30, 25;
I have a csv containing various columns (full_log.csv). One of the columns is labeled "HASH" and contains the hash value of the file shown in that row. For Example, my columns would have the following headers:
Filename - Hash - Hostname - Date
I need my python script to take another CSV (hashes.csv) containing only 1 column of multiple hash values, and compare the hash values against my the HASH column in my full_log.csv.
Anytime it finds a match I want it to output the entire row that contains the hash to an additional CSV (output.csv). So my output.csv will contain only the rows of full_log.csv that contain any of the hash values found in hashes.csv, if that makes sense.
So far I have the following. It works for the hash value that I manually enter in the script, but now I need it to look at hashes.csv to compare instead of manually putting the hash in the script, and instead of printing the results I need to export them to output.csv.
import csv
with open('full_log.csv', 'rb') as input_file1:
reader = csv.DictReader(input_file1)
rows = [row for row in reader if row ['HASH'] == 'FB7D9605D1A38E38AA4C14C6F3622E5C3C832683']
for row in rows:
print row
I would generate a set from the hashes.csv file. Using membership in that set as a filter, I would iterate over the full_log.csv file, outputting only those lines that match.
import csv
with open('hashes.csv') as hashes:
hashes = csv.reader(hashes)
hashes = set(row[0] for row in hashes)
with open('full_log.csv') as input_file:
reader = csv.DictReader(input_file)
with open('output.csv', 'w') as output_file:
writer = csv.DictWriter(output_file, reader.fieldnames)
writer.writeheader()
writer.writerows(row for row in reader if row['Hash'] in hashes)
look at pandas lib for python:
http://pandas.pydata.org/pandas-docs/stable/
it has various helpful function for your question, easily read, transform and write to csv file
Iterating through the rows of files and hashes and using a filter with any to return matches in the collection of hashes:
matching_rows = []
with open('full_log.csv', 'rb') as file1, open('hashes.csv', 'rb') as file2:
reader = csv.DictReader(file1)
hash_reader = csv.DictReader(file2)
matching_rows = [row for row in reader if any(row['Hash'] == r['Hash'] for r in hash_reader)]
with open('output.csv', 'wb') as f:
writer = csv.DictWriter(f)
writer.writerows(matching_rows)
I am a bit unclear as to exactly how much help that you require in solving this. I will assume that you do not need a full solution, but rather, simply tips on how to craft your solution.
First question, which file is larger? If you know that hashes.csv is not too large, meaning it will fit in memory with no problem, then I would simply suck that file in one line at a time and store each hash entry in a Set variable. I won't provide full code, but the general structure is as follows:
hashes = Set()
for each line in the hashes.csv file
hashes.add(hash from the line)
Now, I believe you to already know how to read a CSV file, since you have an example above, but, what you want to do is to now iterate through each row in the full log CSV file. For each of those lines, do not check to see if the hash is a specific value, instead, check to see if that value is contained in the hashes variable. if it is, then use the CSV writer to write the single line to a file.
The biggest gotcha, I think, is knowing if the hashes will always be in a particular case so that you can perform the compare. For example, if one file uses uppercase for the HASH and the other uses lowercase, then you need to be sure to convert to use the same case.
I'm trying to read in a csv file with many rows and columns; i would like to print one row, in a particular format, to a text file, and do some hashing on the values. SO far, i have been able to read in the file, parse thru it using DictReader, find the row i want using an IF statement and then print the keys and values. I cannot figure out how to format it to the format i want in the end ( Key = Value \n), and i cannot figure how to write to a file (much less in the format i want) using the value of 'row' obtained below. I've been trying for days and make a little progress but cannot get it to work. Here is what i got to work (with much detail left out of results):
>>>import csv
with open("C:\path_to_script\filename_Brief.csv") as infh:
reader = csv.DictReader(infh)
for row in reader:
if row['ALIAS'] == 'Y4K':
print(row)
result-output
{'Full_Name': 'Jack Flash', 'PHONE_NO': '555 555-1212', 'ALIAS': 'Y4K'}
I'd like to ask the user to input the Alias and then use that to determine row to print. I've done a ton of research but am new-ish to Python so am asking for help! i've used pyexcel, xlrd/xlwt, even thought I'd try pandas but too much to learn. I also got it to format the way i wanted in one test but then could not get the row selection to work--in other words, it prints all the records rather than the row i want. Have 30 Firefox tabs open trying to find an answer! Thanks in advance!
The following may at least be close to what you want (I think):
import csv
with open(r'C:\path_to_script\filename_Brief.csv') as infh, \
open('new_file.txt', 'wt') as outfh:
reader = csv.DictReader(infh)
for row in reader:
if row['ALIAS'] == 'Y4K':
outfh.write('Full_Name = {Full_Name}\n'
'PHONE_NO = {PHONE_NO}\n'
'ALIAS = {ALIAS}\n'.format(**row))
This would write 3 lines formatted like this into the output file for every matchingrow:
Full_Name = Jack Flash
PHONE_NO = 555 555-1212
ALIAS = Y4K
BTW, the **rownotation means basically "take all the entries in the specified dictionary and turn them into keyword arguments for this function call". The {keyword} syntax in the format string refers to any keyword arguments that will be passed to the str.format() method.