Multiline CSV read using Python3 - python

Everyday we get CSV file from vendor and we need to parse them and insert it to database. We use single Python3 program for all the tasks.
The problem happening is with multiline CSV files, where the contents in the second lines are skipped.
48.11363;11.53402;81369;München;"";1.0;1962;I would need
help from
Stackoverflow;"";"";"";289500.0;true;""
Here the field "I would need help from Stackoverflow" is spread in 3 lines.
The problem that happens is python3 only considers "I would Need" as a record and skips the rest of the part.
At present I am using below options to read from database :
with open(file_path, newline='', encoding='utf-8') as f:
reader = csv.reader(f, delimiter=',' , quotechar='"', quoting=csv.QUOTE_MINIMAL)
for row in reader:
{MY LOGIC}
Is there any way to include multiline CSV as a single record.
I understand, In pyspark, there is an option of option("multiline",True) but we don't want to use pyspark in first place.
Looking for options.
Thanks in Advance

Related

How to import csv files in Python containing badly formatted quote marks?

I'm trying to load the following test.csv file:
R1C1 R1C2 R1C3
R2C1 R2C2 R2C3
R3C1 "R3C2 R3C3
R4C1 R4C2 R4C3
... Using this Python script :
import csv
with open("test.csv") as f:
for row in csv.reader(f, delimiter='\t'):
print(row)
The result I got was the following :
['R1C1', 'R1C2', 'R1C3']
['R2C1', 'R2C2', 'R2C3']
['R3C1', 'R3C2\tR3C3\nR4C1\tR4C2\tR4C3\n']
It turns out that when Python finds a field whose first character is a quotation mark and there is no closing quotation mark, it will include all of the following content as part of the same field.
My question: What is the best approach for all rows in the file to be read properly? Please consider I'm using Python 3.8.5 and the script should be able to read huge files (2gb or more), so memory usage and performance issues should be also considered.
Thanks!
Honestly, if you're dealing with that much data, it'd be best to go in and clean it first. And if possible, fix whatever process is producing your bad data in the first place.
I haven't tested with a large file, but you may just be able to replace " characters as you read lines, assuming there's never a case where they're valid characters:
import csv
with open("test.csv") as f:
line_generator = (line.replace('"', '') for line in f)
for row in csv.reader(line_generator, delimiter='\t'):
print(row)
Output:
['R1C1', 'R1C2', 'R1C3']
['R2C1', 'R2C2', 'R2C3']
['R3C1', 'R3C2', 'R3C3']
['R4C1', 'R4C2', 'R4C3']

Collecting comments from Reddit, outputting to CSV file

I'm trying to scrape comments from a certain submission on Reddit and output them to a CSV file.
import praw
import csv
reddit = praw.Reddit(client_id='ClientID', client_secret='ClientSecret', user_agent='UserAgent')
Submission = reddit.submission(id="SubmissionID")
with open('Reddit.csv', 'w') as csvfile:
for comment in submission.comments:
csvfile.write(comment.body)
The problem is that for each cell the comments seem to be randomly split up. I want each comment in its own cell. Any ideas on how to achieve this?
You are importing the csv library but you are not actually utilizing it. Utilize it and your problem may go away.
https://docs.python.org/3/library/csv.html#csv.DictWriter
import csv
# ...
comment = "this string was created from your code"
# ...
with open('names.csv', 'w', newline='') as csvfile:
fieldnames = ['comment']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
writer.writerow({'comment': comment})
To write a CSV file in Python, use the csv module, specifically csv.writer(). You import this module at the top of your code, but you never use it.
Using this in your code, this looks like:
with open('Reddit.csv', 'w') as csvfile:
comment_writer = csv.writer(csvfile)
for comment in submission.comments:
comment_writer.writerow([comment.body])
Here, we use csv.writer() to create a CSV writer from the file that we've opened, and we call it comment_writer. Then, for each comment, we write another row to the CSV file. The row is represented as a list. Since we only have one piece of information to write on each row, the list contains just one item. The row is [comment.body].
The csv module takes care of making sure that values with new lines, commas, or other special characters are properly formatted as CSV values.
Note that, for some submissions with many comments, your PRAW code might raise an exception along the lines of 'MoreComments' object has no attribute 'body'. The PRAW docs discuss this, and I encourage you to read that to learn more, but know that we can avoid this happening in code by further modifying our loop:
from praw.models import Comment
# ...
with open('Reddit.csv', 'w') as csvfile:
comment_writer = csv.writer(csvfile)
for comment in submission.comments:
if isinstance(comment, Comment):
comment_writer.writerow([comment.body])
Also, your code only gets the top level comments of a submission. If you're interested in more, see this question, which is about how to get more than just top-level comments from a submission.
I'm guessing that the cells are not being split randomly, but being split at a comma, space semi-colon. You can choose what character you want the cells to be split at by using the delimiter parameter.
import csv
with open('Reddit.csv', 'w') as csvfile:
comments = ['comment one','comment two','comment three']
csv_writer = csv.writer(csvfile, delimiter='-')
csv_writer.writerow(comments)

Issue with parsing csv from Django web form

I was hoping someone could help me with this. I'm getting a file from a form in Django, this file is a csv and I'm trying to read it with Python's library csv. The problem here is that when I apply the function csv.reader and I turn that result into a list in order to print it, I find out that csv.reader is not splitting correctly my file.
Here are some images to show the problem
This is my csv file:
This my code:
And this is the printed value of the variable file_readed:
As you can see in the picture, it seems to be splitting my file character by character with some exceptions.
I thank you for any help you can provide me.
If you are pulling from a web form, try getting the csv as a string, confirm in a print or debug tool that the result is correct, and then pass it to csv using StringIO.
from io import StringIO
import csv
csv_string = form.files['carga_cie10'].file_read().decode(encoding="ISO-88590-1")
csv_file = StringIO(csv_string)
reader = csv.reader(csvfile, delimiter=',', quotechar='"')
for row in reader:
print(row)
Another thing you can try is changing the lineterminator argument to csv.reader(). It can default to \r\n but the web form might use some other value. Inspect the string you get from the web form to confirm.
that CSV does not seem right: you got some lines with more arguments than others.
The acronym of CSV being Comma Separated Values, you need to have the exact same arguments separated by commas for each line, or else it will mess it up.
I see in your lines you're maybe expecting to have 3 columns, instead you got lines with 2, or 4 arguments, and some of them have an opening " in one argument, comma, then closing " in the second argument
check if your script works with other CSVs maybe
Most likely you need to specify delimiter. Since you haven't explicitly told about the delimiter, I guess it's confused.
csv.reader(csvfile, delimiter=',')
However, since there are quotations with comma delimiter, you may need to alter the default delimiter on the CSV file's creation too for tab or something else.
The problem is here:
print(list(file_readed))
'list' is causing printing of every element within the csv as an individual unit.
Try this instead:
with open('carga_cie10') as f:
reader = csv.reader(f)
for row in reader:
print(" ".join(row))
Edit:
import pandas as pd
file_readed = pd.read_csv(file_csv)
print(file_readed)
The output should look clean. Pandas is highly useful in situations where data needs to be read, manipulated, changed, etc.

How to read CSV with column with more than one element in Python

I have the following CSV file:
id;name;duration;predecessors;
10;A;7;;
20;B;10;10;
25;B2;3;10;
30;C;5;10;
40;D;5;20,30, 25;
That is, the last row, in the fourth column I have three elements (20,30,25) separated by comma.
I have the following code:
csv_file = open(path_to_csv, 'r')
csv_file_reader = csv.reader(csv_file, delimiter=',')
first_row = True
for row in csv_file_reader :
if not first_row:
print(row)
else :
first_row = False
but I get a weird output:
['10;A;7;;']
['20;B;10;10;']
['25;B2;3;10;']
['30;C;5;10;']
['40;D;5;20', '30', ' 25;']
Any ideas?
Thanks in advance
You have specified CSV in your description, which stands for Comma Separated Values. However, your data uses semicolons.
Consider specifying the delimiter as ; for the CSV library:
with open(path_to_csv, 'r') as csv_file:
csv_file_reader = csv.reader(csv_file, delimiter=';')
...
And while we're here, note the change to using the with statement to open the file. The with statement allows you to open the file in a language-robust manner. No matter what happens (exception, quit, etc.), Python guarantees that the file will be closed and all resources accounted for. You don't need to close the file, just exit the block (unindent). It's "Pythonic" and a good habit to get into.
✓ #Antonio, I appreciate the above answer. As we know CSV is a file with comma separated values and Python's csv module works based on this, by default.
✓ No problem, you can still read from it without using csv module.
✓ Based on your provided input in problem I have written another simple solution without using any Python module to read CSVs (it's ok for simple tasks).
Please read, try and comment if you are not satisfied with the code or if it fails for some of your test cases.I will modify and make it workable.
» Data.csv
id;name;duration;predecessors;
10;A;7;;
20;B;10;10;
25;B2;3;10;
30;C;5;10;
40;D;5;20,30, 25;
Now, have a look at the below code (that finds and prints all the lines with 4th column having more than one elements):
with open ("Data.csv") as csv_file:
for line in csv_file.readlines()[1:]:
arr = line.strip().split(";")
if len(arr[3].split(",") )> 1:
print(line) # 40;D;5;20,30, 25;

Write matching rows in csvfile to a new csv file using Python

I am new to python and I am trying to reduce the csv file records by matching specific strings. I want to write the rows of the matching one to a new csv file.
Here is an example dataset:
What I am trying to do is search by going through all of the rows for specific matching keywords (e.g. only write the rows containing WARRANT ARREST as can be seen on the image) to a new csv file.
Here is my code for so far:
import csv
with open('test.csv', 'a') as myfile:
with open('train3.csv', 'rb') as csvfile:
spamreader = csv.reader(csvfile, delimiter=',')
for r in spamreader:
for field in row:
if field == "OTHER OFFENSES":
myfile.write(r)
test.csv is empty and train3 contains all the records.
You can often learn a lot about what's going on by simply adding some else statements. For instance, after if field == "OTHER OFFENSES":, you could write else: print(field) or else: print(r). It might become obvious why your comparison fails once you see the actual data.
There might also be a newline character after each row that's messing up the comparison (that was the cause of the problem the last time someone asked about this and I answered). Perhaps python sees OTHER OFFENSES\n which does not equal OTHER OFFENCES. To match these, use a less strict comparison or strip() the field.
Try replacing if field == "OTHER OFFENSES" with if "OTHER OFFENSES" in field:. When you do == you're asking for an exact match whereas something in something_else will search the whole line of text for something.
Try the following approach, it is a bit difficult to test as your data cannot be copy/pasted:
import csv
with open('test.csv', 'a', newline='') as f_outputcsv, open('train3.csv', 'r') as f_inputcsv:
csv_spamreader = csv.reader(f_inputcsv)
csv_writer = csv.writer(f_outputcsv)
for row in csv_spamreader:
for field in row:
if field == "WARRANT ARREST":
csv_writer.writerow(row)
break
This uses a csv.writer instance to write whole rows back to your output file.

Categories

Resources