Selectively feeding python csv class with lines - python

I have a csv file with a few patterns. I only want to selectively load lines into the csv reader class of python. Currently, csv only takes a file object. Is there a way to get around this?
In other words, what I need is:
with open('filename') as f:
for line in f:
if condition(line):
record = csv.reader(line)
But, currently, csv class fails if it is given a line instead of a file object.

From the csv.reader docstring:
csvfile can be any object which supports the iterator protocol and returns a string each time its __next__() method is called
You can feed csv.reader with a generator iterator that yields only the selected rows.
with open('filename') as f:
lines = (line for line in f if condition(line))
for record in csv.reader(lines):
do_something()

To read file as stream you can use this.
io.open(file, mode='r', buffering=-1, encoding=None, errors=None, newline=None, closefd=True)

import shlex
lex = shlex.shlex('"sreeraag","100,ABC,XYZ",112',',', posix=True)
lex.whitespace += ','
lex.whitespace_split = True
print list(lex)
yields
['sreeraag', '100,ABC,XYZ', '112']

Found a solution: As csv expects object which supports __next__(), I'm using a StringIO class to convert string to StringIO object which in turn handles __next__() and returns one line everytime for csv reader class.
with open('filename') as f:
for line in f:
if condition(line):
record = csv.reader(StringIO.StringIO(line))

```
with open("xx.csv") as f:
csv = f.readlines()
print(csv[0])
```
→_→ Life is short,your need pandas
pip install pandas
```
import pandas as pd
df = pd.read_csv(filepath or url)
df.ix[0]
df.ix[1]
df.ix[1:3]
```

Related

How to open a json.gz file and return to dictionary in Python

I have downloaded a compressed json file and want to open it as a dictionary.
I used json.load but the data type still gives me a string.
I want to extract a keyword list from the json file. Is there a way I can do it even though my data is a string?
Here is my code:
import gzip
import json
with gzip.open("19.04_association_data.json.gz", "r") as f:
data = f.read()
with open('association.json', 'w') as json_file:
json.dump(data.decode('utf-8'), json_file)
with open("association.json", "r") as read_it:
association_data = json.load(read_it)
print(type(association_data))
#The actual output is 'str' but I expect it is 'dic'
In the first with block you already got the uncompressed string, no need to open it a second time.
import gzip
import json
with gzip.open("19.04_association_data.json.gz", "r") as f:
data = f.read()
j = json.loads (data.decode('utf-8'))
print (type(j))
Open the file using the gzip package from the standard library (docs), then read it directly into json.loads():
import gzip
import json
with gzip.open("19.04_association_data.json.gz", "rb") as f:
data = json.loads(f.read(), encoding="utf-8")
To read from a json.gz, you can use the following snippet:
import json
import gzip
with gzip.open("file_path_to_read", "rt") as f:
expected_dict = json.load(f)
The result is of type dict.
In case if you want to write to a json.gz, you can use the following snippet:
import json
import gzip
with gzip.open("file_path_to_write", "wt") as f:
json.dump(expected_dict, f)

Converting a .csv.gz to .csv in Python 2.7

I have read the documentation and a few additional posts on SO and other various places, but I can't quite figure out this concept:
When you call csvFilename = gzip.open(filename, 'rb') then reader = csv.reader(open(csvFilename)), is that reader not a valid csv file?
I am trying to solve the problem outlined below, and am getting a coercing to Unicode: need string or buffer, GzipFile found error on line 41 and 7 (highlighted below), leading me to believe that the gzip.open and csv.reader do not work as I had previously thought.
Problem I am trying to solve
I am trying to take a results.csv.gz and convert it to a results.csv so that I can turn the results.csv into a python dictionary and then combine it with another python dictionary.
File 1:
alertFile = payload.get('results_file')
alertDataCSV = rh.dataToDict(alertFile) # LINE 41
alertDataTotal = rh.mergeTwoDicts(splunkParams, alertDataCSV)
Calls File 2:
import gzip
import csv
def dataToDict(filename):
csvFilename = gzip.open(filename, 'rb')
reader = csv.reader(open(csvFilename)) # LINE 7
alertData={}
for row in reader:
alertData[row[0]]=row[1:]
return alertData
def mergeTwoDicts(dictA, dictB):
dictC = dictA.copy()
dictC.update(dictB)
return dictC
*edit: also forgive my non-PEP style of naming in Python
gzip.open returns a file-like object (same as what plain open returns), not the name of the decompressed file. Simply pass the result directly to csv.reader and it will work (the csv.reader will receive the decompressed lines). csv does expect text though, so on Python 3 you need to open it to read as text (on Python 2 'rb' is fine, the module doesn't deal with encodings, but then, neither does the csv module). Simply change:
csvFilename = gzip.open(filename, 'rb')
reader = csv.reader(open(csvFilename))
to:
# Python 2
csvFile = gzip.open(filename, 'rb')
reader = csv.reader(csvFile) # No reopening involved
# Python 3
csvFile = gzip.open(filename, 'rt', newline='') # Open in text mode, not binary, no line ending translation
reader = csv.reader(csvFile) # No reopening involved
The following worked for me for python==3.7.9:
import gzip
my_filename = my_compressed_file.csv.gz
with gzip.open(my_filename, 'rt') as gz_file:
data = gz_file.read() # read decompressed data
with open(my_filename[:-3], 'wt') as out_file:
out_file.write(data) # write decompressed data
my_filename[:-3] is to get the actual filename so that it does get a random filename.

Fileobject passing threw csv reader Python 3.6

Sample fileobject data contains the following,
b'QmFyY29kZSxRdHkKQTIzMjMsMTAKQTIzMjQsMTUKNjUxMDA1OTUzMjkyNSwxMgpBMjMyNCwxCkEyMzI0LDEKQTIzMjMsMTAK'
And python file contains the following code
string_data = BytesIO(base64.decodestring(csv_rec))
read_file = csv.reader(string_data, quotechar='"', delimiter=',')
next(read_file)
when i run the above code in python, i got the following exception
_csv.Error: iterator should return strings, not int (did you open the file in text mode?)
How can i open a bytes data in text mode ?
You are almost there. Indeed, csv.reader expects iterator which returns strings (not bytes). Such iterator is provided by sibling of BytesIO - io.StringIO.
from io import StringIO
csv_rec = b'QmFyY29kZSxRdHkKQTIzMjMsMTAKQTIzMjQsMTUKNjUxMDA1OTUzMjkyNSwxMgpBMjMyNCwxCkEyMzI0LDEKQTIzMjMsMTAK'
bytes_data = base64.decodestring(csv_rec)
# decode() method is used to decode bytes to string
string_data = StringIO(bytes_data.decode())
read_file = csv.reader(string_data, quotechar='"', delimiter=',')
next(read_file)

How to use string as input for csv reader without storing it to file

I'm trying to loop through rows in a csv file. I get csv file as string from a web location. I know how to create csv.reader using with when data is stored in a file. What I don't know is, how to get rows using csv.reader without storing string to a file. I'm using Python 2.7.12.
I've tried to create StringIO object like this:
from StringIO import StringIO
csv_data = "some_string\nfor_example"
with StringIO(csv_data) as input_file:
csv_reader = reader(csv_data, delimiter=",", quotechar='"')
However, I'm getting this error:
Traceback (most recent call last):
File "scraper.py", line 228, in <module>
with StringIO(csv_data) as input_file:
AttributeError: StringIO instance has no attribute '__exit__'
I understand that StringIO class doesn't have __exit__ method which is called when when finishes doing whatever it does with this object.
My answer is how to do this correctly? I suppose I can alter StringIO class by subclassing it and adding __exit__ method, but I suspect that there is easier solution.
Update:
Also, I've tried different combinations that came to my mind:
with open(StringIO(csv_data)) as input_file:
with csv_data as input_file:
but, of course, none of those worked.
>>> import csv
>>> csv_data = "some,string\nfor,example"
>>> result = csv.reader(csv_data.splitlines())
>>> list(result)
[['some', 'string'], ['for', 'example']]
You should use the io module instead of the StringIO one, because io.BytesIO for byte string or io.StringIO for Unicode ones both support the context manager interface and can be used in with statements:
from io import BytesIO
from csv import reader
csv_data = "some_string\nfor_example"
with BytesIO(csv_data) as input_file:
csv_reader = reader(input_file, delimiter=",", quotechar='"')
for row in csv_reader:
print row
If you like context managers, you can use tempfile instead:
import tempfile
with tempfile.NamedTemporaryFile(mode='w') as t:
t.write('csv_data')
t.seek(0)
csv_reader = reader(open(t.name), delimiter=",", quotechar='"')
As an advantage to pass string splitlines directly to csv reader you can write file of any size and then safely read it in csv reader without memory issues.
This file will be closed and deleted automatically

Replace and overwrite instead of appending

I have the following code:
import re
#open the xml file for reading:
file = open('path/test.xml','r+')
#convert to string:
data = file.read()
file.write(re.sub(r"<string>ABC</string>(\s+)<string>(.*)</string>",r"<xyz>ABC</xyz>\1<xyz>\2</xyz>",data))
file.close()
where I'd like to replace the old content that's in the file with the new content. However, when I execute my code, the file "test.xml" is appended, i.e. I have the old content follwed by the new "replaced" content. What can I do in order to delete the old stuff and only keep the new?
You need seek to the beginning of the file before writing and then use file.truncate() if you want to do inplace replace:
import re
myfile = "path/test.xml"
with open(myfile, "r+") as f:
data = f.read()
f.seek(0)
f.write(re.sub(r"<string>ABC</string>(\s+)<string>(.*)</string>", r"<xyz>ABC</xyz>\1<xyz>\2</xyz>", data))
f.truncate()
The other way is to read the file then open it again with open(myfile, 'w'):
with open(myfile, "r") as f:
data = f.read()
with open(myfile, "w") as f:
f.write(re.sub(r"<string>ABC</string>(\s+)<string>(.*)</string>", r"<xyz>ABC</xyz>\1<xyz>\2</xyz>", data))
Neither truncate nor open(..., 'w') will change the inode number of the file (I tested twice, once with Ubuntu 12.04 NFS and once with ext4).
By the way, this is not really related to Python. The interpreter calls the corresponding low level API. The method truncate() works the same in the C programming language: See http://man7.org/linux/man-pages/man2/truncate.2.html
file='path/test.xml'
with open(file, 'w') as filetowrite:
filetowrite.write('new content')
Open the file in 'w' mode, you will be able to replace its current text save the file with new contents.
Using truncate(), the solution could be
import re
#open the xml file for reading:
with open('path/test.xml','r+') as f:
#convert to string:
data = f.read()
f.seek(0)
f.write(re.sub(r"<string>ABC</string>(\s+)<string>(.*)</string>",r"<xyz>ABC</xyz>\1<xyz>\2</xyz>",data))
f.truncate()
import os#must import this library
if os.path.exists('TwitterDB.csv'):
os.remove('TwitterDB.csv') #this deletes the file
else:
print("The file does not exist")#add this to prevent errors
I had a similar problem, and instead of overwriting my existing file using the different 'modes', I just deleted the file before using it again, so that it would be as if I was appending to a new file on each run of my code.
See from How to Replace String in File works in a simple way and is an answer that works with replace
fin = open("data.txt", "rt")
fout = open("out.txt", "wt")
for line in fin:
fout.write(line.replace('pyton', 'python'))
fin.close()
fout.close()
in my case the following code did the trick
with open("output.json", "w+") as outfile: #using w+ mode to create file if it not exists. and overwrite the existing content
json.dump(result_plot, outfile)
Using python3 pathlib library:
import re
from pathlib import Path
import shutil
shutil.copy2("/tmp/test.xml", "/tmp/test.xml.bak") # create backup
filepath = Path("/tmp/test.xml")
content = filepath.read_text()
filepath.write_text(re.sub(r"<string>ABC</string>(\s+)<string>(.*)</string>",r"<xyz>ABC</xyz>\1<xyz>\2</xyz>", content))
Similar method using different approach to backups:
from pathlib import Path
filepath = Path("/tmp/test.xml")
filepath.rename(filepath.with_suffix('.bak')) # different approach to backups
content = filepath.read_text()
filepath.write_text(re.sub(r"<string>ABC</string>(\s+)<string>(.*)</string>",r"<xyz>ABC</xyz>\1<xyz>\2</xyz>", content))

Categories

Resources