Regex to remove doubled double quotes from CSV - python

I have an excel sheet that has a lot of data in it in one column in the form of a python dictionary from a sql database. I don't have access to the original database and I can't import the CSV back into sql with the local infile command due to the fact that the keys/values on each row of the CSV are not in the same order. When I export the excel sheet to CSV I get:
"{""first_name"":""John"",""last_name"":""Smith"",""age"":30}"
"{""first_name"":""Tim"",""last_name"":""Johnson"",""age"":34}"
What is the best way to remove the " before and after the curly brackets as well as the extra " around the keys/values?
I also need to leave the integers alone that don't have quotes around them.
I am trying to then import this into python with the json module so that I can print specific keys but I can't import them with the doubled double quotes. I ultimately need the data saved in a file that looks like:
{"first_name":"John","last_name":"Smith","age":30}
{"first_name":"Tim","last_name":"Johnson","age":34}
Any help is most appreciated!

Easy:
text = re.sub(r'"(?!")', '', text)
Given the input file: TEST.TXT:
"{""first_name"":""John"",""last_name"":""Smith"",""age"":30}"
"{""first_name"":""Tim"",""last_name"":""Johnson"",""age"":34}"
The script:
import re
f = open("TEST.TXT","r")
text_in = f.read()
text_out = re.sub(r'"(?!")', '', text_in)
print(text_out)
produces the following output:
{"first_name":"John","last_name":"Smith","age":30}
{"first_name":"Tim","last_name":"Johnson","age":34}

This should do it:
with open('old.csv') as old, open('new.csv', 'w') as new:
new.writelines(re.sub(r'"(?!")', '', line) for line in old)

If the input file is just as shown, and of the small size you mention, you can load the whole file in memory, make the substitutions, and then save it. IMHO, you don't need a RegEx to do this. The easiest to read code that does this is:
with open(filename) as f:
input= f.read()
input= str.replace('""','"')
input= str.replace('"{','{')
input= str.replace('}"','}')
with open(filename, "w") as f:
f.write(input)
I tested it with the sample input and it produces:
{"first_name":"John","last_name":"Smith","age":30}
{"first_name":"Tim","last_name":"Johnson","age":34}
Which is exactly what you want.
If you want, you can also pack the code and write
with open(inputFilename) as if:
with open(outputFilename, "w") as of:
of.write(if.read().replace('""','"').replace('"{','{').replace('}"','}'))
but I think the first one is much clearer and both do exactly the same.

I think you are overthinking the problem, why don't replace data?
l = list()
with open('foo.txt') as f:
for line in f:
l.append(line.replace('""','"').replace('"{','{').replace('}"','}'))
s = ''.join(l)
print s # or save it to file
It generates:
{"first_name":"John","last_name":"Smith","age":30}
{"first_name":"Tim","last_name":"Johnson","age":34}
Use a list to store intermediate lines and then invoke .join for improving performance as explained in Good way to append to a string

You can actual use the csv module and regex to do this:
st='''\
"{""first_name"":""John"",""last_name"":""Smith"",""age"":30}"
"{""first_name"":""Tim"",""last_name"":""Johnson"",""age"":34}"\
'''
import csv, re
data=[]
reader=csv.reader(st, dialect='excel')
for line in reader:
data.extend(line)
s=re.sub(r'(\w+)',r'"\1"',''.join(data))
s=re.sub(r'({[^}]+})',r'\1\n',s).strip()
print s
Prints
{"first_name":"John","last_name":"Smith","age":"30"}
{"first_name":"Tim","last_name":"Johnson","age":"34"}

Related

How to replace characters in a csv file

I'm doing some measurements in the lab and want to transform them into some nice Python plots. The problem is the way the software exports CSV files, as I can't find a way to properly read the numbers. It looks like this:
-10;-0,0000026
-8;-0,00000139
-6;-0,000000546
-4;-0,000000112
-2;-5,11E-09
0,0000048;6,21E-09
2;0,000000318
4;0,00000304
6;0,0000129
8;0,0000724
10;0,000268
Separation by ; is fine, but I need every , to be ..
Ideally I would like Python to be able to read numbers such as 6.21E-09 as well, but I should be able to fix that in excel...
My main issue: Change every , to . so Python can read them as a float.
The simplest way would be for you to convert them to string and then use the .replace() method to pretty much do anything. For i.e.
txt = "0,0000048;6,21E-09"
txt = txt.replace(';', '.')
You could also read the CSV file (I don't know how you are reading the file) but depending on the library, you could change the 'delimiter' (to : for example). CSV is Comma-separated values and as the name implies, it separates columns by means of '.
You can do whatever you want in Python, for example:
import csv
with open('path_to_csv_file', 'r') as csv_file:
data = list(csv.reader(csv_file, delimiter=';'))
data = [(int(raw_row[0]), float(raw_row[1].replace(',', '.'))) for row in data]
with open('path_to_csv_file', 'w') as csv_file:
writer = csv.writer(csv_file, delimiter=';')
writer.writerows(data)
Can you consider a regex to match the ',' all in the text, then loop the match results in a process that takes ',' to '.'.

Want to convert the csv file from line break mode to be separated by comma

Currently the csv file is saved in line break mode. But it should be separated by comma for inputting these datas as an array.
The current csv file:
test#eaxmple.com
test#eaxmple.com
test#eaxmple.com
The ideal csv file:
test#eaxmple.com, test#eaxmple.com, test#eaxmple.com
The code:
def get_addresses():
with open('./addresses.csv') as f:
addresses_file = csv.reader(f)
# Need to be converted
How can I convert it? I hope to use Python.
tried this.
with open('./addresses.txt') as input, open('./addresses.csv', 'w') as output:
output.write(','.join(input.readlines()))
output.write('\n')
the result:
test#eaxmple.com
,test#eaxmple.com
,test#eaxmple.com
with open('./addresses.txt') as f:
print(",".join(f.read().splitlines()))
Load the original file into pandas using:
import pandas as pd
df = pd.read_csv({YOUR_FILE}, escapechar='\\')
Then export it back to .csv (by default this will be comma separated).
df.to_csv({YOUR_FILE})
For this simple task, just read them into an array, then join the array on commas.
with open('./addresses.txt') as input, open('./addresses.csv', 'w') as output:
output.write(','.join(input.read().splitlines()))
output.write('\n')
This ignores any complications in the CSV formatting - if your data could contain commas (which are reserved as the field separator) or double quotes (which are reserved for quoting other reserved characters) you will want to switch to the proper csv module for output and perhaps for input.
Overwriting your input file is also an unnecessary complication, so I suggest you rename the input file to addresses.txt and use addresses.csv only for output.
Demo: https://repl.it/repls/AdequateStunningVideogames
Another common trick is to read one line at a time, and write a separator before each output except the first. This is more scalable for large input files.
with open blah blah blah ...:
separator = '' # for first line
for line in input:
output.write(separator)
output.write(line)
separator = ',' # for subsequent input lines
output.write('\n')

Convert txt file to mixed delimited output using python

I have quite a messy txt file which I need to convert to a dataframe to use as reference data. An Excerpt is shown below:
http://amdc.in2p3.fr/nubase/nubase2016.txt
I've cleaned it up the best I can but to cut a long story short I would like to space delimit most of each line and then fixed delimit the last column. i.e. ignore the spaces in the last section.
Cleaned Data Text File
Can anyone point me in the right direction of a resource which can do this? Not sure if Pandas copes with this?
Kenny
P.S. I have found some great resources to clean up the multiple whitespaces and replace the line breaks. Sorry can't find the original reference, so see attached.
fin = open("Input.txt", "rt")
fout = open("Ouput.txt", "wt")
for line in fin:
fout.write(re.sub(' +', ' ', line).strip() + "\n")
fin.close()
fout.close()
So what i would do is very simple, i would clean up the data as much as possible and then convert it to a csv file, because they are easy to use. An then i would step by step load it into a pandas dataframe and change if it needed.
with open("NudatClean.txt") as f:
text=f.readlines()
import csv
with open('dat.csv', 'w', newline='') as file:
writer = csv.writer(file)
for i in text:
l=i.split(' ')
row=[]
for a in l:
if a!='':
row.append(a)
print(row)
writer.writerow(row)
That should to the job for the beginning. But I don't know what you want remove exactly so I think the rest should be pretty clear.
The way I managed to do this was split the csv into two parts then recombine. Not particularly elegant but did the job I needed.
Split by Column

How to delete a specifil line by line number in a file?

I'm trying to write a simple Phyton script that alway delete the line number 5 in a tex file, and replace with another string always at line 5. I look around but I could't fine a solution, can anyone tell me the correct way to do that? Here what I have so far:
#!/usr/bin/env python3
import od
import sys
import fileimput
f= open('prova.js', 'r')
filedata = f,read()
f.close ()
newdata = "mynewstring"
f = open('prova.js', 'w')
f.write(newdata, 5)
f.close
basically I need to add newdata at line 5.
One possible simple solution to remove/replace 5th line of file. This solution should be fine as long as the file is not too large:
fn = 'prova.js'
newdata = "mynewstring"
with open(fn, 'r') as f:
lines = f.read().split('\n')
#to delete line use "del lines[4]"
#to replace line:
lines[4] = newdata
with open(fn,'w') as f:
f.write('\n'.join(lines))
I will try to point you in the right direction without giving you the answer directly. As you said in your comment you know how to open a file. So after you open a file you might want to split the data by the newlines (hint: .split("\n")). Now you have a list of each line from the file. Now you can use list methods to change the 5th item in the list (hint: change the item at list[4]). Then you can convert the list into a string and put the newlines back (hint: "\n".join(list)). Then write that string to the file which you know how to do. Now, see if you can write the code yourself. Have fun!

Print elements of a list to a .csv file

I am reading in a csv file and dealing with each line as a list. At the end, I'd like to reprint to a .csv file, but the lines aren't necessarily even. I obviously cannot just go "print row", since this will print it as a list. How can I print it in .csv format?
Read manual, there's a CSV writer method (with example too). Don't print the data, store them and then write them into CSV file
http://docs.python.org/library/csv.html#csv.writer
Assuming that "row" contains a list of strings, you could try using
print ",".join(row)
Though it looks like your question was more towards writing to a csv file rather than printing it to standard output, here an example to do the latter using StringIO:
import StringIO
import csv
a_list_from_csv_file = [['for', 'bar'], [1, 2]]
out_fd = StringIO.StringIO()
writer = csv.writer(out_fd, delimiter='|')
for row in a_list_from_csv_file:
writer.writerow(row)
print out_fd.getvalue()
Like this you can use the different delimiters or escape characters as used by the csv you are reading.
What do you mean by "the lines aren't necessarily even"? Are you using a homebrew CSV parser or are you using the csv module?
If the former and there's nothing special you need to escape, you could try something like
print ",".join([ '"' + x.replace('"', '""') + '"' for x in row])
If you don't want ""s around each field, maybe make a function like escape_field() that checks if it needs to be wrapped in double quotes and/or escaped.

Categories

Resources