remove unwanted quotes and comma in csv file [duplicate] - python

This question already has answers here:
Remove unwanted parts from strings in a column
(10 answers)
Closed 3 years ago.
I need to remove unwanted quotes and commas from a csv file. Sample data as below
header1, header2, header3, header4
1, "ABC", BCD, "EDG",GHT\2\TST"
The last column has some free text values which seems like a new column but it opend in excel then it look like this
EDG",GHT\2\TST
Please guide me in fixing this last column.
Tried this -
sed 's/","/|/g' $filename | sed 's/|",/||/g' | sed 's/|,"/|/g' | sed 's/",/ /g' | sed 's/^.//' | awk '{print substr($0, 1, length($0)-1)}' | sed 's/,/ /g' | sed 's/"/ /g' | sed 's/|/,/g' > "out_"$filename

this should find " or , from columns and replace it with nothing
df = df.str.replace('[",]','',regex=True)

You can do it like this :
with open("data.txt", "r") as f:
for line in f.readlines():
columns = line.split(", ") # Split by ", "
columns[3] = "".join(columns[3:]) # Merge columns 4 to ... last
columns[3] = columns[3].replace("\"", "").replace(",", "")` # Removing unwanted characters
del columns[4:] # Remove all unnecessary columns
print("%s | %s | %s | %s" % (columns[0], columns[1], columns[2], columns[3]))
My data.txt file :
1, "ABC", BCD, "EDG",GHT\2\TST"
2, "CBA", DCB, "GDV",DHZ,\2RS"
Output :
1 | "ABC" | BCD | EDGGHT\2\TST
2 | "CBA" | DCB | GDVDHZ\2RS
This solution will works if only last column contains commas.

Related

How to remove duplicates without pandas?

This is the data
row1| sbkjd nsdnak ABC
row2| vknfe edcmmi ABC
row3| fjnfn msmsle XYZ
row4| sdkmm tuiepd XYZ
row5| adjck rulsdl LMN
I have already tried this using pandas and got help from stackoverflow. But, I want to be able to remove the duplicates without having to use the pandas library or any library in general. So, only one of the rows having "ABC" must be chosen, only one of the rows having "XYZ" must be chosen and the last row is unique, so, it should be chosen. How do I do this?
So, my final output should contain this:
[ row1 or row2 + row3 or row4 + row5 ]
This should only select the unique rows from your original table. If there are two or more rows which share duplicate data, it will select the first row.
data = [["sbkjd", "nsdnak", "ABC"],
["vknfe", "edcmmi", "ABC"],
["fjnfn", "msmsle", "XYZ"],
["sdkmm", "tuiepd", "XYZ"],
["adjck", "rulsdl", "LMN"]]
def check_list_uniqueness(candidate_row, unique_rows):
for element in candidate_row:
for unique_row in unique_rows:
if element in unique_row:
return False
return True
final_rows = []
for row in data:
if check_list_uniqueness(row, final_rows):
final_rows.append(row)
print(final_rows)
This Bash command would do (assuming your data is in a file called test, and that values of column 4 do not appear in other columns)
cut -d ' ' -f 4 test | tr '\n' ' ' | sed 's/\([a-zA-Z][a-zA-Z]*[ ]\)\1/\1/g' | tr ' ' '\n' | while read str; do grep -m 1 $str test; done
cut -d ' ' -f 4 test chooses the data in the fourth column
tr '\n' ' ' turns the column into a row (translating new line character to a space)
sed 's/\([a-zA-Z][a-zA-Z]*[ ]\)\1/\1/g' deletes the repetitions
tr ' ' '\n' turns the row of unique values to a column
while read str; do grep -m 1 $str test; done reads the unique words and prints the first line from test that matches that word

Only the last line of a multiline file / string is printed

I searched a bit on Stack Overflow and stumbled on different answers but nothing fitted for my situation...
I got a map.txt file like this:
+----------------------+
| |
| |
| |
| test |
| |
| |
| |
+------------------------------------------------+
| | |
| | |
| | |
| Science | Bibliothek |
| | |
| | |
| | |
+----------------------+-------------------------+
when I want to print it using this:
def display_map():
s = open("map.txt").read()
return s
print display_map()
it just prints me:
+----------------------+-------------------------+
When I try the same method with another text file like:
line 1
line 2
line 3
it works perfectly.
What I do wrong?
I guess this file uses the CR (Carriage Return) character (Ascii 13, or '\r') for newlines; on Windows and Linux this would just move the cursor back to column 1, but not move the cursor down to the beginning of a new line.
(Of course such line terminators would not survive copy-paste to Stack Overflow, which is why this cannot be replicated).
You can debug strange characters in a string with repr:
print(repr(read_map())
It will print out the string with all special characters escaped.
If you see \r in the repred string, you could try this instead:
def read_map():
with open('map.txt') as f: # with ensures the file is closed properly
return f.read().replace('\r', '\n') # replace \r with \n
Alternatively supply the U flag to open for universal newlines, which would convert '\r', '\r\n' and '\n' all to the \n upon reading despite the underlying operating system's conventions:
def read_map():
with open('map.txt', 'rU') as f:
return f.read()

String to Csv file using Python

I have the following string
string = "OGC Number | LT No | Job /n 9625878 | EPP3234 | 1206545/n" and continues on
I am trying to write it to a .CSV file where it will look like this:
OGC Number | LT No | Job
------------------------------
9625878 | EPP3234 | 1206545
9708562 | PGP43221 | 1105482
9887954 | BCP5466 | 1025454
where each newline in the string is a new row
where each "|" in the sting is a new column
I am having trouble getting the formatting.
I think I need to use:
string.split('/n')
string.split('|')
Thanks.
Windows 7, Python 2.6
Untested:
text="""
OGC Number | LT No | Job
------------------------------
9625878 | EPP3234 | 1206545
9708562 | PGP43221 | 1105482
9887954 | BCP5466 | 1025454"""
import csv
lines = text.splitlines()
with open('outputfile.csv', 'wb') as fout:
csvout = csv.writer(fout)
csvout.writerow(lines[0]) # header
for row in lines[2:]: # content
csvout.writerow([col.strip() for col in row.split('|')])
If you are interested in using a third party module. Prettytable is very useful and has a nice set of features to deal with and print tabular data.
EDIT: Oops, I missunderstood your question!
The code below will use two regular expressions to do the modifications.
import re
str="""OGC Number | LT No | Job
------------------------------
9625878 | EPP3234 | 1206545
9708562 | PGP43221 | 1105482
9887954 | BCP5466 | 1025454
"""
# just setup above
# remove all lines with at least 4 dashes
str=re.sub( r'----+\n', '', str )
# replace all pipe symbols with their
# surrounding spaces by single semicolons
str=re.sub( r' +\| +', ';', str )
print str

Python -- how to read and change specific fields from file? (specifically, numbers)

I just started learning python scripting yesterday and I've already gotten stuck. :(
So I have a data file with a lot of different information in various fields.
Formatted basically like...
Name (tab) Start# (tab) End# (tab) A bunch of fields I need but do not do anything with
Repeat
I need to write a script that takes the start and end numbers, and add/subtract a number accordingly depending on whether another field says + or -.
I know that I can replace words with something like this:
x = open("infile")
y = open("outfile","a")
while 1:
line = f.readline()
if not line: break
line = line.replace("blah","blahblahblah")
y.write(line + "\n")
y.close()
But I've looked at all sorts of different places and I can't figure out how to extract specific fields from each line, read one field, and change other fields. I read that you can read the lines into arrays, but can't seem to find out how to do it.
Any help would be great!
EDIT:
Example of a line from the data here: (Each | represents a tab character)
| |
V V
chr21 | 33025905 | 33031813 | ENST00000449339.1 | 0 | **-** | 33031813 | 33031813 | 0 | 3 | 1835,294,104, | 0,4341,5804,
chr21 | 33036618 | 33036795 | ENST00000458922.1 | 0 | **+** | 33036795 | 33036795 | 0 | 1 | 177, | 0,
The second and third columns (indicated by arrows) would be the ones that I'd need to read/change.
You can use csv to do the splitting, although for these sorts of problems, I usually just use str.split:
with open(infile) as fin,open('outfile','w') as fout:
for line in fin:
#use line.split('\t'3) if the name of the field can contain spaces
name,start,end,rest = line.split(None,3)
#do something to change start and end here.
#Note that `start` and `end` are strings, but they can easily be changed
#using `int` or `float` builtins.
fout.write('\t'.join((name,start,end,rest)))
csv is nice if you want to split lines like this:
this is a "single argument"
into:
['this','is','a','single argument']
but it doesn't seem like you need that here.

Extracting each line from a file and passing it as a variable to "foreach" loop

Could somebody help me figure out a simple way of doing this using any script ? I will be running the script on Linux
1 ) I have a file1 which has the following lines :
(Bank8GntR[3] | Bank8GntR[2] | Bank8GntR[1] | Bank8GntR[0] ),
(Bank7GntR[3] | Bank7GntR[2] | Bank7GntR[1] | Bank7GntR[0] ),
(Bank6GntR[3] | Bank6GntR[2] | Bank6GntR[1] | Bank6GntR[0] ),
(Bank5GntR[3] | Bank5GntR[2] | Bank5GntR[1] | Bank5GntR[0] ),
2 ) I need the contents of file1 to be modified as following and written to a file2
(Bank15GntR[3] | Bank15GntR[2] | Bank15GntR[1] | Bank15GntR[0] ),
(Bank14GntR[3] | Bank14GntR[2] | Bank14GntR[1] | Bank14GntR[0] ),
(Bank13GntR[3] | Bank13GntR[2] | Bank13GntR[1] | Bank13GntR[0] ),
(Bank12GntR[3] | Bank12GntR[2] | Bank12GntR[1] | Bank12GntR[0] ),
So I have to:
read each line from the file1,
use "search" using regular expression,
to match Bank[0-9]GntR,
replace \1 with "7 added to number matched",
insert it back into the line,
write the line into a new file.
How about something like this in Python:
# a function that adds 7 to a matched group.
# groups 1 and 2, we grabbed (Bank) to avoid catching the digits in brackets.
def plus7(matchobj):
return '%s%d' % (matchobj.group(1), int(matchobj.group(2)) + 7)
# iterate over the input file, have access to the output file.
with open('in.txt') as fhi, open('out.txt', 'w') as fho:
for line in fhi:
fho.write(re.sub('(Bank)(\d+)', plus7, line))
Assuming you don't have to use python, you can do this using awk:
cat test.txt | awk 'match($0, /Bank([0-9]+)GntR/, nums) { d=nums[1]+7; gsub(/Bank[0-9]+GntR\[/, "Bank" d "GntR["); print }'
This gives the desired output.
The point here is that match will match your data and allows capturing groups which you can use to extract out the number. As awk supports arithmetic, you can then add 7 within awk and then do a replacement on all the values in the rest of the line. Note, I've assumed all the values in the line have the same digit in them.

Categories

Resources