Python removing substrings from strings - python

I'm trying to remove some substrings from a string in a csv file.
import csv
import string
input_file = open('in.csv', 'r')
output_file = open('out.csv', 'w')
data = csv.reader(input_file)
writer = csv.writer(output_file,quoting=csv.QUOTE_ALL)# dialect='excel')
specials = ("i'm", "hello", "bye")
for line in data:
line = str(line)
new_line = str.replace(line,specials,'')
writer.writerow(new_line.split(','))
input_file.close()
output_file.close()
So for this example:
hello. I'm obviously over the moon. If I am being honest I didn't think I'd get picked, so to get picked is obviously a big thing. bye.
I'd want the output to be:
obviously over the moon. If I am being honest I didn't think I'd get picked, so to get picked is obviously a big thing.
This however only works when im searching for a single word. So that specials = "I'm" for example. Do I need to add my words to a list or an array?

It looks like you aren't iterating through specials, since it's a tuple rather than a list, so it's only grabbing one of the values. Try this:
specials = ["i'm, "hello", "bye"]
for line in data:
new_line = str(line)
for word in specials:
new_line = str.replace(new_line, word, '')
writer.writerow(new_line.split(','))

It seems like you're already splitting the input via the csv.reader, but then you're throwing away all that goodness by turning the split line back into a string. It's best not to do this, but to keep working with the lists that are yielded from the csv reader. So, it becomes something like this:
for row in data:
new_row = [] # A place to hold the processed row data.
# look at each field in the row.
for field in row:
# remove all the special words.
new_field = field
for s in specials:
new_field = new_field.replace(s, '')
# add the sanitized field to the new "processed" row.
new_row.append(new_field)
# after all fields are processed, write it with the csv writer.
writer.writerow(new_row)

Related

To parse text file and create json out of it

I am new to Python. I want to parse a text file in which the first row contains the headers and are the keys and the next row(2nd row) has its corresponding values.
The problem I'm facing is that the content in the text file is not symmetric meaning there are uneven spaces between the first and the second row so, I'm not able to use the delimiter also.
Also, there is no necessity that the header will always have a corresponding value in the next row. It may be empty sometimes.
After that, I want to make it a JSON format having those key-value pairs.
Any help would be appreciated.
import re
with open("E:\\wipro\\samridh\\test.txt") as read_file:
line = read_file.readline()
while line:
#print(line,end='')
new_string = re.sub(' +',' ', line)
line= read_file.readline()
print(new_string)
PFA image of my text input
You can find the indices and matches of the header with the finditer of the re package. Then, use that to process the rest:
import re
import json
thefile = open("file.txt")
line = thefile.readline()
iter = re.finditer("\w+\s+", line)
columns = [(m.group(0), m.start(0), m.end(0)) for m in iter]
records = []
while line:
line = thefile.readline()
record = {}
for col in columns:
record[col[0]] = line[col[1]:col[2]]
records.append(record)
print(json.dumps(records))
I'll leave it up to OP to strip whitespace and filter out empty entries. Not to mention error handling ;-).
Not quite sure what you want to do, but if I understand it correctly and under these assumptions: - you only have 2 lines in the file. - you have the same number of keys and value. - no spaces allowed "inside" a value or key, meaning no spaces allowed except the ones separate between elements.
with open(fname) as f:
content = f.readlines()
# you may also want to remove whitespace characters like `\n` at the end of each line
content = [x.strip() for x in content]
after that, content[0] is your keys line and content[1] is your values.
now all you need to do is this:
key_value_dict = {}
for key,value in zip(content[0].split(), content[1].split()):
key_value_dict[key] = value
and your key_value_dict holds up a dictionary (JSON like) of keys and values.
I assume that each of the headers is a single word without intervening whitespace. Then, to find out where each column starts, you can do this, for example:
with open("E:\\wipro\\samridh\\test.txt") as read_file:
line = next(read_file)
headers = line.split()
l_bounds = [line.find(word) for word in headers]
When splitting data lines, you will also need the right boundaries. If you know, say, that none of your data lines is longer than 1000 characters, you could do something like this:
r_bounds = l_bounds[1:] + [1000]
When you walk over the data lines, you put together the left and right limits and the header_words like so:
out_str = json.dumps({name: line[l:r].strip()
for name, l, r in zip(headers, l_bounds, r_bounds)})
No regex required, by the way.
Assumptions the below makes:
Headers are one word (as they are in your example)
Headers and values don’t overlap... That is if header 1 goes from index 5 to 15, then its value in the row below will also be found within the same index of the row below
Benefits of this approach are that the values can have spaces in between them (as they do in your example). If you were to split both the header and value strings by spaces, then they would have a different number of elements and you wouldn’t be able to combine them. Also, you wouldn’t be able to find values that were empty (as in his example).
Here is the approach I would take...
If you are sure your file headers are only one words (no spaces), then find all indices of the first character of each word and store them in an array. Every time you figure out two indices, extract the header between them. So between (header1-firstchar, header2-firstchar - 1)...
Then get the second line and sequentially extract substrings from indices: (header1-firstchar, header2-firstchar - 1)...
Once you've done that combine the extracted header/key and values into a dictionary.
dictVerson = zip(headers, values)
Then call following:
import json
jsonVersion = json.dumps(dictVersion)

Parsing a text file with line breaks in python

I have a text file with about 20 entries. They look like this:
~
England
Link: http://imgur.com/foobar.jpg
Capital: London
~
Iceland
Link: http://imgur.com/foobar2.jpg
Capital: Reykjavik
...
etc.
I would like to take these entries and turn them into a CSV.
There is a '~' separating each entry. I'm scratching my head trying to figure out how to go thru line by line and create the CSV values for each country. Can anyone give me a clue on how to go about this?
Use the libraries luke :)
I'm assuming your data is well formatted. Most real world data isn't that way. So, here goes a solution.
>>> content.split('~')
['\nEngland\nLink: http://imgur.com/foobar.jpg\nCapital: London\n', '\nIceland\nLink: http://imgur.com/foobar2.jpg\nCapital: Reykjavik\n', '\nEngland\nLink: http://imgur.com/foobar.jpg\nCapital: London\n', '\nIceland\nLink: http://imgur.com/foobar2.jpg\nCapital: Reykjavik\n']
For writing the CSV, Python has standard library functions.
>>> import csv
>>> csvfile = open('foo.csv', 'wb')
>>> fieldnames = ['Country', 'Link', 'Capital']
>>> writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
>>> for entry in entries:
... cols = entry.strip().splitlines()
... writer.writerow({'Country': cols[0], 'Link':cols[1].split(': ')[1], 'Capital':cols[2].split(':')[1]})
...
If your data is more semi structured or badly formatted, consider using a library like PyParsing.
Edit:
Second column contains URLs, so we need to handle the splits well.
>>> cols[1]
'Link: http://imgur.com/foobar2.jpg'
>>> cols[1].split(':')[1]
' http'
>>> cols[1].split(': ')[1]
'http://imgur.com/foobar2.jpg'
The way that I would do that would be to use the open() function using the syntax of:
f = open('NameOfFile.extensionType', 'a+')
Where "a+" is append mode. The file will not be overwritten and new data can be appended. You could also use "r+" to open the file in read mode, but would lose the ability to edit. The "+" after a letter signifies that if the document does not exist, it will be created. The "a+" I've never found to work without the "+".
After that I would use a for loop like this:
data = []
tmp = []
for line in f:
line.strip() #Removes formatting marks made by python
if line == '~':
data.append(tmp)
tmp = []
continue
else:
tmp.append(line)
Now you have all of the data stored in a list, but you could also reformat it as a class object using a slightly different algorithm.
I have never edited CSV files using python, but I believe you can use a loop like this to add the data:
f2 = open('CSVfileName.csv', 'w') #Can change "w" for other needs i.e "a+"
for entry in data:
for subentry in entry:
f2.write(str(subentry) + '\n') #Use '\n' to create a new line
From my knowledge of CSV that loop would create a single column of all of the data. At the end remember to close the files in order to save the changes:
f.close()
f2.close()
You could combine the two loops into one in order to save space, but for the sake of explanation I have not.

identify csv in python

I have a data dump that is a "messed up" CSV. (About 100 files, each with about 1000 lines of actual CSV data.)
The dump has some other text in addition to CSV. How can I extract the CSV part separately, programmatically?
As an example the data file looks like something like this
Session:1
Data collection date: 09-09-2016
Related questions:
Question 1: parta, partb, partc,
Question 2: parta, partb, partc
"field1","field2","field3","field4"
"data11","data12","data13","data14"
"data21","data22","data23","data24"
"data31","data32","data33","data34"
"data41","data42","data43","data44"
"data51","data52","data53","data54"
I need to extract the csv part.
Caveats,
the text in the beginning is NOT limited to 4 - 5 lines.
the additional text is NOT just in the beginning of the file
I saw this post that suggests using re.split and/or csv.Sniffer,
however my attempt was not fruitful.
with open("untitled.csv") as csvfile:
dialect = csv.Sniffer().sniff(csvfile.read(1024))
csvfile.seek(0)
print(dialect.__dict__)
csvstarts = False
csvdump = []
for ln in csvfile.readlines():
toks = re.split(r'[,]', ln)
print(toks)
if toks[0] == '"field1"' and not csvstarts: # identify by the header line
csvstarts = True
continue
if csvstarts:
if toks[0] == '"field1"': # identify the start of subsequent csv data
csvstarts = False
continue
csvdump.append(ln) # record the current line
print(csvdump)
For now I am able to identify the csv lines accurately ONLY if there is one bunch of data.
Is there anything better I can do?
How about this:
import re
my_pattern = re.compile("(\"[\w]+\",)+")
with open('<your_file>', 'rb') as fi:
for f in fi:
result = my_pattern.match(f)
if result:
print f
Assuming the csv data can be differentiated from the rest by having no special characters in them (we only accept each element to have letters or numbers surrounded by double quotes and a comma to separate from the next element)
If your csv lines and only those lines start with \", then you can do this:
import csv
data = list(csv.reader(open("test.csv", 'rb'), quotechar='¬'))
# for quotechar - use something that won't turn up in data
def importCSV(data):
# outputs list of list with required data
# works on the assumption that all required data starts with \"
# and that no text starts with \"
out = []
for line in data:
if (line != []) and (line[0][0] == "\""):
line = [el.replace("\"", "") for el in line]
out.append(line)
return out
useful = importCSV(data)
Can you not read each line and do a regex to see weather or not to pull the data?
Maybe something like:
^(["][\w]["][,])+["][\w]["]$
My regex is not the best and there may likely be a better way but that seemed to work for me.

Adding each item in list to end of specific lines in FASTA file

I solved this in the comments below.
So essentially what I am trying to do is add each element of a list of strings to the end of specific lines in a different file.
Hard to explain but essentially I want to parse a FASTA file, and every time it reaches a header (line.startswith('>')) I want it to replace parts of that header with an element in a list I've already made.
For example:
File1:
">seq1 unwanted here
AATATTATA
ATATATATA
>seq2 unwanted stuff here
GTGTGTGTG
GTGTGTGTG
>seq3 more stuff I don't want
ACACACACAC
ACACACACAC"
I want it to keep ">seq#" but replace everything after with the next item in the list below:
List:
mylist = "['things1', '', 'things3', 'things4', '' 'things6', 'things7']"
Result (modified file1):
">seq1 things1
AATATTATA
ATATATATA
>seq2 # adds nothing here due to mylist[1] = ''
GTGTGTGTG
GTGTGTGTG
>seq3 things3
ACACACACAC
ACACACACAC
As you can see I want it to add even the blank items in the list.
So once again, I want it to parse this FASTA file, and every time it gets to a header (there are thousands), I want it to replace everything after the first word with the next item in the separate list I have made.
What you have will work, but there are a few unnecessary lines so I've edited down to use a few less lines. Also, an important note is that you don't close your file handles. This could result in errors, specifically when writing to file, either way it's bad practice. code:
#!/usr/bin/python
import sys
# gets list of annotations
def get_annos(infile):
with open(infile, 'r') as fh: # makes sure the file is closed properly
annos = []
for line in fh:
annos.append( line.split('\t')[5] ) # added tab as separator
return annos
# replaces extra info on each header with correct annotation
def add_annos(infile1, infile2, outfile):
annos = get_annos(infile1) # contains list of annos
with open(infile2, 'r') as f2, open(outfile, 'w') as output:
for line in f2:
if line.startswith('>'):
line_split = list(line.split()[0]) # split line on whitespace and store first element in list
line_split.append(annos.pop(0)) # append data of interest to current id line
output.write( ' '.join(line_split) + '\n' ) # join and write to file with a newline character
else:
output.write(line)
anno = sys.argv[1]
seq = sys.argv[2]
out = sys.argv[3]
add_annos(anno, seq, out)
get_annos(anno)
This is not perfect but it cleans things up a bit. I'd might veer away from using pop() to associate the annotation data with the sequence IDs unless you are certain the files are in the same order every time.
There is a great library in python for Fasta and other DNA file parsing. It is totally helpful in Bioinformatics. You can also manipulate any data according to your need.
Here is a simple example extracted from the library website:
from Bio import SeqIO
for seq_record in SeqIO.parse("ls_orchid.fasta", "fasta"):
print(seq_record.id)
print(repr(seq_record.seq))
print(len(seq_record))
You should get something like this on your screen:
gi|2765658|emb|Z78533.1|CIZ78533
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG...CGC', SingleLetterAlphabet())
740
...
gi|2765564|emb|Z78439.1|PBZ78439
Seq('CATTGTTGAGATCACATAATAATTGATCGAGTTAATCTGGAGGATCTGTTTACT...GCC', SingleLetterAlphabet())
592
***********EDIT*********
I solved this before anyone could help. This is my code, can anyone tell me if I have any bad practices? Is there a way to do this without writing everything to a new file? Seems like it would take a long time/lots of memory.
#!/usr/bin/python
# Script takes unedited FASTA file, removed seq length and
# other header info, adds annotation after sequence name
# run as: $ python addanno.py testanno.out testseq.fasta out.txt
import sys
# gets list of annotations
def get_annos(infile):
f = open(infile)
list2 = []
for line in f:
columns = line.strip().split('\t')
list2.append(columns[5])
return list2
# replaces extra info on each header with correct annotation
def add_annos(infile1, infile2, outfile):
mylist = get_annos(infile1) # contains list of annos
f2 = open(infile2, 'r')
output = open(out, 'w')
for line in f2:
if line.startswith('>'):
l = line.partition(" ")
list3 = list(l)
del list3[1:]
list3.append(' ')
list3.append(mylist.pop(0))
final = ''.join(list3)
line = line.replace(line, final)
output.write(line)
output.write('\n')
else:
output.write(line)
anno = sys.argv[1]
seq = sys.argv[2]
out = sys.argv[3]
add_annos(anno, seq, out)
get_annos(anno)

python csv replace listitem

i have following output from a csv file:
word1|word2|word3|word4|word5|word6|01:12|word8
word1|word2|word3|word4|word5|word6|03:12|word8
word1|word2|word3|word4|word5|word6|01:12|word8
what i need to do is change the time string like this 00:01:12.
my idea is to extract the list item [7] and add a "00:" as string to the front.
import csv
with open('temp', 'r') as f:
reader = csv.reader(f, delimiter="|")
for row in reader:
fixed_time = (str("00:") + row[7])
begin = row[:6]
end = row[:8]
print begin + fixed_time +end
get error message:
TypeError: can only concatenate list (not "str") to list.
i also had a look on this post.
how to change [1,2,3,4] to '1234' using python
i neeed to know if my approach to soloution is the right way. maybe need to use split or anything else for this.
thx for any help
The line that's throwing the exception is
print begin + fixed_time +end
because begin and end are both lists and fixed_time is a string. Whenever you take a slice of a list (that's the row[:6] and row[:8] parts), a list is returned. If you just want to print it out, you can do
print begin, fixed_time, end
and you won't get an error.
Corrected code:
I'm opening a new file for writing (I'm calling it 'final', but you can call it whatever you want), and I'm just writing everything to it with the one modification. It's easiest to just change the one element of the list that has the line (row[6] here), and use '|'.join to write a pipe character between each column.
import csv
with open('temp', 'r') as f, open('final', 'w') as fw:
reader = csv.reader(f, delimiter="|")
for row in reader:
# just change the element in the row to have the extra zeros
row[6] = '00:' + row[6]
# 'write the row back out, separated by | characters, and a new line.
fw.write('|'.join(row) + '\n')
you can use regex for that:
>>> txt = """\
... word1|word2|word3|word4|word5|word6|01:12|word8
... word1|word2|word3|word4|word5|word6|03:12|word8
... word1|word2|word3|word4|word5|word6|01:12|word8"""
>>> import re
>>> print(re.sub(r'\|(\d\d:\d\d)\|', r'|00:\1|', txt))
word1|word2|word3|word4|word5|word6|00:01:12|word8
word1|word2|word3|word4|word5|word6|00:03:12|word8
word1|word2|word3|word4|word5|word6|00:01:12|word8

Categories

Resources