Splitting a file with multiple but not all delimiters in Python - python

I know there's been several answers to questions regarding multiple delimiters, but my issue involves needing to delimit by multiple delimiters but not all of them. I have a file that contains the following:
((((((Anopheles_coluzzii:0.002798,Anopheles_arabiensis:0.005701):0.001405,(Anopheles_gambiae:0.002824,Anopheles_quadriannulatus:0.004249):0.002085):0,Anopheles_melas:0.008552):0.003211,Anopheles_merus:0.011152):0.068265,Anopheles_christyi:0.086784):0.023746,Anopheles_epiroticus:0.082921):1.101881;
It is newick format so all information is in one long line. What I would like to do is isolate all the numbers that follow another number. So for example the first number I would like to isolate is 0.001405. I would like to put that in a list with all the other numbers that follow a number (not a name etc).
I tried to use the following code:
with open("file.nh", "r") as f:
for line in f:
data = line
z = re.findall(r"[\w']+", data)
The issue here is that this splits the list using "." as well as the other delimiters and this is a problem because all the numbers I require have decimal points.
I considered going along with this and converting the numbers in the list to ints and then removing all non-int values and 0 values. However, some of the files contain 0 as a value that needs to be kept.
So is there a way of choosing which delimiters to use and which to avoid when multiple delimiters are required?

It's not necessary to split by multiple but not all delimiters if you set up your regex to catch the wanted parts. By your definition, you could use every number after ):. Using the re module a possible solution is this:
with open("file.nh", "r") as f:
for line in f:
z = re.findall(r"\):([0-9.]+)", line)
print(z)
The result is:
['0.001405', '0.002085', '0', '0.003211', '0.068265', '0.023746', '1.101881']
r"\):([0-9.]+)" is searching for ): followed by a part with numbers or decimal point. The second part is the result and is therefore inside parenthesis.
As Alex Hall mentioned in most cases it's not a good idea to use regex if the data is well structured. Watch out for libraries working with the given data structure instead.

Related

Python Reading Variable Whitespace Text Table Format

I have this weird output from another tool that I cannot change or modify that I need to parse and do analysis on. Any ideas on what pandas or python library i should use? It has this space filling between columns so that each column start is aligned properly which makes it difficult. White space and tabs are not the delimiter.
If the columns are consistent and every cell has a value, it should actually be pretty easy to parse this manually. You can do some variation of:
your_file = 'C:\\whatever.txt'
with open(your_file) as f:
for line in f:
total_capacity, existing_load, recallable_load, new_load, excess_load, excess_capacity_needed = line.strip().split()
# analysis
If you have strings with spaces, you can alternatively do:
for line in f:
row = (cell.strip() for cell in line.split(' ') if cell.strip())
total_capacity, existing_load, recallable_load, new_load, excess_load, excess_capacity_needed = row
# analysis
This splits the line by two spaces each, and uses a list comprehension to strip excess whitespace/tabs/newlines from all values in the list before removing any subsequently empty values. If you have strings that can contain multiple spaces... that would be an issue. You'd need a way to ensure those strings were enclosed in quotes so you could isolate them.
If by "white space and tabs are not the delimiter" you meant that the whitespaces uses some weird blank unicode character, you can just do
row = (cell.strip() for cell in line.split(unicode_character_here) if cell.strip())
Just make sure that for any of these solutions, you remember to add some checks for those ===== dividers and add a way to detect the titles of the columns if necessary.

Finding sub-strings in LARGE string

#read in csv file in form ("case, num, val \n case1, 1, baz\n...")
# convert to form FOO = "casenumval..." roughly 6 million characters
for someString in List: #60,000 substrings
if substr not in FOO:
#do stuff
else:
#do other stuff
So my issue is that there are far too many sub strings to check against this massive string. I have tried reading the file in line by line and checking the substrings against the line, but this still crashes the program. Are there any techniques for checking a lot of substrings againsts a very large string efficiently?
FOR CONTEXT:
I am performing data checks, suspect data is saved to a csv file to be reviewed/changed. This reviewed/changed file is then compared to the original file. Data which has not changed has been verified as good and must be saved to a new "exceptionFile". Data that has been changed and passes is disregarded. And data which has been changed and is checked and still suspect is the sent off for review again.
The first thing you should do is convert your list of 60,000 strings to search for into one big regular expression:
import re
searcher = re.compile("|".join(re.escape(s) for s in List)
Now you can search for them all at once:
for m in searcher.finditer(FOO):
print(m.group(0)) # prints the substring that matched
If all you care about is knowing which ones were found,
print(set(m.group(0) for m in searcher.finditer(FOO))
This is still doing substantially more work than the absolute minimum, but it should be much more efficient than what you were doing before.
Also, if you know that your input is a CSV file and you also know that none of the strings-to-search-for contain a newline, you can operate line by line, which may or may not be faster than what you were doing depending on conditions, but will certainly use less memory:
with open("foo.csv") as FOO:
for line in FOO:
for m in searcher.finditer(line):
# do something with the substring that matched

How to ensure two line breaks between each paragraph in python

I am reading txt files into python, and want to get paragraph breaks consistent. Sometimes there is one, two, three, four... occasionally several tens or hundreds of blank lines between paragraphs.
It is obviously easy to strip out all the breaks, but I can only think of "botched" ways of making everything two breaks (i.e. a single blank line between each paragraph). All i can think of would be specifying multiple strips/replaces for different possible combinations of breaks... which gets unwieldy when the number of breaks is very large ... or iterativly removing excess breaks until left with two, which I guess would be slow and not particularly scalable to many tens of thousands of txt files ...
Is there a moderately fast to process [/simple] way of achieving this?
import re
re.sub(r"([\r\n]){2,}",r"\1\1",x)
You can try this.Here x will be your string containing all the paragraphs.
Here's one way.
import os
f = open("text.txt")
r = f.read()
pars = [p for p in r.split(os.linesep) if p]
print (os.linesep * 2).join(pars)
This is assuming by paragraphs we mean a block of text not containing a linebreak.

Cleaning up commas in numbers w/ regular expressions in Python

I have been googling this one fervently, but I can't really narrow it down. I am attempting to interpret a csv file of values, common enough sort of behaviour. But I am being punished by values over a thousand, i.e. in quotations and involving a comma. I have kinda gotten around it by using the csv reader, which creates a list of numbers from the row, but I then have to pick the commas out afterwards.
For purely academic reasons, is there a better way to edit a string with regular expressions? Going from 08/09/2010,"25,132","2,909",650 to 08/09/2010,25132,2909,650.
(If you are into Vim, basically I want to put Python on this:
:1,$s/"\([0-9]*\),\([0-9]*\)"/\1\2/g :D )
Use the csv module for first-stage parsing, and a regex only for seeing if the result can be transformed to a number.
import csv, re
num_re = re.compile('^[0-9]+[0-9,]+$')
for row in csv.reader(open('input_file.csv')):
for el_num in len(row):
if num_re.match(row[el_num]):
row[el_num] = row[el_num].replace(',', '')
...although it would probably be faster not to use the regular expression at all:
for row in ([item.replace(',', '') for item in row]
for row in csv.reader(open('input_file.csv'))):
do_something_with_your(row)
I think what you're looking for is, assuming that commas will only appear in numbers, and that those entries will always be quoted:
import re
def remove_commas(mystring):
return re.sub(r'"(\d+?),(\d+?)"', r'\1\2', mystring)
UPDATE:
Adding cdarke's comments below, the following should work for arbitrary-length numbers:
import re
def remove_commas_and_quotes(mystring):
return re.sub(r'","|",|"', ',', re.sub(r'(?:(\d+?),)',r'\1',mystring))
Python has a regular expressions module, "re":
http://docs.python.org/library/re.html
However, in this case, you might want to consider using the "partition" function:
>>> s = 'some_long_string,"12,345",more_string,"56,6789",and_some_more'
>>> left_part,quote_mark,right_part = s.partition(")
>>> right_part
'12,345",more_string,"56,6789",and_some_more'
>>> number,quote_mark,remainder = right_part.partition(")
'12,345'
string.partition("character") splits a string into 3 parts, stuff to the left of the first occurrence of "character", "character" itself and stuff to the right.
Here's a simple regex for removing commas from numbers of any length:
re.sub(r'(\d+),?([\d+]?)',r'\1\2',mystring)

How to add a comma to the end of a list efficiently?

I have a list of horizontal names that is too long to open in excel. It's 90,000 names long. I need to add a comma after each name to put into my program. I tried find/replace but it freezes up my computer and crashes. Is there a clever way I can get a comma at the end of each name? My options to work with are python and excel thanks.
If you actually had a Python list, say names, then ','.join(names) would make into a string with a comma between each name and the following one (if you need one at the end as well, just use + ',' to append one more comma to the result).
Even though you say you have "a list" I suspect you actually have a string instead, for example in a file, where the names are separated by...? You don't tell us, and therefore force us to guess. For example, if they're separated by line-ends (one name per line), your life is easiest:
with open('yourfile.txt') as f:
result = ','.join(f)
(again, supplement this with a + ',' after the join if you need that, of course). That's because separation by line-ends is the normal default behavior for a text file, of course.
If the separator is something different, you'll have to read the file's contents as a string (with f.read()) and split it up appropriately then join it up again with commas.
For example, if the separator is a tab character:
with open('yourfile.txt') as f:
result = ','.join(f.read().split('\t'))
As you see, it's not so much worse;-).

Categories

Resources