Python Reading Variable Whitespace Text Table Format - python

I have this weird output from another tool that I cannot change or modify that I need to parse and do analysis on. Any ideas on what pandas or python library i should use? It has this space filling between columns so that each column start is aligned properly which makes it difficult. White space and tabs are not the delimiter.

If the columns are consistent and every cell has a value, it should actually be pretty easy to parse this manually. You can do some variation of:
your_file = 'C:\\whatever.txt'
with open(your_file) as f:
for line in f:
total_capacity, existing_load, recallable_load, new_load, excess_load, excess_capacity_needed = line.strip().split()
# analysis
If you have strings with spaces, you can alternatively do:
for line in f:
row = (cell.strip() for cell in line.split(' ') if cell.strip())
total_capacity, existing_load, recallable_load, new_load, excess_load, excess_capacity_needed = row
# analysis
This splits the line by two spaces each, and uses a list comprehension to strip excess whitespace/tabs/newlines from all values in the list before removing any subsequently empty values. If you have strings that can contain multiple spaces... that would be an issue. You'd need a way to ensure those strings were enclosed in quotes so you could isolate them.
If by "white space and tabs are not the delimiter" you meant that the whitespaces uses some weird blank unicode character, you can just do
row = (cell.strip() for cell in line.split(unicode_character_here) if cell.strip())
Just make sure that for any of these solutions, you remember to add some checks for those ===== dividers and add a way to detect the titles of the columns if necessary.

Related

Extract a column from a text file into a list

I have a text file which contains a table comprised of numbers e.g:
5,10,6
6,20,1
7,30,4
8,40,3
9,23,1
4,13,6
if for example I want the numbers contained only in the third column, how do i extract that column into a list?
I have tried the following:
myNumbers.append(line.split(',')[2])
The strip method will make sure that the newline character is stripped off. The split method is used here to make sure that the commas are used as a delimiter.
line.strip().split(',')[2]

Splitting a file with multiple but not all delimiters in Python

I know there's been several answers to questions regarding multiple delimiters, but my issue involves needing to delimit by multiple delimiters but not all of them. I have a file that contains the following:
((((((Anopheles_coluzzii:0.002798,Anopheles_arabiensis:0.005701):0.001405,(Anopheles_gambiae:0.002824,Anopheles_quadriannulatus:0.004249):0.002085):0,Anopheles_melas:0.008552):0.003211,Anopheles_merus:0.011152):0.068265,Anopheles_christyi:0.086784):0.023746,Anopheles_epiroticus:0.082921):1.101881;
It is newick format so all information is in one long line. What I would like to do is isolate all the numbers that follow another number. So for example the first number I would like to isolate is 0.001405. I would like to put that in a list with all the other numbers that follow a number (not a name etc).
I tried to use the following code:
with open("file.nh", "r") as f:
for line in f:
data = line
z = re.findall(r"[\w']+", data)
The issue here is that this splits the list using "." as well as the other delimiters and this is a problem because all the numbers I require have decimal points.
I considered going along with this and converting the numbers in the list to ints and then removing all non-int values and 0 values. However, some of the files contain 0 as a value that needs to be kept.
So is there a way of choosing which delimiters to use and which to avoid when multiple delimiters are required?
It's not necessary to split by multiple but not all delimiters if you set up your regex to catch the wanted parts. By your definition, you could use every number after ):. Using the re module a possible solution is this:
with open("file.nh", "r") as f:
for line in f:
z = re.findall(r"\):([0-9.]+)", line)
print(z)
The result is:
['0.001405', '0.002085', '0', '0.003211', '0.068265', '0.023746', '1.101881']
r"\):([0-9.]+)" is searching for ): followed by a part with numbers or decimal point. The second part is the result and is therefore inside parenthesis.
As Alex Hall mentioned in most cases it's not a good idea to use regex if the data is well structured. Watch out for libraries working with the given data structure instead.

HTML/XML special characters cause row breaks reaing csv file

I'm trying read a semicolon delimited CSV file into python. There is a column which contains some XML code, and in some rows, these codes contain special entities such as < for < and so on. Those semicolons in cause wrong row breaks, resulting in inconsistent number of columns for certain rows. Is there a way to avoid that without replacing every problematic character?
Here is an example of such rows (I've shortened it for matter of visibility):
20160210-12:45:43:047;C2ALLIANCE.EAM.EVENT.EAMEVENTREPORT.DPROB14;<?xml version="1.0"?><FAP:Message><eam:Data id="LOTTYPE">R&D</eam:Data></FAP:Message>;EVENT;DPROB14;
There are actually 5 columns, while the semicolon in & causes an extra break, so my code gets wrong number of columns.
I need certain columns and I use numpy:
data = numpy.genfromtxt('csvfile.csv', delimiter=";", dtype='str',usecols=(0, 1, 3), skip_header=1)
If the offending semicolon was between quotation marks, it would be possible to ignore it using pandas; but here, it's completely taken for a delimiter (I'm not the author of the data).

How can I format a txt file in python so that extra paragraph lines are removed as well as extra blank spaces?

I'm trying to format a file similar to this: (random.txt)
Hi, im trying to format a new txt document so
that extra spaces between words and paragraphs are only 1.
This should make this txt document look like:
This is how it should look below: (randomoutput.txt)
Hi, I'm trying to format a new txt document so
that extra spaces between words and paragraphs are only 1.
This should make this txt document look like:
So far the code I've managed to make has only removed the spaces, but I'm having trouble making it recognize where a new paragraph starts so that it doesn't remove the blank lines between paragraphs. This is what I have so far.
def removespaces(input, output):
ivar = open(input, 'r')
ovar = open(output, 'w')
n = ivar.read()
ovar.write(' '.join(n.split()))
ivar.close()
ovar.close()
Edit:
I've also found a way to create spaces between paragraphs, but right now it just takes every line break and creates a space between the old line and new line using:
m = ivar.readlines()
m[:] = [i for i in m if i != '\n']
ovar.write('\n'.join(m))
You should process the input line-by line. Not only will this make your program simpler but also more easy on the system's memory.
The logic for normalizing horizontal white space in a line stays the same (split words and join with a single space).
What you'll need to do for the paragraphs is test whether line.strip() is empty (just use it as a boolean expression) and keep a flag whether the previous line was empty too. You simply throw away the empty lines but if you encounter a non-empty line and the flag is set, print a single empty line before it.
with open('input.txt', 'r') as istr:
new_par = False
for line in istr:
line = line.strip()
if not line: # blank
new_par = True
continue
if new_par:
print() # print a single blank line
print(' '.join(line.split()))
new_par = False
If you want to suppress blank lines at the top of the file, you'll need an extra flag that you set only after encountering the first non-blank line.
If you want to go more fancy, have a look at the textwrap module but be aware that is has (or, at least, used to have, from what I can say) some bad worst-case performance issues.
The trick here is that you want to turn any sequence of 2 or more \n into exactly 2 \n characters. This is hard to write with just split and join—but it's dead simple to write with re.sub:
n = re.sub(r'\n\n+', r'\n\n', n)
If you want lines with nothing but spaces to be treated as blank lines, do this after stripping spaces; if you want them to be treated as non-blank, do it before.
You probably also want to change your space-stripping code to use split(' ') rather than just split(), so it doesn't screw up newlines. (You could also use re.sub for that as well, but it isn't really necessary, because turning 1 or more spaces into exactly 1 isn't hard to write with split and join.)
Alternatively, you could just go line by line, and keep track of the last line (either with an explicit variable inside the loop, or by writing a simple adjacent_pairs iterator, like i1, i2 = tee(ivar); next(i2); return zip_longest(i1, i2, fillvalue='')) and if the current line and the previous line are both blank, don't write the current line.
split without Argument will cut your string at each occurence if a whitespace ( space, tab, new line,...).
Write
n.split(" ")
and it will only split at spaces.
Instead of writing the output to a file, put it Ingo a New variable, and repeat the step again, this time with
m.split("\n")
Firstly, let's see, what exactly is the problem...
You cannot have 1+ consecutive spaces or 2+ consecutive newlines.
You know how to handle 1+ spaces.
That approach won't work on 2+ newlines as there are 3 possible situations:
- 1 newline
- 2 newlines
- 2+ newlines
Great so.. How do you solve this then?
There are many solutions. I'll list 3 of them.
Regex based.
This problem is very easy to solve iff1 you know how to use regex...
So, here's the code:
s = re.sub(r'\n{2,}', r'\n\n', in_file.read())
If you have memory constraints, this is not the best way as we read the entire file into the momory.
While loop based.
This code is really self-explainatory, but I wrote this line anyway...
s = in_file.read()
while "\n\n\n" in s:
s = s.replace("\n\n\n", "\n\n")
Again, you have memory constraints, we still read the entire file into the momory.
State based.
Another way to approach this problem is line-by-line. By keeping track whether the last line we encountered was blank, we can decide what to do.
was_last_line_blank = False
for line in in_file:
# Uncomment if you consider lines with only spaces blank
# line = line.strip()
if not line:
was_last_line_blank = True
continue
if not was_last_line_blank:
# Add a new line to output file
out_file.write("\n")
# Write contents of `line` in file
out_file.write(line)
was_last_line_blank = False
Now, 2 of them need you to load the entire file into memory, the other one is fairly more complicated. My point is: All these work but since there is a small difference in ow they work, what they need on the system varies...
1 The "iff" is intentional.
Basically, you want to take lines that are non-empty (so line.strip() returns empty string, which is a False in boolean context). You can do this using list/generator comprehension on result of str.splitlines(), with if clause to filterout empty lines.
Then for each line you want to ensure, that all words are separated by single space - for this you can use ' '.join() on result of str.split().
So this should do the job for you:
compressed = '\n'.join(
' '.join(line.split()) for line in txt.splitlines()
if line.strip()
)
or you can use filter and map with helper function to make it maybe more readable:
def squash_line(line):
return ' '.join(line.split())
non_empty_lines = filter(str.strip, txt.splitlines())
compressed = '\n'.join(map(squash_line, non_empty_lines))
To fix the paragraph issue:
import re
data = open("data.txt").read()
result = re.sub("[\n]+", "\n\n", data)
print(result)

How to add a comma to the end of a list efficiently?

I have a list of horizontal names that is too long to open in excel. It's 90,000 names long. I need to add a comma after each name to put into my program. I tried find/replace but it freezes up my computer and crashes. Is there a clever way I can get a comma at the end of each name? My options to work with are python and excel thanks.
If you actually had a Python list, say names, then ','.join(names) would make into a string with a comma between each name and the following one (if you need one at the end as well, just use + ',' to append one more comma to the result).
Even though you say you have "a list" I suspect you actually have a string instead, for example in a file, where the names are separated by...? You don't tell us, and therefore force us to guess. For example, if they're separated by line-ends (one name per line), your life is easiest:
with open('yourfile.txt') as f:
result = ','.join(f)
(again, supplement this with a + ',' after the join if you need that, of course). That's because separation by line-ends is the normal default behavior for a text file, of course.
If the separator is something different, you'll have to read the file's contents as a string (with f.read()) and split it up appropriately then join it up again with commas.
For example, if the separator is a tab character:
with open('yourfile.txt') as f:
result = ','.join(f.read().split('\t'))
As you see, it's not so much worse;-).

Categories

Resources