Removing tab delimited spaces from a text file using for loop - python

For my python class, I am working on opening a .tsv file and taking 15 rows of data, broken down in 4 columns, and turning it into lists for each line. To do this, I must remove the tabs in between each column.
I've been advised to use a for loop and loop through each line. This makes sense but I can't figure out how to remove the tabs.
Any help?

To read lines from a file, and split each line on the tab delimiter, you can do this:
rows = []
for line in open('file.tsv', 'rb'):
rows.append(line.strip().split('\t'))

Properly, this should be done using the Python CSV module (as mentioned in another answer) as this will handle escaped separators, quoted values etc.
In the more general sense, this can be done with a list comprehension:
rows = [line.split('\t') for line in file]
And, as suggested in the comments, in some cases a generator expression would be a better choice:
rows = (line.split('\t') for line in file)
See Generator Expressions vs. List Comprehensions for some discussion on when to use each.

You should use Python's stdlib csv module, particularly the csv.reader function.
rows = [row for row in csv.reader(open('yourfile.tsv', 'rb'), delimiter='\t')]
There's also a a dialect parameter that can take excel-tab to conform to Microsoft Excel's tab-delimited format.

Check out the built-in string functions. split() should do the job.
>>> line = 'word1\tword2\tword3'
>>> line.split('\t')
['word1', 'word2', 'word3']

Related

How to parse a csv file with a custom delimiter

I have a csv file with a custom delimiter as "$&$" like this:
$&$value$&$,$&$type$&$,$&$text$&$
$&$N$&$,$&$N$&$,$&$text of the message$&$
$&$N$&$,$&$F$&$,$&$text of the message_2$&$
$&$B$&$,$&$N$&$,$&$text of the message_3$&$
and I'm not able to parse it with the following code:
df = pd.read_csv('messages.csv', delimiter= '$\&$', engine='python)
can you help me, please??
From the docs:
... separators longer than 1 character and different from '\s+' will be interpreted as regular expressions and will also force the use of the Python parsing engine. Note that regex delimiters are prone to ignoring quoted data. Regex example: '\r\t'.
So, to fix your case it should be like this:
df = pd.read_csv('messages.csv', delimiter= '\$\&\$,\$\&\$|\$\&\$', usecols=[1,2,3])
Note that there are going to be 2 additional columns, the first one and the last one. They exist because all data start/end with $&$. In addition to that, the delimiter is actually $&$,$&$. So, usecols get rid of them.
This is the output from the provided sample:
value
type
text
N
N
text of the message

Python Reading Variable Whitespace Text Table Format

I have this weird output from another tool that I cannot change or modify that I need to parse and do analysis on. Any ideas on what pandas or python library i should use? It has this space filling between columns so that each column start is aligned properly which makes it difficult. White space and tabs are not the delimiter.
If the columns are consistent and every cell has a value, it should actually be pretty easy to parse this manually. You can do some variation of:
your_file = 'C:\\whatever.txt'
with open(your_file) as f:
for line in f:
total_capacity, existing_load, recallable_load, new_load, excess_load, excess_capacity_needed = line.strip().split()
# analysis
If you have strings with spaces, you can alternatively do:
for line in f:
row = (cell.strip() for cell in line.split(' ') if cell.strip())
total_capacity, existing_load, recallable_load, new_load, excess_load, excess_capacity_needed = row
# analysis
This splits the line by two spaces each, and uses a list comprehension to strip excess whitespace/tabs/newlines from all values in the list before removing any subsequently empty values. If you have strings that can contain multiple spaces... that would be an issue. You'd need a way to ensure those strings were enclosed in quotes so you could isolate them.
If by "white space and tabs are not the delimiter" you meant that the whitespaces uses some weird blank unicode character, you can just do
row = (cell.strip() for cell in line.split(unicode_character_here) if cell.strip())
Just make sure that for any of these solutions, you remember to add some checks for those ===== dividers and add a way to detect the titles of the columns if necessary.

Does a fast Python built-in method for reading lines and then splitting them exist?

This method works just fine in Python:
with open(file) as f:
for line in f:
for field in line.rstrip().split('\t'):
continue
However, it also means I read each line twice. First I loop over each character of the file and search for newline characters and second I loop over each character of the line and search for tab spaces. Is there a built-in method for splitting lines, while avoiding looping over the same set of characters twice? Apologies if this is a stupid question.
If you're worried about this level of efficiency then you probably shouldn't be programming in Python. Most of what is happening in that loop happens in C (if you're using the CPython implementation). You're not going to find a more efficient way to process your data using a pure python approach or without creating a very complicated looping structure.
If I wanted to avoid looping over the lines and handle the whole file in one go I would go with a regular expression. Also, regular expressions should be really fast.
import re
regexp = re.compile("\n+")
with open(file) as f:
lines = re.split(regexp, f.read())
Now \n matches one or more newlines and splits the file there. The results is a python list with all the lines. If you want to split by another character, for example whitespaces (and tabs and newlines) you would replace \n+ with \s+. Depending on what you want to do with the lines this might not be faster. Timeit is your friend.
More on pythons regexp:
https://docs.python.org/2/library/re.html

Eliminating extra commas

I am having trouble replacing three commas with one comma in a text file of data.
I am processing a large text file to put it into comma delimited format so I can query it using a database.
I do the following at the command prompt and it works:
>>> import re
>>> line = 'one,,,two'
>>> line=re.sub(',+',',',line)
>>> print line
one,two
>>>
following below is my actual code:
with open("dmis8.txt", "r") as ifp:
with open("dmis7.txt", "w") as ofp:
for line in ifp:
#join lines by removing a line ending.
line=re.sub('(?m)(MM/ANGDEC)[\r\n]+$','',line)
#various replacements of text with nothing. This removes the text
line=re.sub('IDENTIFIER','',line)
line=re.sub('PART','50-1437',line)
line=re.sub('Eval','',line)
line=re.sub('Feat','',line)
line=re.sub('=','',line)
#line=re.sub('r"++++"','',line)
line=re.sub('r"----|"',' ',line)
line=re.sub('Nom','',line)
line=re.sub('Act',' ',line)
line=re.sub('Dev','',line)
line=re.sub('LwTol','',line)
line=re.sub('UpTol','',line)
line=re.sub(':','',line)
line=re.sub('(?m)(Trend)[\r\n]*$',' ',line)
#Remove spaces replace with semicolon
line=re.sub('[ \v\t\f]+', ',', line)
#no worky line=re.sub(r",,,",',',line)
line=re.sub(',+',',',line)
#line=line.replace(",+", ",")
#line=line.replace(",,,", ",")
ofp.write(line)
This is what i get from the code above:
There are several commas together. I don't understand why they won't get replaced down to one comma.
Never mind that I don't see how the extra commas got there in the first place.
50-1437,d
2012/05/01
00/08/27
232_PD_1_DIA,PED_HL1_CR,,,12.482,12.478,-0.004,-0.021,0.020,----|++++
232_PD_2_DIA_TOP,PED_HL2_TOP,,12.482,12.483,0.001,-0.021,0.020,----|++++
232_PD_2_DIA,PED_HL2_CR,,12.482,12.477,-0.005,-0.021,0.020,----|++++
232_PD_2_DIA_BOT,PED_HL2_BOT,,12.482,12.470,-0.012,-0.021,0.020,--|--++++
raw data for reference:
PART IDENTIFIER : d
2012/05/01
00/08/27
232_PD_1_DIA Eval Feat = PED_HL1_CR MM/ANGDEC
Nom Act Dev LwTol UpTol Trend
12.482 12.478 -0.004 -0.021 0.020 ----|++++
232_PD_2_DIA_TOP Eval Feat = PED_HL2_TOP MM/ANGDEC
12.482 12.483 0.001 -0.021 0.020 ----|++++
232_PD_2_DIA Eval Feat = PED_HL2_CR MM/ANGDEC
12.482 12.477 -0.005 -0.021 0.020 ----|++++
Can someone kindly point what I am doing wrong?
thanks in advance...
Your regex is working fine. The problem is that it you concatenate the lines (by write()ing them) after you scrub them with your regex.
Instead, use "".join() on all of your lines, run re.sub() on the whole thing, and then write() it all to the file at once.
I think your problem is caused by the fact that removing line endings does not join lines, in combination with the fact that write does not add newlines to the end each string. So you have multiple input lines that look like a single line in the output.
Looking at the comments, you seem to think that just replacing the end of the line by an empty string will magically append the next line to it, but that doesn't actually work. So the three commas you're seeing are not replaced by your re.sub command because they're not in one line, they're multiple input lines (which after all the replacements are empty except for commas) which get printed to a single output line because you stripped their '\n' characters, and write doesn't automatically add '\n' to the end of each written string (unlike print).
To debug your code, just put print line after each line of code, to see what each "line" actually is - that should help you see what's going wrong.
In general, reading file formats where each "record" spans multiple lines requires more complicated methods than just a for line in file loop.

Python reading 2 strings from the same line

how can I read once at a time 2 strings from a txt file, that are written on the same line?
e.g.
francesco 10
# out is your file
out.readline().split() # result is ['francesco', '10']
Assuming that your two strings are separated by whitespace. You can split based on any string (comma, colon, etc.)
Why not read just the line and split it up later? You'd have to read byte-by-byte and look for the space character, which is very inefficient. Better to read the entire line, and then split the resulting string on the space, giving you two strings.
'francesco 10'.split()
will give you ['francesco', '10'].
for line in fi:
line.split()
Its ideal to just iterate over a file object.

Categories

Resources