Removing trailing white spaces after text in python - python

Let's say that I have a file with several SQL queries, all separated by ";". If put the contents of the file in a string and do:
with open(query_file) as f:
query = f.read()
f.close()
queries = query.split(';')
I'll get a list where each item is one of the queries. Which is my final objective. However, how do I include here that if the last item is only spaces, next lines, tabs or an empty string to remove it? Can it be done via the .split()? I want to avoid things like this:
>>> a = 'a;b;'
>>> a.split(';')
['a', 'b', '']
Or this (this is a bad example, but you get the idea):
>>> a = '''a;b;\n'''
>>> a.split(';')
['a', 'b', '\n']
Thanks!
EDIT: I'm open to other approaches as well, basically separating the string into all the individual queries.

Related

Python 3 split()

When I'm splitting a string "abac" I'm getting undesired results.
Example
print("abac".split("a"))
Why does it print:
['', 'b', 'c']
instead of
['b', 'c']
Can anyone explain this behavior and guide me on how to get my desired output?
Thanks in advance.
As #DeepSpace pointed out (referring to the docs)
If sep is given, consecutive delimiters are not grouped together and are deemed to delimit empty strings (for example, '1,,2'.split(',') returns ['1', '', '2']).
Therefore I'd suggest using a better delimiter such as a comma , or if this is the formatting you're stuck with then you could just use the builtin filter() function as suggested in this answer, this will remove any "empty" strings if passed None as the function.
sample = 'abac'
filtered_sample = filter(None, sample.split('a'))
print(filtered_sample)
#['b', 'c']
When you split a string in python you keep everything between your delimiters (even when it's an empty string!)
For example, if you had a list of letters separated by commas:
>>> "a,b,c,d".split(',')
['a','b','c','d']
If your list had some missing values you might leave the space in between the commas blank:
>>> "a,b,,d".split(',')
['a','b','','d']
The start and end of the string act as delimiters themselves, so if you have a leading or trailing delimiter you will also get this "empty string" sliced out of your main string:
>>> "a,b,c,d,,".split(',')
['a','b','c','d','','']
>>> ",a,b,c,d".split(',')
['','a','b','c','d']
If you want to get rid of any empty strings in your output, you can use the filter function.
If instead you just want to get rid of this behavior near the edges of your main string, you can strip the delimiters off first:
>>> ",,a,b,c,d".strip(',')
"a,b,c,d"
>>> ",,a,b,c,d".strip(',').split(',')
['a','b','c','d']
In your example, "a" is what's called a delimiter. It acts as a boundary between the characters before it and after it. So, when you call split, it gets the characters before "a" and after "a" and inserts it into the list. Since there's nothing in front of the first "a" in the string "abac", it returns an empty string and inserts it into the list.
split will return the characters between the delimiters you specify (or between an end of the string and a delimiter), even if there aren't any, in which case it will return an empty string. (See the documentation for more information.)
In this case, if you don't want any empty strings in the output, you can use filter to remove them:
list(filter(lambda s: len(s) > 0, "abac".split("a"))

Splitting String with Multiple Delimiters in a Particular Order

I am dealing with a type of ASCII file where there are effectively 4 columns of data and the each row is assigned to a line in the file. Below is an example of a row of data from this file
'STOP.F 11966.0000:STOP DEPTH'
The data is always structured so that the delimiter between the first and second column is a period, the delimiter between the second and third column is a space and the delimiter between the third and fourth column is a colon.
Ideally, I would like to find a way to return the following result from the string above
['STOP', 'F', '11966.0000', 'STOP DEPTH']
I tried using a regular expression with the period, space and colon as delimiters, but it breaks down (see example below) because I don't know how to specify the specific order in which to split the string, and I don't know if there is a way to specify the maximum number of splits per delimiter right in the regular expression itself. I want it to split the delimiters in the specific order and each delimiter a maximum of 1 time.
import re
line = 'STOP.F 11966.0000:STOP DEPTH'
re.split("[. :]", line)
>>> ['STOP', 'F', '11966', '0000', 'STOP', 'DEPTH']
Any suggestions on a tidy way to do this?
This may work. Credit to Juan
import re
pattern = re.compile(r'^(.+)\.(.+) (.+):(.+)$')
line = 'STOP.F 11966.0000:STOP DEPTH'
pattern.search(line).groups()
Out[6]: ('STOP', 'F', '11966.0000', 'STOP DEPTH')
re.split() solution with specific regex pattern:
import re
s = 'STOP.F 11966.0000:STOP DEPTH'
result = re.split(r'(?<=^[^.]+)\.|(?<=^[^ ]+) |:', s)
print(result)
The output:
['STOP', 'F', '11966.0000', 'STOP DEPTH']

read a file with single quote data and store it in a list in python

When I was trying to read a file and store it in a list its failing to store string which is inside a single quote as single value in list.
sample file:
12 3 'dsf dsf'
the list should contain
listname = [12, 3, 'dsf dsf']
I am able to do this like below:
listname = [12, 3, 'dsf', 'dsf']
Please help
Use the csv module.
Demo:
>>> import csv
>>> with open('input.txt') as inp:
... print(list(csv.reader(inp, delimiter=' ', quotechar="'"))[0])
...
['12', '3', 'dsf dsf']
input.txt is the file containing your data in the example.
You can use shlex module to split your data in a simple way.
import shlex
data = open("sample file", 'r')
print shlex.split(data.read())
Try it:)
You can use regular expressions:
import re
my_regex = re.compile(r"(?<=')[\w\s]+(?=')|\w+")
with open ("filename.txt") as my_file:
my_list = my_regex.findall(my_file.read())
print(my_list)
Output for file content 12 3 'dsf dsf':
['12', '3', 'dsf dsf']
RegEx explanation:
(?<=') # matches if there's a single quote *before* the matched pattern
[\w\s]+ # matches one or more alphanumeric characters and spaces
(?=') # matches if there's a single quote *after* the matched pattern
| # match either the pattern above or below
\w+ # matches one or more alphanumeric characters
You can use:
>>> l = ['12', '3', 'dsf', 'dsf']
>>> l[2:] = [' '.join(l[2:])]
>>> l
['12', '3', 'dsf dsf']
Basically, you need to parse the data. Which is:
split it into tokens
interpret the resulting sequence
in your case, each token can be interpreted separately
For the 1st task:
each token is:
a set nonspace characters, or
a quote, then anything until another quote.
the separator is a single space (you didn't specify if runs of spaces/other whitespace characters are valid)
Interpretation:
quoted: take the enclosed text, discarding the quotes
non-quoted: convert to integer if possible (you didn't specify if it always is/should be an interger)
(you also didn't specify if it's always 2 integers + quoted string - i.e. if this combination should be enforced)
Since the syntax is very simple, the two tasks can be done at the same time:
import re
i=0
maxi=len(line)
tokens=[]
re_sep=r"\s"
re_term=r"\S+"
re_quoted=r"'(?P<enclosed>[^']*)'"
re_chunk=re.compile("(?:(?P<term>%(re_term)s)"\
"|(?P<quoted>%(re_quoted)s))"\
"(?:%(re_sep)s|$)"%locals())
del re_sep,re_term,re_quoted
while i<maxi:
m=re.match(re_chunk,line,i)
if not m: raise ValueError("invalid syntax at char %d"%i)
gg=m.groupdict()
token=gg['term']
if token:
try: token=int(token)
except ValueError: pass
elif gg['quoted']:
token=gg['enclosed']
else: assert False,"invalid match. locals=%r"%locals()
tokens.append(token)
i+=m.end()
del m,gg,token
This is an example of how it can be done by hand. You can, however, reuse any existing parsing algorithm that can process the same syntax. csv and shlex suggested in other answers are examples. Do note though that they likely accept other syntax, too, which you may or may not want. E.g.:
shlex also accepts double quotes and constructs like "asd"fgh and 'asd'\''fgh'
csv allows multiple consecutive separators (producing an empty element) and things like 'asd'fgh (stripping the quotes) and asd'def' (leaving the quotes intact)

How do I strip a string given a list of unwanted characters? Python

Is there a way to pass in a list instead of a char to str.strip() in python? I have been doing it this way:
unwanted = [c for c in '!##$%^&*(FGHJKmn']
s = 'FFFFoFob*&%ar**^'
for u in unwanted:
s = s.strip(u)
print s
Desired output, this output is correct but there should be some sort of a more elegant way than how i'm coding it above:
oFob*&%ar
Strip and friends take a string representing a set of characters, so you can skip the loop:
>>> s = 'FFFFoFob*&%ar**^'
>>> s.strip('!##$%^&*(FGHJKmn')
'oFob*&%ar'
(the downside of this is that things like fn.rstrip(".png") seems to work for many filenames, but doesn't really work)
Since, you are looking to not delete elements from the middle, you can just use.
>>> 'FFFFoFob*&%ar**^'.strip('!##$%^&*(FGHJKmn')
'oFob*&%ar'
Otherwise, Use str.translate().
>>> 'FFFFoFob*&%ar**^'.translate(None, '!##$%^&*(FGHJKmn')
'oobar'

splitting a string based on tab in the file

I have file that contains values separated by tab ("\t"). I am trying to create a list and store all values of file in the list. But I get some problem. Here is my code.
line = "abc def ghi"
values = line.split("\t")
It works fine as long as there is only one tab between each value. But if there is one than one tab then it copies the tab to values as well. In my case mostly the extra tab will be after the last value in the file.
You can use regex here:
>>> import re
>>> strs = "foo\tbar\t\tspam"
>>> re.split(r'\t+', strs)
['foo', 'bar', 'spam']
update:
You can use str.rstrip to get rid of trailing '\t' and then apply regex.
>>> yas = "yas\t\tbs\tcda\t\t"
>>> re.split(r'\t+', yas.rstrip('\t'))
['yas', 'bs', 'cda']
Split on tab, but then remove all blank matches.
text = "hi\tthere\t\t\tmy main man"
print([splits for splits in text.split("\t") if splits])
Outputs:
['hi', 'there', 'my main man']
You can use regexp to do this:
import re
patt = re.compile("[^\t]+")
s = "a\t\tbcde\t\tef"
patt.findall(s)
['a', 'bcde', 'ef']
An other regex-based solution:
>>> strs = "foo\tbar\t\tspam"
>>> r = re.compile(r'([^\t]*)\t*')
>>> r.findall(strs)[:-1]
['foo', 'bar', 'spam']
Python has support for CSV files in the eponymous csv module. It is relatively misnamed since it support much more that just comma separated values.
If you need to go beyond basic word splitting you should take a look. Say, for example, because you are in need to deal with quoted values...

Categories

Resources