How would I print all the instances when the "$" shows up? - python

I have this string and I'm basically trying to get the numbers after the "$" shows up. For example, I would want an output like:
>>> 100, 654, 123, 111.654
The variable and string:
file = """| $100 on the first line
| $654 on the second line
| $123 on the third line
| $111.654 on the fourth line"""
And as of right now, I have this bit of code that I think helps me separate the numbers. But I can't figure out why it's only separating the fourth line. It only prints out 111.654
txt = io.StringIO(file).getvalue()
idx = txt.rfind('$')
print(txt[idx+1:].split()[0])
Is there an easier way to do this or am I just forgetting something?

Your code finds only the last $ because that's exactly what you programmed it to do.
You take the entire input, find the last $, and then split the rest of the string. This specifically ignores any other $ in the input.
You cite "line" as if it's a unit of your program, but you've done nothing to iterate through lines. I recommend that you quit fiddling with io and simply use standard file operations. You find this in any tutorial on Python files.
In the meantime, here's how you handle the input you have:
by_lines = txt.split('\n') # Split in newline characters
for line in by_lines:
idx = line.rfind('$')
print(line[idx+1:].split()[0])
Output:
100
654
123
111.654
Does that get you moving?

Regular expressions yay:
import re
matches = re.findall(r'\$(\d+\.\d+|\d+)', file)
Finds all integer and float amounts, ensures trailing '.' fullstops are not incorrectly captured.

This should do it! For every character in txt: if it is '$' then continue until you find a space.
print(*[txt[i+1: i + txt[i:].find(' ')] for i in range(0, len(txt)) if txt[i]=='$'])
Output:
100 654 123 111.654

Your whole sequence appears to be a single string. Try using the split function to break it into separate lines. Then, I believe you need to iterate through the entire list, searching for $ at each iteration.
I'm not the most fluent in python, but maybe something like this:
for i in txt.split('\n'):
idx=txt.rfind('$')
print(txt[idx+1].split()[0])

How about this?
re.findall('\$(\d+\.?\d*)', file)
# ['100', '654', '123', '111.654']
The regex looks for the dollar sign \$ then grabs the maximum sized group available () containing one or more digits \d+ and zero or one decimal points \.? and zero or more digits \d* after that.

Related

Want to replace comma with decimal point in text file where after each number there is a comma in python

eg
Arun,Mishra,108,23,34,45,56,Mumbai
o\p I want is
Arun,Mishra,108.23,34,45,56,Mumbai
Tried to replace the comma with dot but all the demiliters are replaced with comma
tried text.replace(',','.') but replacing all the commas with dot
You can use regex for these kind of tasks:
import re
old_str = 'Arun,Mishra,108,23,34,45,56,Mumbai'
new_str = re.sub(r'(\d+)(,)(\d+)', r'\1.\3', old_str, 1)
>>> 'Arun,Mishra,108.23,34,45,56,Mumbai'
The search pattern r'(\d+)(,)(\d+)' was to find a comma between two numbers. There are three capture groups, therefore one can use them in the replacement: r\1.\3 (\1 and \3 are first and third groups). The old_str is the string and 1 is to tell the pattern to only replace the first occurrence (thus keep 34, 45).
It may be instructive to show how this can be done without additional module imports.
The idea is to search the string for all/any commas. Once the index of a comma has been identified, examine the characters either side (checking for digits). If such a pattern is observed, modify the string accordingly
s = 'Arun,Mishra,108,23,34,45,56,Mumbai'
pos = 1
while (pos := s.find(',', pos, len(s)-1)) > 0:
if s[pos-1].isdigit() and s[pos+1].isdigit():
s = s[:pos] + '.' + s[pos+1:]
break
pos += 1
print(s)
Output:
Arun,Mishra,108.23,34,45,56,Mumbai
Assuming you have a plain CSV file as in your single line example, we can assume there are 8 columns and you want to 'merge' columns 3 and 4 together. You can do this with a regular expression - as shown below.
Here I explicitly match the 8 columns into 8 groups - matching everything that is not a comma as a column value and then write out the 8 columns again with commas separating all except columns 3 and 4 where I put the period/dot you require.
$ echo "Arun,Mishra,108,23,34,45,56,Mumbai" | sed -r "s/([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*)/\1,\2,\3.\4,\5,\6,\7,\8/"
Arun,Mishra,108.23,34,45,56,Mumbai
This regex is for your exact data. Having a generic regex to replace any comma between two subsequent sets of digits might give false matches on other data however so I think explicitly matching the data based on the exact columns you have will be the safest way to do it.
You can take the above regex and code it into your python code as shown below.
import re
inLine = 'Arun,Mishra,108,23,34,45,56,Mumbai'
outLine = re.sub(r'([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*)'
, r'\1,\2,\3.\4,\5,\6,\7,\8', inLine, 0)
print(outLine)
As Tim Biegeleisen pointed out in an original comment, if you have access to the original source data you would be better fixing the formatting there. Of course that is not always possible.
First split the string using s.split() and then replace ',' in 2nd element
after replacing join the string back again.
s= 'Arun,Mishra,108,23,34,45,56,Mumbai '
ls = s.split(',')
ls[2] = '.'.join([ls[2], ls[3]])
ls.pop(3)
s = ','.join(ls)
It changes all the commas to dots if dot have numbers before and after itself.
txt = "2459,12 is the best number. lets change the dots . with commas , 458,45."
commaindex = 0
while commaindex != -1:
commaindex = txt.find(",",commaindex+1)
if txt[commaindex-1].isnumeric() and txt[commaindex+1].isnumeric():
txt = txt[0:commaindex] + "." + txt[commaindex+1:len(txt)+1]
print(txt)

Regular expression help to find space after a long string

My code is as follow:
list = re.findall(("PROGRAM S\d\d"), contents
If I print the list I just print S51 but I want to take everything.
I want to findall everything like that "PROGRAM S51_Mix_Station". I know how to put the digits to find them but I don´t know how to find everything until the next space because usually after the last character there is an space.
Thanks in advance.
You can also use \w+:
import re
s = "PROGRAM S51_Mix_Station"
new_data = re.findall('^PROGRAM\s\w+\_\w+_\w+', s)
final_data = new_data[0] if new_data else new_data
Output:
'PROGRAM S51_Mix_Station'
Ok, thanks. I find another solution.
lista = re.findall(("PROGRAM S\d\d\S+") To find any character after the digit as repetition.
You could use this:
list = re.findall(r"PROGRAM S\d\d[^ ]*", contents)
This would match PROGRAM S followed by two digits, then followed by any number of non space characters. If you wanted to include all whitespace characters with spaces, then the #Wiktor comment would be better, i.e. use PROGRAM S\d\d\S*.

Elegant way test in python if string contains nothing except 0-9,e,+,-,spaces,tabs

I would like to find the most efficient and simple way to test in python if a string passes the following criteria:
contains nothing except:
digits (the numbers 0-9)
decimal points: '.'
the letter 'e'
the sign '+' or '-'
spaces (any number of them)
tabs (any number of them)
I can do this easily with nested 'if' loops, etc., but i'm wondering if there's a more convenient way...
For example, I would want the string:
0.0009017041601 5.13623e-05 0.00137531 0.00124203
to be 'true' and all the following to be 'false':
# File generated at 10:45am Tuesday, July 8th
# Velocity: 82.568
# Ambient Pressure: 150000.0
Time(seconds) Force_x Force_y Force_z
That's trivial for a regex, using a character class:
import re
if re.match(r"[0-9e \t+.-]*$", subject):
# Match!
However, that will (according to the rules) also match eeeee or +-e-+ etc...
If what you actually want to do is check whether a given string is a valid number, you could simply use
try:
num = float(subject)
except ValueError:
print("Illegal value")
This will handle strings like "+34" or "-4e-50" or " 3.456e7 ".
import re
if re.match(r"^[0-9\te+ -]+$",x):
print "yes"
else:
print "no"
You can try this.If there is a match,its a pass else fail.Here x will be your string.
Easiest way to check whether the string has only required characters is by using the string.translate method.
num = "1234e+5"
if num.translate(None, "0123456789e+- \t"
print "pass"
else:
print "Wrong character present!!!"
You can add any character at the second parameter in the translate method other than that I mentioned.
You dont need to use regular expressions just use a test_list and all operation :
>>> from string import digits
>>> test_list=list(digits)+['+','-',' ','\t','e','.']
>>> all(i in test_list for i in s)
Demo:
>>> s ='+4534e '
>>> all(i in test_list for i in s)
True
>>> s='+9328a '
>>> all(i in test_list for i in s)
False
>>> s="0.0009017041601 5.13623e-05 0.00137531 0.00124203"
>>> all(i in test_list for i in s)
True
Performance wise, running a regular expression check is costly, depending on the expression. Also running a regex check for each valid line (i.e. lines which the value should be "True") will be costly, especially because you'll end up parsing each line with a regex and parse the same line again to get the numbers.
You did not say what you wanted to do with the data so I will empirically assume a few things.
First off in a case like this I would make sure the data source is always formatted the same way. Using your example as a template I would then define the following convention:
any line, which first non-blank character is a hash sign is ignored
any blank line is ignored
any line that contains only spaces is ignored
This kind of convention makes parsing much easier since you only need one regular expression to fit rules 1. to 3. : ^\s*(#|$), i.e. any number of space followed by either a hash sign or an end of line. On the performance side, this expression scans an entire line only when it's comprised of spaces and just spaces, which shall not happen very often. In other cases the expression scans a line and stops at the first non-space character, which means comments will be detected quickly for the scanning will stop as soon as the hash is encountered, at position 0 most of the time.
If you can also enforce the following convention:
the first non blank line of the remaining lines is the header with column names
there is no blank lines between samples
there are no comments in samples
Your code would then do the following:
read lines into line for as long as re.match(r'^\s*(#|$)', line) evaluates to True;
continue, reading headers from the next line into line: headers = line.split() and you have headers in a list.
You can use a namedtuple for your line layout — which I assume is constant throughout the same data table:
class WindSample(namedtuple('WindSample', 'time, force_x, force_y, force_z')):
def __new__(cls, time, force_x, force_y, force_z):
return super(WindSample, cls).__new__(
cls,
float(time),
float(force_x),
float(force_y),
float(force_z)
)
Parsing valid lines would then consist of the following, for each line:
try:
data = WindSample(*line.split())
except ValueError, e:
print e
Variable data would hold something such as:
>>> print data
WindSample(time=0.0009017041601, force_x=5.13623e-05, force_y=0.00137531, force_z=0.00124203)
The advantage is twofold:
you run costly regular expressions only for the smallest set of lines (i.e. blank lines and comments);
your code parses floats, raising an exception whenever parsing would yield something invalid.

Using Python 2.4.3: Want to find the same regex multiple times in a text file

Super NOOB to Python (2.4.3): I am executing a function containing a regular expression which searches through a txt file that I'm importing. I am able to read and run re.search on the text file and the output is correct. I need to fun this for multiple occurrences. The regex occurs 48 times in the text). The code is as follows:
!/usr/bin/python
import re
dataRead = open('pd_usage_14-04-23.txt', 'r')
dataWrite = open('test_write.txt', 'w')
text = (dataRead.read()) #reads and initializes text for conversion to string
s = str(text) #converts text to string for reading
def user(str):
re1='((?:[a-z][a-z]+))' # Word 1
re2='(\\s+)' # White Space 1
re3='((?:[a-z][a-z]+))' # Word 2
re4='(\\s+)' # White Space 2
re5='((?:[a-z][a-z]*[0-9]+[a-z0-9]*))' # Alphanum 1
rg = re.compile(re1+re2+re3+re4+re5,re.IGNORECASE|re.DOTALL)
#alphanum1=rg.group(5)
re.findall(rg, s, flags=0)
#print "("+alphanum1+")"+"\n"
#if m:
#word1=m.group(1)
#ws1=m.group(2)
#word2=m.group(3)
#ws2=m.group(4)
#alphanum1=m.group(5)
#print "("+alphanum1+")"+"\n"
return
user(s)
dataRead.close()
dataWrite.close()
OUTPUT: g706454
THIS OUTPUT IS CORRECT! BUT...!
I need to run it multiple times reading text thats further down.
I have 2 other definitions that need to be ran multiple times also. I need all 3 to run consecutively, and then run again but starting with the next line or something to search and output newer data. All the logic I tried implement returns the same output.
So I have something like this:
for count in range (0,47):
if stop_read:
date(s)
usage(s)
user(s)
stop_read is a definition that finds the next line after the data that I'm looking for (date, usage, user). I figured I could call this to say If you hit stop_read, read the next line and run definitions all over again.
Any help is greatly appreciated!
Here is what I do for a regex in Python 3, should be similar to Python 2. This is for a multiline searc.
regex = re.compile("\\w+-\\d+\\b", re.MULTILINE)
Then later on in code I have something like:
myset.update([m.group(0) for m in regex.finditer(logmsg.text)])
Maybe you might want to update your Python if you can, 2.4 is old, old, and stale.
looks like re.findall would solve your problem:
re.findall(pattern, string, flags=0)
Return a list of all non-overlapping matches in the string.
If one or more groups are present in the pattern, return a
list of groups; this will be a list of tuples if the pattern
has more than one group.
Empty matches are included in the result.

Remove -#### in zipcodes

How do I remove the +4 from zipcodes, in python?
I've got data like
85001
52804-3233
Winston-Salem
And I want that to become
85001
52804
Winston-Salem
>>> zip = '52804-3233'
>>> zip[:5]
'52804'
...and of course when you parse your lines from the original data you should insert some kind of rule to distinguish between zipcode to fix and other strings, but I don't know how your data looks like, so I can't help much (you could check if they are only digits and the '-' symbol, maybe?).
>>> import re
>>> s = "52804-3233"
>>> # regex to remove a dash and 4 digits after the dash after 5 digits:
>>> re.sub('(\d{5})-\d{4}', '\\1', s)
'52804'
The \\1 is a so called back reference and gets replaced by the first group, which would be the 5 digit zipcode in this case.
You could try something like this:
for input in inputs:
if input[:5].isnumeric():
input = input[:5]
# Takes the first 5 characters from the string
Just take away the first 5 characters of anything that is numbers in the first 5 positions.
re.sub('-\d{4}$', '', zipcode)
This grabs all items of the format 00000-0000 with a space or other word boundary before and after the number and replaces it with the first five digits. The other regex's posted will match some other number formats that you might not want.
re.sub('\b(\d{5})-\d{4}\b', '\\1', zipcode)
Or without regex:
output = [line[:5] if line[:5].isnumeric() and line[6:].isnumeric() else line for line in text if line]

Categories

Resources